Cross-sectional study

Last updated

In medical research, social science, and biology, a cross-sectional study (also known as a cross-sectional analysis, transverse study, prevalence study) is a type of observational study that analyzes data from a population, or a representative subset, at a specific point in timethat is, cross-sectional data.


In economics, cross-sectional studies typically involve the use of cross-sectional regression, in order to sort out the existence and magnitude of causal effects of one independent variable upon a dependent variable of interest at a given point in time. They differ from time series analysis, in which the behavior of one or more economic aggregates is traced through time.

In medical research, cross-sectional studies differ from case-control studies in that they aim to provide data on the entire population under study, whereas case-control studies typically include only individuals who have developed a specific condition and compare them with a matched sample, often a tiny minority, of the rest of the population. Cross-sectional studies are descriptive studies (neither longitudinal nor experimental). Unlike case-control studies, they can be used to describe, not only the odds ratio, but also absolute risks and relative risks from prevalences (sometimes called prevalence risk ratio, or PRR). [1] [2] They may be used to describe some feature of the population, such as prevalence of an illness, but cannot prove cause and effect. Longitudinal studies differ from both in making a series of observations more than once on members of the study population over a period of time.


Cross-sectional studies involve data collected at a defined time. They are often used to assess the prevalence of acute or chronic conditions, but cannot be used to answer questions about the causes of disease or the results of intervention. Cross-sectional data cannot be used to infer causality because temporality is not known. They may also be described as censuses. Cross-sectional studies may involve special data collection, including questions about the past, but they often rely on data originally collected for other purposes. They are moderately expensive, and are not suitable for the study of rare diseases. Difficulty in recalling past events may also contribute bias.


The use of routinely collected data allows large cross-sectional studies to be made at little or no expense. This is a major advantage over other forms of epidemiological study. A natural progression has been suggested from cheap cross-sectional studies of routinely collected data which suggest hypotheses, to case-control studies testing them more specifically, then to cohort studies and trials which cost much more and take much longer, but may give stronger evidence. In a cross-sectional survey, a specific group is looked at to see if an activity, say alcohol consumption, is related to the health effect being investigated, say cirrhosis of the liver. If alcohol use is correlated with cirrhosis of the liver, this would support the hypothesis that alcohol use may be associated with cirrhosis.


Routine data may not be designed to answer the specific question.

Routinely collected data does not normally describe which variable is the cause and which is the effect. Cross-sectional studies using data originally collected for other purposes are often unable to include data on confounding factors, other variables that affect the relationship between the putative cause and effect. For example, data only on present alcohol consumption and cirrhosis would not allow the role of past alcohol use, or of other causes, to be explored. Cross-sectional studies are very susceptible to recall bias.

Most case-control studies collect specifically designed data on all participants, including data fields designed to allow the hypothesis of interest to be tested. However, in issues where strong personal feelings may be involved, specific questions may be a source of bias. For example, past alcohol consumption may be incorrectly reported by an individual wishing to reduce their personal feelings of guilt. Such bias may be less in routinely collected statistics, or effectively eliminated if the observations are made by third parties, for example taxation records of alcohol by area.

In addition, there may be cohort effect, in which differences in social and environmental influences are treated as developmental changes due to ageing. [3] Since the occurrence of differences is consistent with the division of generations and ethnic groups, that is, a group of people experiencing a common historical event is affected by a common influence, it is difficult to obtain the causal relationship of the event.

Weaknesses of aggregated data

Cross-sectional studies can contain individual-level data (one record per individual, for example, in national health surveys). However, in modern epidemiology it may be impossible to survey the entire population of interest, so cross-sectional studies often involve secondary analysis of data collected for another purpose. In many such cases, no individual records are available to the researcher, and group-level information must be used. Major sources of such data are often large institutions like the Census Bureau or the Centers for Disease Control in the United States. Recent census data is not provided on individuals, for example in the UK individual census data is released only after a century. Instead data is aggregated, usually by administrative area. Inferences about individuals based on aggregate data are weakened by the ecological fallacy. Also consider the potential for committing the "atomistic fallacy" where assumptions about aggregated counts are made based on the aggregation of individual level data (such as averaging census tracts to calculate a county average). For example, it might be true that there is no correlation between infant mortality and family income at the city level, while still being true that there is a strong relationship between infant mortality and family income at the individual level. All aggregate statistics are subject to compositional effects, so that what matters is not only the individual-level relationship between income and infant mortality, but also the proportions of low, middle, and high income individuals in each city. Because case-control studies are usually based on individual-level data, they do not have this problem.


In economics, cross-sectional analysis has the advantage of avoiding various complicating aspects of the use of data drawn from various points in time, such as serial correlation of residuals. It also has the advantage that the data analysis itself does not need an assumption that the nature of the relationships between variables is stable over time, though this comes at the cost of requiring caution if the results for one time period are to be assumed valid at some different point in time.

An example of cross-sectional analysis in economics is the regression of money demand the amounts that various people hold in highly liquid financial assetsat a particular time upon their income, total financial wealth, and various demographic factors. Each data point is for a particular individual or family, and the regression is conducted on a statistical sample drawn at one point in time from the entire population of individuals or families. In contrast, an intertemporal analysis of money demand would use data on an entire country's holdings of money at each of various points in time, and would regress that on contemporaneous (or near-contemporaneous) income, total financial wealth, and some measure of interest rates. The cross-sectional study has the advantage that it can investigate the effects of various demographic factors (age, for example) on individual differences; but it has the disadvantage that it cannot find the effect of interest rates on money demand, because in the cross-sectional study at a particular point in time all observed units are faced with the same current level of interest rates.

Related Research Articles

A cohort study is a particular form of longitudinal study that samples a cohort, performing a cross-section at intervals through time. It is a type of panel study where the individuals in the panel share a common characteristic.

Dependent and independent variables Concept in mathematical modeling, statistical modeling and experimental sciences

Dependent and Independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand that they depend, by some law or rule, on the values of other variables. Independent variables, in turn, are not seen as depending on any other variable in the scope of the experiment in question. In this sense, some common independent variables are time, space, density, mass, fluid flow rate, and previous values of some observed value of interest to predict future values.

An ecological fallacy is a formal fallacy in the interpretation of statistical data that occurs when inferences about the nature of individuals are deduced from inferences about the group to which those individuals belong. 'Ecological fallacy' is a term that is sometimes used to describe the fallacy of division, which is not a statistical fallacy. The four common statistical ecological fallacies are: confusion between ecological correlations and individual correlations, confusion between group average and total average, Simpson's paradox, and confusion between higher average and higher likelihood.

Spurious relationship Apparent, but false, correlation between causally-independent variables

In statistics, a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are associated but not causally related, due to either coincidence or the presence of a certain third, unseen factor.

A market anomaly in a financial market is predictability that seems to be inconsistent with theories of asset prices. Standard theories include the capital asset pricing model and the Fama-French Three Factor Model, but a lack of agreement among academics about the proper theory leads many to refer to anomalies without a reference to a benchmark theory. Indeed, many academics simply refer to anomalies as "return predictors", avoiding the problem of defining a benchmark theory.

Panel (data) analysis is a statistical method, widely used in social science, epidemiology, and econometrics to analyze two-dimensional panel data. The data are usually collected over time and over the same individuals and then a regression is run over these two dimensions. Multidimensional analysis is an econometric method in which data are collected over more than two dimensions.

In statistics and econometrics, panel data and longitudinal data are both multi-dimensional data involving measurements over time. Panel data is a subset of longitudinal data where observations are for the same subjects each time.

In statistics and econometrics, a cross-sectional regression is a type of regression in which the explained and explanatory variables are all associated with the same single period or point in time. This type of cross-sectional analysis is in contrast to a time-series regression or longitudinal regression in which the variables are considered to be associated with a sequence of points in time.

Clinical study design is the formulation of trials and experiments, as well as observational studies in medical, clinical and other types of research involving human beings. The goal of a clinical study is to assess the safety, efficacy, and / or the mechanism of action of an investigational medicinal product (IMP) or procedure, or new drug or device that is in development, but potentially not yet approved by a health authority. It can also be to investigate a drug, device or procedure that has already been approved but is still in need of further investigation, typically with respect to long-term effects or cost-effectiveness.

A nested case–control (NCC) study is a variation of a case–control study in which cases and controls are drawn from the population in a fully enumerated cohort.

Confounding Variable in statistics

In statistics, a confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations. The existence of confounders is an important quantitative explanation why correlation does not imply causation.

Cross-sectional data, or a cross section of a study population, in statistics and econometrics, is a type of data collected by observing many subjects at the one point or period of time. The analysis might also have no regard to differences in time. Analysis of cross-sectional data usually consists of comparing the differences among selected subjects.

Modifiable areal unit problem Statistical bias encountered when point-based measures of spatial phenomena are aggregated into districts

The modifiable areal unit problem (MAUP) is a source of statistical bias that can significantly impact the results of statistical hypothesis tests. MAUP affects results when point-based measures of spatial phenomena are aggregated into districts, for example, population density or illness rates. The resulting summary values are influenced by both the shape and scale of the aggregation unit.

Multilevel models are statistical models of parameters that vary at more than one level. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the students are grouped. These models can be seen as generalizations of linear models, although they can also extend to non-linear models. These models became much more popular after sufficient computing power and software became available.

Observational study Draws inferences from a sample to a population where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints

In fields such as epidemiology, social sciences, psychology and statistics, an observational study draws inferences from a sample to a population where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints. One common observational study is about the possible effect of a treatment on subjects, where the assignment of subjects into a treated group versus a control group is outside the control of the investigator. This is in contrast with experiments, such as randomized controlled trials, where each subject is randomly assigned to a treated group or a control group. Observational studies, for lacking an assignment mechanism, naturally present difficulties for inferential analysis.

Repeated measures design is a research design that involves multiple measures of the same variable taken on the same or matched subjects either under different conditions or over two or more time periods. For instance, repeated measurements are collected in a longitudinal study in which change over time is assessed.

In statistics, econometrics, political science, epidemiology, and related disciplines, a regression discontinuity design (RDD) is a quasi-experimental pretest-posttest design that aims to determine the causal effects of interventions by assigning a cutoff or threshold above or below which an intervention is assigned. By comparing observations lying closely on either side of the threshold, it is possible to estimate the average treatment effect in environments in which randomisation is unfeasible. However, it remains impossible to make true causal inference with this method alone, as it does not automatically reject causal effects by any potential confounding variable. First applied by Donald Thistlethwaite and Donald Campbell (1960) to the evaluation of scholarship programs, the RDD has become increasingly popular in recent years. Recent study comparisons of randomised controlled trials (RCTs) and RDDs have empirically demonstrated the internal validity of the design.

Meta-regression is defined to be a meta-analysis that uses regression analysis to combine, compare, and synthesize research findings from multiple studies while adjusting for the effects of available covariates on a response variable. A meta-regression analysis aims to reconcile conflicting studies or corroborate consistent ones; a meta-regression analysis is therefore characterized by the collated studies and their corresponding data sets—whether the response variable is study-level data or individual participant data. A data set is aggregate when it consists of summary statistics such as the sample mean, effect size, or odds ratio. On the other hand, individual participant data are in a sense raw in that all observations are reported with no abridgment and therefore no information loss. Aggregate data are easily compiled through internet search engines and therefore not expensive. However, individual participant data are usually confidential and are only accessible within the group or organization that performed the studies.

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.


  1. Schmidt, CO; Kohlmann, T (2008). "When to use the odds ratio or the relative risk?". International Journal of Public Health. 53 (3): 165–167. doi:10.1007/s00038-008-7068-3. PMID   19127890.
  2. Lee, James (1994). "Odds Ratio or Relative Risk for Cross-Sectional Data?". International Journal of Epidemiology. 23 (1): 201–3. doi:10.1093/ije/23.1.201. PMID   8194918.
  3. Ryder, Norman B. (1965). "The Cohort as a Concept in the Study of Social Change". American Sociological Review. 30 (6): 843–861. doi:10.2307/2090964. ISSN   0003-1224.