External validity is the validity of applying the conclusions of a scientific study outside the context of that study.In other words, it is the extent to which the results of a study can be generalized to and across other situations, people, stimuli, and times. In contrast, internal validity is the validity of conclusions drawn within the context of a particular study. Because general conclusions are almost always a goal in research, external validity is an important property of any study. Mathematical analysis of external validity concerns a determination of whether generalization across heterogeneous populations is feasible, and devising statistical and computational methods that produce valid generalizations.
"A threat to external validity is an explanation of how you might be wrong in making a generalization from the findings of a particular study."In most cases, generalizability is limited when the effect of one factor (i.e. the independent variable) depends on other factors. Therefore, all threats to external validity can be described as statistical interactions. Some examples include:
Note that a study's external validity is limited by its internal validity. If a causal inference made within a study is invalid, then generalizations of that inference to other contexts will also be invalid.
Cook and Campbellmade the crucial distinction between generalizing to some population and generalizing across subpopulations defined by different levels of some background factor. Lynch has argued that it is almost never possible to generalize to meaningful populations except as a snapshot of history, but it is possible to test the degree to which the effect of some cause on some dependent variable generalizes across subpopulations that vary in some background factor. That requires a test of whether the treatment effect being investigated is moderated by interactions with one or more background factors.
Whereas enumerating threats to validity may help researchers avoid unwarranted generalizations, many of those threats can be disarmed, or neutralized in a systematic way, so as to enable a valid generalization. Specifically, experimental findings from one population can be "re-processed", or "re-calibrated" so as to circumvent population differences and produce valid generalizations in a second population, where experiments cannot be performed. Pearl and Bareinboimclassified generalization problems into two categories: (1) those that lend themselves to valid re-calibration, and (2) those where external validity is theoretically impossible. Using graph-based calculus, they derived a necessary and sufficient condition for a problem instance to enable a valid generalization, and devised algorithms that automatically produce the needed re-calibration, whenever such exists. This reduces the external validity problem to an exercise in graph theory, and has led some philosophers to conclude that the problem is now solved.
An important variant of the external validity problem deals with selection bias, also known as sampling bias—that is, bias created when studies are conducted on non-representative samples of the intended population. For example, if a clinical trial is conducted on college students, an investigator may wish to know whether the results generalize to the entire population, where attributes such as age, education, and income differ substantially from those of a typical student. The graph-based method of Bareinboim and Pearl identifies conditions under which sample selection bias can be circumvented and, when these conditions are met, the method constructs an unbiased estimator of the average causal effect in the entire population. The main difference between generalization from improperly sampled studies and generalization across disparate populations lies in the fact that disparities among populations are usually caused by preexisting factors, such as age or ethnicity, whereas selection bias is often caused by post-treatment conditions, for example, patients dropping out of the study, or patients selected by severity of injury. When selection is governed by post-treatment factors, unconventional re-calibration methods are required to ensure bias-free estimation, and these methods are readily obtained from the problem's graph.
If age is judged to be a major factor causing treatment effect to vary from individual to individual, then age differences between the sampled students and the general population would lead to a biased estimate of the average treatment effect in that population. Such bias can be corrected though by a simple re-weighing procedure: We take the age-specific effect in the student subpopulation and compute its average using the age distribution in the general population. This would give us an unbiased estimate of the average treatment effect in the population. If, on the other hand, the relevant factor that distinguishes the study sample from the general population is in itself affected by the treatment, then a different re-weighing scheme need be invoked. Calling this factor Z, we again average the z-specific effect of X on Y in the experimental sample, but now we weigh it by the "causal effect" of X on Z. In other words, the new weight is the proportion of units attaining level Z=z had treatment X=x been administered to the entire population. This interventional probability, often written , can sometimes be estimated from observational studies in the general population.
A typical example of this nature occurs when Z is a mediator between the treatment and outcome, For instance, the treatment may be a cholesterol-reducing drug, Z may be cholesterol level, and Y life expectancy. Here, Z is both affected by the treatment and a major factor in determining the outcome, Y. Suppose that subjects selected for the experimental study tend to have higher cholesterol levels than is typical in the general population. To estimate the average effect of the drug on survival in the entire population, we first compute the z-specific treatment effect in the experimental study, and then average it using as a weighting function. The estimate obtained will be bias-free even when Z and Y are confounded—that is, when there is an unmeasured common factor that affects both Z and Y.
The precise conditions ensuring the validity of this and other weighting schemes are formulated in Bareinboim and Pearl, 2016and Bareinboim et al., 2014.
In many studies and research designs, there may be a trade-off between internal validity and external validity:Attempts to increase internal validity may also limit the generalizability of the findings, and vice versa. This situation has led many researchers call for "ecologically valid" experiments. By that they mean that experimental procedures should resemble "real-world" conditions. They criticize the lack of ecological validity in many laboratory-based studies with a focus on artificially controlled and constricted environments. Some researchers think external validity and ecological validity are closely related in the sense that causal inferences based on ecologically valid research designs often allow for higher degrees of generalizability than those obtained in an artificially produced lab environment. However, this again relates to the distinction between generalizing to some population (closely related to concerns about ecological validity) and generalizing across subpopulations that differ on some background factor. Some findings produced in ecologically valid research settings may hardly be generalizable, and some findings produced in highly controlled settings may claim near-universal external validity. Thus, external and ecological validity are independent—a study may possess external validity but not ecological validity, and vice versa.
Within the qualitative research paradigm, external validity is replaced by the concept of transferability. Transferability is the ability of research results to transfer to situations with similar parameters, populations and characteristics.
It is common for researchers to claim that experiments are by their nature low in external validity. Some claim that many drawbacks can occur when following the experimental method. By the virtue of gaining enough control over the situation so as to randomly assign people to conditions and rule out the effects of extraneous variables, the situation can become somewhat artificial and distant from real life.
There are two kinds of generalizability at issue:
However, both of these considerations pertain to Cook and Campbell's concept of generalizing to some target population rather than the arguably more central task of assessing the generalizability of findings from an experiment across subpopulations that differ from the specific situation studied and people who differ from the respondents studied in some meaningful way.
Critics of experiments suggest that external validity could be improved by the use of field settings (or, at a minimum, realistic laboratory settings) and by the use of true probability samples of respondents. However, if one's goal is to understand generalizability across subpopulations that differ in situational or personal background factors, these remedies do not have the efficacy in increasing external validity that is commonly ascribed to them. If background factor X treatment interactions exist of which the researcher is unaware (as seems likely), these research practices can mask a substantial lack of external validity. Dipboye and Flanagan, writing about industrial and organizational psychology, note that the evidence is that findings from one field setting and from one lab setting are equally unlikely to generalize to a second field setting.Thus, field studies are not by their nature high in external validity and laboratory studies are not by their nature low in external validity. It depends in both cases whether the particular treatment effect studied would change with changes in background factors that are held constant in that study. If one's study is "unrealistic" on the level of some background factor that does not interact with the treatments, it has no effect on external validity. It is only if an experiment holds some background factor constant at an unrealistic level and if varying that background factor would have revealed a strong Treatment x Background factor interaction, that external validity is threatened.
Research in psychology experiments attempted in universities is often criticized for being conducted in artificial situations and that it cannot be generalized to real life.To solve this problem, social psychologists attempt to increase the generalizability of their results by making their studies as realistic as possible. As noted above, this is in the hope of generalizing to some specific population. Realism per se does not help the make statements about whether the results would change if the setting were somehow more realistic, or if study participants were placed in a different realistic setting. If only one setting is tested, it is not possible to make statements about generalizability across settings.
However, many authors conflate external validity and realism. There is more than one way that an experiment can be realistic:
This is referred to the extent to which an experiment is similar to real-life situations as the experiment's mundane realism.
It is more important to ensure that a study is high in psychological realism—how similar the psychological processes triggered in an experiment are to psychological processes that occur in everyday life.
Psychological realism is heightened if people find themselves engrossed in a real event. To accomplish this, researchers sometimes tell the participants a cover story—a false description of the study's purpose. If however, the experimenters were to tell the participants the purpose of the experiment then such a procedure would be low in psychological realism. In everyday life, no one knows when emergencies are going to occur and people do not have time to plan responses to them. This means that the kinds of psychological processes triggered would differ widely from those of a real emergency, reducing the psychological realism of the study.
People don't always know why they do what they do, or what they do until it happens. Therefore, describing an experimental situation to participants and then asking them to respond normally will produce responses that may not match the behavior of people who are actually in the same situation. We cannot depend on people's predictions about what they would do in a hypothetical situation; we can only find out what people will really do when we construct a situation that triggers the same psychological processes as occur in the real world.
Social psychologists study the way in which people, in general, are susceptible to social influence. Several experiments have documented an interesting, unexpected example of social influence, whereby the mere knowledge that others were present reduced the likelihood that people helped.
The only way to be certain that the results of an experiment represent the behaviour of a particular population is to ensure that participants are randomly selected from that population. Samples in experiments cannot be randomly selected just as they are in surveys because it is impractical and expensive to select random samples for social psychology experiments. It is difficult enough to convince a random sample of people to agree to answer a few questions over the telephone as part of a political poll, and such polls can cost thousands of dollars to conduct. Moreover, even if one somehow was able to recruit a truly random sample, there can be unobserved heterogeneity in the effects of the experimental treatments... A treatment can have a positive effect on some subgroups but a negative effect on others. The effects shown in the treatment averages may not generalize to any subgroup.
Many researchers address this problem by studying basic psychological processes that make people susceptible to social influence, assuming that these processes are so fundamental that they are universally shared. Some social psychologist processes do vary in different cultures and in those cases, diverse samples of people have to be studied.
The ultimate test of an experiment's external validity is replication — conducting the study over again, generally with different subject populations or in different settings. Researchers will often use different methods, to see if they still get the same results.
When many studies of one problem are conducted, the results can vary. Several studies might find an effect of the number of bystanders on helping behaviour, whereas a few do not. To make sense out of this, there is a statistical technique called meta-analysis that averages the results of two or more studies to see if the effect of an independent variable is reliable. A meta analysis essentially tells us the probability that the findings across the results of many studies are attributable to chance or to the independent variable. If an independent variable is found to have an effect in only one of 20 studies, the meta-analysis will tell you that that one study was an exception and that, on average, the independent variable is not influencing the dependent variable. If an independent variable is having an effect in most of the studies, the meta-analysis is likely to tell us that, on average, it does influence the dependent variable.
There can be reliable phenomena that are not limited to the laboratory. For example, increasing the number of bystanders has been found to inhibit helping behaviour with many kinds of people, including children, university students, and future ministers;in Israel; in small towns and large cities in the U.S.; in a variety of settings, such as psychology laboratories, city streets, and subway trains; and with a variety of types of emergencies, such as seizures, potential fires, fights, and accidents, as well as with less serious events, such as having a flat tire. Many of these replications have been conducted in real-life settings where people could not possibly have known that an experiment was being conducted.
When conducting experiments in psychology, some believe that there is always a trade-off between internal and external validity—
Some researchers believe that a good way to increase external validity is by conducting field experiments. In a field experiment, people's behavior is studied outside the laboratory, in its natural setting. A field experiment is identical in design to a laboratory experiment, except that it is conducted in a real-life setting. The participants in a field experiment are unaware that the events they experience are in fact an experiment. Some claim that the external validity of such an experiment is high because it is taking place in the real world, with real people who are more diverse than a typical university student sample. However, as real-world settings differ dramatically, findings in one real-world setting may or may not generalize to another real-world setting.
Neither internal nor external validity is captured in a single experiment. Social psychologists opt first for internal validity, conducting laboratory experiments in which people are randomly assigned to different conditions and all extraneous variables are controlled. Other social psychologists prefer external validity to control, conducting most of their research in field studies, and many do both. Taken together, both types of studies meet the requirements of the perfect experiment. Through replication, researchers can study a given research question with maximal internal and external validity.
The design of experiments is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associated with experiments in which the design introduces conditions that directly affect the variation, but may also refer to the design of quasi-experiments, in which natural conditions that influence the variation are selected for observation.
Social psychology is the scientific study of how the thoughts, feelings, and behaviors of individuals are influenced by the actual, imagined, and implied presence of others, 'imagined' and 'implied presences' referring to the internalized social norms that humans are influenced by even when they are alone.
An experiment is a procedure carried out to support or refute a hypothesis. Experiments provide insight into cause-and-effect by demonstrating what outcome occurs when a particular factor is manipulated. Experiments vary greatly in goal and scale but always rely on repeatable procedure and logical analysis of the results. There also exist natural experimental studies.
Validity is the main extent to which a concept, conclusion or measurement is well-founded and likely corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool is the degree to which the tool measures what it claims to measure. Validity is based on the strength of a collection of different types of evidence described in greater detail below.
Experimental psychology refers to work done by those who apply experimental methods to psychological study and the processes that underlie it. Experimental psychologists employ human participants and animal subjects to study a great many topics, including sensation & perception, memory, cognition, learning, motivation, emotion; developmental processes, social psychology, and the neural substrates of all of these.
Selection bias is the bias introduced by the selection of individuals, groups, or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase "selection bias" most often refers to the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may be false.
In statistics, an effect size is a number measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of a parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes complement statistical hypothesis testing, and play an important role in power analyses, sample size planning, and in meta-analyses. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.
In psychology, an attribution bias or attributional bias is a cognitive bias that refers to the systematic errors made when people evaluate or try to find reasons for their own and others' behaviors. People constantly make attributions—judgements and assumptions about why people behave in certain ways. However, attributions do not always accurately reflect reality. Rather than operating as objective perceivers, people are prone to perceptual errors that lead to biased interpretations of their social world. Attribution biases are present in everyday life. For example, when a driver cuts someone off, the person who has been cut off is often more likely to attribute blame to the reckless driver's inherent personality traits rather than situational circumstances. Additionally, there are many different types of attribution biases, such as the ultimate attribution error, fundamental attribution error, actor-observer bias, and hostile attribution bias. Each of these biases describes a specific tendency that people exhibit when reasoning about the cause of different behaviors.
In psychology, the false consensus effect, also known as consensus bias, is a pervasive cognitive bias that causes people to “see their own behavioral choices and judgments as relatively common and appropriate to existing circumstances”. In other words, they assume that their personal qualities, characteristics, beliefs, and actions are relatively widespread through the general population.
In the behavioral sciences, ecological validity is often used to refer to the judgment of whether a given study's variables and conclusions are sufficiently relevant to its population. Authors like Brewer claim a tie of this use to the original meaning in Brunswik advocating "representative design". However, Brunswik's concern was whether test and extrapolation settings matched in terms of correlations among cues – not some more general notion or realism. Those using this altered sense of ecological validity are not founding this in any statistical conception of the generality of findings across settings or statistical statements related to the methodological validity of a study. Essentially, ecological validity is an impressionistic commentary on the relative strength of a study's implication(s) for policy, society, culture, etc., rather than on inferences related to the given variables. It is a subjective similarity judgment as entailed in Tversky and Kahneman's representativeness heuristic.
Field experiments are experiments carried out outside of laboratory settings.
Internal validity is the extent to which a piece of evidence supports a claim about cause and effect, within the context of a particular study. It is one of the most important properties of scientific studies and is an important concept in reasoning about evidence more generally. Internal validity is determined by how well a study can rule out alternative explanations for its findings. It contrasts with external validity, the extent to which results can justify conclusions about other contexts.
In statistics, a confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations.
In fields such as epidemiology, social sciences, psychology and statistics, an observational study draws inferences from a sample to a population where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints. One common observational study is about the possible effect of a treatment on subjects, where the assignment of subjects into a treated group versus a control group is outside the control of the investigator. This is in contrast with experiments, such as randomized controlled trials, where each subject is randomly assigned to a treated group or a control group. Observational studies, for lacking an assignment mechanism, naturally present difficulties for inferential analysis.
A web-based experiment or Internet-based experiment is an experiment that is conducted over the Internet. In such experiments, the Internet is either "a medium through which to target larger and more diverse samples with reduced administrative and financial costs" or "a field of social science research in its own right." Psychology and Internet studies are probably the disciplines that have used these experiments most widely, although a range of other disciplines including political science and economics also use web-based experiments. Within psychology most web-based experiments are conducted in the areas of cognitive psychology and social psychology. This form of experimental setup has become increasingly popular because researchers can cheaply collect large amounts of data from a wider range of locations and people. A web-based experiment is a type of online research method. Web based experiments have become significantly more widespread since the COVID-19 pandemic, as researchers have been unable to conduct lab-based experiments.
Statistical conclusion validity is the degree to which conclusions about the relationship among variables based on the data are correct or "reasonable". This began as being solely about whether the statistical conclusion about the relationship of the variables was correct, but now there is a movement towards moving to "reasonable" conclusions that use: quantitative, statistical, and qualitative data. Fundamentally, two types of errors can occur: type I and type II. Statistical conclusion validity concerns the qualities of the study that make these types of errors more likely. Statistical conclusion validity involves ensuring the use of adequate sampling procedures, appropriate statistical tests, and reliable measurement procedures.
A quasi-experiment is an empirical interventional study used to estimate the causal impact of an intervention on target population without random assignment. Quasi-experimental research shares similarities with the traditional experimental design or randomized controlled trial, but it specifically lacks the element of random assignment to treatment or control. Instead, quasi-experimental designs typically allow the researcher to control the assignment to the treatment condition, but using some criterion other than random assignment.
In the statistical analysis of observational data, propensity score matching (PSM) is a statistical matching technique that attempts to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment. PSM attempts to reduce the bias due to confounding variables that could be found in an estimate of the treatment effect obtained from simply comparing outcomes among units that received the treatment versus those that did not. Paul R. Rosenbaum and Donald Rubin introduced the technique in 1983.
Although often used interchangeably, confounds and artifacts refer to two different kinds of threats to the validity of social psychological research.
Observational methods in psychological research entail the observation and description of a subject's behavior. Researchers utilizing the observational method can exert varying amounts of control over the environment in which the observation takes place. This makes observational research a sort of middle ground between the highly controlled method of experimental design and the less structured approach of conducting interviews.