Dependent and independent variables

Last updated

Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function), on the values of other variables. Independent variables, in turn, are not seen as depending on any other variable in the scope of the experiment in question. [lower-alpha 1] In this sense, some common independent variables are time, space, density, mass, fluid flow rate, [1] [2] and previous values of some observed value of interest (e.g. human population size) to predict future values (the dependent variable). [3]

Contents

Of the two, it is always the dependent variable whose variation is being studied, by altering inputs, also known as regressors in a statistical context. In an experiment, any variable that the experimenter manipulates[ clarification needed ] can be called an independent variable. Models and experiments test the effects that the independent variables have on the dependent variables. Sometimes, even if their influence is not of direct interest, independent variables may be included for other reasons, such as to account for their potential confounding effect.

In single variable calculus, a function is typically graphed with the horizontal axis representing the independent variable and the vertical axis representing the dependent variable. In this function, y is the dependent variable and x is the independent variable. Polynomialdeg2.svg
In single variable calculus, a function is typically graphed with the horizontal axis representing the independent variable and the vertical axis representing the dependent variable. In this function, y is the dependent variable and x is the independent variable.

Mathematics

In mathematics, a function is a rule for taking an input (in the simplest case, a number or set of numbers) [5] and providing an output (which may also be a number). [5] A symbol that stands for an arbitrary input is called an independent variable, while a symbol that stands for an arbitrary output is called a dependent variable. [6] The most common symbol for the input is x, and the most common symbol for the output is y; the function itself is commonly written y = f(x). [6] [7]

It is possible to have multiple independent variables or multiple dependent variables. For instance, in multivariable calculus, one often encounters functions of the form z = f(x,y), where z is a dependent variable and x and y are independent variables. [8] Functions with multiple outputs are often referred to as vector-valued functions.

Modeling

In mathematical modeling, the dependent variable is studied to see if and how much it varies as the independent variables vary. In the simple stochastic linear model yi = a + bxi + ei the term yi is the ith value of the dependent variable and xi is the ith value of the independent variable. The term ei is known as the "error" and contains the variability of the dependent variable not explained by the independent variable.

With multiple independent variables, the model is yi = a + bxi,1 + bxi,2 + ... + bxi,n + ei, where n is the number of independent variables.[ citation needed ]

The linear regression model is now discussed. To use linear regression, a scatter plot of data is generated with X as the independent variable and Y as the dependent variable. This is also called a bivariate dataset, (x1, y1)(x2, y2) ...(xi, yi). The simple linear regression model takes the form of Yi = a + Bxi + Ui, for i = 1, 2, ... , n. In this case, Ui, ... ,Un are independent random variables. This occurs when the measurements do not influence each other. Through propagation of independence, the independence of Ui implies independence of Yi, even though each Yi has a different expectation value. Each Ui has an expectation value of 0 and a variance of σ2. [9]

Expectation of Yi Proof: [9]

The line of best fit for the bivariate dataset takes the form y = α + βx and is called the regression line. α and β correspond to the intercept and slope, respectively. [9]

Simulation

In simulation, the dependent variable is changed in response to changes in the independent variables.

Statistics

In an experiment, the variable manipulated by an experimenter is something that is proven to work, called an independent variable. [10] The dependent variable is the event expected to change when the independent variable is manipulated. [11]

In data mining tools (for multivariate statistics and machine learning), the dependent variable is assigned a role as target variable (or in some tools as label attribute), while an independent variable may be assigned a role as regular variable. [12] Known values for the target variable are provided for the training data set and test data set, but should be predicted for other data. The target variable is used in supervised learning algorithms but not in unsupervised learning.

Statistics synonyms

Depending on the context, an independent variable is sometimes called a "predictor variable", regressor, covariate , "manipulated variable", "explanatory variable", exposure variable (see reliability theory), "risk factor" (see medical statistics), "feature" (in machine learning and pattern recognition) or "input variable". [13] [14] In econometrics, the term "control variable" is usually used instead of "covariate". [15] [16] [17] [18] [19]

"Explanatory variable" is preferred by some authors over "independent variable" when the quantities treated as independent variables may not be statistically independent or independently manipulable by the researcher. [20] [21] If the independent variable is referred to as an "explanatory variable" then the term "response variable" is preferred by some authors for the dependent variable. [14] [20] [21]

From the Economics community, the independent variables are also called exogenous.

Depending on the context, a dependent variable is sometimes called a "response variable", "regressand", "criterion", "predicted variable", "measured variable", "explained variable", "experimental variable", "responding variable", "outcome variable", "output variable", "target" or "label". [14] In economics endogenous variables are usually referencing the target.

"Explained variable" is preferred by some authors over "dependent variable" when the quantities treated as "dependent variables" may not be statistically dependent. [22] If the dependent variable is referred to as an "explained variable" then the term "predictor variable" is preferred by some authors for the independent variable. [22]

Variables may also be referred to by their form: continuous or categorical, which in turn may be binary/dichotomous, nominal categorical, and ordinal categorical, among others.

An example is provided by the analysis of trend in sea level by Woodworth (1987). Here the dependent variable (and variable of most interest) was the annual mean sea level at a given location for which a series of yearly values were available. The primary independent variable was time. Use was made of a covariate consisting of yearly values of annual mean atmospheric pressure at sea level. The results showed that inclusion of the covariate allowed improved estimates of the trend against time to be obtained, compared to analyses which omitted the covariate.

Other variables

A variable may be thought to alter the dependent or independent variables, but may not actually be the focus of the experiment. So that the variable will be kept constant or monitored to try to minimize its effect on the experiment. Such variables may be designated as either a "controlled variable", "control variable", or "fixed variable".

Extraneous variables, if included in a regression analysis as independent variables, may aid a researcher with accurate response parameter estimation, prediction, and goodness of fit, but are not of substantive interest to the hypothesis under examination. For example, in a study examining the effect of post-secondary education on lifetime earnings, some extraneous variables might be gender, ethnicity, social class, genetics, intelligence, age, and so forth. A variable is extraneous only when it can be assumed (or shown) to influence the dependent variable. If included in a regression, it can improve the fit of the model. If it is excluded from the regression and if it has a non-zero covariance with one or more of the independent variables of interest, its omission will bias the regression's result for the effect of that independent variable of interest. This effect is called confounding or omitted variable bias; in these situations, design changes and/or controlling for a variable statistical control is necessary.

Extraneous variables are often classified into three types:

  1. Subject variables, which are the characteristics of the individuals being studied that might affect their actions. These variables include age, gender, health status, mood, background, etc.
  2. Blocking variables or experimental variables are characteristics of the persons conducting the experiment which might influence how a person behaves. Gender, the presence of racial discrimination, language, or other factors may qualify as such variables.
  3. Situational variables are features of the environment in which the study or research was conducted, which have a bearing on the outcome of the experiment in a negative way. Included are the air temperature, level of activity, lighting, and the time of day.

In modelling, variability that is not covered by the independent variable is designated by and is known as the "residual", "side effect", "error", "unexplained share", "residual variable", "disturbance", or "tolerance".

Examples

In a study measuring the influence of different quantities of fertilizer on plant growth, the independent variable would be the amount of fertilizer used. The dependent variable would be the growth in height or mass of the plant. The controlled variables would be the type of plant, the type of fertilizer, the amount of sunlight the plant gets, the size of the pots, etc.
In a study of how different doses of a drug affect the severity of symptoms, a researcher could compare the frequency and intensity of symptoms when different doses are administered. Here the independent variable is the dose and the dependent variable is the frequency/intensity of symptoms.
In measuring the amount of color removed from beetroot samples at different temperatures, temperature is the independent variable and amount of pigment removed is the dependent variable.
The taste varies with the amount of sugar added in the coffee. Here, the sugar is the independent variable, while the taste is the dependent variable.

See also

Notes

  1. Even if the existing dependency is invertible (e.g., by finding the inverse function when it exists), the nomenclature is kept if the inverse dependency is not the object of study in the experiment.

Related Research Articles

Econometrics is the application of statistical methods to economic data in order to give empirical content to economic relationships. More precisely, it is "the quantitative analysis of actual economic phenomena based on the concurrent development of theory and observation, related by appropriate methods of inference". An introductory economics textbook describes econometrics as allowing economists "to sift through mountains of data to extract simple relationships". Jan Tinbergen is one of the two founding fathers of econometrics. The other, Ragnar Frisch, also coined the term in the sense in which it is used today.

Least squares Approximation method in statistics

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems by minimizing the sum of the squares of the residuals made in the results of every single equation.

In statistics and econometrics, particularly in regression analysis, a dummy variable is one that takes only the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. They can be thought of as numeric stand-ins for qualitative facts in a regression model, sorting data into mutually exclusive categories.

Analysis of covariance (ANCOVA) is a general linear model which blends ANOVA and regression. ANCOVA evaluates whether the means of a dependent variable (DV) are equal across levels of a categorical independent variable (IV) often called a treatment, while statistically controlling for the effects of other continuous variables that are not of primary interest, known as covariates (CV) or nuisance variables. Mathematically, ANCOVA decomposes the variance in the DV into variance explained by the CV(s), variance explained by the categorical IV, and residual variance. Intuitively, ANCOVA can be thought of as 'adjusting' the DV by the group means of the CV(s).

Regression analysis Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual object, with the successive columns corresponding to the variables and their specific values for that object. The design matrix is used in certain statistical models, e.g., the general linear model. It can contain indicator variables that indicate group membership in an ANOVA, or it can contain values of continuous variables.

Coefficient of determination Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, also spelt coëfficient, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term, in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

Simple linear regression

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

Confounding Variable in statistics

In statistics, a confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes, before some event occurs, to one or more covariates that may be associated with that quantity of time. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate. For example, taking a drug may halve one's hazard rate for a stroke occurring, or, changing the material from which a manufactured component is constructed may double its hazard rate for failure. Other types of survival models such as accelerated failure time models do not exhibit proportional hazards. The accelerated failure time model describes a situation where the biological or mechanical life history of an event is accelerated.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

Mediation (statistics)

In statistics, a mediation model seeks to identify and explain the mechanism or process that underlies an observed relationship between an independent variable and a dependent variable via the inclusion of a third hypothetical variable, known as a mediator variable. Rather than a direct causal relationship between the independent variable and the dependent variable, a mediation model proposes that the independent variable influences the (non-observable) mediator variable, which in turn influences the dependent variable. Thus, the mediator variable serves to clarify the nature of the relationship between the independent and dependent variables.

In statistics, the variance inflation factor (VIF) is the ratio (quotient) of the variance of estimating some parameter in a model that includes multiple other terms (parameters) by the variance of a model constructed using only one term. It quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It provides an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity. Cuthbert Daniel claims to have invented the concept behind the variance inflation factor, but did not come up with the name.

In the statistical analysis of observational data, propensity score matching (PSM) is a statistical matching technique that attempts to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment. PSM attempts to reduce the bias due to confounding variables that could be found in an estimate of the treatment effect obtained from simply comparing outcomes among units that received the treatment versus those that did not. Paul R. Rosenbaum and Donald Rubin introduced the technique in 1983.

In statistics and regression analysis, moderation occurs when the relationship between two variables depends on a third variable. The third variable is referred to as the moderator variable or simply the moderator. The effect of a moderating variable is characterized statistically as an interaction; that is, a categorical or quantitative variable that affects the direction and/or strength of the relation between dependent and independent variables. Specifically within a correlational analysis framework, a moderator is a third variable that affects the zero-order correlation between two other variables, or the value of the slope of the dependent variable on the independent variable. In analysis of variance (ANOVA) terms, a basic moderator effect can be represented as an interaction between a focal independent variable and a factor that specifies the appropriate conditions for its operation.

In statistics and machine learning, lasso is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. It was originally introduced in geophysics, and later by Robert Tibshirani, who coined the term.

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

References

  1. Aris, Rutherford (1994). Mathematical modelling techniques. Courier Corporation.
  2. Boyce, William E.; Richard C. DiPrima (2012). Elementary differential equations. John Wiley & Sons.
  3. Alligood, Kathleen T.; Sauer, Tim D.; Yorke, James A. (1996). Chaos an introduction to dynamical systems. Springer New York.
  4. Hastings, Nancy Baxter. Workshop calculus: guided exploration with review. Vol. 2. Springer Science & Business Media, 1998. p. 31
  5. 1 2 Carlson, Robert. A concrete introduction to real analysis. CRC Press, 2006. p.183
  6. 1 2 Stewart, James. Calculus. Cengage Learning, 2011. Section 1.1
  7. Anton, Howard, Irl C. Bivens, and Stephen Davis. Calculus Single Variable. John Wiley & Sons, 2012. Section 0.1
  8. Larson, Ron, and Bruce Edwards. Calculus. Cengage Learning, 2009. Section 13.1
  9. 1 2 3 Dekking, Frederik Michel (2005), A modern introduction to probability and statistics: understanding why and how, Springer, ISBN   1-85233-896-2, OCLC   783259968
  10. http://onlinestatbook.com/2/introduction/variables.html
  11. Random House Webster's Unabridged Dictionary. Random House, Inc. 2001. Page 534, 971. ISBN   0-375-42566-7.
  12. English Manual version 1.0 Archived 2014-02-10 at the Wayback Machine for RapidMiner 5.0, October 2013.
  13. Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP. ISBN   0-19-920613-9 (entry for "independent variable")
  14. 1 2 3 Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP. ISBN   0-19-920613-9 (entry for "regression")
  15. Gujarati, Damodar N.; Porter, Dawn C. (2009). "Terminology and Notation". Basic Econometrics (Fifth international ed.). New York: McGraw-Hill. p. 21. ISBN   978-007-127625-2.
  16. Wooldridge, Jeffrey (2012). Introductory Econometrics: A Modern Approach (Fifth ed.). Mason, OH: South-Western Cengage Learning. pp. 22–23. ISBN   978-1-111-53104-1.
  17. Last, John M., ed. (2001). A Dictionary of Epidemiology (Fourth ed.). Oxford UP. ISBN   0-19-514168-7.
  18. Everitt, B. S. (2002). The Cambridge Dictionary of Statistics (2nd ed.). Cambridge UP. ISBN   0-521-81099-X.
  19. Woodworth, P. L. (1987). "Trends in U.K. mean sea level". Marine Geodesy. 11 (1): 57–87. doi:10.1080/15210608709379549.
  20. 1 2 Everitt, B.S. (2002) Cambridge Dictionary of Statistics, CUP. ISBN   0-521-81099-X
  21. 1 2 Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP. ISBN   0-19-920613-9
  22. 1 2 Ash Narayan Sah (2009) Data Analysis Using Microsoft Excel, New Delhi. ISBN   978-81-7446-716-4