# Design matrix

Last updated

In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual object, with the successive columns corresponding to the variables and their specific values for that object. The design matrix is used in certain statistical models, e.g., the general linear model. [1] [2] [3] It can contain indicator variables (ones and zeros) that indicate group membership in an ANOVA, or it can contain values of continuous variables.

## Contents

The design matrix contains data on the independent variables (also called explanatory variables) in statistical models which attempt to explain observed data on a response variable (often called a dependent variable) in terms of the explanatory variables. The theory relating to such models makes substantial use of matrix manipulations involving the design matrix: see for example linear regression. A notable feature of the concept of a design matrix is that it is able to represent a number of different experimental designs and statistical models, e.g., ANOVA, ANCOVA, and linear regression.[ citation needed ]

## Definition

The design matrix is defined to be a matrix ${\displaystyle X}$ such that ${\displaystyle X_{ij}}$ (the jth column of the ith row of ${\displaystyle X}$) represents the value of the jth variable associated with the ith object.

A regression model which is a linear combination of the explanatory variables may therefore be represented via matrix multiplication as

${\displaystyle y=X\beta ,}$

where X is the design matrix, ${\displaystyle \beta }$ is a vector of the model's coefficients (one for each variable), and y is the vector of predicted outputs for each object.

## Size

The matrix of data has dimension n-by-p, where n is the number of samples observed, and p is the number of variables (features) measured in all samples. [4] [5]

In this representation different rows typically represent different repetitions of an experiment, while columns represent different types of data (say, the results from particular probes). For example, suppose an experiment is run where 10 people are pulled off the street and asked four questions. The data matrix M would be a 10×4 matrix (meaning 10 rows and 4 columns). The datum in row i and column j of this matrix would be the answer of the ith person to the jth question.

## Examples

### Arithmetic mean

The design matrix for an arithmetic mean is a column vector of ones.

### Simple linear regression

This section gives an example of simple linear regression—that is, regression with only a single explanatory variable—with seven observations. The seven data points are {yi, xi}, for i = 1, 2, …, 7. The simple linear regression model is

${\displaystyle y_{i}=\beta _{0}+\beta _{1}x_{i}+\varepsilon _{i},\,}$

where ${\displaystyle \beta _{0}}$ is the y-intercept and ${\displaystyle \beta _{1}}$ is the slope of the regression line. This model can be represented in matrix form as

${\displaystyle {\begin{bmatrix}y_{1}\\y_{2}\\y_{3}\\y_{4}\\y_{5}\\y_{6}\\y_{7}\end{bmatrix}}={\begin{bmatrix}1&x_{1}\\1&x_{2}\\1&x_{3}\\1&x_{4}\\1&x_{5}\\1&x_{6}\\1&x_{7}\end{bmatrix}}{\begin{bmatrix}\beta _{0}\\\beta _{1}\end{bmatrix}}+{\begin{bmatrix}\varepsilon _{1}\\\varepsilon _{2}\\\varepsilon _{3}\\\varepsilon _{4}\\\varepsilon _{5}\\\varepsilon _{6}\\\varepsilon _{7}\end{bmatrix}}}$

where the first column of 1s in the design matrix allows estimation of the y-intercept while the second column contains the x-values associated with the corresponding y-values.

### Multiple regression

This section contains an example of multiple regression with two covariates (explanatory variables): w and x. Again suppose that the data consist of seven observations, and that for each observed value to be predicted (${\displaystyle y_{i}}$), values wi and xi of the two covariates are also observed. The model to be considered is

${\displaystyle y_{i}=\beta _{0}+\beta _{1}w_{i}+\beta _{2}x_{i}+\varepsilon _{i}}$

This model can be written in matrix terms as

${\displaystyle {\begin{bmatrix}y_{1}\\y_{2}\\y_{3}\\y_{4}\\y_{5}\\y_{6}\\y_{7}\end{bmatrix}}={\begin{bmatrix}1&w_{1}&x_{1}\\1&w_{2}&x_{2}\\1&w_{3}&x_{3}\\1&w_{4}&x_{4}\\1&w_{5}&x_{5}\\1&w_{6}&x_{6}\\1&w_{7}&x_{7}\end{bmatrix}}{\begin{bmatrix}\beta _{0}\\\beta _{1}\\\beta _{2}\end{bmatrix}}+{\begin{bmatrix}\varepsilon _{1}\\\varepsilon _{2}\\\varepsilon _{3}\\\varepsilon _{4}\\\varepsilon _{5}\\\varepsilon _{6}\\\varepsilon _{7}\end{bmatrix}}}$

Here the 7×3 matrix on the right side is the design matrix.

### One-way ANOVA (cell means model)

This section contains an example with a one-way analysis of variance (ANOVA) with three groups and seven observations. The given data set has the first three observations belonging to the first group, the following two observations belonging to the second group and the final two observations belonging to the third group. If the model to be fit is just the mean of each group, then the model is

${\displaystyle y_{ij}=\mu _{i}+\varepsilon _{ij}}$

which can be written

${\displaystyle {\begin{bmatrix}y_{1}\\y_{2}\\y_{3}\\y_{4}\\y_{5}\\y_{6}\\y_{7}\end{bmatrix}}={\begin{bmatrix}1&0&0\\1&0&0\\1&0&0\\0&1&0\\0&1&0\\0&0&1\\0&0&1\end{bmatrix}}{\begin{bmatrix}\mu _{1}\\\mu _{2}\\\mu _{3}\end{bmatrix}}+{\begin{bmatrix}\varepsilon _{1}\\\varepsilon _{2}\\\varepsilon _{3}\\\varepsilon _{4}\\\varepsilon _{5}\\\varepsilon _{6}\\\varepsilon _{7}\end{bmatrix}}}$

In this model ${\displaystyle \mu _{i}}$ represents the mean of the ${\displaystyle i}$th group.

### One-way ANOVA (offset from reference group)

The ANOVA model could be equivalently written as each group parameter ${\displaystyle \tau _{i}}$ being an offset from some overall reference. Typically this reference point is taken to be one of the groups under consideration. This makes sense in the context of comparing multiple treatment groups to a control group and the control group is considered the "reference". In this example, group 1 was chosen to be the reference group. As such the model to be fit is

${\displaystyle y_{ij}=\mu +\tau _{i}+\varepsilon _{ij}}$

with the constraint that ${\displaystyle \tau _{1}}$ is zero.

${\displaystyle {\begin{bmatrix}y_{1}\\y_{2}\\y_{3}\\y_{4}\\y_{5}\\y_{6}\\y_{7}\end{bmatrix}}={\begin{bmatrix}1&0&0\\1&0&0\\1&0&0\\1&1&0\\1&1&0\\1&0&1\\1&0&1\end{bmatrix}}{\begin{bmatrix}\mu \\\tau _{2}\\\tau _{3}\end{bmatrix}}+{\begin{bmatrix}\varepsilon _{1}\\\varepsilon _{2}\\\varepsilon _{3}\\\varepsilon _{4}\\\varepsilon _{5}\\\varepsilon _{6}\\\varepsilon _{7}\end{bmatrix}}}$

In this model ${\displaystyle \mu }$ is the mean of the reference group and ${\displaystyle \tau _{i}}$ is the difference from group ${\displaystyle i}$ to the reference group. ${\displaystyle \tau _{1}}$ is not included in the matrix because its difference from the reference group (itself) is necessarily zero.

## Related Research Articles

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one.

In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The spellings homoskedasticity and heteroskedasticity are also frequently used.

In mathematics, a Lie superalgebra is a generalisation of a Lie algebra to include a Z2-grading. Lie superalgebras are important in theoretical physics where they are used to describe the mathematics of supersymmetry. In most of these theories, the even elements of the superalgebra correspond to bosons and odd elements to fermions.

In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.

In differential geometry, a tensor density or relative tensor is a generalization of the tensor field concept. A tensor density transforms as a tensor field when passing from one coordinate system to another, except that it is additionally multiplied or weighted by a power W of the Jacobian determinant of the coordinate transition function or its absolute value. A distinction is made among (authentic) tensor densities, pseudotensor densities, even tensor densities and odd tensor densities. Sometimes tensor densities with a negative weight W are called tensor capacity. A tensor density can also be regarded as a section of the tensor product of a tensor bundle with a density bundle.

Functional data analysis (FDA) is a branch of statistics that analyzes data providing information about curves, surfaces or anything else varying over a continuum. In its most general form, under an FDA framework each sample element is considered to be a function. The physical continuum over which these functions are defined is often time, but may also be spatial location, wavelength, probability, etc.

In statistics, the ordered logit model is an ordinal regression model—that is, a regression model for ordinal dependent variables—first considered by Peter McCullagh. For example, if one question on a survey is to be answered by a choice among "poor", "fair", "good", and "excellent", and the purpose of the analysis is to see how well that response can be predicted by the responses to other questions, some of which may be quantitative, then ordered logistic regression may be used. It can be thought of as an extension of the logistic regression model that applies to dichotomous dependent variables, allowing for more than two (ordered) response categories.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

The Newman–Penrose (NP) formalism is a set of notation developed by Ezra T. Newman and Roger Penrose for general relativity (GR). Their notation is an effort to treat general relativity in terms of spinor notation, which introduces complex forms of the usual variables used in GR. The NP formalism is itself a special case of the tetrad formalism, where the tensors of the theory are projected onto a complete vector basis at each point in spacetime. Usually this vector basis is chosen to reflect some symmetry of the spacetime, leading to simplified expressions for physical observables. In the case of the NP formalism, the vector basis chosen is a null tetrad: a set of four null vectors—two real, and a complex-conjugate pair. The two real members asymptotically point radially inward and radially outward, and the formalism is well adapted to treatment of the propagation of radiation in curved spacetime. The Weyl scalars, derived from the Weyl tensor, are often used. In particular, it can be shown that one of these scalars— in the appropriate frame—encodes the outgoing gravitational radiation of an asymptotically flat system.

In statistics, Bayesian linear regression is an approach to linear regression in which the statistical analysis is undertaken within the context of Bayesian inference. When the regression model has errors that have a normal distribution, and if a particular form of prior distribution is assumed, explicit results are available for the posterior probability distributions of the model's parameters.

The hat operator is a mathematical notation with various uses in different branches of science and mathematics.

Viscoplasticity is a theory in continuum mechanics that describes the rate-dependent inelastic behavior of solids. Rate-dependence in this context means that the deformation of the material depends on the rate at which loads are applied. The inelastic behavior that is the subject of viscoplasticity is plastic deformation which means that the material undergoes unrecoverable deformations when a load level is reached. Rate-dependent plasticity is important for transient plasticity calculations. The main difference between rate-independent plastic and viscoplastic material models is that the latter exhibit not only permanent deformations after the application of loads but continue to undergo a creep flow as a function of time under the influence of the applied load.

A fiber-reinforced composite (FRC) is a composite building material that consists of three components:

1. the fibers as the discontinuous or dispersed phase,
2. the matrix as the continuous phase, and
3. the fine interphase region, also known as the interface.

A Sommerfeld expansion is an approximation method developed by Arnold Sommerfeld for a certain class of integrals which are common in condensed matter and statistical physics. Physically, the integrals represent statistical averages using the Fermi–Dirac distribution.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

In statistics and in machine learning, a linear predictor function is a linear function of a set of coefficients and explanatory variables, whose value is used to predict the outcome of a dependent variable. This sort of function usually comes in linear regression, where the coefficients are called regression coefficients. However, they also occur in various types of linear classifiers, as well as in various other models, such as principal component analysis and factor analysis. In many of these models, the coefficients are referred to as "weights".

In electromagnetism, a branch of fundamental physics, the matrix representations of the Maxwell's equations are a formulation of Maxwell's equations using matrices, complex numbers, and vector calculus. These representations are for a homogeneous medium, an approximation in an inhomogeneous medium. A matrix representation for an inhomogeneous medium was presented using a pair of matrix equations. A single equation using 4 × 4 matrices is necessary and sufficient for any homogeneous medium. For an inhomogeneous medium it necessarily requires 8 × 8 matrices.

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

## References

1. Everitt, B. S. (2002). Cambridge Dictionary of Statistics (2nd ed.). Cambridge, UK: Cambridge University Press. ISBN   0-521-81099-X.
2. Box, G. E. P.; Tiao, G. C. (1992) [1973]. Bayesian Inference in Statistical Analysis. New York: John Wiley and Sons. ISBN   0-471-57428-7. (Section 8.1.1)
3. Timm, Neil H. (2007). Applied Multivariate Analysis. Springer Science & Business Media. p. 107.
4. Johnson, Richard A; Wichern, Dean W (2001). Applied Multivariate Statistical Analysis. Pearson. pp. 111–112. ISBN   0131877151.
5. "Basic Concepts for Multivariate Statistics p.2" (PDF). SAS Institute.
• Verbeek, Albert (1984). "The Geometry of Model Selection in Regression". In Dijkstra, Theo K. (ed.). Misspecification Analysis. New York: Springer. pp. 20–36. ISBN   0-387-13893-5.