Median polish

Last updated

The median polish is a simple and robust exploratory data analysis procedure proposed by the statistician John Tukey. The purpose of median polish is to find an additively-fit model for data in a two-way layout table (usually, results from a factorial experiment) of the form row effect + column effect + overall median.

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

John Tukey American mathematician

John Wilder Tukey was an American mathematician best known for development of the FFT algorithm and box plot. The Tukey range test, the Tukey lambda distribution, the Tukey test of additivity, and the Teichmüller–Tukey lemma all bear his name. He is also credited with coining the term 'bit'.

Factorial experiment

In statistics, a full factorial experiment is an experiment whose design consists of two or more factors, each with discrete possible values or "levels", and whose experimental units take on all possible combinations of these levels across all such factors. A full factorial design may also be called a fully crossed design. Such an experiment allows the investigator to study the effect of each factor on the response variable, as well as the effects of interactions between factors on the response variable.

Contents

Median polish utilizes the medians obtained from the rows and the columns a two-way table to iteratively calculate the row effect and column effect on the data. The results are not meant to be sensitive to the outliers, as the iterative procedure uses the medians rather than the means.

Model for a two-way table

Suppose an experiment observes the variable Y under the influence of two variables. We can arrange the data in a two-way table in which one variable is constant along the rows and the other variable constant along the columns. Let i and j denote the position of rows and columns (e.g. yij denotes the value of y at the ith row and the jth column). Then we can obtain a simple linear regression equation:

where b0, b1, b2 are constants, and xi and zj are values associated with rows and columns, respectively.

The equation can be further simplified if no xi and zj values are present for the analysis:

where ci and dj denote row effects and column effects, respectively.

Procedure

To carry out median polish:

(1) find the row medians for each row, find the median of the row medians, record this as the overall effect.

(2) subtract each element in a row by its row median, do this for all rows.

(3) subtract the overall effect from each row median.

(4) do the same for each column, and add the overall effect from column operations to the overall effect generated from row operations.

(5) repeat (1)-(4) until negligible change occur with row or column medians


Related Research Articles

Principal component analysis conversion of a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. If there are observations with variables, then the number of distinct principal components is . This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

The five-number summary is a set of descriptive statistics that provide information about a dataset. It consists of the five most important sample percentiles:

  1. the sample minimum
  2. the lower quartile or first quartile
  3. the median
  4. the upper quartile or third quartile
  5. the sample maximum
Gauss–Markov theorem theorem

In statistics, the Gauss–Markov theorem, named after Carl Friedrich Gauss and Andrey Markov, states that in a linear regression model in which the errors have expectation zero, are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator, provided it exists. Here "best" means giving the lowest variance of the estimate, as compared to other unbiased, linear estimators. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator or ridge regression.

In mathematics, especially in applications of linear algebra to physics, the Einstein notation or Einstein summation convention is a notational convention that implies summation over a set of indexed terms in a formula, thus achieving notational brevity. As part of mathematics it is a notational subset of Ricci calculus; however, it is often used in applications in physics that do not distinguish between tangent and cotangent spaces. It was introduced to physics by Albert Einstein in 1916.

General linear model statistical linear model

The general linear model or multivariate regression model is a statistical linear model. It may be written as

In statistics, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual object, with the successive columns corresponding to the variables and their specific values for that object. The design matrix is used in certain statistical models, e.g., the general linear model. It can contain indicator variables that indicate group membership in an ANOVA, or it can contain values of continuous variables.

Ordinary least squares method for estimating the unknown parameters in a linear regression model

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function.

In statistics, Cook's distance or Cook's D is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in several ways: to indicate influential data points that are particularly worth checking for validity; or to indicate regions of the design space where it would be good to be able to obtain more data points. It is named after the American statistician R. Dennis Cook, who introduced the concept in 1977.

The sample mean or empirical mean and the sample covariance are statistics computed from a collection of data on one or more random variables. The sample mean and sample covariance are estimators of the population mean and population covariance, where the term population refers to the set from which the sample was taken.

In statistics, the projection matrix , sometimes also called the influence matrix or hat matrix , maps the vector of response values to the vector of fitted values. It describes the influence each response value has on each fitted value. The diagonal elements of the projection matrix are the leverages, which describe the influence each response value has on the fitted value for that same observation.

Principal component regression statistical technique

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). Typically, it considers regressing the outcome on a set of covariates based on a standard linear regression model, but uses PCA for estimating the unknown regression coefficients in the model.

In mathematics, specifically multilinear algebra, a dyadic or dyadic tensor is a second order tensor, written in a notation that fits in with vector algebra.

In statistics, Tukey's test of additivity, named for John Tukey, is an approach used in two-way ANOVA to assess whether the factor variables are additively related to the expected value of the response variable. It can be applied when there are no replicated values in the data set, a situation in which it is impossible to directly estimate a fully general non-additive regression structure and still have information left to estimate the error variance. The test statistic proposed by Tukey has one degree of freedom under the null hypothesis, hence this is often called "Tukey's one-degree-of-freedom test."

In statistics, the two-way analysis of variance (ANOVA) is an extension of the one-way ANOVA that examines the influence of two different categorical independent variables on one continuous dependent variable. The two-way ANOVA not only aims at assessing the main effect of each independent variable but also if there is any interaction between them.

In statistics and in machine learning, a linear predictor function is a linear function of a set of coefficients and explanatory variables, whose value is used to predict the outcome of a dependent variable. This sort of function usually comes in linear regression, where the coefficients are called regression coefficients. However, they also occur in various types of linear classifiers, as well as in various other models, such as principal component analysis and factor analysis. In many of these models, the coefficients are referred to as "weights".

Variance-based sensitivity analysis is a form of global sensitivity analysis. Working within a probabilistic framework, it decomposes the variance of the output of the model or system into fractions which can be attributed to inputs or sets of inputs. For example, given a model with two inputs and one output, one might find that 70% of the output variance is caused by the variance in the first input, 20% by the variance in the second, and 10% due to interactions between the two. These percentages are directly interpreted as measures of sensitivity. Variance-based measures of sensitivity are attractive because they measure sensitivity across the whole input space, they can deal with nonlinear responses, and they can measure the effect of interactions in non-additive systems.

Influential observation

In statistics, an influential observation is an observation for a statistical calculation whose deletion from the dataset would noticeably change the result of the calculation. In particular, in regression analysis an influential point is one whose deletion has a large effect on the parameter estimates.

Linear regression statistical approach for modeling the relationship between a scalar dependent variable and one or more explanatory variables

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

References

Charles Frederick Mosteller, usually known as Frederick Mosteller, was one of the most eminent statisticians of the 20th century. He was the founding chairman of Harvard's statistics department, from 1957 to 1971, and served as the president of several professional bodies including the Psychometric Society, the American Statistical Association, the Institute of Mathematical Statistics, the American Association for the Advancement of Science, and the International Statistical Institute.

Addison-Wesley publisher

Addison-Wesley is a publisher of textbooks and computer literature. It is an imprint of Pearson PLC, a global publishing and education company. In addition to publishing books, Addison-Wesley also distributes its technical titles through the Safari Books Online e-reference service. Addison-Wesley's majority of sales derive from the United States (55%) and Europe (22%).

International Standard Book Number Unique numeric book identifier

The International Standard Book Number (ISBN) is a numeric commercial book identifier which is intended to be unique. Publishers purchase ISBNs from an affiliate of the International ISBN Agency.