Projection pursuit

Last updated

Projection pursuit (PP) is a type of statistical technique which involves finding the most "interesting" possible projections in multidimensional data. Often, projections which deviate more from a normal distribution are considered to be more interesting. As each projection is found, the data are reduced by removing the component along that projection, and the process is repeated to find new projections; this is the "pursuit" aspect that motivated the technique known as matching pursuit. [1] [2]

Contents

The idea of projection pursuit is to locate the projection or projections from high-dimensional space to low-dimensional space that reveal the most details about the structure of the data set. Once an interesting set of projections has been found, existing structures (clusters, surfaces, etc.) can be extracted and analyzed separately.

Projection pursuit has been widely used for blind source separation, so it is very important in independent component analysis. Projection pursuit seeks one projection at a time such that the extracted signal is as non-Gaussian as possible. [3]

History

Projection pursuit technique were originally proposed and experimented by Kruskal. [4] Related ideas occur in Switzer (1970) "Numerical classification" pp31–43 in "Computer Applications in the Earth Sciences: Geostatistics, and Switzer and Wright (1971) "Numerical classification of eocene nummulitids," Mathematical Geology pp 297–311. The first successful implementation is due to Jerome H. Friedman and John Tukey (1974), who named projection pursuit.

The original purpose of projection pursuit was to machine-pick "interesting" low-dimensional projections of a high-dimensional point cloud by numerically maximizing a certain objective function or projection index. [5]

Several years later, Friedman and Stuetzle extended the idea behind projection pursuit and added projection pursuit regression (PPR), projection pursuit classification (PPC), and projection pursuit density estimation (PPDE).

Feature

The most exciting feature of projection pursuit is that it is one of the very few multivariate methods able to bypass the "curse of dimensionality" caused by the fact that high-dimensional space is mostly empty. In addition, projection pursuit is able to ignore irrelevant (i.e. noisy and information-poor) variables. This is a distinct advantage over methods based on interpoint distances like minimal spanning trees, multidimensional scaling and most clustering techniques.

Many of the methods of classical multivariate analysis turn out to be special cases of projection pursuit. Examples are principal component analysis and discriminant analysis, and the quartimax and oblimax methods in factor analysis.

One serious drawback of projection pursuit methods is their high demand on computer time.

See also

Related Research Articles

Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.

Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions. Nonparametric statistics is based on either being distribution-free or having a specified distribution but with the distribution's parameters unspecified. Nonparametric statistics includes both descriptive statistics and statistical inference. Nonparametric tests are often used when the assumptions of parametric tests are violated.

Chemometrics is the science of extracting information from chemical systems by data-driven means. Chemometrics is inherently interdisciplinary, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, applied mathematics, and computer science, in order to address problems in chemistry, biochemistry, medicine, biology and chemical engineering. In this way, it mirrors other interdisciplinary fields, such as psychometrics and econometrics.

Time series Sequence of data points over time

In mathematics, a time series is a series of data points indexed in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.

John Tukey American mathematician

John Wilder Tukey was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distribution, the Tukey test of additivity, and the Teichmüller–Tukey lemma all bear his name. He is also credited with coining the term 'bit' and the first published use of the word software.

In statistics, exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable. Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.

Partial least squares regression is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares discriminant analysis (PLS-DA) is a variant used when the Y is categorical.

In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern recognition, classification and regression. Features are usually numeric, but structural features such as strings and graphs are used in syntactic pattern recognition. The concept of "feature" is related to that of explanatory variable used in statistical techniques such as linear regression.

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

The Unistat computer program is a statistical data analysis tool featuring two modes of operation: The stand-alone user interface is a complete workbench for data input, analysis and visualization while the Microsoft Excel add-in mode extends the features of the mainstream spreadsheet application with powerful analytical capabilities.

Sammon mapping or Sammon projection is an algorithm that maps a high-dimensional space to a space of lower dimensionality by trying to preserve the structure of inter-point distances in high-dimensional space in the lower-dimension projection.

In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the sample size, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking.

Multilinear subspace learning Approach to dimensionality reduction

Multilinear subspace learning is an approach to dimensionality reduction. Dimensionality reduction can be performed on a data tensor whose observations have been vectorized and organized into a data tensor, or whose observations are matrices that are concatenated into a data tensor. Here are some examples of data tensors whose observations are vectorized or whose observations are matrices concatenated into data tensor images (2D/3D), video sequences (3D/4D), and hyperspectral cubes (3D/4D).

Targeted projection pursuit

Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data to find features or patterns of potential interest.

Jacqueline Meulman Dutch statistician

Jacqueline Meulman is a Dutch statistician and professor emerita of Applied Statistics at the Mathematical Institute of Leiden University.

Jerome Harold Friedman is an American statistician, consultant and Professor of Statistics at Stanford University, known for his contributions in the field of statistics and data mining.

The Grand Tour is a technique developed by Daniel Asimov in 1985, which is used to explore multivariate statistical data by means of an animation. The animation, or "movie", consists of a series of distinct views of the data as seen from different directions, displayed on a computer screen, that appear to change continuously and that get closer and closer to all possible views. This allows a human- or computer-based evaluation of these views, with the goal of detecting patterns that will convey useful information about the data.

Outline of machine learning Overview of and topical guide to machine learning

The following outline is provided as an overview of and topical guide to machine learning. Machine learning is a subfield of soft computing within computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed". Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.

References

  1. J. H. Friedman and J. W. Tukey (Sep 1974). "A Projection Pursuit Algorithm for Exploratory Data Analysis" (PDF). IEEE Transactions on Computers. C-23 (9): 881–890. doi:10.1109/T-C.1974.224051. ISSN   0018-9340.
  2. M. C. Jones and R. Sibson (1987). "What is Projection Pursuit?". Journal of the Royal Statistical Society, Series A. 150 (1): 1–37. doi:10.2307/2981662. JSTOR   2981662.
  3. James V. Stone(2004); "Independent Component Analysis: A Tutorial Introduction", The MIT Press Cambridge, Massachusetts, London, England; ISBN   0-262-69315-1
  4. Kruskal, JB. 1969; "Toward a practical method which helps uncover the structure of a set of observations by finding the line transformation which optimizes a new “index of condensation”", Pages 427–440 of: Milton, RC, & Nelder, JA (eds), Statistical computation; New York, Academic Press
  5. P. J. Huber (Jun 1985). "Projection pursuit". The Annals of Statistics. 13 (2): 435–475. doi: 10.1214/aos/1176349519 .