Cluster-weighted modeling

Last updated

In data mining, cluster-weighted modeling (CWM) is an algorithm-based approach to non-linear prediction of outputs (dependent variables) from inputs (independent variables) based on density estimation using a set of models (clusters) that are each notionally appropriate in a sub-region of the input space. The overall approach works in jointly input-output space and an initial version was proposed by Neil Gershenfeld. [1] [2]

Contents

Basic form of model

The procedure for cluster-weighted modeling of an input-output problem can be outlined as follows. [2] In order to construct predicted values for an output variable y from an input variable x, the modeling and calibration procedure arrives at a joint probability density function, p(y,x). Here the "variables" might be uni-variate, multivariate or time-series. For convenience, any model parameters are not indicated in the notation here and several different treatments of these are possible, including setting them to fixed values as a step in the calibration or treating them using a Bayesian analysis. The required predicted values are obtained by constructing the conditional probability density p(y|x) from which the prediction using the conditional expected value can be obtained, with the conditional variance providing an indication of uncertainty.

The important step of the modeling is that p(y|x) is assumed to take the following form, as a mixture model:

where n is the number of clusters and {wj} are weights that sum to one. The functions pj(y,x) are joint probability density functions that relate to each of the n clusters. These functions are modeled using a decomposition into a conditional and a marginal density:

where:

  • pj(y|x) is a model for predicting y given x, and given that the input-output pair should be associated with cluster j on the basis of the value of x. This model might be a regression model in the simplest cases.
  • pj(x) is formally a density for values of x, given that the input-output pair should be associated with cluster j. The relative sizes of these functions between the clusters determines whether a particular value of x is associated with any given cluster-center. This density might be a Gaussian function centered at a parameter representing the cluster-center.

In the same way as for regression analysis, it will be important to consider preliminary data transformations as part of the overall modeling strategy if the core components of the model are to be simple regression models for the cluster-wise condition densities, and normal distributions for the cluster-weighting densities pj(x).

General versions

The basic CWM algorithm gives a single output cluster for each input cluster. However, CWM can be extended to multiple clusters which are still associated with the same input cluster. [3] Each cluster in CWM is localized to a Gaussian input region, and this contains its own trainable local model. [4] It is recognized as a versatile inference algorithm which provides simplicity, generality, and flexibility; even when a feedforward layered network might be preferred, it is sometimes used as a "second opinion" on the nature of the training problem. [5]

The original form proposed by Gershenfeld describes two innovations:

CWM can be used to classify media in printer applications, using at least two parameters to generate an output that has a joint dependency on the input parameters. [6]

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> A paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use.

<span class="mw-page-title-main">Pattern recognition</span> Automated recognition of patterns and regularities in data

Pattern recognition is the automated recognition of patterns and regularities in data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

Unsupervised learning, is paradigm in machine learning where, in contrast to supervised learning and semi-supervised learning, algorithms learn patterns exclusively from unlabeled data.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

The information bottleneck method is a technique in information theory introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek. It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and self-described as providing "a surprisingly rich framework for discussing a variety of problems in signal processing and learning".

Sensitivity analysis is the study of how the uncertainty in the output of a mathematical model or system can be divided and allocated to different sources of uncertainty in its inputs. A related practice is uncertainty analysis, which has a greater focus on uncertainty quantification and propagation of uncertainty; ideally, uncertainty and sensitivity analysis should be run in tandem.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

<span class="mw-page-title-main">Regression analysis</span> Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

<span class="mw-page-title-main">Joint probability distribution</span> Type of probability distribution

Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered for any given number of random variables. The joint distribution encodes the marginal distributions, i.e. the distributions of each of the individual random variables. It also encodes the conditional probability distributions, which deal with how the outputs of one random variable are distributed when given information on the outputs of the other random variable(s).

In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished, following Jebara (2004):

  1. A generative model is a statistical model of the joint probability distribution on given observable variable X and target variable Y;
  2. A discriminative model is a model of the conditional probability of the target Y, given an observation x; and
  3. Classifiers computed without using a probability model are also referred to loosely as "discriminative".

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

<span class="mw-page-title-main">Local regression</span> Moving average and polynomial regression method for smoothing data

Local regression or local polynomial regression, also known as moving regression, is a generalization of the moving average and polynomial regression. Its most common methods, initially developed for scatterplot smoothing, are LOESS and LOWESS, both pronounced. They are two strongly related non-parametric regression methods that combine multiple regression models in a k-nearest-neighbor-based meta-model. In some fields, LOESS is known and commonly referred to as Savitzky–Golay filter.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

The term kernel is used in statistical analysis to refer to a window function. The term "kernel" has several distinct meanings in different branches of statistics.

Discriminative models, also referred to as conditional models, are a class of logistical models used for classification or regression. They distinguish decision boundaries through observed data, such as pass/fail, win/lose, alive/dead or healthy/sick.

A probabilistic neural network (PNN) is a feedforward neural network, which is widely used in classification and pattern recognition problems. In the PNN algorithm, the parent probability distribution function (PDF) of each class is approximated by a Parzen window and a non-parametric function. Then, using PDF of each class, the class probability of a new input data is estimated and Bayes’ rule is then employed to allocate the class with highest posterior probability to new input data. By this method, the probability of mis-classification is minimized. This type of artificial neural network (ANN) was derived from the Bayesian network and a statistical algorithm called Kernel Fisher discriminant analysis. It was introduced by D.F. Specht in 1966. In a PNN, the operations are organized into a multilayered feedforward network with four layers:

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

In statistics, specifically regression analysis, a binary regression estimates a relationship between one or more explanatory variables and a single output binary variable. Generally the probability of the two alternatives is modeled, instead of simply outputting a single value, as in linear regression.

References

  1. Gershenfeld, N (1997). "Nonlinear Inference and Cluster-Weighted Modeling". Annals of the New York Academy of Sciences. 808: 18–24. Bibcode:1997NYASA.808...18G. doi:10.1111/j.1749-6632.1997.tb51651.x. S2CID   85736539.
  2. 1 2 Gershenfeld, N.; Schoner; Metois, E. (1999). "Cluster-weighted modelling for time-series analysis". Nature. 397 (6717): 329–332. Bibcode:1999Natur.397..329G. doi:10.1038/16873. S2CID   204990873.
  3. Feldkamp, L.A.; Prokhorov, D.V.; Feldkamp, T.M. (2001). "Cluster-weighted modeling with multiclusters". IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222). Vol. 3. pp. 1710–1714. doi:10.1109/IJCNN.2001.938419. ISBN   0-7803-7044-9. S2CID   60819260.
  4. Boyden, Edward S. "Tree-based Cluster Weighted Modeling: Towards A Massively Parallel Real-Time Digital Stradivarius" (PDF). Cambridge, MA: MIT Media Lab.{{cite journal}}: Cite journal requires |journal= (help)
  5. 1 2 Prokhorov, A New Approach to Cluster-Weighted Modeling Danil V.; Lee A. Feldkamp; Timothy M. Feldkamp. "A New Approach to Cluster-Weighted Modeling" (PDF). Dearborn, MI: Ford Research Laboratory.{{cite journal}}: Cite journal requires |journal= (help)
  6. Gao, Jun; Ross R. Allen (2003-07-24). "CLUSTER-WEIGHTED MODELING FOR MEDIA CLASSIFICATION". Palo Alto, CA: World Intellectual Property Organization. Archived from the original on 2012-12-12.{{cite journal}}: Cite journal requires |journal= (help)