In statistics, cluster analysis is the algorithmic grouping of objects into homogeneous groups based on numerical measurements. Model-based clustering [1] bases this on a statistical model for the data, usually a mixture model. This has several advantages, including a principled statistical basis for clustering, and ways to choose the number of clusters, to choose the best clustering model, to assess the uncertainty of the clustering, and to identify outliers that do not belong to any group.
Suppose that for each of observations we have data on variables, denoted by for observation . Then model-based clustering expresses the probability density function of as a finite mixture, or weighted average of component probability density functions:
where is a probability density function with parameter , is the corresponding mixture probability where . Then in its simplest form, model-based clustering views each component of the mixture model as a cluster, estimates the model parameters, and assigns each observation to cluster corresponding to its most likely mixture component.
The most common model for continuous data is that is a multivariate normal distribution with mean vector and covariance matrix , so that . This defines a Gaussian mixture model. The parameters of the model, and for , are typically estimated by maximum likelihood estimation using the expectation-maximization algorithm (EM); see also EM algorithm and GMM model.
Bayesian inference is also often used for inference about finite mixture models. [2] The Bayesian approach also allows for the case where the number of components, , is infinite, using a Dirichlet process prior, yielding a Dirichlet process mixture model for clustering. [3]
An advantage of model-based clustering is that it provides statistically principled ways to choose the number of clusters. Each different choice of the number of groups corresponds to a different mixture model. Then standard statistical model selection criteria such as the Bayesian information criterion (BIC) can be used to choose . [4] The integrated completed likelihood (ICL) [5] is a different criterion designed to choose the number of clusters rather than the number of mixture components in the model; these will often be different if highly non-Gaussian clusters are present.
For data with high dimension, , using a full covariance matrix for each mixture component requires estimation of many parameters, which can result in a loss of precision, generalizabity and interpretability. Thus it is common to use more parsimonious component covariance matrices exploiting their geometric interpretation. Gaussian clusters are ellipsoidal, with their volume, shape and orientation determined by the covariance matrix. Consider the eigendecomposition of a matrix
where is the matrix of eigenvectors of , is a diagonal matrix whose elements are proportional to the eigenvalues of in descending order, and is the associated constant of proportionality. Then controls the volume of the ellipsoid, its shape, and its orientation. [6] [7]
Each of the volume, shape and orientation of the clusters can be constrained to be equal (E) or allowed to vary (V); the orientation can also be spherical, with identical eigenvalues (I). This yields 14 possible clustering models, shown in this table:
Model | Description | # Parameters |
---|---|---|
EII | Spherical, equal volume | 1 |
VII | Spherical, varying volume | 9 |
EEI | Diagonal, equal volume & shape | 4 |
VEI | Diagonal, equal shape | 12 |
EVI | Diagonal, equal volume, varying shape | 28 |
VVI | Diagonal, varying volume & shape | 36 |
EEE | Equal | 10 |
VEE | Equal shape & orientation | 18 |
EVE | Equal volume & orientation | 34 |
VVE | Equal orientation | 42 |
EEV | Equal volume & shape | 58 |
VEV | Equal shape | 66 |
EVV | Equal volume | 82 |
VVV | Varying | 90 |
It can be seen that many of these models are more parsimonious, with far fewer parameters than the unconstrained model that has 90 parameters when and .
Several of these models correspond to well-known heuristic clustering methods. For example, k-means clustering is equivalent to estimation of the EII clustering model using the classification EM algorithm. [8] The Bayesian information criterion (BIC) can be used to choose the best clustering model as well as the number of clusters. It can also be used as the basis for a method to choose the variables in the clustering model, eliminating variables that are not useful for clustering. [9] [10]
Different Gaussian model-based clustering methods have been developed with an eye to handling high-dimensional data. These include the pgmm method, [11] which is based on the mixture of factor analyzers model, and the HDclassif method, based on the idea of subspace clustering. [12]
The mixture-of-experts framework extends model-based clustering to include covariates. [13] [14]
We illustrate the method with a dateset consisting of three measurements (glucose, insulin, sspg) on 145 subjects for the purpose of diagnosing diabetes and the type of diabetes present. [15] The subjects were clinically classified into three groups: normal, chemical diabetes and overt diabetes, but we use this information only for evaluating clustering methods, not for classifying subjects.
The BIC plot shows the BIC values for each combination of the number of clusters, , and the clustering model from the Table. Each curve corresponds to a different clustering model. The BIC favors 3 groups, which corresponds to the clinical assessment. It also favors the unconstrained covariance model, VVV. This fits the data well, because the normal patients have low values of both sspg and insulin, while the distributions of the chemical and overt diabetes groups are elongated, but in different directions. Thus the volumes, shapes and orientations of the three groups are clearly different, and so the unconstrained model is appropriate, as selected by the model-based clustering method.
The classification plot shows the classification of the subjects by model-based clustering. The classification was quite accurate, with a 12% error rate as defined by the clinical classificiation. Other well-known clustering methods performed worse with higher error rates, such as single-linkage clustering with 46%, average link clustering with 30%, complete-linkage clustering also with 30%, and k-means clustering with 28%.
An outlier in clustering is a data point that does not belong to any of the clusters. One way of modeling outliers in model-based clustering is to include an additional mixture component that is very dispersed, with for example a uniform distribution. [6] [16] Another approach is to replace the multivariate normal densities by -distributions, [17] with the idea that the long tails of the -distribution would ensure robustness to outliers. However, this is not breakdown-robust. [18] A third approach is the "tclust" or data trimming approach [19] which excludes observations identified as outliers when estimating the model parameters.
Sometimes one or more clusters deviate strongly from the Gaussian assumption. If a Gaussian mixture is fitted to such data, a strongly non-Gaussian cluster will often be represented by several mixture components rather than a single one. In that case, cluster merging can be used to find a better clustering. [20] A different approach is to use mixtures of complex component densities to represent non-Gaussian clusters. [21] [22]
Clustering multivariate categorical data is most often done using the latent class model. This assumes that the data arise from a finite mixture model, where within each cluster the variables are independent.
These arise when variables are of different types, such as continuous, categorical or ordinal data. A latent class model for mixed data assumes local independence between the variable. [23] The location model relaxes the local independence assumption. [24] The clustMD approach assumes that the observed variables are manifestations of underlying continuous Gaussian latent variables. [25]
The simplest model-based clustering approach for multivariate count data is based on finite mixtures with locally independent Poisson distributions, similar to the latent class model. More realistic approaches allow for dependence and overdispersion in the counts. [26] These include methods based on the multivariate Poisson distribution, the multivarate Poisson-log normal distribution, the integer-valued autoregressive (INAR) model and the Gaussian Cox model.
These consist of sequences of categorical values from a finite set of possibilities, such as life course trajectories. Model-based clustering approaches include group-based trajectory and growth mixture models [27] and a distance-based mixture model. [28]
These arise when individuals rank objects in order of preference. The data are then ordered lists of objects, arising in voting, education, marketing and other areas. Model-based clustering methods for rank data include mixtures of Plackett-Luce models and mixtures of Benter models, [29] [30] and mixtures of Mallows models. [31]
These consist of the presence, absence or strength of connections between individuals or nodes, and are widespread in the social sciences and biology. The stochastic blockmodel carries out model-based clustering of the nodes in a network by assuming that there is a latent clustering and that connections are formed independently given the clustering. [32] The latent position cluster model assumes that each node occupies a position in an unobserved latent space, that these positions arise from a mixture of Gaussian distributions, and that presence or absence of a connection is associated with distance in the latent space. [33]
Much of the model-based clustering software is in the form of a publicly and freely available R package. Many of these are listed in the CRAN Task View on Cluster Analysis and Finite Mixture Models. [34] The most used such package is mclust, [35] [36] which is used to cluster continuous data and has been downloaded over 8 million times. [37]
The poLCA package [38] clusters categorical data using the latent class model. The clustMD package [25] clusters mixed data, including continuous, binary, ordinal and nominal variables.
The flexmix package [39] does model-based clustering for a range of component distributions. The mixtools package [40] can cluster different data types. Both flexmix and mixtools implement model-based clustering with covariates.
Model-based clustering was first invented in 1950 by Paul Lazarsfeld for clustering multivariate discrete data, in the form of the latent class model. [41]
In 1959, Lazarsfeld gave a lecture on latent structure analysis at the University of California-Berkeley, where John H. Wolfe was an M.A. student. This led Wolfe to think about how to do the same thing for continuous data, and in 1965 he did so, proposing the Gaussian mixture model for clustering. [42] [43] He also produced the first software for estimating it, called NORMIX. Day (1969), working independently, was the first to publish a journal article on the approach. [44] However, Wolfe deserves credit as the inventor of model-based clustering for continuous data.
Murtagh and Raftery (1984) developed a model-based clustering method based on the eigenvalue decomposition of the component covariance matrices. [45] McLachlan and Basford (1988) was the first book on the approach, advancing methodology and sparking interest. [46] Banfield and Raftery (1993) coined the term "model-based clustering", introduced the family of parsimonious models, described an information criterion for choosing the number of clusters, proposed the uniform model for outliers, and introduced the mclust software. [6] Celeux and Govaert (1995) showed how to perform maximum likelihood estimation for the models. [7] Thus, by 1995 the core components of the methodology were in place, laying the groundwork for extensive development since then.
Free download: https://math.univ-cotedazur.fr/~cbouveyr/MBCbook/
Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.
Geostatistics is a branch of statistics focusing on spatial or spatiotemporal datasets. Developed originally to predict probability distributions of ore grades for mining operations, it is currently applied in diverse disciplines including petroleum geology, hydrogeology, hydrology, meteorology, oceanography, geochemistry, geometallurgy, geography, forestry, environmental control, landscape ecology, soil science, and agriculture. Geostatistics is applied in varied branches of geography, particularly those involving the spread of diseases (epidemiology), the practice of commerce and military planning (logistics), and the development of efficient spatial networks. Geostatistical algorithms are incorporated in many places, including geographic information systems (GIS).
Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.
In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.
Chemometrics is the science of extracting information from chemical systems by data-driven means. Chemometrics is inherently interdisciplinary, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, applied mathematics, and computer science, in order to address problems in chemistry, biochemistry, medicine, biology and chemical engineering. In this way, it mirrors other interdisciplinary fields, such as psychometrics and econometrics.
In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.
In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents are statistically independent from each other. ICA was invented by Jeanny Hérault and Christian Jutten in 1985. ICA is a special case of blind source separation. A common example application of ICA is the "cocktail party problem" of listening in on one person's speech in a noisy room.
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.
In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized. The underlying random variables may be random real numbers, or they may be random vectors, in which case the mixture distribution is a multivariate distribution.
In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture models are used for clustering, under the name model-based clustering, and also for density estimation.
Ordination or gradient analysis, in multivariate analysis, is a method complementary to data clustering, and used mainly in exploratory data analysis. In contrast to cluster analysis, ordination orders quantities in a latent space. In the ordination space, quantities that are near each other share attributes, and dissimilar objects are farther from each other. Such relationships between the objects, on each of several axes or latent variables, are then characterized numerically and/or graphically in a biplot.
Functional data analysis (FDA) is a branch of statistics that analyses data providing information about curves, surfaces or anything else varying over a continuum. In its most general form, under an FDA framework, each sample element of functional data is considered to be a random function. The physical continuum over which these functions are defined is often time, but may also be spatial location, wavelength, probability, etc. Intrinsically, functional data are infinite dimensional. The high intrinsic dimensionality of these data brings challenges for theory as well as computation, where these challenges vary with how the functional data were sampled. However, the high or infinite dimensional structure of the data is a rich source of information and there are many interesting challenges for research and data analysis.
In robust statistics, robust regression seeks to overcome some limitations of traditional regression analysis. A regression analysis models the relationship between one or more independent variables and a dependent variable. Standard types of regression, such as ordinary least squares, have favourable properties if their underlying assumptions are true, but can give misleading results otherwise. Robust regression methods are designed to limit the effect that violations of assumptions by the underlying data-generating process have on regression estimates.
In statistics, a latent class model (LCM) is a model for clustering multivariate discrete data. It assumes that the data arise from a mixture of discrete distributions, within each of which the variables are independent. It is called a latent class model because the class to which each data point belongs is unobserved, or latent.
In various science/engineering applications, such as independent component analysis, image analysis, genetic analysis, speech recognition, manifold learning, and time delay estimation it is useful to estimate the differential entropy of a system or process, given some observations.
Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.
In probability and statistics, an elliptical distribution is any member of a broad family of probability distributions that generalize the multivariate normal distribution. Intuitively, in the simplified two and three dimensional case, the joint distribution forms an ellipse and an ellipsoid, respectively, in iso-density plots.
Energy distance is a statistical distance between probability distributions. If X and Y are independent random vectors in Rd with cumulative distribution functions (cdf) F and G respectively, then the energy distance between the distributions F and G is defined to be the square root of
John H. Wolfe is the inventor of model-based clustering for continuous data. Wolfe graduated with a B.A. in mathematics from Caltech and then went to graduate school in psychology at the University of California, Berkeley to work with Robert Tryon.