Gower's distance

Last updated

In statistics, Gower's distance between two mixed-type objects is a similarity measure that can handle different types of data within the same dataset and is particularly useful in cluster analysis or other multivariate statistical techniques. Data can be binary, ordinal, or continuous variables. It works by normalizing the differences between each pair of variables and then computing a weighted average of these differences. The distance was defined in 1971 by Gower [1] and it takes values between 0 and 1 with smaller values indicating higher similarity.

Contents

Definition

For two objects and having descriptors, the similarity is defined as:

where the are non-negative weights usually set to [2] and is the similarity between the two objects regarding their -th variable. If the variable is binary or ordinal, the values of are 0 or 1, with 1 denoting equality. If the variable is continuous, with being the range of -th variable and thus ensuring . As a result, the overall similarity between two objects is the weighted average of the similarities calculated for all their descriptors. [3]

In its original exposition, the distance does not treat ordinal variables in a special manner. In the 1990s, first Kaufman and Rousseeuw [4] and later Podani [5] suggested extensions where the ordering of an ordinal feature is used. For example, Podani obtains relative rank differences as with being the ranks corresponding to the ordered categories of the -th variable.

Software implementations

Many programming languages and statistical packages, such as R, Python, etc., include implementations of Gower's distance. The implementations may follow Kaufmann and Rousseeuw's extensions, which change the similarity for continuous variables to [6]


Language/programFunctionRef.
R StatMatch::gower.dist(X)
Python gower.gower_matrix(X)

Related Research Articles

<span class="mw-page-title-main">Legendre polynomials</span> System of complete and orthogonal polynomials

In mathematics, Legendre polynomials, named after Adrien-Marie Legendre (1782), are a system of complete and orthogonal polynomials with a wide number of mathematical properties and numerous applications. They can be defined in many ways, and the various definitions highlight different aspects as well as suggest generalizations and connections to different mathematical structures and physical and numerical applications.

<span class="mw-page-title-main">Helmholtz free energy</span> Thermodynamic potential

In thermodynamics, the Helmholtz free energy is a thermodynamic potential that measures the useful work obtainable from a closed thermodynamic system at a constant temperature (isothermal). The change in the Helmholtz energy during a process is equal to the maximum amount of work that the system can perform in a thermodynamic process in which temperature is held constant. At constant temperature, the Helmholtz free energy is minimized at equilibrium.

<span class="mw-page-title-main">Multidimensional scaling</span> Set of related ordination techniques used in information visualization

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a data set. MDS is used to translate distances between each pair of objects in a set into a configuration of points mapped into an abstract Cartesian space.

In data mining and statistics, hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.

<span class="mw-page-title-main">Jaccard index</span> Measure of similarity and diversity between sets

The Jaccard index is a statistic used for gauging the similarity and diversity of sample sets. It is defined in general taking the ratio of two sizes, the intersection size divided by the union size, also called intersection over union (IoU).

In statistics, a rank correlation is any of several statistics that measure an ordinal association — the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. For example, two common nonparametric methods of significance that use rank correlation are the Mann–Whitney U test and the Wilcoxon signed-rank test.

The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free. However, the test is most often used in contexts where a family of distributions is being tested, in which case the parameters of that family need to be estimated and account must be taken of this in adjusting either the test-statistic or its critical values. When applied to testing whether a normal distribution adequately describes a set of data, it is one of the most powerful statistical tools for detecting most departures from normality. K-sample Anderson–Darling tests are available for testing whether several collections of observations can be modelled as coming from a single population, where the distribution function does not have to be specified.

<span class="mw-page-title-main">Rand index</span> Measure of similarity between two data clusterings

The Rand index or Rand measure in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. The Rand index is the accuracy of determining if a link belongs within a cluster or not.

Kendall's W is a non-parametric statistic for rank correlation. It is a normalization of the statistic of the Friedman test, and can be used for assessing agreement among raters and in particular inter-rater reliability. Kendall's W ranges from 0 to 1.

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938, though Gustav Fechner had proposed a similar measure in the context of time series in 1897.

In mathematics, a univariate polynomial of degree n with real or complex coefficients has n complex roots, if counted with their multiplicities. They form a multiset of n points in the complex plane. This article concerns the geometry of these points, that is the information about their localization in the complex plane that can be deduced from the degree and the coefficients of the polynomial.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.

In ecology and biology, the Bray–Curtis dissimilarity is a statistic used to quantify the dissimilarity in species composition between two different sites, based on counts at each site. It is named after J. Roger Bray and John T. Curtis who first presented it in a paper in 1957.

In digital image and video processing, a color layout descriptor (CLD) is designed to capture the spatial distribution of color in an image. The feature extraction process consists of two parts: grid based representative color selection and discrete cosine transform with quantization.

Curvilinear coordinates can be formulated in tensor calculus, with important applications in physics and engineering, particularly for describing transportation of physical quantities and deformation of matter in fluid mechanics and continuum mechanics.

The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) is a multi-criteria decision analysis method, which was originally developed by Ching-Lai Hwang and Yoon in 1981 with further developments by Yoon in 1987, and Hwang, Lai and Liu in 1993. TOPSIS is based on the concept that the chosen alternative should have the shortest geometric distance from the positive ideal solution (PIS) and the longest geometric distance from the negative ideal solution (NIS). A dedicated book in the fuzzy context was published in 2021

Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known. These data exist on an ordinal scale, one of four levels of measurement described by S. S. Stevens in 1946. The ordinal scale is distinguished from the nominal scale by having a ranking. It also differs from the interval scale and ratio scale by not having category widths that represent equal increments of the underlying attribute.

Graph cut optimization is a combinatorial optimization method applicable to a family of functions of discrete variables, named after the concept of cut in the theory of flow networks. Thanks to the max-flow min-cut theorem, determining the minimum cut over a graph representing a flow network is equivalent to computing the maximum flow over the network. Given a pseudo-Boolean function , if it is possible to construct a flow network with positive weights such that

References

  1. Gower, John C (1971). "A general coefficient of similarity and some of its properties". Biometrics. 27 (4): 857–871. doi:10.2307/2528823. JSTOR   2528823 . Retrieved 2024-06-03.
  2. Borg, Ingwer; Groenen, Patrick J. F. (2005). Modern multidimensional scaling: theory and applications (2 ed.). New York [Heidelberg]: Springer. pp. 124–125. ISBN   978-0387-25150-9.
  3. Legendre, Pierre; Legendre, Louis (2012). Numerical ecology (Third English ed.). Amsterdam: Elsevier. pp. 278–280. ISBN   978-0-444-53868-0.
  4. Kaufman, Leonard; Rousseeuw, Peter J. (1990). Finding groups in data: an introduction to cluster analysis. New York: Wiley. pp. 35–36. ISBN   9780471878766.
  5. Podani, János (May 1999). "Extending Gower's general coefficient of similarity to ordinal characters". Taxon. 48 (2): 331–340. doi:10.2307/1224438. JSTOR   1224438.
  6. D'Orazio, Marcello. "gower.dist {StatMatch}". The R Project for Statistical Computing. Retrieved 31 October 2024.