BIRCH

Last updated October 07, 2023 • 8 min readFrom Wikipedia, The Free Encyclopedia

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets.^[1] With modifications it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation–maximization algorithm.^[2] An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, BIRCH only requires a single scan of the database.

Problem with previous methods

Previous clustering algorithms performed less effectively over very large databases and did not adequately consider the case wherein a data-set was too large to fit in main memory. As a result, there was a lot of overhead maintaining high clustering quality while minimizing the cost of additional IO (input/output) operations. Furthermore, most of BIRCH's predecessors inspect all data points (or all currently existing clusters) equally for each 'clustering decision' and do not perform heuristic weighting based on the distance between these data points.

Advantages with BIRCH

It is local in that each clustering decision is made without scanning all data points and currently existing clusters. It exploits the observation that the data space is not usually uniformly occupied and not every data point is equally important. It makes full use of available memory to derive the finest possible sub-clusters while minimizing I/O costs. It is also an incremental method that does not require the whole data set in advance.

Algorithm

The BIRCH algorithm takes as input a set of $N$ data points, represented as real-valued vectors, and a desired number of clusters $K$ . It operates in four phases, the second of which is optional.

The first phase builds a clustering feature ( $CF$ ) tree out of the data points, a height-balanced tree data structure, defined as follows:

Given a set of N d-dimensional data points, the clustering feature $CF$ $BIRCH$ of the set is defined as the triple $CF=(N,{\overrightarrow {LS}},SS)$ $BIRCH$ , where
- ${\overrightarrow {LS}}=\sum _{i=1}^{N}{\overrightarrow {X_{i}}}$ is the linear sum.
- $SS=\sum _{i=1}^{N}({\overrightarrow {X_{i}}})^{2}$ is the square sum of data points.
Clustering features are organized in a CF tree, a height-balanced tree with two parameters:^{[ clarification needed ]} branching factor $B$ and threshold $T$ . Each non-leaf node contains at most $B$ entries of the form $[CF_{i},child_{i}]$ , where $child_{i}$ is a pointer to its $i$ th child node and $CF_{i}$ the clustering feature representing the associated subcluster. A leaf node contains at most $L$ entries each of the form $[CF_{i}]$ . It also has two pointers prev and next which are used to chain all leaf nodes together. The tree size depends on the parameter $T$ . A node is required to fit in a page of size $P$ . $B$ and $L$ are determined by $P$ . So $P$ can be varied for performance tuning. It is a very compact representation of the dataset because each entry in a leaf node is not a single data point but a subcluster.

In the second step, the algorithm scans all the leaf entries in the initial $CF$ tree to rebuild a smaller $CF$ tree, while removing outliers and grouping crowded subclusters into larger ones. This step is marked optional in the original presentation of BIRCH.

In step three an existing clustering algorithm is used to cluster all leaf entries. Here an agglomerative hierarchical clustering algorithm is applied directly to the subclusters represented by their $CF$ vectors. It also provides the flexibility of allowing the user to specify either the desired number of clusters or the desired diameter threshold for clusters. After this step a set of clusters is obtained that captures major distribution pattern in the data. However, there might exist minor and localized inaccuracies which can be handled by an optional step 4. In step 4 the centroids of the clusters produced in step 3 are used as seeds and redistribute the data points to its closest seeds to obtain a new set of clusters. Step 4 also provides us with an option of discarding outliers. That is a point which is too far from its closest seed can be treated as an outlier.

Calculations with the clustering features

Given only the clustering feature $CF=[N,{\overrightarrow {LS}},SS]$ , the same measures can be calculated without the knowledge of the underlying actual values.

Centroid: ${\overrightarrow {C}}={\frac {\sum _{i=1}^{N}{\overrightarrow {X_{i}}}}{N}}={\frac {\overrightarrow {LS}}{N}}$
Radius: $R={\sqrt {\frac {\sum _{i=1}^{N}({\overrightarrow {X_{i}}}-{\overrightarrow {C}})^{2}}{N}}}={\sqrt {\frac {N\cdot {\overrightarrow {C}}^{2}+SS-2\cdot {\overrightarrow {C}}\cdot {\overrightarrow {LS}}}{N}}}={\sqrt {{\frac {SS}{N}}-({\frac {\overrightarrow {LS}}{N}})^{2}}}$
Average Linkage Distance between clusters $CF_{1}=[N_{1},{\overrightarrow {LS_{1}}},SS_{1}]$ and $CF_{2}=[N_{2},{\overrightarrow {LS_{2}}},SS_{2}]$ : $D_{2}={\sqrt {\frac {\sum _{i=1}^{N_{1}}\sum _{j=1}^{N_{2}}({\overrightarrow {X_{i}}}-{\overrightarrow {Y_{j}}})^{2}}{N_{1}\cdot N_{2}}}}={\sqrt {\frac {N_{1}\cdot SS_{2}+N_{2}\cdot SS_{1}-2\cdot {\overrightarrow {LS_{1}}}\cdot {\overrightarrow {LS_{2}}}}{N_{1}\cdot N_{2}}}}$

In multidimensional cases the square root should be replaced with a suitable norm.

BIRCH uses the distances DO to D3 to find the nearest leaf, then the radius R or the diameter D to decide whether to absorb the data into the existing leaf or whether to add a new leaf.

Numerical issues in BIRCH clustering features

Unfortunately, there are numerical issues associated with the use of the term $SS$ in BIRCH. When subtracting ${\frac {SS}{N}}-{\big (}{\frac {\vec {LS}}{N}}{\big )}^{2}$ or similar in the other distances such as $D_{2}$ , catastrophic cancellation can occur and yield a poor precision, and which can in some cases even cause the result to be negative (and the square root then become undefined).^[2] This can be resolved by using BETULA cluster features $CF=(N,\mu ,S)$ instead, which store the count $N$ , mean $\mu$ , and sum of squared deviations instead based on numerically more reliable online algorithms to calculate variance. For these features, a similar additivity theorem holds. When storing a vector respectively a matrix for the squared deviations, the resulting BIRCH CF-tree can also be used to accelerate Gaussian Mixture Modeling with the expectation–maximization algorithm, besides k-means clustering and hierarchical agglomerative clustering.

Instead of storing the linear sum and the sum of squares, we can instead store the mean and the squared deviation from the mean in each cluster feature $CF'=(N,\mu ,S)$ ,^[4] where

$n$ is the node weight (number of points)
$\mu$ is the node center vector (arithmetic mean, centroid)
$S$ is the sum of squared deviations from the mean (either a vector, or a sum to conserve memory, depending on the application)

The main difference here is that S is computed relative to the center, instead of relative to the origin.

A single point $x$ can be cast into a cluster feature $CF_{x}=(1,x,0)$ . In order to combine two cluster features $CF_{AB}=CF_{A}+CF_{B}$ , we use

$N_{AB}=N_{A}+N_{B}$
$\mu _{AB}=\mu _{A}+{\frac {N_{B}}{N_{AB}}}(\mu _{B}-\mu _{A})$ (incremental update of the mean)
$S_{AB}=S_{A}+S_{B}+N_{B}(\mu _{B}-\mu _{A})\circ (\mu _{B}-\mu _{AB})$ in vector form using the element-wise product, respectively
$S_{AB}=S_{A}+S_{B}+N_{B}(\mu _{B}-\mu _{A})^{T}(\mu _{B}-\mu _{AB})$ to update a scalar sum of squared deviations

These computations use numerically more reliable computations (c.f. online computation of the variance) that avoid the subtraction of two similar squared values. The centroid is simply the node center vector $\mu$ , and can directly be used for distance computations using, e.g., the Euclidean or Manhattan distances. The radius simplifies to $R={\sqrt {{\frac {1}{N}}S}}$ and the diameter to $D={\sqrt {{\frac {2}{N-1}}S}}$ .

We can now compute the different distances D0 to D4 used in the BIRCH algorithm as:^[4]

Euclidean distance $D_{0}=\|\mu _{A}-\mu _{B}\|$ and Manhattan distance $D_{1}=\|\mu _{A}-\mu _{B}\|_{1}$ are computed using the CF centers $\mu$
Inter-cluster distance $D_{2}={\sqrt {{\frac {1}{N_{A}}}S_{A}+{\frac {1}{N_{B}}}S_{B}+{\big \|}\mu _{A}-\mu _{B}{\big \|}^{2}}}$
Intra-cluster distance $D_{3}={\sqrt {{\frac {2}{N_{AB}(N_{AB}-1)}}\left(N_{AB}(S_{A}+S_{B})+N_{A}N_{B}{\big \|}\mu _{A}-\mu _{B}{\big \|}^{2}\right)}}$
Variance-increase distance $D_{4}={\sqrt {{\frac {N_{A}N_{B}}{N_{AB}}}{\big \|}\mu _{A}-\mu _{B}{\big \|}^{2}}}$

These distances can also be used to initialize the distance matrix for hierarchical clustering, depending on the chosen linkage. For accurate hierarchical clustering and k-means clustering, we also need to use the node weight $N$ .

Clustering Step

The CF-tree provides a compressed summary of the data set, but the leaves themselves only provide a very poor data clustering. In a second step, the leaves can be clustered using, e.g.,

k-means clustering, where leaves are weighted by the numbers of points, N.
k-means++, by sampling cluster features proportional to $S+N\min _{i}||\mu -c_{i}||$ where the $c_{i}$ are the previously chosen centers, and $(N,\mu ,S)$ is the BETULA cluster feature.
Gaussian mixture modeling, where also the variance S can be taken into account, and if the leaves store covariances, also the covariances.
Hierarchical agglomerative clustering, where the linkage can be initialized using the following equivalence of linkages to BIRCH distances:^[5]

Correspondence between HAC linkages and BIRCH distances^[5]
HAC Linkage	BIRCH distance
UPGMA	D2²
WPGMA	D0²
Ward	2 D4²

Availability

ELKI contains BIRCH and BETULA.
scikit-learn contains a limited version of BIRCH, which only supports D0 distance, static thresholds, and which uses only the centroids of the leaves in the clustering step.^[6]

Related Research Articles

Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, that is, in Euclid's Elements, it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean spaces of any positive integer dimension n, which are called Euclidean n-spaces when one wants to specify their dimension. For n equal to one or two, they are commonly called respectively Euclidean lines and Euclidean planes. The qualifier "Euclidean" is used to distinguish Euclidean spaces from other spaces that were later considered in physics and modern mathematics.

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

In probability theory, the central limit theorem (CLT) establishes that, in many situations, for independent and identically distributed random variables, the sampling distribution of the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.

In probability and statistics, Student's t-distribution $is a continuous probability distribution that generalizes the standard normal distribution. Like the latter, it is symmetric around zero and bell-shaped.$

A ferrimagnetic material is a material that has populations of atoms with opposing magnetic moments, as in antiferromagnetism, but these moments are unequal in magnitude so a spontaneous magnetization remains. This can for example occur when the populations consist of different atoms or ions (such as Fe²⁺ and Fe³⁺).

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

<span class="mw-page-title-main">Mertens function</span> Summatory function of the Möbius function

In number theory, the Mertens function is defined for all positive integers n as

In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of a parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes complement statistical hypothesis testing, and play an important role in power analyses, sample size planning, and in meta-analyses. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

<span class="mw-page-title-main">Noncentral chi-squared distribution</span>

In probability theory and statistics, the noncentral chi-squared distribution is a noncentral generalization of the chi-squared distribution. It often arises in the power analysis of statistical tests in which the null distribution is a chi-squared distribution; important examples of such tests are the likelihood-ratio tests.

In probability theory and statistics, the generalized inverse Gaussian distribution (GIG) is a three-parameter family of continuous probability distributions with probability density function

The circles of Apollonius are any of several sets of circles associated with Apollonius of Perga, a renowned Greek geometer. Most of these circles are found in planar Euclidean geometry, but analogs have been defined on other surfaces; for example, counterparts on the surface of a sphere can be defined through stereographic projection.

In probability theory, calculation of the sum of normally distributed random variables is an instance of the arithmetic of random variables.

In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.

In queueing theory, a discipline within the mathematical theory of probability, the backpressure routing algorithm is a method for directing traffic around a queueing network that achieves maximum network throughput, which is established using concepts of Lyapunov drift. Backpressure routing considers the situation where each job can visit multiple service nodes in the network. It is an extension of max-weight scheduling where each job visits only a single service node.

In mathematical physics, the Belinfante–Rosenfeld tensor is a modification of the energy–momentum tensor that is constructed from the canonical energy–momentum tensor and the spin current so as to be symmetric yet still conserved.

<span class="mw-page-title-main">Matrix completion</span>

Matrix completion is the task of filling in the missing entries of a partially observed matrix, which is equivalent to performing data imputation in statistics. A wide range of datasets are naturally organized in matrix form. One example is the movie-ratings matrix, as appears in the Netflix problem: Given a ratings matrix in which each entry $represents the rating of movie by customer, if customer has watched movie and is otherwise missing, we would like to predict the remaining entries in order to make good recommendations to customers on what to watch next. Another example is the document-term matrix: The frequencies of words used in a collection of documents can be represented as a matrix, where each entry corresponds to the number of times the associated term appears in the indicated document.$

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

References

1 2 Zhang, T.; Ramakrishnan, R.; Livny, M. (1996). "BIRCH: an efficient data clustering method for very large databases". Proceedings of the 1996 ACM SIGMOD international conference on Management of data - SIGMOD '96. pp. 103–114. doi: 10.1145/233269.233324 .
1 2 Lang, Andreas; Schubert, Erich (2020), "BETULA: Numerically Stable CF-Trees for BIRCH Clustering", Similarity Search and Applications, pp. 281–296, arXiv: 2006.12881 , doi:10.1007/978-3-030-60936-8_22, ISBN 978-3-030-60935-1, S2CID 219980434 , retrieved 2021-01-16
↑ "2006 SIGMOD Test of Time Award". Archived from the original on 2010-05-23.
1 2 Lang, Andreas; Schubert, Erich (2022). "BETULA: Fast clustering of large data with improved BIRCH CF-Trees". Information Systems. 108: 101918. doi: 10.1016/j.is.2021.101918 .
1 2 Schubert, Erich; Lang, Andreas (2022-12-31), "5.1 Data Aggregation for Hierarchical Clustering", Machine Learning under Resource Constraints - Fundamentals, De Gruyter, pp. 215–226, arXiv: 2309.02552 , ISBN 978-3-11-078594-4
↑ as discussed in

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[birch-1] 1 2 Zhang, T.; Ramakrishnan, R.; Livny, M. (1996). "BIRCH: an efficient data clustering method for very large databases". Proceedings of the 1996 ACM SIGMOD international conference on Management of data - SIGMOD '96. pp. 103–114. doi: 10.1145/233269.233324 .

[:0-2] 1 2 Lang, Andreas; Schubert, Erich (2020), "BETULA: Numerically Stable CF-Trees for BIRCH Clustering", Similarity Search and Applications, pp. 281–296, arXiv: 2006.12881 , doi:10.1007/978-3-030-60936-8_22, ISBN 978-3-030-60935-1, S2CID 219980434 , retrieved 2021-01-16

[3] "2006 SIGMOD Test of Time Award". Archived from the original on 2010-05-23.

[:1-4] 1 2 Lang, Andreas; Schubert, Erich (2022). "BETULA: Fast clustering of large data with improved BIRCH CF-Trees". Information Systems. 108: 101918. doi: 10.1016/j.is.2021.101918 .

[corr-5] 1 2 Schubert, Erich; Lang, Andreas (2022-12-31), "5.1 Data Aggregation for Hierarchical Clustering", Machine Learning under Resource Constraints - Fundamentals, De Gruyter, pp. 215–226, arXiv: 2309.02552 , ISBN 978-3-11-078594-4

[6] s discussed in

[1]

[2]

[3]

[4]

[5]

[6]