SUBCLU

Last updated December 08, 2022 • 2 min readFrom Wikipedia, The Free Encyclopedia

SUBCLU is an algorithm for clustering high-dimensional data by Karin Kailing, Hans-Peter Kriegel and Peer Kröger.^[1] It is a subspace clustering algorithm that builds on the density-based clustering algorithm DBSCAN. SUBCLU can find clusters in axis-parallel subspaces, and uses a bottom-up, greedy strategy to remain efficient.

Approach

SUBCLU uses a monotonicity criteria: if a cluster is found in a subspace $S$ , then each subspace $T\subseteq S$ also contains a cluster. However, a cluster $C\subseteq DB$ in subspace $S$ is not necessarily a cluster in $T\subseteq S$ , since clusters are required to be maximal, and more objects might be contained in the cluster in $T$ that contains $C$ . However, a density-connected set in a subspace $S$ is also a density-connected set in $T\subseteq S$ .

This downward-closure property is utilized by SUBCLU in a way similar to the Apriori algorithm: first, all 1-dimensional subspaces are clustered. All clusters in a higher-dimensional subspace will be subsets of the clusters detected in this first clustering. SUBCLU hence recursively produces $k+1$ -dimensional candidate subspaces by combining $k$ -dimensional subspaces with clusters sharing $k-1$ attributes. After pruning irrelevant candidates, DBSCAN is applied to the candidate subspace to find out if it still contains clusters. If it does, the candidate subspace is used for the next combination of subspaces. In order to improve the runtime of DBSCAN, only the points known to belong to clusters in one $k$ -dimensional subspace (which is chosen to contain as little clusters as possible) are considered. Due to the downward-closure property, other point cannot be part of a $k+1$ -dimensional cluster anyway.

Pseudocode

SUBCLU takes two parameters, $\epsilon \!\,$ and $MinPts$ , which serve the same role as in DBSCAN. In a first step, DBSCAN is used to find 1D-clusters in each subspace spanned by a single attribute:

${\mathtt {SUBCLU}}(DB,eps,MinPts)$

S_{1}:=\emptyset

C_{1}:=\emptyset

{\mathtt {for\,each}}\,a\in Attributes

C^{\{a\}}={\mathtt {DBSCAN}}(DB,\{a\},eps,MinPts)\!\,

{\mathtt {if}}(C^{\{a\}}\neq \emptyset )

S_{1}:=S_{1}\cup \{a\}

C_{1}:=C_{1}\cup C^{\{a\}}

{\mathtt {end\,if}}

{\mathtt {end\,for}}

// In a second step,

k+1

-dimensional clusters are built from

k

-dimensional ones:

k:=1\!\,

{\mathtt {while}}(C_{k}\neq \emptyset )

{\mathtt {CandS}}_{k+1}:={\mathtt {GenerateCandidateSubspaces}}(S_{k})\!\,

{\mathtt {for\,each}}\,cand\in {\mathtt {CandS}}_{k+1}

{\mathtt {bestSubspace:=}}\min _{s\in S_{k}\wedge s\subset cand}\sum _{C_{i}\in C^{s}}|C_{i}|

C^{cand}:=\emptyset

{\mathtt {for\,each\,cluster}}\,cl\in C^{\mathtt {bestSubspace}}

C^{cand}:=C^{cand}\cup {\mathtt {DBSCAN}}(cl,cand,eps,MinPts)

{\mathtt {if}}\,(C^{cand}\neq \emptyset )

S_{k+1}:=S_{k+1}\cup cand

C_{k+1}:=C_{k+1}\cup C^{cand}

{\mathtt {end\,if}}

{\mathtt {end\,for}}

{\mathtt {end\,for}}

k:=k+1\!\,

{\mathtt {end\,while}}

${\mathtt {end}}\!\,$

The set $S_{k}$ contains all the $k$ -dimensional subspaces that are known to contain clusters. The set $C_{k}$ contains the sets of clusters found in the subspaces. The $bestSubspace$ is chosen to minimize the runs of DBSCAN (and the number of points that need to be considered in each run) for finding the clusters in the candidate subspaces.

Candidate subspaces are generated much alike the Apriori algorithm generates the frequent itemset candidates: Pairs of the $k$ -dimensional subspaces are compared, and if they differ in one attribute only, they form a $k+1$ -dimensional candidate. However, a number of irrelevant candidates are found as well; they contain a $k$ -dimensional subspace that does not contain a cluster. Hence, these candidates are removed in a second step:

${\mathtt {GenerateCandidateSubspaces}}(S_{k})$

{\mathtt {CandS}}_{k+1}:=\emptyset

{\mathtt {for\,each}}\,s_{1}\in S_{k}

{\mathtt {for\,each}}\,s_{2}\in S_{k}

{\mathtt {if}}\,(s_{1}\,{\mathtt {and}}\,s_{2}\,\,{\mathtt {differ\,\,in\,\,exactly\,\,one\,\,attribute}})

{\mathtt {CandS}}_{k+1}:={\mathtt {CandS}}_{k+1}\cup \{s_{1}\cup s_{2}\}

{\mathtt {end\,if}}

{\mathtt {end\,for}}

{\mathtt {end\,for}}

// Pruning of irrelevant candidate subspaces

{\mathtt {for\,each}}\,cand\in {\mathtt {CandS}}_{k+1}

{\mathtt {for\,each}}\,k{\texttt {-element}}\,s\subset cand

{\mathtt {if}}\,(s\not \in S_{k})

{\mathtt {CandS}}_{k+1}={\mathtt {CandS}}_{k+1}\setminus \{cand\}

{\mathtt {end\,if}}

{\mathtt {end\,for}}

{\mathtt {end\,for}}

${\mathtt {end}}\,\!$

Availability

An example implementation of SUBCLU is available in the ELKI framework.

Related Research Articles

In mathematics, the concept of a measure is a generalization and formalization of geometrical measures and other common notions, such as mass and probability of events. These seemingly distinct concepts have many similarities and can often be treated together in a single mathematical context. Measures are foundational in probability theory, integration theory, and can be generalized to assume negative values, as with electrical charge. Far-reaching generalizations of measure are widely used in quantum physics and physics in general.

This is a glossary of some terms used in the branch of mathematics known as topology. Although there is no absolute distinction between different areas of topology, the focus here is on general topology. The following definitions are also fundamental to algebraic topology, differential topology and geometric topology.

In mathematics, a topological vector space is one of the basic structures investigated in functional analysis. A topological vector space is a vector space that is also a topological space with the property that the vector space operations are also continuous functions. Such a topology is called a vector topology and every topological vector space has a uniform topological structure, allowing a notion of uniform convergence and completeness. Some authors also require that the space is a Hausdorff space. One of the most widely studied categories of TVSs are locally convex topological vector spaces. This article focuses on TVSs that are not necessarily locally convex. Banach spaces, Hilbert spaces and Sobolev spaces are other well-known examples of TVSs.

In mathematics, a quadric or quadric surface, is a generalization of conic sections. It is a hypersurface in a (D + 1)-dimensional space, and it is defined as the zero set of an irreducible polynomial of degree two in D + 1 variables; for example, D = 1 in the case of conic sections. When the defining polynomial is not absolutely irreducible, the zero set is generally not considered a quadric, although it is often called a degenerate quadric or a reducible quadric.

In mathematics, an invariant subspace of a linear mapping T : V → V i.e. from some vector space V to itself, is a subspace W of V that is preserved by T; that is, T(W) ⊆ W.

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

In linear algebra and functional analysis, the min-max theorem, or variational theorem, or Courant–Fischer–Weyl min-max principle, is a result that gives a variational characterization of eigenvalues of compact Hermitian operators on Hilbert spaces. It can be viewed as the starting point of many results of similar nature.

In the mathematical fields of linear algebra and functional analysis, the orthogonal complement of a subspace W of a vector space V equipped with a bilinear form B is the set W^⊥ of all vectors in V that are orthogonal to every vector in W. Informally, it is called the perp, short for perpendicular complement. It is a subspace of V.

Pregeometry, and in full combinatorial pregeometry, are essentially synonyms for "matroid". They were introduced by Gian-Carlo Rota with the intention of providing a less "ineffably cacophonous" alternative term. Also, the term combinatorial geometry, sometimes abbreviated to geometry, was intended to replace "simple matroid". These terms are now infrequently used in the study of matroids.

In functional and convex analysis, and related disciplines of mathematics, the polar set $is a special convex set associated to any subset of a vector space lying in the dual space The bipolar of a subset is the polar of but lies in .$

In additive combinatorics, Freiman's theorem is a central result which indicates the approximate structure of sets whose sumset is small. It roughly states that if $is small, then can be contained in a small generalized arithmetic progression.$

In mathematics, the tautological bundle is a vector bundle occurring over a Grassmannian in a natural tautological way: for a Grassmannian of $-dimensional subspaces of, given a point in the Grassmannian corresponding to a -dimensional vector subspace, the fiber over is the subspace itself. In the case of projective space the tautological bundle is known as the tautological line bundle.$

In topology and related fields of mathematics, a sequential space is a topological space whose topology can be completely characterized by its convergent/divergent sequences. They can be thought of as spaces that satisfy a very weak axiom of countability, and all first-countable spaces are sequential.

In the mathematical discipline of functional analysis, the concept of a compact operator on Hilbert space is an extension of the concept of a matrix acting on a finite-dimensional vector space; in Hilbert space, compact operators are precisely the closure of finite-rank operators in the topology induced by the operator norm. As such, results from matrix theory can sometimes be extended to compact operators using similar arguments. By contrast, the study of general operators on infinite-dimensional spaces often requires a genuinely different approach.

In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions . DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

In mathematics, a cardinal function is a function that returns cardinal numbers.

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology can produce many measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.

A sum-of-squares optimization program is an optimization problem with a linear cost function and a particular type of constraint on the decision variables. These constraints are of the form that when the decision variables are used as coefficients in certain polynomials, those polynomials should have the polynomial SOS property. When fixing the maximum degree of the polynomials involved, sum-of-squares optimization is also known as the Lasserre hierarchy of relaxations in semidefinite programming.

In mathematics, a polyadic space is a topological space that is the image under a continuous function of a topological power of an Alexandroff one-point compactification of a discrete space.

References

↑ Karin Kailing, Hans-Peter Kriegel and Peer Kröger. Density-Connected Subspace Clustering for High-Dimensional Data . In: Proc. SIAM Int. Conf. on Data Mining (SDM'04), pp. 246-257, 2004.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Karin Kailing, Hans-Peter Kriegel and Peer Kröger. Density-Connected Subspace Clustering for High-Dimensional Data . In: Proc. SIAM Int. Conf. on Data Mining (SDM'04), pp. 246-257, 2004.

[1]