Automatic clustering algorithms

Last updated

Automatic clustering algorithms are algorithms that can perform clustering without prior knowledge of data sets. In contrast with other cluster analysis techniques, automatic clustering algorithms can determine the optimal number of clusters even in the presence of noise and outlier points. [1] [ needs context ]

Contents

Centroid-based

Given a set of n objects, centroid-based algorithms create k partitions based on a dissimilarity function, such that k≤n. A major problem in applying this type of algorithm is determining the appropriate number of clusters for unlabeled data. Therefore, most research in clustering analysis has been focused on the automation of the process.

Automated selection of k in a K-means clustering algorithm, one of the most used centroid-based clustering algorithms, is still a major problem in machine learning. The most accepted solution to this problem is the elbow method. It consists of running k-means clustering to the data set with a range of values, calculating the sum of squared errors for each, and plotting them in a line chart. If the chart looks like an arm, the best value of k will be on the "elbow". [2]

Another method that modifies the k-means algorithm for automatically choosing the optimal number of clusters is the G-means algorithm. It was developed from the hypothesis that a subset of the data follows a Gaussian distribution. Thus, k is increased until each k-means center's data is Gaussian. This algorithm only requires the standard statistical significance level as a parameter and does not set limits for the covariance of the data. [3]

Connectivity-based (hierarchical clustering)

Connectivity-based clustering or hierarchical clustering is based on the idea that objects have more similarities to other nearby objects than to those further away. Therefore, the generated clusters from this type of algorithm will be the result of the distance between the analyzed objects.

Hierarchical models can either be divisive, where partitions are built from the entire data set available, or agglomerating, where each partition begins with a single object and additional objects are added to the set. [4] Although hierarchical clustering has the advantage of allowing any valid metric to be used as the defined distance, it is sensitive to noise and fluctuations in the data set and is more difficult to automate.

Methods have been developed to improve and automate existing hierarchical clustering algorithms [5] such as an automated version of single linkage hierarchical cluster analysis (HCA). This computerized method bases its success on a self-consistent outlier reduction approach followed by the building of a descriptive function which permits defining natural clusters. Discarded objects can also be assigned to these clusters. Essentially, one needs not to resort to external parameters to identify natural clusters. Information gathered from HCA, automated and reliable, can be resumed in a dendrogram with the number of natural clusters and the corresponding separation, an option not found in classical HCA. This method includes the two following steps: outliers being removed (this is applied in many filtering applications) and an optional classification allowing expanding clusters with the whole set of objects. [6]

BIRCH (balanced iterative reducing and clustering using hierarchies) is an algorithm used to perform connectivity-based clustering for large data-sets. [7] It is regarded as one of the fastest clustering algorithms, but it is limited because it requires the number of clusters as an input. Therefore, new algorithms based on BIRCH have been developed in which there is no need to provide the cluster count from the beginning, but that preserves the quality and speed of the clusters. The main modification is to remove the final step of BIRCH, where the user had to input the cluster count, and to improve the rest of the algorithm, referred to as tree-BIRCH, by optimizing a threshold parameter from the data. In this resulting algorithm, the threshold parameter is calculated from the maximum cluster radius and the minimum distance between clusters, which are often known. This method proved to be efficient for data sets of tens of thousands of clusters. If going beyond that amount, a supercluster splitting problem is introduced. For this, other algorithms have been developed, like MDB-BIRCH, which reduces super cluster splitting with relatively high speed. [8]

Density-based

Unlike partitioning and hierarchical methods, density-based clustering algorithms are able to find clusters of any arbitrary shape, not only spheres.

The density-based clustering algorithm uses autonomous machine learning that identifies patterns regarding geographical location and distance to a particular number of neighbors. It is considered autonomous because a priori knowledge on what is a cluster is not required. [9] This type of algorithm provides different methods to find clusters in the data. The fastest method is DBSCAN, which uses a defined distance to differentiate between dense groups of information and sparser noise. Moreover, HDBSCAN can self-adjust by using a range of distances instead of a specified one. Lastly, the method OPTICS creates a reachability plot based on the distance from neighboring features to separate noise from clusters of varying density.

These methods still require the user to provide the cluster center and cannot be considered automatic. The Automatic Local Density Clustering Algorithm (ALDC) is an example of the new research focused on developing automatic density-based clustering. ALDC works out local density and distance deviation of every point, thus expanding the difference between the potential cluster center and other points. This expansion allows the machine to work automatically. The machine identifies cluster centers and assigns the points that are left by their closest neighbor of higher density. [10]

In the automation of data density to identify clusters, research has also been focused on artificially generating the algorithms. For instance, the Estimation of Distribution Algorithms guarantees the generation of valid algorithms by the directed acyclic graph (DAG), in which nodes represent procedures (building block) and edges represent possible execution sequences between two nodes. Building Blocks determine the EDA's alphabet or, in other words, any generated algorithm. Clustering algorithms artificially generated are compared to DBSCAN, a manual algorithm, in experimental results. [11]

Related Research Articles

Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.

In data mining and statistics, hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

Random sample consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates. Therefore, it also can be interpreted as an outlier detection method. It is a non-deterministic algorithm in the sense that it produces a reasonable result only with a certain probability, with this probability increasing as more iterations are allowed. The algorithm was first published by Fischler and Bolles at SRI International in 1981. They used RANSAC to solve the location determination problem (LDP), where the goal is to determine the points in the space that project onto an image into a set of landmarks with known locations.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

The k-medoids problem is a clustering problem similar to k-means. The name was coined by Leonard Kaufman and Peter J. Rousseeuw with their PAM algorithm. Both the k-means and k-medoids algorithms are partitional and attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids chooses actual data points as centers, and thereby allows for greater interpretability of the cluster centers than in k-means, where the center of a cluster is not necessarily one of the input data points. Furthermore, k-medoids can be used with arbitrary dissimilarity measures, whereas k-means generally requires Euclidean distance for efficient solutions. Because k-medoids minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances, it is more robust to noise and outliers than k-means.

<span class="mw-page-title-main">Spectral clustering</span> Clustering methods

In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed, and marks as outliers points that lie alone in low-density regions . DBSCAN is one of the most commonly used and cited clustering algorithms.

In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, tries to identify objects of a specific class amongst all objects, by primarily learning from a training set containing only the objects of that class, although there exist variants of one-class classifiers where counter-examples are used to further refine the classification boundary. This is different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all the classes. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or the operational status of a nuclear plant as 'normal': In this scenario, there are few, if any, examples of catastrophic system states; only the statistics of normal operation are known.

BIRCH is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. With modifications it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation–maximization algorithm. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources. In most cases, BIRCH only requires a single scan of the database.

Silhouette is a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. It was proposed by Belgian statistician Peter Rousseeuw in 1987.

Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.

Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander. Its basic idea is similar to DBSCAN, but it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density. To do so, the points of the database are (linearly) ordered such that spatially closest points become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that must be accepted for a cluster so that both points belong to the same cluster. This is represented as a dendrogram.

CURE is an efficient data clustering algorithm for large databases. Compared with K-means clustering it is more robust to outliers and able to identify clusters having non-spherical shapes and size variances.

In computer science, data stream clustering is defined as the clustering of data that arrive continuously such as telephone records, multimedia data, financial transactions etc. Data stream clustering is usually studied as a streaming algorithm and the objective is, given a sequence of points, to construct a good clustering of the stream, using a small amount of memory and time.

<span class="mw-page-title-main">ELKI</span> Data mining framework

ELKI is a data mining software framework developed for use in research and teaching. It was originally created by the database systems research unit at the Ludwig Maximilian University of Munich, Germany, led by Professor Hans-Peter Kriegel. The project has continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbours.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

References

  1. Outlier
  2. "Using the elbow method to determine the optimal number of clusters for k-means clustering". bl.ocks.org. Retrieved 2018-11-12.
  3. Hamerly, Greg; Elkan, Charles (9 December 2003). Sebastian Thrun; Lawrence K Saul; Bernhard H Schölkopf (eds.). Learning the k in k-means (PDF). Proceedings of the 16th International Conference on Neural Information Processing Systems. Whistler, British Columbia, Canada: MIT Press. pp. 281–288. Archived from the original (PDF) on 16 October 2022. Retrieved 3 November 2022.{{cite conference}}: CS1 maint: date and year (link)
  4. "Introducing Clustering II: Clustering Algorithms - GameAnalytics". GameAnalytics. 2014-05-20. Retrieved 2018-11-06.
  5. Almeida, J.A.S.; Barbosa, L.M.S.; Pais, A.A.C.C.; Formosinho, S.J. (June 2007). "Improving hierarchical cluster analysis: A new method with outlier detection and automatic clustering" (PDF). Chemometrics and Intelligent Laboratory Systems. 87 (2). Elsevier: 208–217. doi:10.1016/j.chemolab.2007.01.005. hdl:10316/5042. Archived from the original (PDF) on 15 November 2018. Retrieved 3 November 2022.
  6. Almeida, J.A.S.; Barbosa, L.M.S.; Pais, A.A.C.C.; Formosinho, S.J. (2007-06-15). "Improving hierarchical cluster analysis: A new method with outlier detection and automatic clustering" (PDF). Chemometrics and Intelligent Laboratory Systems. 87 (2): 208–217. doi:10.1016/j.chemolab.2007.01.005. hdl: 10316/5042 . ISSN   0169-7439.
  7. Zhang, Tian; Ramakrishnan, Raghu; Livny, Miron; Zhang, Tian; Ramakrishnan, Raghu; Livny, Miron (1996-06-01). "BIRCH: an efficient data clustering method for very large databases, BIRCH: an efficient data clustering method for very large databases". ACM SIGMOD Record. 25 (2): 103, 103–114, 114. doi: 10.1145/235968.233324 . ISSN   0163-5808.
  8. Lorbeer, Boris; Kosareva, Ana; Deva, Bersant; Softić, Dženan; Ruppel, Peter; Küpper, Axel (2018-03-01). "Variations on the Clustering Algorithm BIRCH". Big Data Research. 11: 44–53. doi: 10.1016/j.bdr.2017.09.002 . ISSN   2214-5796.
  9. "How Density-based Clustering works—ArcGIS Pro | ArcGIS Desktop". pro.arcgis.com. Retrieved 2018-11-05.
  10. "An algorithm for automatic recognition of cluster centers based on local density clustering - IEEE Conference Publication". doi:10.1109/CCDC.2017.7978726. S2CID   23267464.{{cite journal}}: Cite journal requires |journal= (help)
  11. "AutoClustering: An estimation of distribution algorithm for the automatic generation of clustering algorithms - IEEE Conference Publication". CiteSeerX   10.1.1.308.9977 . doi:10.1109/CEC.2012.6252874.{{cite journal}}: Cite journal requires |journal= (help)