In the study of hierarchical clustering, Dasgupta's objective is a measure of the quality of a clustering, defined from a similarity measure on the elements to be clustered. It is named after Sanjoy Dasgupta, who formulated it in 2016. [1] Its key property is that, when the similarity comes from an ultrametric space, the optimal clustering for this quality measure follows the underlying structure of the ultrametric space. In this sense, clustering methods that produce good clusterings for this objective can be expected to approximate the ground truth underlying the given similarity measure. [2]
In Dasgupta's formulation, the input to a clustering problem consists of similarity scores between certain pairs of elements, represented as an undirected graph , with the elements as its vertices and with non-negative real weights on its edges. Large weights indicate elements that should be considered more similar to each other, while small weights or missing edges indicate pairs of elements that are not similar. A hierarchical clustering can be described as a tree (not necessarily a binary tree) whose leaves are the elements to be clustered; the clusters are then the subsets of elements descending from each tree node, and the size of any cluster is its number of elements. For each edge of the input graph, let denote the weight of edge and let denote the smallest cluster of a given clustering that contains both and . Then Dasgupta defines the cost of a clustering to be [1]
The optimal clustering for this objective is NP-hard to find. However, it is possible to find a clustering that approximates the minimum value of the objective in polynomial time by a divisive (top-down) clustering algorithm that repeatedly subdivides the elements using an approximation algorithm for the sparsest cut problem, the problem of finding a partition that minimizes the ratio of the total weight of cut edges to the total number of cut pairs. [1] Equivalently, for purposes of approximation, one may minimize the ratio of the total weight of cut edges to the number of elements on the smaller side of the cut. Using the best known approximation for the sparsest cut problem, the approximation ratio of this approach is . [3]
In computer science and optimization theory, the max-flow min-cut theorem states that in a flow network, the maximum amount of flow passing from the source to the sink is equal to the total weight of the edges in a minimum cut, i.e., the smallest total weight of the edges which if removed would disconnect the source from the sink.
In graph theory, an independent set, stable set, coclique or anticlique is a set of vertices in a graph, no two of which are adjacent. That is, it is a set of vertices such that for every two vertices in , there is no edge connecting the two. Equivalently, each edge in the graph has at most one endpoint in . A set is independent if and only if it is a clique in the graph's complement. The size of an independent set is the number of vertices it contains. Independent sets have also been called "internally stable sets", of which "stable set" is a shortening.
In computer science and operations research, approximation algorithms are efficient algorithms that find approximate solutions to optimization problems with provable guarantees on the distance of the returned solution to the optimal one. Approximation algorithms naturally arise in the field of theoretical computer science as a consequence of the widely believed P ≠ NP conjecture. Under this conjecture, a wide class of optimization problems cannot be solved exactly in polynomial time. The field of approximation algorithms, therefore, tries to understand how closely it is possible to approximate optimal solutions to such problems in polynomial time. In an overwhelming majority of the cases, the guarantee of such algorithms is a multiplicative one expressed as an approximation ratio or approximation factor i.e., the optimal solution is always guaranteed to be within a (predetermined) multiplicative factor of the returned solution. However, there are also many approximation algorithms that provide an additive guarantee on the quality of the returned solution. A notable example of an approximation algorithm that provides both is the classic approximation algorithm of Lenstra, Shmoys and Tardos for scheduling on unrelated parallel machines.
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.
In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.
The set cover problem is a classical question in combinatorics, computer science, operations research, and complexity theory.
A Euclidean minimum spanning tree of a finite set of points in the Euclidean plane or higher-dimensional Euclidean space connects the points by a system of line segments with the points as endpoints, minimizing the total length of the segments. In it, any two points can reach each other along a path through the line segments. It can be found as the minimum spanning tree of a complete graph with the points as vertices and the Euclidean distances between points as edge weights.
In graph theory, a dominating set for a graph G is a subset D of its vertices, such that any vertex of G is in D, or has a neighbor in D. The domination numberγ(G) is the number of vertices in a smallest dominating set for G.
In graph theory, a cut is a partition of the vertices of a graph into two disjoint subsets. Any cut determines a cut-set, the set of edges that have one endpoint in each subset of the partition. These edges are said to cross the cut. In a connected graph, each cut-set determines a unique cut, and in some cases cuts are identified with their cut-sets rather than with their vertex partitions.
The Christofides algorithm or Christofides–Serdyukov algorithm is an algorithm for finding approximate solutions to the travelling salesman problem, on instances where the distances form a metric space . It is an approximation algorithm that guarantees that its solutions will be within a factor of 3/2 of the optimal solution length, and is named after Nicos Christofides and Anatoliy I. Serdyukov ; the latter discovered it independently in 1976.
In computer science and graph theory, Karger's algorithm is a randomized algorithm to compute a minimum cut of a connected graph. It was invented by David Karger and first published in 1993.
In mathematics, a graph partition is the reduction of a graph to a smaller graph by partitioning its set of nodes into mutually exclusive groups. Edges of the original graph that cross between the groups will produce edges in the partitioned graph. If the number of resulting edges is small compared to the original graph, then the partitioned graph may be better suited for analysis and problem-solving than the original. Finding a partition that simplifies graph analysis is a hard problem, but one that has applications to scientific computing, VLSI circuit design, and task scheduling in multiprocessor computers, among others. Recently, the graph partition problem has gained importance due to its application for clustering and detection of cliques in social, pathological and biological networks. For a survey on recent trends in computational methods and applications see Buluc et al. (2013). Two common examples of graph partitioning are minimum cut and maximum cut problems.
In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.
In a graph, a maximum cut is a cut whose size is at least the size of any other cut. That is, it is a partition of the graph's vertices into two complementary sets S and T, such that the number of edges between S and T is as large as possible. Finding such a cut is known as the max-cut problem.
Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance.
In graph theory and computer science, a dense subgraph is a subgraph with many edges per vertex. This is formalized as follows: let G = (V, E) be an undirected graph and let S = (VS, ES) be a subgraph of G. Then the density of S is defined to be:
In graph drawing, a circular layout is a style of drawing that places the vertices of a graph on a circle, often evenly spaced so that they form the vertices of a regular polygon.
Approximate max-flow min-cut theorems are mathematical propositions in network flow theory. Approximate max-flow min-cut theorems deal with the relationship between maximum flow rate ("max-flow") and minimum cut ("min-cut") in a multi-commodity flow problem. The theorems have enabled the development of approximation algorithms for use in graph partition and related problems.
In graph theory, the cutwidth of an undirected graph is the smallest integer with the following property: there is an ordering of the vertices of the graph, such that every cut obtained by partitioning the vertices into earlier and later subsets of the ordering is crossed by at most edges. That is, if the vertices are numbered , then for every , the number of edges with and is at most .
In graph theory, the carving width of a graph is a number, defined from the graph, that describes the number of edges separating the clusters in a hierarchical clustering of the graph vertices.