Cluster labeling

Last updated

In natural language processing and information retrieval, cluster labeling is the problem of picking descriptive, human-readable labels for the clusters produced by a document clustering algorithm; standard clustering algorithms do not typically produce any such labels. Cluster labeling algorithms examine the contents of the documents per cluster to find a labeling that summarize the topic of each cluster and distinguish the clusters from each other.

Contents

Differential cluster labeling

Differential cluster labeling labels a cluster by comparing term distributions across clusters, using techniques also used for feature selection in document classification, such as mutual information and chi-squared feature selection. Terms having very low frequency are not the best in representing the whole cluster and can be omitted in labeling a cluster. By omitting those rare terms and using a differential test, one can achieve the best results with differential cluster labeling. [1]

Pointwise mutual information

In the fields of probability theory and information theory, mutual information measures the degree of dependence of two random variables. The mutual information of two variables X and Y is defined as:

where p(x, y) is the joint probability distribution of the two variables, p1(x) is the probability distribution of X, and p2(y) is the probability distribution of Y.

In the case of cluster labeling, the variable X is associated with membership in a cluster, and the variable Y is associated with the presence of a term. [2] Both variables can have values of 0 or 1, so the equation can be rewritten as follows:

In this case, p(C = 1) represents the probability that a randomly selected document is a member of a particular cluster, and p(C = 0) represents the probability that it isn't. Similarly, p(T = 1) represents the probability that a randomly selected document contains a given term, and p(T = 0) represents the probability that it doesn't. The joint probability distribution function p(C, T) represents the probability that two events occur simultaneously. For example, p(0, 0) is the probability that a document isn't a member of cluster c and doesn't contain term t; p(0, 1) is the probability that a document isn't a member of cluster C and does contain term T; and so on.

Chi-Squared Selection

The Pearson's chi-squared test can be used to calculate the probability that the occurrence of an event matches the initial expectations. In particular, it can be used to determine whether two events, A and B, are statistically independent. The value of the chi-squared statistic is:

where Oa,b is the observed frequency of a and b co-occurring, and Ea,b is the expected frequency of co-occurrence.

In the case of cluster labeling, the variable A is associated with membership in a cluster, and the variable B is associated with the presence of a term. Both variables can have values of 0 or 1, so the equation can be rewritten as follows:

For example, O1,0 is the observed number of documents that are in a particular cluster but don't contain a certain term, and E1,0 is the expected number of documents that are in a particular cluster but don't contain a certain term. Our initial assumption is that the two events are independent, so the expected probabilities of co-occurrence can be calculated by multiplying individual probabilities: [3]

E1,0 = N * P(C = 1) * P(T = 0)

where N is the total number of documents in the collection.

Cluster-Internal Labeling

Cluster-internal labeling selects labels that only depend on the contents of the cluster of interest. No comparison is made with the other clusters. Cluster-internal labeling can use a variety of methods, such as finding terms that occur frequently in the centroid or finding the document that lies closest to the centroid.

Centroid Labels

A frequently used model in the field of information retrieval is the vector space model, which represents documents as vectors. The entries in the vector correspond to terms in the vocabulary. Binary vectors have a value of 1 if the term is present within a particular document and 0 if it is absent. Many vectors make use of weights that reflect the importance of a term in a document, and/or the importance of the term in a document collection. For a particular cluster of documents, we can calculate the centroid by finding the arithmetic mean of all the document vectors. If an entry in the centroid vector has a high value, then the corresponding term occurs frequently within the cluster. These terms can be used as a label for the cluster. One downside to using centroid labeling is that it can pick up words like "place" and "word" that have a high frequency in written text, but have little relevance to the contents of the particular cluster.

Contextualized centroid labels

A simple, cost-effective way of overcoming the above limitation is to embed the centroid terms with the highest weight in a graph structure that provides a context for their interpretation and selection. [4] In this approach, a term-term co-occurrence matrix referred as is first built for each cluster . Each cell represents the number of times term co-occurs with term within a certain window of text (a sentence, a paragraph, etc.) In a second stage, a similarity matrix is obtained by multiplying with its transpose. We have . Being the dot product of two normalized vectors and , denotes the cosine similarity between terms and . The so obtained can then be used as the weighted adjacency matrix of a term similarity graph. The centroid terms are part of this graph, and they thus can be interpreted and scored by inspecting the terms that surround them in the graph.

Title labels

An alternative to centroid labeling is title labeling. Here, we find the document within the cluster that has the smallest Euclidean distance to the centroid, and use its title as a label for the cluster. One advantage to using document titles is that they provide additional information that would not be present in a list of terms. However, they also have the potential to mislead the user, since one document might not be representative of the entire cluster.

External knowledge labels

Cluster labeling can be done indirectly using external knowledge such as pre-categorized knowledge such as the one of Wikipedia. [5] In such methods, a set of important cluster text features are first extracted from the cluster documents. These features then can be used to retrieve the (weighted) K-nearest categorized documents from which candidates for cluster labels can be extracted. The final step involves the ranking of such candidates. Suitable methods are such that are based on a voting or a fusion process which is determined using the set of categorized documents and the original cluster features.

Combining Several Cluster Labelers

The cluster labels of several different cluster labelers can be further combined to obtain better labels. For example, Linear Regression can be used to learn an optimal combination of labeler scores. [6] A more sophisticated technique is based on a fusion approach and analysis of the cluster labels decision stability of various labelers. [7]

Related Research Articles

<span class="mw-page-title-main">Support vector machine</span> Set of methods for supervised statistical learning

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve high accuracy levels.

Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power. These activities can be viewed as two facets of the same field of application, and they have undergone substantial development over the past few decades.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

The information bottleneck method is a technique in information theory introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek. It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and self-described as providing "a surprisingly rich framework for discussing a variety of problems in signal processing and learning".

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

<span class="mw-page-title-main">Dirichlet distribution</span> Probability distribution

In probability and statistics, the Dirichlet distribution, often denoted , is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.

In probability theory and statistics, the Rademacher distribution is a discrete probability distribution where a random variate X has a 50% chance of being +1 and a 50% chance of being -1.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.

Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis, from which PLSA evolved.

<span class="mw-page-title-main">Jaccard index</span> Measure of similarity and diversity between sets

The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v) and now is frequently referred to as the Critical Success Index in meteorology. It was later developed independently by Paul Jaccard, originally giving the French name coefficient de communauté, and independently formulated again by T. Tanimoto. Thus, the Tanimoto index or Tanimoto coefficient are also used in some fields. However, they are identical in generally taking the ratio of Intersection over Union. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

<span class="mw-page-title-main">Softmax function</span> Smooth approximation of one-hot arg max

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce's choice axiom.

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector , and an observation drawn from a multinomial distribution with probability vector p and number of trials n. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.

In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally, automatic image collection would allow classifiers to be trained with nothing but the category names as input. This problem is closely related to that of content-based image retrieval (CBIR), where the goal is to return better image search results rather than training a classifier for image recognition.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

The Rocchio algorithm is based on a method of relevance feedback found in information retrieval systems which stemmed from the SMART Information Retrieval System developed between 1960 and 1964. Like many other retrieval systems, the Rocchio algorithm was developed using the vector space model. Its underlying assumption is that most users have a general conception of which documents should be denoted as relevant or irrelevant. Therefore, the user's search query is revised to include an arbitrary percentage of relevant and irrelevant documents as a means of increasing the search engine's recall, and possibly the precision as well. The number of relevant and irrelevant documents allowed to enter a query is dictated by the weights of the a, b, c variables listed below in the Algorithm section.

The Binary Independence Model (BIM) in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.

<span class="mw-page-title-main">Nearest centroid classifier</span>

In machine learning, a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation. When applied to text classification using word vectors containing tf*idf weights to represent documents, the nearest centroid classifier is known as the Rocchio classifier because of its similarity to the Rocchio algorithm for relevance feedback.

References

  1. Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge: Cambridge UP, 2008. Cluster Labeling. Stanford Natural Language Processing Group. Web. 25 Nov. 2009. <http://nlp.stanford.edu/IR-book/html/htmledition/cluster-labeling-1.html>.
  2. Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge: Cambridge UP, 2008. Mutual Information. Stanford Natural Language Processing Group. Web. 25 Nov. 2009. <http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html>.
  3. Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge: Cambridge UP, 2008. Chi2 Feature Selection. Stanford Natural Language Processing Group. Web. 25 Nov. 2009. <http://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html>.
  4. Francois Role, Moahmed Nadif. Beyond cluster labeling: Semantic interpretation of clusters’ contents using a graph representation. Knowledge-Based Systems, Volume 56, January, 2014: 141-155
  5. David Carmel, Haggai Roitman, Naama Zwerdling. Enhancing cluster labeling using wikipedia. SIGIR 2009: 139-146
  6. David Carmel, Haggai Roitman, Naama Zwerdling. Enhancing cluster labeling using wikipedia. SIGIR 2009: 139-146
  7. Haggai Roitman, Shay Hummel, Michal Shmueli-Scheuer. A fusion approach to cluster labeling. SIGIR 2014: 883-886