Pachinko allocation

Last updated January 01, 2025

In machine learning and natural language processing, the pachinko allocation model (PAM) is a topic model. Topic models are a suite of algorithms to uncover the hidden thematic structure of a collection of documents. ^[1] The algorithm improves upon earlier topic models such as latent Dirichlet allocation (LDA) by modeling correlations between topics in addition to the word correlations which constitute topics. PAM provides more flexibility and greater expressive power than latent Dirichlet allocation.^[2] While first described and implemented in the context of natural language processing, the algorithm may have applications in other fields such as bioinformatics. The model is named for pachinko machines—a game popular in Japan, in which metal balls bounce down around a complex collection of pins until they land in various bins at the bottom.^[3]

History

Pachinko allocation was first described by Wei Li and Andrew McCallum in 2006.^[3] The idea was extended with hierarchical Pachinko allocation by Li, McCallum, and David Mimno in 2007.^[4] In 2007, McCallum and his colleagues proposed a nonparametric Bayesian prior for PAM based on a variant of the hierarchical Dirichlet process (HDP).^[2] The algorithm has been implemented in the MALLET software package published by McCallum's group at the University of Massachusetts Amherst.

Model

PAM connects words in V and topics in T with an arbitrary directed acyclic graph (DAG), where topic nodes occupy the interior levels and the leaves are words.

The probability of generating a whole corpus is the product of the probabilities for every document:^[3]

$P(\mathbf {D} |\alpha )=\prod _{d}P(d|\alpha )$

Related Research Articles

Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.

Michael Irwin Jordan is an American scientist, professor at the University of California, Berkeley, research scientist at the Inria Paris, and researcher in machine learning, statistics, and artificial intelligence.

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis, from which PLSA evolved.

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. The kind of graph used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.

A latent variable model is a statistical model that relates a set of observable variables to a set of latent variables. Latent variable models are applied across a wide range of fields such as biology, computer science, and social science. Common use cases for latent variable models include applications in psychometrics, and natural language processing.

In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic model. In this, observations are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

In machine learning, semantic analysis of a text corpus is the task of building structures that approximate concepts from a large set of documents. It generally does not involve prior semantic understanding of the documents.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally, automatic image collection would allow classifiers to be trained with nothing but the category names as input. This problem is closely related to that of content-based image retrieval (CBIR), where the goal is to return better image search results rather than training a classifier for image recognition.

In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

David Meir Blei is a professor in the Statistics and Computer Science departments at Columbia University. Prior to fall 2014 he was an associate professor in the Department of Computer Science at Princeton University. His work is primarily in machine learning.

Andrew McCallum is a professor in the computer science department at University of Massachusetts Amherst. His primary specialties are in machine learning, natural language processing, information extraction, information integration, and social network analysis.

Within statistics, Dynamic topic models' are generative models that can be used to analyze the evolution of (unobserved) topics of a collection of documents over time. This family of models was proposed by David Blei and John Lafferty and is an extension to Latent Dirichlet Allocation (LDA) that can handle sequential documents.

In statistics and machine learning, the hierarchical Dirichlet process (HDP) is a nonparametric Bayesian approach to clustering grouped data. It uses a Dirichlet process for each group of data, with the Dirichlet processes for all groups sharing a base distribution which is itself drawn from a Dirichlet process. This method allows groups to share statistical strength via sharing of clusters across groups. The base distribution being drawn from a Dirichlet process is important, because draws from a Dirichlet process are atomic probability measures, and the atoms will appear in all group-level Dirichlet processes. Since each atom corresponds to a cluster, clusters are shared across all groups. It was developed by Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David Blei and published in the Journal of the American Statistical Association in 2006, as a formalization and generalization of the infinite hidden Markov model published in 2002.

Stan is a probabilistic programming language for statistical inference written in C++. The Stan language is used to specify a (Bayesian) statistical model with an imperative program calculating the log probability density function.

The following outline is provided as an overview of, and topical guide to, machine learning:

Yee-Whye Teh is a professor of statistical machine learning in the Department of Statistics, University of Oxford. Prior to 2012 he was a reader at the Gatsby Charitable Foundation computational neuroscience unit at University College London. His work is primarily in machine learning, artificial intelligence, statistics and computer science.

References

↑ Blei, David. "Topic modeling". Archived from the original on 2 October 2012. Retrieved 4 October 2012.
1 2 Li, Wei; Blei, David; McCallum, Andrew (2007). Nonparametric Bayes Pachinko Allocation. Twenty-Third Conference on Uncertainty in Artificial Intelligence. arXiv: 1206.5270 .
1 2 3 Li, Wei; McCallum, Andrew (2006). "Pachinko allocation: DAG-structured mixture models of topic correlations" (PDF). Proceedings of the 23rd international conference on Machine learning - ICML '06. pp. 577–584. doi:10.1145/1143844.1143917. ISBN 1595933832. S2CID 13160178.
↑ Mimno, David; Li, Wei; McCallum, Andrew (2007). "Mixtures of hierarchical topics with Pachinko allocation" (PDF). Proceedings of the 24th international conference on Machine learning. pp. 633–640. doi:10.1145/1273496.1273576. ISBN 9781595937933. S2CID 6045658.
↑ Hofmann, Thomas (1999). "Probabilistic Latent Semantic Indexing" (PDF). Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval. Archived from the original (PDF) on 14 December 2010.
↑ Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John (January 2003). "Latent Dirichlet allocation". Journal of Machine Learning Research . 3: pp. 993–1022. Archived from the original on 1 May 2012. Retrieved 19 July 2010.

External links

Mixtures of Hierarchical Topics with Pachinko Allocation, a video recording of David Mimno presenting HPAM in 2007.

This computer science article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Blei, David. "Topic modeling". Archived from the original on 2 October 2012. Retrieved 4 October 2012.

[mccallum07-2] 1 2 Li, Wei; Blei, David; McCallum, Andrew (2007). Nonparametric Bayes Pachinko Allocation. Twenty-Third Conference on Uncertainty in Artificial Intelligence. arXiv: 1206.5270 .

[li2006-3] 1 2 3 Li, Wei; McCallum, Andrew (2006). "Pachinko allocation: DAG-structured mixture models of topic correlations" (PDF). Proceedings of the 23rd international conference on Machine learning - ICML '06. pp. 577–584. doi:10.1145/1143844.1143917. ISBN 1595933832. S2CID 13160178.

[mimno2007-4] Mimno, David; Li, Wei; McCallum, Andrew (2007). "Mixtures of hierarchical topics with Pachinko allocation" (PDF). Proceedings of the 24th international conference on Machine learning. pp. 633–640. doi:10.1145/1273496.1273576. ISBN 9781595937933. S2CID 6045658.

[hofmann1999-5] Hofmann, Thomas (1999). "Probabilistic Latent Semantic Indexing" (PDF). Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval. Archived from the original (PDF) on 14 December 2010.

[blei2003-6] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John (January 2003). "Latent Dirichlet allocation". Journal of Machine Learning Research . 3: pp. 993–1022. Archived from the original on 1 May 2012. Retrieved 19 July 2010.

[1]

[2]

[3]

[4]

[5]

[6]