Topic model

Last updated

In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

Contents

Topic models are also referred to as probabilistic topic models, which refers to statistical algorithms for discovering the latent semantic structures of an extensive text body. In the age of information, the amount of the written material we encounter each day is simply beyond our processing capacity. Topic models can help to organize and offer insights for us to understand large collections of unstructured text bodies. Originally developed as a text-mining tool, topic models have been used to detect instructive structures in data such as genetic information, images, and networks. They also have applications in other fields such as bioinformatics [1] and computer vision. [2]

History

An early topic model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998. [3] Another one, called probabilistic latent semantic analysis (PLSA), was created by Thomas Hofmann in 1999. [4] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSA. Developed by David Blei, Andrew Ng, and Michael I. Jordan in 2002, LDA introduces sparse Dirichlet prior distributions over document-topic and topic-word distributions, encoding the intuition that documents cover a small number of topics and that topics often use a small number of words. [5] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Hierarchical latent tree analysis (HLTA) is an alternative to LDA, which models word co-occurrence using a tree of latent variables and the states of the latent variables, which correspond to soft clusters of documents, are interpreted as topics.

Animation of the topic detection process in a document-word matrix through biclustering. Every column corresponds to a document, every row to a word. A cell stores the frequency of a word in a document, with dark cells indicating high word frequencies. This procedure groups documents, which use similar words, as it groups words occurring in a similar set of documents. Such groups of words are then called topics. More usual topic models, such as LDA, only group documents, based on a more sophisticated and probabilistic mechanism.

Topic models for context information

Approaches for temporal information include Block and Newman's determination of the temporal dynamics of topics in the Pennsylvania Gazette during 1728–1800. Griffiths & Steyvers used topic modeling on abstracts from the journal PNAS to identify topics that rose or fell in popularity from 1991 to 2001 whereas Lamba & Madhusushan [6] used topic modeling on full-text research articles retrieved from DJLIT journal from 1981 to 2018. In the field of library and information science, Lamba & Madhusudhan [6] [7] [8] [9] applied topic modeling on different Indian resources like journal articles and electronic theses and resources (ETDs). Nelson [10] has been analyzing change in topics over time in the Richmond Times-Dispatch to understand social and political changes and continuities in Richmond during the American Civil War. Yang, Torget and Mihalcea applied topic modeling methods to newspapers from 1829 to 2008. Mimno used topic modelling with 24 journals on classical philology and archaeology spanning 150 years to look at how topics in the journals change over time and how the journals become more different or similar over time.

Yin et al. [11] introduced a topic model for geographically distributed documents, where document positions are explained by latent regions which are detected during inference.

Chang and Blei [12] included network information between linked documents in the relational topic model, to model the links between websites.

The author-topic model by Rosen-Zvi et al. [13] models the topics associated with authors of documents to improve the topic detection for documents with authorship information.

HLTA was applied to a collection of recent research papers published at major AI and Machine Learning venues. The resulting model is called The AI Tree. The resulting topics are used to index the papers at aipano.cse.ust.hk to help researchers track research trends and identify papers to read, and help conference organizers and journal editors identify reviewers for submissions.

To improve the qualitative aspects and coherency of generated topics, some researchers have explored the efficacy of "coherence scores", or otherwise how computer-extracted clusters (i.e. topics) align with a human benchmark. [14] [15] Coherence scores are metrics for optimising the number of topics to extract from a document corpus. [16]

Algorithms

In practice, researchers attempt to fit appropriate model parameters to the data corpus using one of several heuristics for maximum likelihood fit. A survey by D. Blei describes this suite of algorithms. [17] Several groups of researchers starting with Papadimitriou et al. [3] have attempted to design algorithms with provable guarantees. Assuming that the data were actually generated by the model in question, they try to design algorithms that probably find the model that was used to create the data. Techniques used here include singular value decomposition (SVD) and the method of moments. In 2012 an algorithm based upon non-negative matrix factorization (NMF) was introduced that also generalizes to topic models with correlations among topics. [18]

In 2017, neural network has been leveraged in topic modeling to make it faster in inference, [19] which has been extended weakly supervised version. [20]

In 2018 a new approach to topic models was proposed: it is based on stochastic block model. [21]

Because of the recent development of LLM, topic modeling has leveraged LLM through contextual embedding [22] and fine tuning. [23]

Applications of topic models

To quantitative biomedicine

Topic models are being used also in other contexts. For examples uses of topic models in biology and bioinformatics research emerged. [24] Recently topic models has been used to extract information from dataset of cancers' genomic samples. [25] In this case topics are biological latent variables to be inferred.

To analysis of music and creativity

Topic models can be used for analysis of continuous signals like music. For instance, they were used to quantify how musical styles change in time, and identify the influence of specific artists on later music creation. [26]

See also

Related Research Articles

Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in each document in a collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis, from which PLSA evolved.

Scott Craig Deerwester is a patented and highly cited computer scientist who created latent semantic analysis (LSA), an important technology used in natural language processing. His expertise includes technologies that are important to addressing today's societal challenges such as information and data science, software systems architecture, and data modeling.

Non-negative matrix factorization, also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.

In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic model. In this, observations are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

<span class="mw-page-title-main">Distributional semantics</span> Field of linguistics

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

In machine learning and natural language processing, the pachinko allocation model (PAM) is a topic model. Topic models are a suite of algorithms to uncover the hidden thematic structure of a collection of documents. The algorithm improves upon earlier topic models such as latent Dirichlet allocation (LDA) by modeling correlations between topics in addition to the word correlations which constitute topics. PAM provides more flexibility and greater expressive power than latent Dirichlet allocation. While first described and implemented in the context of natural language processing, the algorithm may have applications in other fields such as bioinformatics. The model is named for pachinko machines—a game popular in Japan, in which metal balls bounce down around a complex collection of pins until they land in various bins at the bottom.

In natural language processing, textual entailment (TE), also known as natural language inference (NLI), is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text.

Within statistics, Dynamic topic models' are generative models that can be used to analyze the evolution of (unobserved) topics of a collection of documents over time. This family of models was proposed by David Blei and John Lafferty and is an extension to Latent Dirichlet Allocation (LDA) that can handle sequential documents.

<span class="mw-page-title-main">Entity linking</span> Concept in Natural Language Processing

In natural language processing, Entity Linking, also referred to as named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a unique identity to entities mentioned in text. For example, given the sentence "Paris is the capital of France", the main idea is to first identify "Paris" and "France" as named entities, and then to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris" and "France" to the french country. The Entity Linking task is composed of 3 subtasks. First, Named Entity Recognition, which consist in the extraction of named entities from a text. Second, for each named entity, the objective is to generate candidates from a Knowledge Base. We call this step candidate generation. The main challenge being that we want to get the corresponding entity inside the candidates set. Lastly, the objective is to choose from the candidate set the correct entity. We call this step disambiguation.

In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.

<span class="mw-page-title-main">Semantic parsing</span>

Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Applications of semantic parsing include machine translation, question answering, ontology induction, automated reasoning, and code generation. The phrase was first used in the 1970s by Yorick Wilks as the basis for machine translation programs working with only semantic representations. Semantic parsing is one of the important tasks in computational linguistics and natural language processing.

References

  1. Blei, David (April 2012). "Probabilistic Topic Models". Communications of the ACM. 55 (4): 77–84. doi:10.1145/2133806.2133826. S2CID   753304.
  2. Cao, Liangliang, and Li Fei-Fei. "Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes." 2007 IEEE 11th International Conference on Computer Vision. IEEE, 2007.
  3. 1 2 Papadimitriou, Christos; Raghavan, Prabhakar; Tamaki, Hisao; Vempala, Santosh (1998). "Latent semantic indexing". Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '98. pp. 159–168. doi:10.1145/275487.275505. ISBN   978-0897919968. S2CID   1479546. Archived from the original (Postscript) on 2013-05-09. Retrieved 2012-04-17.
  4. Hofmann, Thomas (1999). "Probabilistic Latent Semantic Indexing" (PDF). Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval. Archived from the original (PDF) on 2010-12-14.
  5. Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John (January 2003). "Latent Dirichlet allocation". Journal of Machine Learning Research . 3: 993–1022. doi:10.1162/jmlr.2003.3.4-5.993.
  6. 1 2 Lamba, Manika jun (2019). "Mapping of topics in DESIDOC Journal of Library and Information Technology, India: a study". Scientometrics. 120 (2): 477–505. doi:10.1007/s11192-019-03137-5. ISSN   0138-9130. S2CID   174802673.
  7. Lamba, Manika jun (2019). "Metadata Tagging and Prediction Modeling: Case Study of DESIDOC Journal of Library and Information Technology (2008-2017)". World Digital Libraries. 12: 33–89. doi:10.18329/09757597/2019/12103 (inactive 1 November 2024). ISSN   0975-7597.{{cite journal}}: CS1 maint: DOI inactive as of November 2024 (link)
  8. Lamba, Manika may (2019). "Author-Topic Modeling of DESIDOC Journal of Library and Information Technology (2008-2017), India". Library Philosophy and Practice.
  9. Lamba, Manika sep (2018). Metadata Tagging of Library and Information Science Theses: Shodhganga (2013-2017) (PDF). ETD2018:Beyond the boundaries of Rims and Oceans. Taiwan, Taipei.
  10. Nelson, Rob. "Mining the Dispatch". Mining the Dispatch. Digital Scholarship Lab, University of Richmond. Retrieved 26 March 2021.
  11. Yin, Zhijun (2011). "Geographical topic discovery and comparison". Proceedings of the 20th international conference on World wide web. pp. 247–256. doi:10.1145/1963405.1963443. ISBN   9781450306324. S2CID   17883132.
  12. Chang, Jonathan (2009). "Relational Topic Models for Document Networks" (PDF). Aistats. 9: 81–88.
  13. Rosen-Zvi, Michal (2004). "The author-topic model for authors and documents". Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence: 487–494. arXiv: 1207.4169 .
  14. Nikolenko, Sergey (2017). "Topic modelling for qualitative studies". Journal of Information Science. 43: 88–102. doi:10.1177/0165551515617393. S2CID   30657489.
  15. Reverter-Rambaldi, Marcel (2022). Topic Modelling in Spontaneous Speech Data (Honours thesis). Australian National University. doi:10.25911/M1YF-ZF55.
  16. Newman, David (2010). "Automatic evaluation of topic coherence". Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics: 100–108.
  17. Blei, David M. (April 2012). "Introduction to Probabilistic Topic Models" (PDF). Comm. ACM. 55 (4): 77–84. doi:10.1145/2133806.2133826. S2CID   753304.
  18. Sanjeev Arora; Rong Ge; Ankur Moitra (April 2012). "Learning Topic Models—Going beyond SVD". arXiv: 1204.1956 [cs.LG].
  19. Miao, Yishu; Grefenstette, Edward; Blunsom, Phil (2017). "Discovering Discrete Latent Topics with Neural Variational Inference". Proceedings of the 34th International Conference on Machine Learning. PMLR: 2410–2419. arXiv: 1706.00359 .
  20. Xu, Weijie; Jiang, Xiaoyu; Sengamedu Hanumantha Rao, Srinivasan; Iannacci, Francis; Zhao, Jinjin (2023). "vONTSS: vMF based semi-supervised neural topic modeling with optimal transport". Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg, PA, USA: Association for Computational Linguistics: 4433–4457. arXiv: 2307.01226 . doi:10.18653/v1/2023.findings-acl.271.
  21. Martin Gerlach; Tiago Pexioto; Eduardo Altmann (2018). "A network approach to topic models". Science Advances. 4 (7): eaaq1360. arXiv: 1708.01677 . Bibcode:2018SciA....4.1360G. doi:10.1126/sciadv.aaq1360. PMC   6051742 . PMID   30035215.
  22. Bianchi, Federico; Terragni, Silvia; Hovy, Dirk (2021). "Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence". Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Stroudsburg, PA, USA: Association for Computational Linguistics. pp. 759–766. doi:10.18653/v1/2021.acl-short.96.
  23. Xu, Weijie; Hu, Wenxiang; Wu, Fanyou; Sengamedu, Srinivasan (2023). "DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM". Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg, PA, USA: Association for Computational Linguistics: 9040–9057. arXiv: 2310.15296 . doi:10.18653/v1/2023.findings-emnlp.606.
  24. Liu, L.; Tang, L.; et al. (2016). "An overview of topic modeling and its current applications in bioinformatics". SpringerPlus. 5 (1): 1608. doi: 10.1186/s40064-016-3252-8 . PMC   5028368 . PMID   27652181. S2CID   16712827.
  25. Valle, F.; Osella, M.; Caselle, M. (2020). "A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data". Cancers. 12 (12): 3799. doi: 10.3390/cancers12123799 . PMC   7766023 . PMID   33339347. S2CID   229325007.
  26. Shalit, Uri; Weinshall, Daphna; Chechik, Gal (2013-05-13). "Modeling Musical Influence with Topic Models". Proceedings of the 30th International Conference on Machine Learning. PMLR: 244–252.

Further reading