Gensim

Last updated
Gensim
Original author(s) Radim Řehůřek
Developer(s) RARE Technologies Ltd.
Initial release2009
Stable release
4.3.2 [1] / 24 August 2023;5 months ago (24 August 2023)
Repository github.com/RaRe-Technologies/gensim
Written in Python
Operating system Linux, Windows, macOS
Type Information retrieval
License LGPL
Website radimrehurek.com/gensim/

Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning.

Contents

Gensim is implemented in Python and Cython for performance. Gensim is designed to handle large text collections using data streaming and incremental online algorithms, which differentiates it from most other machine learning software packages that target only in-memory processing.

Main Features

Gensim includes streamed parallelized implementations of fastText, [2] word2vec and doc2vec algorithms, [3] as well as latent semantic analysis (LSA, LSI, SVD), non-negative matrix factorization (NMF), latent Dirichlet allocation (LDA), tf-idf and random projections. [4]

Some of the novel online algorithms in Gensim were also published in the 2011 PhD dissertation Scalability of Semantic Analysis in Natural Language Processing of Radim Řehůřek, the creator of Gensim. [5]

Uses of Gensim

Gensim library has been used and cited in over 1400 commercial and academic applications as of 2018, [6] in a diverse array of disciplines from medicine to insurance claim analysis to patent search. [7] The software has been covered in several new articles, podcasts and interviews. [8] [9] [10]

Free and Commercial Support

The open source code is developed and hosted on GitHub [11] and a public support forum is maintained on Google Groups [12] and Gitter. [13]

Gensim is commercially supported by the company rare-technologies.com, who also provide student mentorships and academic thesis projects for Gensim via their Student Incubator programme. [14]

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

<span class="mw-page-title-main">Orange (software)</span>

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for explorative qualitative data analysis and interactive data visualization.

<span class="mw-page-title-main">Weka (software)</span> Suite of machine learning software written in Java

Waikato Environment for Knowledge Analysis (Weka) is a collection of machine learning and data analysis free software licensed under the GNU General Public License. It was developed at the University of Waikato, New Zealand and is the companion software to the book "Data Mining: Practical Machine Learning Tools and Techniques".

<span class="mw-page-title-main">Distributional semantics</span> Field of linguistics

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.

The following outline is provided as an overview of and topical guide to natural-language processing:

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word and their usage in context. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that they capture the semantic and syntactic qualities of words; as such, a simple mathematical function can indicate the level of semantic similarity between the words represented by those vectors.

Apache SystemDS is an open source ML system for the end-to-end data science lifecycle.

GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. It is developed as an open-source project at Stanford and was launched in 2014. As log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods.

The following outline is provided as an overview of and topical guide to machine learning:

A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another. Position within the latent space can be viewed as being defined by a set of latent variables that emerge from the resemblances from the objects.

References

  1. "Release 4.3.2". 24 August 2023. Retrieved 18 September 2023.
  2. Scalable *2vec training
  3. Deep learning with word2vec and Gensim
  4. Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks
  5. Řehůřek, Radim (2011). "Scalability of Semantic Analysis in Natural Language Processing" (PDF). Retrieved 27 January 2015. my open-source gensim software package that accompanies this thesis
  6. Gensim academic citations
  7. Commercial adopters of Gensim
  8. Podcast.__init__ episode #71 on Gensim
  9. Interview with Radim Řehůřek, creator of Gensim
  10. "DecisionStats Interview Radim Řehůřek Gensim #python". 8 December 2015.
  11. Gensim source code on Github
  12. Gensim mailing list on Google Groups
  13. Gensim chat room on Gitter
  14. Gensim open source Incubator