FastText

Last updated
fastText
Developer(s) Facebook's AI Research (FAIR) lab [1]
Initial releaseNovember 9, 2015;8 years ago (2015-11-09)
Stable release
0.9.2 [2] / April 28, 2020;3 years ago (2020-04-28)
Repository github.com/facebookresearch/fastText
Written in C++, Python
Platform Linux, macOS, Windows
Type Machine learning library
License MIT License
Website fasttext.cc

fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. [3] [4] [5] [6] The model allows one to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages. [7] [8] Several papers describe the techniques used by fastText. [9] [10] [11] [12]

Contents

See also

Related Research Articles

Morphological parsing, in natural language processing, is the process of determining the morphemes from which a given word is constructed. It must be able to distinguish between orthographic rules and morphological rules. For example, the word 'foxes' can be decomposed into 'fox', and 'es'.

<span class="mw-page-title-main">Maluuba</span> Canadian technology company

Maluuba is a Canadian technology company conducting research in artificial intelligence and language understanding. Founded in 2011, the company was acquired by Microsoft in 2017.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word and their usage in context. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that they capture the semantic and syntactic qualities of words; as such, a simple mathematical function can indicate the level of semantic similarity between the words represented by those vectors.

<span class="mw-page-title-main">Differentiable neural computer</span> Artificial neural network architecture

In artificial intelligence, a differentiable neural computer (DNC) is a memory augmented neural network architecture (MANN), which is typically recurrent in its implementation. The model was published in 2016 by Alex Graves et al. of DeepMind.

Matroid, Inc. is a computer vision company that offers a platform for creating computer vision models, called detectors, to search visual media for objects, persons, events, emotions, and actions. Matroid provides real-time notifications once the object of interest has been detected, as well as the ability to search past events.

Semantic spaces in the natural language domain aim to create representations of natural language that are capable of capturing meaning. The original motivation for semantic spaces stems from two core challenges of natural language: Vocabulary mismatch and ambiguity of natural language.

In natural language processing, a sentence embedding refers to a numeric representation of a sentence in the form of a vector of real numbers which encodes meaningful semantic information.

<span class="mw-page-title-main">Tomáš Mikolov</span> Czech computer scientist

Tomáš Mikolov is a Czech computer scientist working in the field of machine learning. In March of 2020, Mikolov became a senior research scientist at the Czech Institute of Informatics, Robotics and Cybernetics.

<span class="mw-page-title-main">Transformer (machine learning model)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture based on the multi-head attention mechanism. It is notable for not containing any recurrent units, and thus requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models on large (language) datasets, such as the Wikipedia corpus and Common Crawl. Input text is split into n-grams encoded as tokens and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. Though the transformer paper was published in 2017, the softmax-based attention mechanism was proposed in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, was proposed in 1992.

Bidirectional Encoder Representations from Transformers (BERT) is a language model based on the transformer architecture, notable for its dramatic improvement over previous state of the art models. It was introduced in October 2018 by researchers at Google. A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analyzing and improving the model."

Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models, and text summarization. Seq2seq uses sequence transformation: it turns one sequence into another sequence.

Machine learning-based attention is a mechanism which intuitively mimicks cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. These weights can be computed either in parallel or sequentially. "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards.

Prompt engineering is the process of structuring text that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

Meta AI is an artificial intelligence laboratory that belongs to Meta Platforms Inc. Meta AI intends to develop various forms of artificial intelligence, improving augmented and artificial reality technologies. Meta AI is an academic research laboratory focused on generating knowledge for the AI community. This is in contrast to Facebook's Applied Machine Learning (AML) team, which focuses on practical applications of its products.

Lê Viết Quốc, or in romanized form Quoc Viet Le, is a Vietnamese-American computer scientist and a machine learning pioneer at Google Brain, which he established with others from Google. He co-invented the doc2vec and seq2seq models in natural language processing. Le also initiated and lead the AutoML initiative at Google Brain, including the proposal of neural architecture search.

Devi Parikh is an American computer scientist.

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs are artificial neural networks following a transformer architecture.

References

  1. Mannes, John. "Facebook's fastText library is now optimized for mobile". TechCrunch . Retrieved 12 January 2018.
  2. Onur Çelebi (2020-04-28). "facebookresearch/fastText/releases/tag/v0.9.2". Facebook. Retrieved 2020-11-21.
  3. Mannes, John. "Facebook's fastText library is now optimized for mobile". TechCrunch . Retrieved 12 January 2018.
  4. Ryan, Kevin J. "Facebook's New Open Source Software Can Learn 1 Billion Words in 10 Minutes". Inc. Retrieved 12 January 2018.
  5. Low, Cherlynn. "Facebook is open-sourcing its AI bot-building research". Engadget . Retrieved 12 January 2018.
  6. Mannes, John. "Facebook's Artificial Intelligence Research lab releases open source fastText on GitHub". TechCrunch . Retrieved 12 January 2018.
  7. Sabin, Dyani. "Facebook Makes A.I. Program Available in 294 Languages". Inverse . Retrieved 12 January 2018.
  8. "Wiki word vectors". fastText. Retrieved 26 November 2020.
  9. "References · fastText". fasttext.cc. Retrieved 2021-09-08.
  10. Bojanowski, Piotr; Grave, Edouard; Joulin, Armand; Mikolov, Tomas (2017-06-19). "Enriching Word Vectors with Subword Information". arXiv: 1607.04606 [cs.CL].
  11. Joulin, Armand; Grave, Edouard; Bojanowski, Piotr; Mikolov, Tomas (2016-08-09). "Bag of Tricks for Efficient Text Classification". arXiv: 1607.01759 [cs.CL].
  12. Joulin, Armand; Grave, Edouard; Bojanowski, Piotr; Douze, Matthijs; Jégou, Hérve; Mikolov, Tomas (2016-12-12). "FastText.zip: Compressing text classification models". arXiv: 1612.03651 [cs.CL].