Sentence embedding

Last updated

In natural language processing, a sentence embedding refers to a numeric representation of a sentence in the form of a vector of real numbers which encodes meaningful semantic information. [1] [2] [3] [4] [5] [6] [7] [8]

Contents

State of the art embeddings are based on the learned hidden layer representation of dedicated sentence transformer models. BERT pioneered an approach involving the use of a dedicated [CLS] token prepended to the beginning of each sentence inputted into the model; the final hidden state vector of this token encodes information about the sentence and can be fine-tuned for use in sentence classification tasks. In practice however, BERT's sentence embedding with the [CLS] token achieves poor performance, often worse than simply averaging non-contextual word embeddings. SBERT later achieved superior sentence embedding performance [9] by fine tuning BERT's [CLS] token embeddings through the usage of a siamese neural network architecture on the SNLI dataset.

Other approaches are loosely based on the idea of distributional semantics applied to sentences. Skip-Thought trains an encoder-decoder structure for the task of neighboring sentences predictions. Though this has been shown to achieve worse performance than approaches such as InferSent or SBERT.

An alternative direction is to aggregate word embeddings, such as those returned by Word2vec, into sentence embeddings. The most straightforward approach is to simply compute the average of word vectors, known as continuous bag-of-words (CBOW). [10] However, more elaborate solutions based on word vector quantization have also been proposed. One such approach is the vector of locally aggregated word embeddings (VLAWE), [11] which demonstrated performance improvements in downstream text classification tasks.

Applications

In recent years, sentence embedding has seen a growing level of interest due to its applications in natural language queryable knowledge bases through the usage of vector indexing for semantic search. LangChain for instance utilizes sentence transformers for purposes of indexing documents. In particular, an indexing is generated by generating embeddings for chunks of documents and storing (document chunk, embedding) tuples. Then given a query in natural language, the embedding for the query can be generated. A top k similarity search algorithm is then used between the query embedding and the document chunk embeddings to retrieve the most relevant document chunks as context information for question answering tasks. This approach is also known formally as retrieval augmented generation [12]

Though not as predominant as BERTScore, sentence embeddings are commonly used for sentence similarity evaluation which sees common use for the task of optimizing a Large language model's generation parameters is often performed via comparing candidate sentences against reference sentences. By using the cosine-similarity of the sentence embeddings of candidate and reference sentences as the evaluation function, a grid-search algorithm can be utilized to automate hyperparameter optimization [citation needed].

Evaluation

A way of testing sentence encodings is to apply them on Sentences Involving Compositional Knowledge (SICK) corpus [13] for both entailment (SICK-E) and relatedness (SICK-R).

In [14] the best results are obtained using a BiLSTM network trained on the Stanford Natural Language Inference (SNLI) Corpus. The Pearson correlation coefficient for SICK-R is 0.885 and the result for SICK-E is 86.3. A slight improvement over previous scores is presented in: [15] SICK-R: 0.888 and SICK-E: 87.8 using a concatenation of bidirectional Gated recurrent unit.

See also

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and information retrieval. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. To this end, natural language processing often borrows ideas from theoretical linguistics. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

<span class="mw-page-title-main">Distributional semantics</span> Field of linguistics

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.

In natural language processing, textual entailment (TE), also known as natural language inference (NLI), is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text.

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

Semantic folding theory describes a procedure for encoding the semantics of natural language text in a semantically grounded binary representation. This approach provides a framework for modelling how language data is processed by the neocortex.

<span class="mw-page-title-main">Semantic parsing</span>

Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Applications of semantic parsing include machine translation, question answering, ontology induction, automated reasoning, and code generation. The phrase was first used in the 1970s by Yorick Wilks as the basis for machine translation programs working with only semantic representations. Semantic parsing is one of the important tasks in computational linguistics and natural language processing.

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.

<span class="mw-page-title-main">Triplet loss</span>

Triplet loss is a loss function for machine learning algorithms where a reference input is compared to a matching input and a non-matching input. The distance from the anchor to the positive is minimized, and the distance from the anchor to the negative input is maximized. An early formulation equivalent to triplet loss was introduced for metric learning from relative comparisons by M. Schultze and T. Joachims in 2003.

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages. The library is built on top of Apache Spark and its Spark ML library.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

Bidirectional Encoder Representations from Transformers (BERT) is a language model based on the transformer architecture, notable for its dramatic improvement over previous state of the art models. It was introduced in October 2018 by researchers at Google. A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analyzing and improving the model."

ELMo is a word embedding method for representing a sequence of words as a corresponding sequence of vectors. Character-level tokens are taken as the inputs to a bidirectional LSTM which produces word-level embeddings. Like BERT, ELMo embeddings are context-sensitive, producing different representations for words that share the same spelling but have different meanings (homonyms) such as "bank" in "river bank" and "bank balance".

Machine learning-based attention is a mechanism which intuitively mimics cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. These weights can be computed either in parallel or sequentially. "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards.

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Perceiver is a transformer adapted to be able to process non-textual data, such as images, sounds and video, and spatial data. Transformers underlie other notable systems such as BERT and GPT-3, which preceded Perceiver. It adopts an asymmetric attention mechanism to distill inputs into a latent bottleneck, allowing it to learn from large amounts of heterogeneous data. Perceiver matches or outperforms specialized models on classification tasks.

Prompt engineering is the process of structuring text that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

DisCoCat is a mathematical framework for natural language processing which uses category theory to unify distributional semantics with the principle of compositionality. The grammatical derivations in a categorial grammar are interpreted as linear maps acting on the tensor product of word vectors to produce the meaning of a sentence or a piece of text. String diagrams are used to visualise information flow and reason about natural language semantics.

References

  1. Paper Summary: Evaluation of sentence embeddings in downstream and linguistic probing tasks
  2. Barkan, Oren; Razin, Noam; Malkiel, Itzik; Katz, Ori; Caciularu, Avi; Koenigstein, Noam (2019). "Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding". arXiv: 1908.05161 [cs.LG].
  3. The Current Best of Universal Word Embeddings and Sentence Embeddings
  4. Cer, Daniel; Yang, Yinfei; Kong, Sheng-yi; Hua, Nan; Limtiaco, Nicole; John, Rhomni St.; Constant, Noah; Guajardo-Cespedes, Mario; Yuan, Steve; Tar, Chris; Sung, Yun-Hsuan; Strope, Brian; Kurzweil, Ray (2018). "Universal Sentence Encoder". arXiv: 1803.11175 [cs.CL].
  5. Wu, Ledell; Fisch, Adam; Chopra, Sumit; Adams, Keith; Bordes, Antoine; Weston, Jason (2017). "StarSpace: Embed All the Things!". arXiv: 1709.03856 [cs.CL].
  6. Sanjeev Arora, Yingyu Liang, and Tengyu Ma. "A simple but tough-to-beat baseline for sentence embeddings.", 2016; openreview:SyK00v5xx.
  7. Trifan, Mircea; Ionescu, Bogdan; Gadea, Cristian; Ionescu, Dan (2015). "A graph digital signal processing method for semantic analysis". 2015 IEEE 10th Jubilee International Symposium on Applied Computational Intelligence and Informatics. pp. 187–192. doi:10.1109/SACI.2015.7208196. ISBN   978-1-4799-9911-8. S2CID   17099431.
  8. Basile, Pierpaolo; Caputo, Annalina; Semeraro, Giovanni (2012). "A Study on Compositional Semantics of Words in Distributional Spaces". 2012 IEEE Sixth International Conference on Semantic Computing. pp. 154–161. doi:10.1109/ICSC.2012.55. ISBN   978-1-4673-4433-3. S2CID   552921.
  9. Reimers, Nils; Gurevych, Iryna (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". arXiv: 1908.10084 [cs.CL].
  10. Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013-09-06). "Efficient Estimation of Word Representations in Vector Space". arXiv: 1301.3781 [cs.CL].
  11. Ionescu, Radu Tudor; Butnaru, Andrei (2019). "Vector of Locally-Aggregated Word Embeddings (VLAWE): A Novel Document-level Representation". Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics: 363–369. doi:10.18653/v1/N19-1033. S2CID   85500146.
  12. Lewis, Patrick; Perez, Ethan; Piktus, Aleksandra; Petroni, Fabio; Karpukhin, Vladimir; Goyal, Naman; Küttler, Heinrich; Lewis, Mike; Yih, Wen-tau; Rocktäschel, Tim; Riedel, Sebastian; Kiela, Douwe (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". arXiv: 2005.11401 [cs.CL].
  13. Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. "A SICK cure for the evaluation of compositional distributional semantic models." In LREC, pp. 216-223. 2014 .
  14. Conneau, Alexis; Kiela, Douwe; Schwenk, Holger; Barrault, Loic; Bordes, Antoine (2017). "Supervised Learning of Universal Sentence Representations from Natural Language Inference Data". arXiv: 1705.02364 [cs.CL].
  15. Subramanian, Sandeep; Trischler, Adam; Bengio, Yoshua; Christopher J Pal (2018). "Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning". arXiv: 1804.00079 [cs.CL].