Paraphrasing (computational linguistics)

Last updated

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. [1] Paraphrasing is also useful in the evaluation of machine translation, [2] as well as semantic parsing [3] and generation [4] of new samples to expand existing corpora. [5]

Contents

Paraphrase generation

Multiple sequence alignment

Barzilay and Lee [5] proposed a method to generate paraphrases through the usage of monolingual parallel corpora, namely news articles covering the same event on the same day. Training consists of using multi-sequence alignment to generate sentence-level paraphrases from an unannotated corpus. This is done by

This is achieved by first clustering similar sentences together using n-gram overlap. Recurring patterns are found within clusters by using multi-sequence alignment. Then the position of argument words is determined by finding areas of high variability within each cluster, aka between words shared by more than 50% of a cluster's sentences. Pairings between patterns are then found by comparing similar variable words between different corpora. Finally, new paraphrases can be generated by choosing a matching cluster for a source sentence, then substituting the source sentence's argument into any number of patterns in the cluster.

Phrase-based Machine Translation

Paraphrase can also be generated through the use of phrase-based translation as proposed by Bannard and Callison-Burch. [6] The chief concept consists of aligning phrases in a pivot language to produce potential paraphrases in the original language. For example, the phrase "under control" in an English sentence is aligned with the phrase "unter kontrolle" in its German counterpart. The phrase "unter kontrolle" is then found in another German sentence with the aligned English phrase being "in check," a paraphrase of "under control."

The probability distribution can be modeled as , the probability phrase is a paraphrase of , which is equivalent to summed over all , a potential phrase translation in the pivot language. Additionally, the sentence is added as a prior to add context to the paraphrase. Thus the optimal paraphrase, can be modeled as:

and can be approximated by simply taking their frequencies. Adding as a prior is modeled by calculating the probability of forming the when is substituted with .

Long short-term memory

There has been success in using long short-term memory (LSTM) models to generate paraphrases. [7] In short, the model consists of an encoder and decoder component, both implemented using variations of a stacked residual LSTM. First, the encoding LSTM takes a one-hot encoding of all the words in a sentence as input and produces a final hidden vector, which can represent the input sentence. The decoding LSTM takes the hidden vector as input and generates a new sentence, terminating in an end-of-sentence token. The encoder and decoder are trained to take a phrase and reproduce the one-hot distribution of a corresponding paraphrase by minimizing perplexity using simple stochastic gradient descent. New paraphrases are generated by inputting a new phrase to the encoder and passing the output to the decoder.

Transformers

With the introduction of Transformer models, paraphrase generation approaches improved their ability to generate text by scaling neural network parameters and heavily parallelizing training through feed-forward layers. [8] These models are so fluent in generating text that human experts cannot identify if an example was human-authored or machine-generated. [9] Transformer-based paraphrase generation relies on autoencoding, autoregressive, or sequence-to-sequence methods. Autoencoder models predict word replacement candidates with a one-hot distribution over the vocabulary, while autoregressive and seq2seq models generate new text based on the source predicting one word at a time. [10] [11] More advanced efforts also exist to make paraphrasing controllable according to predefined quality dimensions, such as semantic preservation or lexical diversity. [12] Many Transformer-based paraphrase generation methods rely on unsupervised learning to leverage large amounts of training data and scale their methods. [13] [14]

Paraphrase recognition

Recursive Autoencoders

Paraphrase recognition has been attempted by Socher et al [1] through the use of recursive autoencoders. The main concept is to produce a vector representation of a sentence and its components by recursively using an autoencoder. The vector representations of paraphrases should have similar vector representations; they are processed, then fed as input into a neural network for classification.

Given a sentence with words, the autoencoder is designed to take 2 -dimensional word embeddings as input and produce an -dimensional vector as output. The same autoencoder is applied to every pair of words in to produce vectors. The autoencoder is then applied recursively with the new vectors as inputs until a single vector is produced. Given an odd number of inputs, the first vector is forwarded as-is to the next level of recursion. The autoencoder is trained to reproduce every vector in the full recursion tree, including the initial word embeddings.

Given two sentences and of length 4 and 3 respectively, the autoencoders would produce 7 and 5 vector representations including the initial word embeddings. The euclidean distance is then taken between every combination of vectors in and to produce a similarity matrix . is then subject to a dynamic min-pooling layer to produce a fixed size matrix. Since are not uniform in size among all potential sentences, is split into roughly even sections. The output is then normalized to have mean 0 and standard deviation 1 and is fed into a fully connected layer with a softmax output. The dynamic pooling to softmax model is trained using pairs of known paraphrases.

Skip-thought vectors

Skip-thought vectors are an attempt to create a vector representation of the semantic meaning of a sentence, similarly to the skip gram model. [15] Skip-thought vectors are produced through the use of a skip-thought model which consists of three key components, an encoder and two decoders. Given a corpus of documents, the skip-thought model is trained to take a sentence as input and encode it into a skip-thought vector. The skip-thought vector is used as input for both decoders; one attempts to reproduce the previous sentence and the other the following sentence in its entirety. The encoder and decoder can be implemented through the use of a recursive neural network (RNN) or an LSTM.

Since paraphrases carry the same semantic meaning between one another, they should have similar skip-thought vectors. Thus a simple logistic regression can be trained to good performance with the absolute difference and component-wise product of two skip-thought vectors as input.

Transformers

Similar to how Transformer models influenced paraphrase generation, their application in identifying paraphrases showed great success. Models such as BERT can be adapted with a binary classification layer and trained end-to-end on identification tasks. [16] [17] Transformers achieve strong results when transferring between domains and paraphrasing techniques compared to more traditional machine learning methods such as logistic regression. Other successful methods based on the Transformer architecture include using adversarial learning and meta-learning. [18] [19]

Evaluation

Multiple methods can be used to evaluate paraphrases. Since paraphrase recognition can be posed as a classification problem, most standard evaluations metrics such as accuracy, f1 score, or an ROC curve do relatively well. However, there is difficulty calculating f1-scores due to trouble producing a complete list of paraphrases for a given phrase and the fact that good paraphrases are dependent upon context. A metric designed to counter these problems is ParaMetric. [20] ParaMetric aims to calculate the precision and recall of an automatic paraphrase system by comparing the automatic alignment of paraphrases to a manual alignment of similar phrases. Since ParaMetric is simply rating the quality of phrase alignment, it can be used to rate paraphrase generation systems, assuming it uses phrase alignment as part of its generation process. A notable drawback to ParaMetric is the large and exhaustive set of manual alignments that must be initially created before a rating can be produced.

The evaluation of paraphrase generation has similar difficulties as the evaluation of machine translation. The quality of a paraphrase depends on its context, whether it is being used as a summary, and how it is generated, among other factors. Additionally, a good paraphrase usually is lexically dissimilar from its source phrase. The simplest method used to evaluate paraphrase generation would be through the use of human judges. Unfortunately, evaluation through human judges tends to be time-consuming. Automated approaches to evaluation prove to be challenging as it is essentially a problem as difficult as paraphrase recognition. While originally used to evaluate machine translations, bilingual evaluation understudy (BLEU) has been used successfully to evaluate paraphrase generation models as well. However, paraphrases often have several lexically different but equally valid solutions, hurting BLEU and other similar evaluation metrics. [21]

Metrics specifically designed to evaluate paraphrase generation include paraphrase in n-gram change (PINC) [21] and paraphrase evaluation metric (PEM) [22] along with the aforementioned ParaMetric. PINC is designed to be used with BLEU and help cover its inadequacies. Since BLEU has difficulty measuring lexical dissimilarity, PINC is a measurement of the lack of n-gram overlap between a source sentence and a candidate paraphrase. It is essentially the Jaccard distance between the sentence, excluding n-grams that appear in the source sentence to maintain some semantic equivalence. PEM, on the other hand, attempts to evaluate the "adequacy, fluency, and lexical dissimilarity" of paraphrases by returning a single value heuristic calculated using N-grams overlap in a pivot language. However, a large drawback to PEM is that it must be trained using large, in-domain parallel corpora and human judges. [21] It is equivalent to training a paraphrase recognition to evaluate a paraphrase generation system.

The Quora Question Pairs Dataset, which contains hundreds of thousands of duplicate questions, has become a common dataset for the evaluation of paraphrase detectors. [23] Consistently reliable paraphrase detection have all used the Transformer architecture and all have relied on large amounts of pre-training with more general data before fine-tuning with the question pairs.

See also

Related Research Articles

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".

A language model is a probabilistic model of a natural language that can generate probabilities of a series of words, based on text corpora in one or multiple languages it was trained on. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

The sequence between semantic related ordered words is classified as a lexical chain. A lexical chain is a sequence of related words in writing, spanning narrow or wide context window. A lexical chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text. A lexical chain can provide a context for the resolution of an ambiguous term and enable disambiguation of concepts that the term represents.

<span class="mw-page-title-main">Autoencoder</span> Neural network that learns efficient data encoding in an unsupervised manner

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

<span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

Long short-term memory (LSTM) network is a recurrent neural network (RNN), aimed to deal with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.

Referring expression generation (REG) is the subtask of natural language generation (NLG) that received most scholarly attention. While NLG is concerned with the conversion of non-linguistic information into natural language, REG focuses only on the creation of referring expressions that identify specific entities called targets.

In natural language processing, textual entailment (TE), also known as natural language inference (NLI), is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

LEPOR is an automatic language independent machine translation evaluation metric with tunable parameters and reinforced factors.

<span class="mw-page-title-main">Word embedding</span> Method in natural language processing

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

<span class="mw-page-title-main">Word2vec</span> Models used to produce word embeddings

Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that they capture the semantic and syntactic qualities of words; as such, a simple mathematical function can indicate the level of semantic similarity between the words represented by those vectors.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling and then translating entire sentences in a single integrated model.

<span class="mw-page-title-main">Sentence embedding</span>

In natural language processing, a sentence embedding refers to a numeric representation of a sentence in the form of a vector of real numbers which encodes meaningful semantic information.

<span class="mw-page-title-main">Transformer (machine learning model)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture, initially proposed in 2017, that relies on the parallel multi-head attention mechanism. It is notable for requiring less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models on large (language) datasets, such as the Wikipedia corpus and Common Crawl, by virtue of the parallelized processing of input sequence. Input text is split into n-grams encoded as tokens and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. Though the transformer paper was published in 2017, the softmax-based attention mechanism was proposed in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, was proposed in 1992.

Bidirectional Encoder Representations from Transformers (BERT) is a family of language models introduced in October 2018 by researchers at Google. A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analyzing and improving the model."

<span class="mw-page-title-main">Self-supervised learning</span> A paradigm in machine learning

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

<span class="mw-page-title-main">Knowledge graph embedding</span> Dimensionality reduction of graph-based semantic data objects [machine learning task]

In representation learning, knowledge graph embedding (KGE), also referred to as knowledge representation learning (KRL), or multi-relation learning, is a machine learning task of learning a low-dimensional representation of a knowledge graph's entities and relations while preserving their semantic meaning. Leveraging their embedded representation, knowledge graphs (KGs) can be used for various applications such as link prediction, triple classification, entity recognition, clustering, and relation extraction.

Prompt engineering is the process of structuring text that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

Adversarial stylometry is the practice of altering writing style to reduce the potential for stylometry to discover the author's identity or their characteristics. This task is also known as authorship obfuscation or authorship anonymisation. Stylometry poses a significant privacy challenge in its ability to unmask anonymous authors or to link pseudonyms to an author's other identities, which, for example, creates difficulties for whistleblowers, activists, and hoaxers and fraudsters. The privacy risk is expected to grow as machine learning techniques and text corpora develop.

References

  1. 1 2 Socher, Richard; Huang, Eric; Pennington, Jeffrey; Ng, Andrew; Manning, Christopher (2011), "Advances in Neural Information Processing Systems 24", Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection, archived from the original on 2018-01-06, retrieved 2017-12-29
  2. Callison-Burch, Chris (October 25–27, 2008). Syntactic Constraints on Paraphrases Extracted from Parallel Corpora. EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing. Honolulu, Hawaii. pp. 196–205.
  3. Berant, Jonathan, and Percy Liang. "Semantic parsing via paraphrasing." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2014.
  4. Wahle, Jan Philip; Ruas, Terry; Kirstein, Frederic; Gipp, Bela (2022). "How Large Language Models are Transforming Machine-Paraphrase Plagiarism". Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Online and Abu Dhabi, United Arab Emirates: 952–963. arXiv: 2210.03568 . doi:10.18653/v1/2022.emnlp-main.62.
  5. 1 2 Barzilay, Regina; Lee, Lillian (May–June 2003). Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment. Proceedings of HLT-NAACL 2003.
  6. Bannard, Colin; Callison-Burch, Chris (2005). Paraphrasing Bilingual Parallel Corpora. Proceedings of the 43rd Annual Meeting of the ACL. Ann Arbor, Michigan. pp. 597–604.
  7. Prakash, Aaditya; Hasan, Sadid A.; Lee, Kathy; Datla, Vivek; Qadir, Ashequl; Liu, Joey; Farri, Oladimeji (2016), Neural Paraphrase Generation with Staked Residual LSTM Networks, arXiv: 1610.03098 , Bibcode:2016arXiv161003098P
  8. Zhou, Jianing; Bhat, Suma (2021). "Paraphrase Generation: A Survey of the State of the Art". Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. pp. 5075–5086. doi: 10.18653/v1/2021.emnlp-main.414 . S2CID   243865349.
  9. Dou, Yao; Forbes, Maxwell; Koncel-Kedziorski, Rik; Smith, Noah; Choi, Yejin (2022). "Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics: 7250–7274. doi: 10.18653/v1/2022.acl-long.501 . S2CID   247315430.
  10. Liu, Xianggen; Mou, Lili; Meng, Fandong; Zhou, Hao; Zhou, Jie; Song, Sen (2020). "Unsupervised Paraphrasing by Simulated Annealing". Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics: 302–312. doi: 10.18653/v1/2020.acl-main.28 . S2CID   202537332.
  11. Wahle, Jan Philip; Ruas, Terry; Meuschke, Norman; Gipp, Bela (2021). "Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection". 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL). Champaign, IL, USA: IEEE. pp. 226–229. arXiv: 2103.12450 . doi:10.1109/JCDL52503.2021.00065. ISBN   978-1-6654-1770-9. S2CID   232320374.
  12. Bandel, Elron; Aharonov, Ranit; Shmueli-Scheuer, Michal; Shnayderman, Ilya; Slonim, Noam; Ein-Dor, Liat (2022). "Quality Controlled Paraphrase Generation". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics: 596–609. doi: 10.18653/v1/2022.acl-long.45 .
  13. Lee, John Sie Yuen; Lim, Ho Hung; Carol Webster, Carol (2022). "Unsupervised Paraphrasability Prediction for Compound Nominalizations". Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics. pp. 3254–3263. doi: 10.18653/v1/2022.naacl-main.237 . S2CID   250390695.
  14. Niu, Tong; Yavuz, Semih; Zhou, Yingbo; Keskar, Nitish Shirish; Wang, Huan; Xiong, Caiming (2021). "Unsupervised Paraphrasing with Pretrained Language Models". Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. pp. 5136–5150. doi: 10.18653/v1/2021.emnlp-main.417 . S2CID   237497412.
  15. Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Ruslan; Zemel, Richard; Torralba, Antonio; Urtasun, Raquel; Fidler, Sanja (2015), Skip-Thought Vectors, arXiv: 1506.06726 , Bibcode:2015arXiv150606726K
  16. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2019). "Proceedings of the 2019 Conference of the North". Proceedings of the 2019 Conference of the North. Minneapolis, Minnesota: Association for Computational Linguistics: 4171–4186. doi:10.18653/v1/N19-1423. S2CID   52967399.
  17. Wahle, Jan Philip; Ruas, Terry; Foltýnek, Tomáš; Meuschke, Norman; Gipp, Bela (2022), Smits, Malte (ed.), "Identifying Machine-Paraphrased Plagiarism", Information for a Better World: Shaping the Global Future, Cham: Springer International Publishing, vol. 13192, pp. 393–413, arXiv: 2103.11909 , doi:10.1007/978-3-030-96957-8_34, ISBN   978-3-030-96956-1, S2CID   232307572 , retrieved 2022-10-06
  18. Nighojkar, Animesh; Licato, John (2021). "Improving Paraphrase Detection with the Adversarial Paraphrasing Task". Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics. pp. 7106–7116. doi: 10.18653/v1/2021.acl-long.552 . S2CID   235436269.
  19. Dopierre, Thomas; Gravier, Christophe; Logerais, Wilfried (2021). "ProtAugment: Intent Detection Meta-Learning through Unsupervised Diverse Paraphrasing". Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics. pp. 2454–2466. doi: 10.18653/v1/2021.acl-long.191 . S2CID   236460333.
  20. Callison-Burch, Chris; Cohn, Trevor; Lapata, Mirella (2008). ParaMetric: An Automatic Evaluation Metric for Paraphrasing (PDF). Proceedings of the 22nd International Conference on Computational Linguistics. Manchester. pp. 97–104. doi: 10.3115/1599081.1599094 . S2CID   837398.
  21. 1 2 3 Chen, David; Dolan, William (2008). Collecting Highly Parallel Data for Paraphrase Evaluation. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon. pp. 190–200.
  22. Liu, Chang; Dahlmeier, Daniel; Ng, Hwee Tou (2010). PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts. Proceedings of the 2010 Conference on Empricial Methods in Natural Language Processing. MIT, Massachusetts. pp. 923–932.
  23. "Paraphrase Identification on Quora Question Pairs". Papers with Code.