Text simplification

Last updated

Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning and information remain the same. Text simplification is an important area of research because of communication needs in an increasingly complex and interconnected world more dominated by science, technology, and new media. But natural human languages pose huge problems because they ordinarily contain large vocabularies and complex constructions that machines, no matter how fast and well-programmed, cannot easily process. However, researchers have discovered that, to reduce linguistic diversity, they can use methods of semantic compression to limit and simplify a set of words used in given texts.

Contents

Example

Text simplification is illustrated with an example used by Siddharthan (2006). [1] The first sentence contains two relative clauses and one conjoined verb phrase. A text simplification system aims to change the first sentence into a group of simpler sentences, as seen just below the first sentence.

One approach to text simplification is lexical simplification via lexical substitution, a two-step process of first identifying complex words and then replacing them with simpler synonyms. A key challenge here is identifying complex words, which is performed by a machine learning classifier trained on labeled data. Researchers, frustrated by the problems with using the classical method of asking research subjects to describe words as either simple or complex, have discovered that they can get a higher consistency in more levels of complexity if they ask labelers to sort words presented to them in order of complexity. [2]

See also

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Psycholinguistics or psychology of language is the study of the interrelation between linguistic factors and psychological aspects. The discipline is mainly concerned with the mechanisms by which language is processed and represented in the mind and brain; that is, the psychological and neurobiological factors that enable humans to acquire, use, comprehend, and produce language.

Lexical semantics, as a subfield of linguistic semantics, is the study of word meanings. It includes the study of how words structure their meaning, how they act in grammar and compositionality, and the relationships between the distinct senses and uses of a word.

Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar developed by Carl Pollard and Ivan Sag. It is a type of phrase structure grammar, as opposed to a dependency grammar, and it is the immediate successor to generalized phrase structure grammar. HPSG draws from other fields such as computer science and uses Ferdinand de Saussure's notion of the sign. It uses a uniform formalism and is organized in a modular way which makes it attractive for natural language processing.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Readability is the ease with which a reader can understand a written text. In natural language, the readability of text depends on its content and its presentation. Researchers have used various factors to measure readability, such as:

In semantics, mathematical logic and related disciplines, the principle of compositionality is the principle that the meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them. This principle is also called Frege's principle, because Gottlob Frege is widely credited for the first modern formulation of it. The principle was never explicitly stated by Frege, and it was arguably already assumed by George Boole decades before Frege's work.

Frame Semantics, a theory of meaning derived from the work of Charles J. Fillmore and others, serves as the foundation for FrameNet. The fundamental notion is simple: most words' meanings may be best understood in terms of a semantic frame, which is a description of a certain kind of event, connection, or item and its actors. As an illustration, the idea of cooking normally requires a cook (Cook), the meal being prepared (Food), a container (Container) to store the food while it is being cooked, and a heating instrument.

In linguistics, nominalization or nominalisation is the use of a word that is not a noun as a noun, or as the head of a noun phrase. This change in functional category can occur through morphological transformation, but it does not always. Nominalization can refer, for instance, to the process of producing a noun from another part of speech by adding a derivational affix, but it can also refer to the complex noun that is formed as a result.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.

<span class="mw-page-title-main">Biolinguistics</span> Study of the biology and evolution of language

Biolinguistics can be defined as the study of biology and the evolution of language. It is highly interdisciplinary as it is related to various fields such as biology, linguistics, psychology, anthropology, mathematics, and neurolinguistics to explain the formation of language. It is important as it seeks to yield a framework by which we can understand the fundamentals of the faculty of language. This field was first introduced by Massimo Piattelli-Palmarini, professor of Linguistics and Cognitive Science at the University of Arizona. It was first introduced in 1971, at an international meeting at the Massachusetts Institute of Technology (MIT). Biolinguistics, also called the biolinguistic enterprise or the biolinguistic approach, is believed to have its origins in Noam Chomsky's and Eric Lenneberg's work on language acquisition that began in the 1950s as a reaction to the then-dominant behaviorist paradigm. Fundamentally, biolinguistics challenges the view of human language acquisition as a behavior based on stimulus-response interactions and associations. Chomsky and Lenneberg militated against it by arguing for the innate knowledge of language. Chomsky in 1960s proposed the Language Acquisition Device (LAD) as a hypothetical tool for language acquisition that only humans are born with. Similarly, Lenneberg (1967) formulated the Critical Period Hypothesis, the main idea of which being that language acquisition is biologically constrained. These works were regarded as pioneers in the shaping of biolinguistic thought, in what was the beginning of a change in paradigm in the study of language.

Sentence processing takes place whenever a reader or listener processes a language utterance, either in isolation or in the context of a conversation or a text. Many studies of the human language comprehension process have focused on reading of single utterances (sentences) without context. Extensive research has shown that language comprehension is affected by context preceding a given utterance as well as many other factors.

Computational lexicology is a branch of computational linguistics, which is concerned with the use of computers in the study of lexicon. It has been more narrowly described by some scholars as the use of computers in the study of machine-readable dictionaries. It is distinguished from computational lexicography, which more properly would be the use of computers in the construction of dictionaries, though some researchers have used computational lexicography as synonymous.

In natural language processing, semantic role labeling is the process that assigns labels to words or phrases in a sentence that indicates their semantic role in the sentence, such as that of an agent, goal, or result.

In natural language processing, semantic compression is a process of compacting a lexicon used to build a textual document by reducing language heterogeneity, while maintaining text semantics. As a result, the same ideas can be represented using a smaller set of words.

Lexical simplification is a sub-task of text simplification. It can be defined as any lexical substitution task that reduces text complexity.

<span class="mw-page-title-main">Word2vec</span> Models used to produce word embeddings

Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that they capture the semantic and syntactic qualities of words; as such, a simple mathematical function can indicate the level of semantic similarity between the words represented by those vectors.

Scott Andrew Crossley is an American linguist. He is a professor of applied linguistics at Vanderbilt University, United States. His research focuses on natural language processing and the application of computational tools and machine learning algorithms in learning analytics including second language acquisition, second language writing, and readability. His main interest area is the development and use of natural language processing tools in assessing writing quality and text difficulty.

References

  1. Siddharthan, Advaith (28 March 2006). "Syntactic Simplification and Text Cohesion". Research on Language and Computation. 4 (1): 77–109. doi:10.1007/s11168-006-9011-1. S2CID   14619244.
  2. Gooding, Sian; Kochmar, Ekaterina; Sarkar, Advait; Blackwell, Alan (August 2019). "Comparative judgments are more consistent than binary classification for labelling word complexity". Proceedings of the 13th Linguistic Annotation Workshop: 208–214. doi: 10.18653/v1/W19-4024 . Retrieved 22 November 2019.