Semantic compression

Last updated

In natural language processing, semantic compression is a process of compacting a lexicon used to build a textual document (or a set of documents) by reducing language heterogeneity, while maintaining text semantics. As a result, the same ideas can be represented using a smaller set of words.

Contents

In most applications, semantic compression is a lossy compression. Increased prolixity does not compensate for the lexical compression and an original document cannot be reconstructed in a reverse process.

By generalization

Semantic compression is basically achieved in two steps, using frequency dictionaries and semantic network:

  1. determining cumulated term frequencies to identify target lexicon,
  2. replacing less frequent terms with their hypernyms (generalization) from target lexicon. [1]

Step 1 requires assembling word frequencies and information on semantic relationships, specifically hyponymy. Moving upwards in word hierarchy, a cumulative concept frequency is calculating by adding a sum of hyponyms' frequencies to frequency of their hypernym: where is a hypernym of . Then a desired number of words with top cumulated frequencies are chosen to build a targed lexicon.

In the second step, compression mapping rules are defined for the remaining words in order to handle every occurrence of a less frequent hyponym as its hypernym in output text.

Example

The below fragment of text has been processed by the semantic compression. Words in bold have been replaced by their hypernyms.

They are both nest building social insects, but paper wasps and honey beesorganize their colonies

in very different ways. In a new study, researchers report that despite their differences, these insects rely on the same network of genes to guide their social behavior.The study appears in the Proceedings of the Royal Society B: Biological Sciences. Honey bees and paper wasps are separated by more than 100 million years of

evolution, and there are striking differences in how they divvy up the work of maintaining a colony.

The procedure outputs the following text:

They are both facility building insect, but insects and honey insectsarrange their biological groups

in very different structure. In a new study, researchers report that despite their difference of opinions, these insects act the same network of genes to steer their party demeanor. The study appears in the proceeding of the institution bacteria Biological Sciences. Honey insects and insect are separated by more than hundred million years of

organic processes, and there are impinging differences of opinions in how they divvy up the work of affirming a biological group.

Implicit semantic compression

A natural tendency to keep natural language expressions concise can be perceived as a form of implicit semantic compression, by omitting unmeaningful words or redundant meaningful words (especially to avoid pleonasms). [2]

Applications and advantages

In the vector space model, compacting a lexicon leads to a reduction of dimensionality, which results in less computational complexity and a positive influence on efficiency.

Semantic compression is advantageous in information retrieval tasks, improving their effectiveness (in terms of both precision and recall). [3] This is due to more precise descriptors (reduced effect of language diversity – limited language redundancy, a step towards a controlled dictionary).

As in the example above, it is possible to display the output as natural text (re-applying inflexion, adding stop words).

See also

Related Research Articles

A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteristics shared by those elements. As such, they are the essential basis of all valid deductive inferences, where the process of verification is necessary to determine whether a generalization holds true for any given situation.

<span class="mw-page-title-main">WordNet</span> Computational lexicon of English

WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. It can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. It was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website. There are now WordNets in more than 200 languages.

<span class="mw-page-title-main">Synonym</span> Words or phrases of the same meaning

A synonym is a word, morpheme, or phrase that means exactly or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words begin, start, commence, and initiate are all synonyms of one another: they are synonymous. The standard test for synonymy is substitution: one form can be replaced by another in a sentence without changing its meaning. Words are considered synonymous in only one particular sense: for example, long and extended in the context long time or extended time are synonymous, but long cannot be used in the phrase extended family. Synonyms with exactly the same meaning share a seme or denotational sememe, whereas those with inexactly similar meanings share a broader denotational or connotational sememe and thus overlap within a semantic field. The former are sometimes called cognitive synonyms and the latter, near-synonyms, plesionyms or poecilonyms.

<span class="mw-page-title-main">Hypernymy and hyponymy</span> Semantic relations involving the type-of property

Hypernymy and hyponymy are the semantic relations between a generic term (hypernym) and a specific instance of it (hyponym). The hypernym is also called a supertype, umbrella term, or blanket term. The hyponym is a subtype of the hypernym. The semantic field of the hyponym is included within that of the hypernym. For example, pigeon, crow, and hen are all hyponyms of bird and animal; bird and animal are both hypernyms of pigeon, crow, and hen.

In probability theory and statistics, the Zipf–Mandelbrot law is a discrete probability distribution. Also known as the Pareto–Zipf law, it is a power-law distribution on ranked data, named after the linguist George Kingsley Zipf who suggested a simpler distribution called Zipf's law, and the mathematician Benoit Mandelbrot, who subsequently generalized it.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

The sequence between semantic related ordered words is classified as a lexical chain. A lexical chain is a sequence of related words in writing, spanning narrow or wide context window. A lexical chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text. A lexical chain can provide a context for the resolution of an ambiguous term and enable disambiguation of concepts that the term represents.

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.

A word list is a list of a language's lexicon within some given text corpus, serving the purpose of vocabulary acquisition. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort", but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles has accelerated the research field.

CICLing is an annual conference on computational linguistics (CL) and natural language processing (NLP). The first CICLing conference was held in 2000 in Mexico City. The conference is attended by one to two hundred of NLP and CL researchers and students every year. As of 2017, it is ranked within the top 20 sources on computational linguistics by Google Scholar. Past CICLing conferences have been held in Mexico, Korea, Israel, Romania, Japan, India, Greece, Nepal, Egypt, Turkey, Hungary, and Vietnam; the 2019 event was held in France.

In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

<span class="mw-page-title-main">Bilingual memory</span>

Bilingualism is the regular use of two fluent languages, and bilinguals are those individuals who need and use two languages in their everyday lives. A person's bilingual memories are heavily dependent on the person's fluency, the age the second language was acquired, and high language proficiency to both languages. High proficiency provides mental flexibility across all domains of thought and forces them to adopt strategies that accelerate cognitive development. People who are bilingual integrate and organize the information of two languages, which creates advantages in terms of many cognitive abilities, such as intelligence, creativity, analogical reasoning, classification skills, problem solving, learning strategies, and thinking flexibility.

The dual-route theory of reading aloud was first described in the early 1970s. This theory suggests that two separate mental mechanisms, or cognitive routes, are involved in reading aloud, with output of both mechanisms contributing to the pronunciation of a written stimulus.

The following outline is provided as an overview of and topical guide to natural-language processing:

Automatic taxonomy construction (ATC) is the use of software programs to generate taxonomical classifications from a body of texts called a corpus. ATC is a branch of natural language processing, which in turn is a branch of artificial intelligence.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.

A word n-gram language model is a purely statistical model of language. It has been superseded by recurrent neural network-based models, which have been superseded by large language models. It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if n − 1 words, an n-gram model. Special tokens were introduced to denote the start and end of a sentence and .

References

  1. Ceglarek, D.; Haniewicz, K.; Rutkowski, W. (2010). "Semantic Compression for Specialised Information Retrieval Systems". Advances in Intelligent Information and Database Systems. Studies in Computational Intelligence. Vol. 283. pp. 111–121. doi:10.1007/978-3-642-12090-9_10. ISBN   978-3-642-12089-3.
  2. Percova, N.N. (1982). "On the types of semantic compression of text". COLING '82 Proceedings of the 9th Conference on Computational Linguistics. Vol. 2. pp. 229–231. doi:10.3115/990100.990155. ISBN   0-444-86393-1. S2CID   33742593.
  3. Ceglarek, D.; Haniewicz, K.; Rutkowski, W. (2010). "Quality of semantic compression in classification". Proceedings of the 2nd International Conference on Computational Collective Intelligence: Technologies and Applications. Vol. 1. Springer. pp. 162–171. ISBN   978-3-642-16692-1.