Computational lexicology

Last updated

Computational lexicology is a branch of computational linguistics, which is concerned with the use of computers in the study of lexicon. It has been more narrowly described by some scholars (Amsler, 1980) as the use of computers in the study of machine-readable dictionaries . It is distinguished from computational lexicography, which more properly would be the use of computers in the construction of dictionaries, though some researchers have used computational lexicography as synonymous.

Contents

History

Computational lexicology emerged as a separate discipline within computational linguistics with the appearance of machine-readable dictionaries, starting with the creation of the machine-readable tapes of the Merriam-Webster Seventh Collegiate Dictionary and the Merriam-Webster New Pocket Dictionary in the 1960s by John Olney et al. at System Development Corporation. Today, computational lexicology is best known through the creation and applications of WordNet. As the computational processing of the researchers increased over time, the use of computational lexicology has been applied ubiquitously in the text analysis. In 1987, amongst others Byrd, Calzolari, Chodorow have developed computational tools for text analysis. In particular the model was designed for coordinating the associations involving the senses of polysemous words. [1]

Study of lexicon

Computational lexicology has contributed to the understanding of the content and limitations of print dictionaries for computational purposes (i.e. it clarified that the previous work of lexicography was not sufficient for the needs of computational linguistics). Through the work of computational lexicologists almost every portion of a print dictionary entry has been studied ranging from:

  1. what constitutes a headword - used to generate spelling correction lists;
  2. what variants and inflections the headword forms - used to empirically understand morphology;
  3. how the headword is delimited into syllables;
  4. how the headword is pronounced - used in speech generation systems;
  5. the parts of speech the headword takes on - used for POS taggers;
  6. any special subject or usage codes assigned to the headword - used to identify text document subject matter;
  7. the headword's definitions and their syntax - used as an aid to disambiguation of word in context;
  8. the etymology of the headword and its use to characterize vocabulary by languages of origin - used to characterize text vocabulary as to its languages of origin;
  9. the example sentences;
  10. the run-ons (additional words and multi-word expressions that are formed from the headword); and
  11. related words such as synonyms and antonyms.

Many computational linguists were disenchanted with the print dictionaries as a resource for computational linguistics because they lacked sufficient syntactic and semantic information for computer programs. The work on computational lexicology quickly led to efforts in two additional directions.

Successors to Computational Lexicology

First, collaborative activities between computational linguists and lexicographers led to an understanding of the role that corpora played in creating dictionaries. Most computational lexicologists moved on to build large corpora to gather the basic data that lexicographers had used to create dictionaries. The ACL/DCI (Data Collection Initiative) and the LDC (Linguistic Data Consortium) went down this path. The advent of markup languages led to the creation of tagged corpora that could be more easily analyzed to create computational linguistic systems. Part-of-speech tagged corpora and semantically tagged corpora were created in order to test and develop POS taggers and word semantic disambiguation technology.

The second direction was toward the creation of Lexical Knowledge Bases (LKBs). A Lexical Knowledge Base was deemed to be what a dictionary should be for computational linguistic purposes, especially for computational lexical semantic purposes. It was to have the same information as in a print dictionary, but totally explicated as to the meanings of the words and the appropriate links between senses. Many began creating the resources they wished dictionaries were, if they had been created for use in computational analysis. WordNet can be considered to be such a development, as can the newer efforts at describing syntactic and semantic information such as the FrameNet work of Fillmore. Outside of computational linguistics, the Ontology work of artificial intelligence can be seen as an evolutionary effort to build a lexical knowledge base for AI applications.

Standardization

Optimizing the production, maintenance and extension of computational lexicons is one of the crucial aspects impacting NLP. The main problem is the interoperability: various lexicons are frequently incompatible. The most frequent situation is: how to merge two lexicons, or fragments of lexicons? A secondary problem is that a lexicon is usually specifically tailored to a specific NLP program and has difficulties being used within other NLP programs or applications.

To this respect, the various data models of Computational lexicons are studied by ISO/TC37 since 2003 within the project lexical markup framework leading to an ISO standard in 2008.

Related Research Articles

Computational linguistics has since 2020s became a near-synonym of either natural language processing or language technology, with deep learning approaches, such as large language models, having replaced most of the specific approaches previously used in the field.

A lexicon is the vocabulary of a language or branch of knowledge. In linguistics, a lexicon is a language's inventory of lexemes. The word lexicon derives from Greek word λεξικόν, neuter of λεξικός meaning 'of or for words'.

Lexicography is the study of lexicons, and is divided into two separate academic disciplines. It is the art of compiling dictionaries.

Lexicology is the branch of linguistics that analyzes the lexicon of a specific language. A word is the smallest meaningful unit of a language that can stand on its own, and is made up of small components called morphemes and even smaller elements known as phonemes, or distinguishing sounds. Lexicology examines every feature of a word – including formation, spelling, origin, usage, and definition.

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

<span class="mw-page-title-main">Charles J. Fillmore</span> American linguist

Charles J. Fillmore was an American linguist and Professor of Linguistics at the University of California, Berkeley. He received his Ph.D. in Linguistics from the University of Michigan in 1961. Fillmore spent ten years at Ohio State University and a year as a Fellow at the Center for Advanced Study in the Behavioral Sciences at Stanford University before joining Berkeley's Department of Linguistics in 1971. Fillmore was extremely influential in the areas of syntax and lexical semantics.

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Beryl T. "Sue" Atkins was a British lexicographer, specialising in computational lexicography, who pioneered the creation of bilingual dictionaries from corpus data.

Linguistic categories include

Machine-readable dictionary (MRD) is a dictionary stored as machine-readable data instead of being printed on paper. It is an electronic dictionary and lexical database.

Language resource management Lexical markup framework, is the International Organization for Standardization ISO/TC37 standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope is standardization of principles and methods relating to language resources in the contexts of multilingual communication.

In digital lexicography, natural language processing, and digital humanities, a lexical resource is a language resource consisting of data regarding the lexemes of the lexicon of one or more languages e.g., in the form of a database.

SemEval is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive.

<span class="mw-page-title-main">BabelNet</span> Multilingual semantic network and encyclopedic dictionary

BabelNet is a multilingual lexicalized semantic network and ontology developed at the NLP group of the Sapienza University of Rome. BabelNet was automatically created by linking Wikipedia to the most popular computational lexicon of the English language, WordNet. The integration is done using an automatic mapping and by filling in lexical gaps in resource-poor languages by using statistical machine translation. The result is an encyclopedic dictionary that provides concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations. Additional lexicalizations and definitions are added by linking to free-license wordnets, OmegaWiki, the English Wiktionary, Wikidata, FrameNet, VerbNet and others. Similarly to WordNet, BabelNet groups words in different languages into sets of synonyms, called Babel synsets. For each Babel synset, BabelNet provides short definitions in many languages harvested from both WordNet and Wikipedia.

The following outline is provided as an overview of and topical guide to natural-language processing:

UBY-LMF is a format for standardizing lexical resources for Natural Language Processing (NLP). UBY-LMF conforms to the ISO standard for lexicons: LMF, designed within the ISO-TC37, and constitutes a so-called serialization of this abstract standard. In accordance with the LMF, all attributes and other linguistic terms introduced in UBY-LMF refer to standardized descriptions of their meaning in ISOCat.

UBY is a large-scale lexical-semantic resource for natural language processing (NLP) developed at the Ubiquitous Knowledge Processing Lab (UKP) in the department of Computer Science of the Technische Universität Darmstadt . UBY is based on the ISO standard Lexical Markup Framework (LMF) and combines information from several expert-constructed and collaboratively constructed resources for English and German.

In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.

Mona Talat Diab is a computer science professor at George Washington University and a research scientist with Facebook AI. Her research focuses on natural language processing, computational linguistics, cross lingual/multilingual processing, computational socio-pragmatics, Arabic language processing, and applied machine learning.

References

  1. Byrd, Roy J., Nicoletta Calzolari, Martin S. Chodorow, Judith L. Klavans, Mary S. Neff, and Omneya A. Rizk. "Tools and methods for computational lexicology."Computational Linguistics 13, no. 3-4 (1987): 219-240.

Amsler, Robert A. 1980. Ph.D. Dissertation, "The Structure of the Merriam-Webster Pocket Dictionary". The University of Texas at Austin.