DATR

Last updated

DATR is a language for lexical knowledge representation. [1] The lexical knowledge is encoded in a network of nodes. Each node has a set of attributes encoded with it. A node can represent a word or a word form.

DATR was developed in the late 1980s by Roger Evans, Gerald Gazdar and Bill Keller, [2] [3] and used extensively in the 1990s; the standard specification is contained in the Evans and Gazdar RFC, available on the Sussex website (below). DATR has been implemented in a variety of programming languages, and several implementations are available on the internet, including an RFC compliant implementation at the Bielefeld website (below).

DATR is still used for encoding inheritance networks in various linguistic and non-linguistic domains and is under discussion as a standard notation for the representation of lexical information.

Related Research Articles

The following outline is provided as an overview of and topical guide to linguistics:

In linguistics, syntax is the set of rules, principles, and processes that govern the structure of sentences in a given language, usually including word order. The term syntax is also used to refer to the study of such principles and processes. The goal of many syntacticians is to discover the syntactic rules common to all languages.

WordNet Computational lexicon of English

WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. WordNet can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. WordNet was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website.

In computational linguistics, word-sense disambiguation (WSD) is an open problem concerned with identifying which sense of a word is used in a sentence. The solution to this issue impacts other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Natural-language understanding (NLU) or natural-language interpretation (NLI) is a subtopic of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an AI-hard problem.

Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar developed by Carl Pollard and Ivan Sag. It is a type of phrase structure grammar, as opposed to a dependency grammar, and it is the immediate successor to generalized phrase structure grammar. HPSG draws from other fields such as computer science and uses Ferdinand de Saussure's notion of the sign. It uses a uniform formalism and is organized in a modular way which makes it attractive for natural language processing.

Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi. Tree-adjoining grammars are somewhat similar to context-free grammars, but the elementary unit of rewriting is the tree rather than the symbol. Whereas context-free grammars have rules for rewriting symbols as strings of other symbols, tree-adjoining grammars have rules for rewriting the nodes of trees as other trees.

Generalized phrase structure grammar (GPSG) is a framework for describing the syntax and semantics of natural languages. It is a type of constraint-based phrase structure grammar. Constraint based grammars are based around defining certain syntactic processes as ungrammatical for a given language and assuming everything not thus dismissed is grammatical within that language. Phrase structure grammars base their framework on constituency relationships, seeing the words in a sentence as ranked, with some words dominating the others. For example, in the sentence "The dog runs", "runs" is seen as dominating "dog" since it is the main focus of the sentence. This view stands in contrast to dependency grammars, which base their assumed structure on the relationship between a single word in a sentence and its dependents.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

Gerald James Michael Gazdar is a linguist and computer scientist.

Frame semantics is a theory of linguistic meaning developed by Charles J. Fillmore that extends his earlier case grammar. It relates linguistic semantics to encyclopedic knowledge. The basic idea is that one cannot understand the meaning of a single word without access to all the essential knowledge that relates to that word. For example, one would not be able to understand the word "sell" without knowing anything about the situation of commercial transfer, which also involves, among other things, a seller, a buyer, goods, money, the relation between the money and the goods, the relations between the seller and the goods and the money, the relation between the buyer and the goods and the money and so on. Thus, a word activates, or evokes, a frame of semantic knowledge relating to the specific concept to which it refers.

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called Distributional hypothesis: linguistic items with similar distributions have similar meanings.

Generative lexicon (GL) is a theory of linguistic semantics which focuses on the distributed nature of compositionality in natural language. The first major work outlining the framework is James Pustejovsky's 1991 article "The Generative Lexicon". Subsequent important developments are presented in Pustejovsky and Boguraev (1993), Bouillon (1997), and Busa (1996). The first unified treatment of GL was given in Pustejovsky (1995). Unlike purely verb-based approaches to compositionality, generative lexicon attempts to spread the semantic load across all constituents of the utterance. Central to the philosophical perspective of GL are two major lines of inquiry: (1) How is it that we are able to deploy a finite number of words in our language in an unbounded number of contexts? (2) Is lexical information and the representations used in composing meanings separable from our commonsense knowledge?

Meaning–text theory (MTT) is a theoretical linguistic framework, first put forward in Moscow by Aleksandr Žolkovskij and Igor Mel’čuk, for the construction of models of natural language. The theory provides a large and elaborate basis for linguistic description and, due to its formal character, lends itself particularly well to computer applications, including machine translation, phraseology, and lexicography.

Quantitative comparative linguistics is the use of quantitative analysis as applied to comparative linguistics.

Linguistics is the scientific study of language. It involves an analysis of language form, language meaning, and language in context, as well as an analysis of the social, cultural, historical, and political factors that influence language.

BabelNet

BabelNet is a multilingual lexicalized semantic network and ontology developed at the NLP group of the Sapienza University of Rome. BabelNet was automatically created by linking Wikipedia to the most popular computational lexicon of the English language, WordNet. The integration is done using an automatic mapping and by filling in lexical gaps in resource-poor languages by using statistical machine translation. The result is an encyclopedic dictionary that provides concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations. Additional lexicalizations and definitions are added by linking to free-license wordnets, OmegaWiki, the English Wiktionary, Wikidata, FrameNet, VerbNet and others. Similarly to WordNet, BabelNet groups words in different languages into sets of synonyms, called Babel synsets. For each Babel synset, BabelNet provides short definitions in many languages harvested from both WordNet and Wikipedia.

Word embedding is any of a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension.

Ann Alicia Copestake is professor of Computational Linguistics and head of the Department of Computer Science and Technology at the University of Cambridge and a fellow of Wolfson College, Cambridge.

Dynamic Syntax (DS) is a grammar formalism and linguistic theory whose overall aim is to explain the real-time twin processes of language understanding and production. Under the DS approach, syntactic knowledge is understood as the ability to incrementally analyse the structure and content of spoken and written language in context and in real-time. While it posits representations similar to those used in Combinatory Categorial Grammars (CCG), it builds those representations left-to-right going word-by-word. Thus it differs from other syntactic models which generally abstract way from features of everyday conversation such as interruption, backtracking, and self-correction. Moreover, it differs from other approaches in that it does not postulate an independent level of syntactic structure over words.

References

  1. Vincent Ooi (B. Y.) (1998). Computer Corpus Lexicography. Edinburgh University Press. pp. 97–100. ISBN   978-0-7486-0815-7 . Retrieved 20 February 2013.
  2. Evans, Roger; Gazdar, Gerald (1996). "DATR: A language for lexical knowledge representation" (PDF). Computational Linguistics. 22 (2): 167–216. Archived from the original (PDF) on 2006-06-19. Retrieved 2014-03-17.
  3. Keller, Bill (1996). An evaluation semantics for DATR theories (PDF). Proceedings of the 16th conference on Computational linguistics-Volume 2. Association for Computational Linguistics. Archived from the original (PDF) on 2014-03-17. Retrieved 2014-03-17.