Temporal annotation is the study of how to automatically add semantic information regarding time to natural language documents. It plays a role in natural language processing and computational linguistics.
Temporal annotation involves the application of a semantic annotation to a document. Significant temporal annotation standards include TimeML, ISO-TimeML and TIDES. [1] These standards typically include annotations for some or all of temporal expressions (or timexes), events, temporal relations, temporal signals, [2] and temporal relation types.
In natural language texts, events may be associated with times; e.g., they may start or end at a given at a time. Events are also associated with other events, like occurring before or after them. We call these relations temporal relations. Temporal relation typing classifies the relation between two arguments, and is an important and difficult sub-task of figuring out all the temporal information in a document. Allen's interval algebra is one scheme for types of temporal relations. Rule-engineering and machine learning approaches to temporal annotation have both been successful, though achieving high performance in temporal relation typing remains a difficult task.
Successful temporal annotation enables systems to find out when facts asserted in texts are true, to build timelines, [3] to extract plans, and to discover mentions of change. This has had applications in many domains, such as information extraction, digital history, [4] processing social media, and clinical text mining.
The TempEval task series sets a shared temporal annotation task, and has run at SemEval three times, attracting system entries from around the world. [5] [6] [7] The task originally centred on determining the types of temporal relations only. In TempEval-2 and -3, this expanded to include event and timex annotation. In addition, the i2b2 clinical evaluation shared task was a temporal annotation exercise in 2012, which attracted a great deal of interest. [8]
In computational linguistics, word-sense disambiguation (WSD) is an open problem concerned with identifying which sense of a word is used in a sentence. The solution to this issue impacts other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.
Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".
James Pustejovsky is an American computer scientist. He is the TJX Feldberg professor of computer science at Brandeis University in Waltham, Massachusetts, United States. His expertise includes theoretical and computational modeling of language, specifically: Computational linguistics, Lexical semantics, Knowledge representation, temporal and spatial reasoning and Extraction. His main topics of research are Natural language processing generally, and in particular, the computational analysis of linguistic meaning.
Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.
Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical and molecular biology domains. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies developed through studies in this field are frequently applied to the biomedical and molecular biology literature available through services such as PubMed.
In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data. The exploitation of treebank data has been important ever since the first large-scale treebank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of treebanks is becoming more widely appreciated in linguistics research as a whole. For example, annotated treebank data has been crucial in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.
The sequence between semantic related ordered words is classified as a lexical chain. A lexical chain is a sequence of related words in writing, spanning short or long distances. A chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text. A lexical chain can provide a context for the resolution of an ambiguous term and enable identification of the concept that the term represents.
PropBank is a corpus that is annotated with verbal propositions and their arguments—a "proposition bank". Although "PropBank" refers to a specific corpus produced by Martha Palmer et al., the term propbank is also coming to be used as a common noun referring to any corpus that has been annotated with propositions and their arguments.
Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.
A temporal expression in a text is a sequence of tokens that denote time, that is express a point in time, a duration or a frequency. Examples:
He was born on <TIMEX>6 May, 1980</TIMEX>.
The show lasted <TIMEX>7 minutes</TIMEX>.
The pump circulates the water <TIMEX>every 2 hours</TIMEX>.
Referring expression generation (REG) is the subtask of natural language generation (NLG) that received most scholarly attention. While NLG is concerned with the conversion of non-linguistic information into natural language, REG focuses only on the creation of referring expressions that identify specific entities called targets.
Lexical substitution is the task of identifying a substitute for a word in the context of a clause. For instance, given the following text: "After the match, replace any remaining fluid deficit to prevent chronic dehydration throughout the tournament", a substitute of game might be given.
TimeML is a set of rules for encoding documents electronically. It is defined in the TimeML Specification version 1.2.1 developed by several efforts, led in large part by the Laboratory for Linguistics and Computation at Brandeis University.
ISO 24617-1:2009, ISO-TimeML is the International Organization for Standardization ISO/TC37 standard for time and event markup and annotation. The scope is standardization of principles and methods relating to the annotation of temporal events in the contexts of electronic documentation and language.
SemEval is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive.
Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.
The following outline is provided as an overview of and topical guide to natural language processing:
The Bulgarian Sense-annotated Corpus (BulSemCor) is a structured corpus of Bulgarian texts in which each lexical item is assigned a sense tag. BulSemCor was created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.