Named entity

Last updated

In information extraction, a named entity is a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper name. It can be abstract or have a physical existence. Examples of named entities include Barack Obama, New York City, Volkswagen Golf, or anything else that can be named. Named entities can simply be viewed as entity instances (e.g., New York City is an instance of a city).

From a historical perspective, the term Named Entity was coined during the MUC-6 evaluation campaign [1] and contained ENAMEX (entity name expressions e.g. persons, locations and organizations) and NUMEX (numerical expression).

A more formal definition can be derived from the rigid designator by Saul Kripke. In the expression "Named Entity", the word "Named" aims to restrict the possible set of entities to only those for which one or many rigid designators stands for the referent. [2] A designator is rigid when it designates the same thing in every possible world. On the contrary, flaccid designators may designate different things in different possible worlds.

As an example, consider the sentence, "Biden is the president of the United States". Both "Biden" and the "United States" are named entities since they refer to specific objects (Joe Biden and United States). However, "president" is not a named entity since it can be used to refer to many different objects in different worlds (in different presidential periods referring to different persons, or even in different countries or organizations referring to different people). Rigid designators usually include proper names as well as certain natural terms like biological species and substances.

There is also a general agreement in the Named Entity Recognition community to consider temporal and numerical expressions as named entities, such as amounts of money and other types of units, which may violate the rigid designator perspective.

The task of recognizing named entities in text is Named Entity Recognition while the task of determining the identity of the named entities mentioned in text is called Named Entity Disambiguation. Both tasks require dedicated algorithms and resources to be addressed. [3]

See also

Related Research Articles

Natural language processing Field of computer science and linguistics

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

A proper noun is a noun that identifies a single entity and is used to refer to that entity, such as Africa, Jupiter, Sarah, or Amazon, as distinguished from a common noun, which is a noun that refers to a class of entities and may be used when referring to instances of a specific class. Some proper nouns occur in plural form, and then they refer to groups of entities considered as unique. Proper nouns can also occur in secondary applications, for example modifying nouns, or in the role of common nouns. The detailed definition of the term is problematic and, to an extent, governed by convention.

Text mining, also referred to as text data mining, similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can differ three different perspectives of text mining: information extraction, data mining, and a KDD process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction

In modal logic and the philosophy of language, a term is said to be a rigid designator or absolute substantial term when it designates the same thing in all possible worlds in which that thing exists. A designator is persistently rigid if it also designates nothing in all other possible worlds. A designator is obstinately rigid if it designates the same thing in every possible world, period, whether or not that thing exists in that world. Rigid designators are contrasted with connotative terms, non-rigid or flaccid designators, which may designate different things in different possible worlds.

In the philosophy of language and modal logic, a term is said to be a non-rigid designator or connotative term if it does not extensionally designate the same object in all possible worlds. This is in contrast to a rigid designator, which does designate the same object in all possible worlds in which that object exists, and does not designate anything else in those worlds in which that object does not exist. The term was coined by Saul Kripke in his 1970 lecture series at Princeton University, later published as the book Naming and Necessity.

Name resolution can refer to any process that further identifies an object or entity from an associated, not-necessarily-unique alphanumeric name:

In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions in a text refer to the same person or thing; they have the same referent, e.g. Bill said he would come; the proper noun Bill and the pronoun he refer to the same person, namely to Bill. Coreference is the main concept underlying binding phenomena in the field of syntax. The theory of binding explores the syntactic relationship that exists between coreferential expressions in sentences and texts. When two expressions are coreferential, the one is usually a full form and the other is an abbreviated form. Linguists use indices to show coreference, as with the i index in the example Billi said hei would come. The two expressions with the same reference are coindexed, hence in this example Bill and he are coindexed, indicating that they should be interpreted as coreferential.

Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

A direct reference theory is a theory of language that claims that the meaning of a word or expression lies in what it points out in the world. The object denoted by a word is called its referent. Criticisms of this position are often associated with Ludwig Wittgenstein.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

The Message Understanding Conferences (MUC) were initiated and financed by DARPA to encourage the development of new and better methods of information extraction. The character of this competition—many concurrent research teams competing against one another—required the development of standards for evaluation, e.g. the adoption of metrics like precision and recall.

Reference is a relationship between objects in which one object designates, or acts as a means by which to connect to or link to, another object. The first object in this relation is said to refer to the second object. It is called a name for the second object. The second object, the one to which the first object refers, is called the referent of the first object. A name is usually a phrase or expression, or some other symbolic representation. Its referent may be anything – a material object, a person, an event, an activity, or an abstract concept.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

Automatic content extraction (ACE) is a research program for developing advanced information extraction technologies convened by the NIST from 1999 to 2008, succeeding MUC and preceding Text Analysis Conference.

The following outline is provided as an overview of and topical guide to natural language processing:

Entity linking

In natural language processing, entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a unique identity to entities mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris". Entity linking is different from named-entity recognition (NER) in that NER identifies the occurrence of a named entity in text but it does not identify which specific entity it is.

NetOwl is a suite of multilingual text and identity analytics products that analyze big data in the form of text data – reports, web, social media, etc. – as well as structured entity data about people, organizations, places, and things.

In semantics and text extraction, name resolution refers to the ability of text mining software to determine which actual person, actor, or object a particular use of a name refers to. It can also be referred to as entity resolution.

In natural language processing, open information extraction (OIE) is the task of generating a structured, machine-readable representation of the information in text, usually in the form of triples or n-ary propositions.

References

  1. Grishman, Ralph; Sundheim, Beth (1996). Design of the MUC-6 evaluation (PDF). TIPSTER '96 Proceedings.
  2. Nadeau, David; Sekine, Satoshi (2007). A survey of named entity recognition and classification (PDF). Lingvisticae Investigationes.
  3. Nouvel, Damien; Ehrmann, Maud; Rosset, Sophie (2015). Wiley (ed.). Named Entities for Computational Linguistics. ISBN   978-1-84821-838-3.