Name resolution (semantics and text extraction)

Last updated

In semantics and text extraction, name resolution refers to the ability of text mining software to determine which actual person, actor, or object a particular use of a name refers to. It can also be referred to as entity resolution.

Contents

Name resolution in simple text

For example, in the text mining field, software frequently needs to interpret the following text:

John gave Edward the book. He then stood up and called to John to come back into the room.

In these sentences, the software must determine whether the pronoun "he" refers to "John", or "Edward" from the first sentence. The software must also determine whether the "John" referred to in the second sentence is the same as the "John" in the first sentence, or a third person whose name also happens to be "John". Such examples apply to almost all languages, and not only English.

Name resolution across documents

Frequently, this type of name resolution is also used across documents, for example to determine whether the "George Bush" referenced in an old newspaper article as President of the United States (George H. W. Bush) is the same person as the "George Bush" mentioned in a separate news article years later about a man who is running for President (George W. Bush.) Because many people may have the same name, analysts and software must take into account substantially more information than only a name to determine whether two identical references ("George Bush") actually refer to the same specific entity or person.

Name/entity resolution in text extraction and semantics is a notoriously difficult problem, in part because in many cases there is not sufficient information to make an accurate determination. Numerous partial solutions exist that rely on specific contextual clues found in the data, but there is no currently known general solution.

The problem is sometimes referred to as name disambiguation and, for digital libraries, author disambiguation.

For examples of software that might provide name resolution benefits, see also:

See also

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and information retrieval. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. To this end, natural language processing often borrows ideas from theoretical linguistics. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

<span class="mw-page-title-main">Semantics</span> Study of meaning in language

Semantics is the study of linguistic meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction between sense and reference. Sense is given by the ideas and concepts associated with an expression while reference is the object to which an expression points. Semantics contrasts with syntax, which studies the rules that dictate how to create grammatically correct sentences, and pragmatics, which investigates how people use language in communication.

The Resource Description Framework (RDF) is a World Wide Web Consortium (W3C) standard originally designed as a data model for metadata. It has come to be used as a general method for description and exchange of graph data. RDF provides a variety of syntax notations and data serialization formats, with Turtle currently being the most widely used notation.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.

An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For annotations of different digital media, see web annotation and text annotation.

In linguistics, anaphora is the use of an expression whose interpretation depends upon another expression in context. In a narrower sense, anaphora is the use of an expression that depends specifically upon an antecedent expression and thus is contrasted with cataphora, which is the use of an expression that depends upon a postcedent expression. The anaphoric (referring) term is called an anaphor. For example, in the sentence Sally arrived, but nobody saw her, the pronoun her is an anaphor, referring back to the antecedent Sally. In the sentence Before her arrival, nobody saw Sally, the pronoun her refers forward to the postcedent Sally, so her is now a cataphor. Usually, an anaphoric expression is a pro-form or some other kind of deictic expression. Both anaphora and cataphora are species of endophora, referring to something mentioned elsewhere in a dialog or text.

In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in Bill said Alice would arrive soon, and she did, the words Alice and she refer to the same person.

Named-entity recognition (NER) (also known as (named)entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities. About 47% of the periods in The Wall Street Journal corpus denote abbreviations. Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, computer code, and slang.

<span class="mw-page-title-main">DBpedia</span> Online database project

DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web using OpenLink Virtuoso. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.

In information extraction, a named entity is a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper name. It can be abstract or have a physical existence. Examples of named entities include Barack Obama, New York City, Volkswagen Golf, or anything else that can be named. Named entities can simply be viewed as entity instances.

A bound variable pronoun is a pronoun that has a quantified determiner phrase (DP) – such as every, some, or who – as its antecedent.

Logophoricity is a phenomenon of binding relation that may employ a morphologically different set of anaphoric forms, in the context where the referent is an entity whose speech, thoughts, or feelings are being reported. This entity may or may not be distant from the discourse, but the referent must reside in a clause external to the one in which the logophor resides. The specially-formed anaphors that are morphologically distinct from the typical pronouns of a language are known as logophoric pronouns, originally coined by the linguist Claude Hagège. The linguistic importance of logophoricity is its capability to do away with ambiguity as to who is being referred to. A crucial element of logophoricity is the logophoric context, defined as the environment where use of logophoric pronouns is possible. Several syntactic and semantic accounts have been suggested. While some languages may not be purely logophoric, logophoric context may still be found in those languages; in those cases, it is common to find that in the place where logophoric pronouns would typically occur, non-clause-bounded reflexive pronouns appear instead.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Entity linking</span> Concept in Natural Language Processing

In natural language processing, entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a unique identity to entities mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris". Entity linking is different from named-entity recognition (NER) in that NER identifies the occurrence of a named entity in text but it does not identify which specific entity it is.

NetOwl is a suite of multilingual text and identity analytics products that analyze big data in the form of text data – reports, web, social media, etc. – as well as structured entity data about people, organizations, places, and things.

<span class="mw-page-title-main">Infobox</span> Template used to collect and present a subset of information about a subject

An infobox is a digital or physical table used to collect and present a subset of information about its subject, such as a document. It is a structured document containing a set of attribute–value pairs, and in Wikipedia represents a summary of information about the subject of an article. In this way, they are comparable to data tables in some aspects. When presented within the larger document it summarizes, an infobox is often presented in a sidebar format.

References