In natural language processing, Entity Linking, also referred to as named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) [1] is the task of assigning a unique identity to entities (such as famous individuals, locations, or companies) mentioned in text. For example, given the sentence "Paris is the capital of France", the main idea is to first identify "Paris" and "France" as named entities, and then to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris" and "France" to the french country. The Entity Linking task is composed of 3 subtasks. First, Named Entity Recognition, which consist in the extraction of named entities from a text. Second, for each named entity, the objective is to generate candidates from a Knowledge Base (e.g. Wikipedia, Wikidata, DBPedia, ...). We call this step candidate generation. The main challenge being that we want to get the corresponding entity inside the candidates set. Lastly, the objective is to choose from the candidate set the correct entity. We call this step disambiguation.
On way to see the Entity Linking task is as the combination of the NER task followed by the NEL task, the NEL task being the combination of candidate generation and disambiguation.
In entity linking, words of interest (names of persons, locations and companies) are mapped from an input text to corresponding unique entities in a target knowledge base. Words of interest are called named entities (NEs), mentions, or surface forms. The target knowledge base depends on the intended application, but for entity linking systems intended to work on open-domain text it is common to use knowledge-bases derived from Wikipedia (such as Wikidata or DBpedia). [1] [2] In this case, each individual Wikipedia page is regarded as a separate entity. Entity linking techniques that map named entities to Wikipedia entities are also called wikification. [3]
Considering again the example sentence "Paris is the capital of France", the expected output of an entity linking system will be Paris and France. These uniform resource locators (URLs) can be used as unique uniform resource identifiers (URIs) for the entities in the knowledge base. Using a different knowledge base will return different URIs, but for knowledge bases built starting from Wikipedia there exist one-to-one URI mappings. [4]
In most cases, knowledge bases are manually built, [5] but in applications where large text corpora are available, the knowledge base can be inferred automatically from the available text. [6]
Entity linking is a critical step to bridge web data with knowledge bases, which is beneficial for annotating the huge amount of raw and often noisy data on the Web and contributes to the vision of the Semantic Web. [7] In addition to entity linking, there are other critical steps including but not limited to event extraction, [8] and event linking [9] etc.
Entity linking is beneficial in fields that need to extract abstract representations from text, as it happens in text analysis, recommender systems, semantic search and chatbots. In all these fields, concepts relevant to the application are separated from text and other non-meaningful data. [10] [11]
For example, a common task performed by search engines is to find documents that are similar to one given as input, or to find additional information about the persons that are mentioned in it. Consider a sentence that contains the expression "the capital of France": without entity linking, the search engine that looks at the content of documents would not be able to directly retrieve documents containing the word "Paris", leading to so-called false negatives (FN). Even worse, the search engine might produce spurious matches (or false positives (FP)), such as retrieving documents referring to "France" as a country.
Many approaches orthogonal to entity linking exist to retrieve documents similar to an input document. For example, latent semantic analysis (LSA) or comparing document embeddings obtained with doc2vec. However, these techniques do not allow the same fine-grained control that is offered by entity linking, as they will return other documents instead of creating high-level representations of the original one. For example, obtaining schematic information about "Paris", as presented by Wikipedia infoboxes would be much less straightforward, or sometimes even unfeasible, depending on the query complexity. [12]
Moreover, entity linking has been used to improve the performance of information retrieval systems [1] and to improve search performance on digital libraries. [13] Entity linking is also a key input for semantic search. [14] [15]
There are various difficulties in performing entity linking. Some of these are intrinsic to the task, [16] such as text ambiguity. Others are relevant in real-world use, such as scalability and execution time.
NIL
entity link. Knowing when to return a NIL
prediction is not straightforward, and many approaches have been proposed. Examples are thresholding a confidence score in the entity linking system, and including a NIL
entity in the knowledge base, which is treated as any entity. However, in some cases, linking to an incorrect but related entity may be more useful to the user than having no result at all. [16] Entity linking related to other concepts. Definitions are often blurry and vary slightly between authors.
Paris is the capital of France.
[Paris]City is the capital of [France]Country.
Paris is the capital of France. It is also the largest city in France.
Entity linking has been a hot topic in industry and academia for the last decade. Many challenges are unsolved, but many entity linking systems have been proposed, with widely different strengths and weaknesses. [24]
Broadly speaking, modern entity linking systems can be divided into two categories:
Often entity linking systems use both knowledge graphs and textual features extracted from, for example, the text corpora used to build the knowledge graphs themselves. [21] [22]
The seminal work by Cucerzan in 2007 published one of the first entity linking systems. Specifically, it tackled the task of wikification, that is, linking textual mentions to Wikipedia pages. [25] This system categorizes pages into entity, disambiguation, or list pages. The set of entities present in each entity page is used to build the entity's context. The final step is a collective disambiguation by comparing binary vectors of hand-crafted features each entity's context. Cucerzan's system is still used as baseline for recent work. [27]
Rao et al. [16] proposed a two-step algorithm to link named entities to entities in a target knowledge base. First, candidate entities are chosen using string matching, acronyms, and known aliases. Then, the best link among the candidates is chosen with a ranking support vector machine (SVM) that uses linguistic features.
Recent systems, such as by Tsai et al., [23] use word embeddings obtained with a skip-gram model as language features, and can be applied to any language for which a large corpus to build word embeddings is available. Like most entity linking systems, it has two steps: an initial candidate selection, and ranking using linear SVM.
Various approaches have been tried to tackle the problem of entity ambiguity. The seminal approach of Milne and Witten uses supervised learning using the anchor texts of Wikipedia entities as training data. [28] Other approaches also collected training data based on unambiguous synonyms. [29]
Modern entity linking systems also use large knowledge graphs created from knowledge bases such as Wikipedia, besides textual features generated from input documents or text corpora. Moreover, multilingual entity linking based on natural language processing (NLP) is difficult, because it requires either large text corpora, which are absent for many languages, or hand-crafted grammar rules, which are widely different between languages. Graph-based entity linking uses features of the graph topology or multi-hop connections between entities, which are hidden to simple text analysis.
Han et al. propose the creation of a disambiguation graph (a subgraph of the knowledge base which contains candidate entities). [2] This graph is used for collective ranking to select the best candidate entity for each textual mention.
Another famous approach is AIDA, [30] which uses a series of complex graph algorithms and a greedy algorithm that identifies coherent mentions on a dense subgraph by also considering context similarities and vertex importance features to perform collective disambiguation. [26]
Alhelbawy et al. presented an entity linking system that uses PageRank to perform collective entity linking on a disambiguation graph, and to understand which entities are more strongly related to each other and so would represent a better linking. [20] Graph ranking (or vertex ranking) algorithms such as PageRank (PR) and Hyperlink-Induced Topic Search (HITS) aim to score node according their relative importance in the graph.
Mathematical expressions (symbols and formulae) can be linked to semantic entities (e.g., Wikipedia articles [31] or Wikidata items [32] ) labeled with their natural language meaning. This is essential for disambiguation, since symbols may have different meanings (e.g., "E" can be "energy" or "expectation value", etc.). [33] [32] The math entity linking process can be facilitated and accelerated through annotation recommendation, e.g., using the "AnnoMathTeX" system that is hosted by Wikimedia. [34] [35] [36]
To facilitate the reproducibility of Mathematical Entity Linking (MathEL) experiments, the benchmark MathMLben was created. [37] [38] It contains formulae from Wikipedia, the arXiV and the NIST Digital Library of Mathematical Functions (DLMF). Formulae entries in the benchmark are labeled and augmented by Wikidata markup. [32] Furthermore, for two large corporae from the arXiv [39] and zbMATH [40] repository distributions of mathematical notation were examined. Mathematical Objects of Interest (MOI) are identified as potential candidates for MathEL. [41]
Besides linking to Wikipedia, Schubotz [38] and Scharpf et al. [32] describe linking mathematical formula content to Wikidata, both in MathML and LaTeX markup. To extend classical citations by mathematical, they call for a Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR) challenge to elaborate automated MathEL. Their FCD approach yields a recall of 68% for retrieving equivalent representations of frequent formulae, and 72% for extracting the formula name from the surrounding text on the NTCIR [42] arXiv dataset. [36]
Natural language processing (NLP) is an interdisciplinary subfield of computer science and artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.
The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.
Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.
Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language.
An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For annotations of different digital media, see web annotation and text annotation.
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".
Named-entity recognition (NER) (also known as (named)entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Bibliographic coupling, like co-citation, is a similarity measure that uses citation analysis to establish a similarity relationship between documents. Bibliographic coupling occurs when two works reference a common third work in their bibliographies. It is an indication that a probability exists that the two works treat a related subject matter.
Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents. Query expansion involves techniques such as:
Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others.
Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.
Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data may, for example, consist of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.
Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.
David Ron Karger is an American computer scientist who is professor and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) at the Massachusetts Institute of Technology.
In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.
In geographic information systems, toponym resolution is the relationship process between a toponym, i.e. the mention of a place, and an unambiguous spatial footprint of the same place.
Author name disambiguation is the process of disambiguation and record linkage applied to the names of individual people. The process could, for example, distinguish individuals with the name "John Smith".
ChengXiang Zhai is a computer scientist. He is a Donald Biggar Willett Professor in Engineering in the Department of Computer Science at the University of Illinois at Urbana-Champaign.
Wei Wang is a Chinese-born American computer scientist. She is the Leonard Kleinrock Chair Professor in Computer Science and Computational Medicine at University of California, Los Angeles and the director of the Scalable Analytics Institute (ScAi). Her research specializes in big data analytics and modeling, database systems, natural language processing, bioinformatics and computational biology, and computational medicine.
In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the free-form semantics or relationships underlying these entities.
{{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help){{cite book}}
: |journal=
ignored (help)