Relationship extraction

Last updated

A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE additionally requires the removal of repeated relations (disambiguation) and generally refers to the extraction of many different relationships.

Contents

Concept and applications

The concept of relationship extraction was first introduced during the 7th Message Understanding Conference in 1998. [1] Relationship extraction involves the identification of relations between entities and it usually focuses on the extraction of binary relations. [2] Application domains where relationship extraction is useful include gene-disease relationships, [3] protein-protein interaction [4] etc.

Current relationship extraction studies use machine learning technologies, which approach relationship extraction as a classification problem. [1] Never-Ending Language Learning is a semantic machine learning system developed by a research team at Carnegie Mellon University that extracts relationships from the open web.

Approaches

There are several methods used to extract relationships and these include text-based relationship extraction. These methods rely on the use of pretrained relationship structure information or it could entail the learning of the structure in order to reveal relationships. [5] Another approach to this problem involves the use of domain ontologies. [6] [7] There is also the approach that involves visual detection of meaningful relationships in parametric values of objects listed on a data table that shift positions as the table is permuted automatically as controlled by the software user. The poor coverage, rarity and development cost related to structured resources such as semantic lexicons (e.g. WordNet, UMLS) and domain ontologies (e.g. the Gene Ontology) has given rise to new approaches based on broad, dynamic background knowledge on the Web. For instance, the ARCHILES technique [8] uses only Wikipedia and search engine page count for acquiring coarse-grained relations to construct lightweight ontologies.

The relationships can be represented using a variety of formalisms/languages. One such representation language for data on the Web is RDF.

More recently, end-to-end systems which jointly learn to extract entity mentions and their semantic relations have been proposed with strong potential to obtain high performance. [9]

Most of the reported systems have demonstrated their approach on English datasets. However, data and systems have been described for other languages, e.g., Russian [10] and Vietnamese. [11]

Datasets

Researchers have constructed multiple datasets for benchmarking relationship extraction methods. [12] One such dataset was the document-level relationship extraction dataset called DocRED released in 2019. It uses relations from Wikidata and text from the English Wikipedia. [12] The dataset has been used by other researchers and a prediction competition has been setup at CodaLab. [13] [14]

See also

Related Research Articles

<span class="mw-page-title-main">Semantic network</span> Knowledge base that represents semantic relations between concepts in a network

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.

In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction

An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For annotations of different digital media, see web annotation and text annotation.

Relational data mining is the data mining technique for relational databases. Unlike traditional data mining algorithms, which look for patterns in a single table, relational data mining algorithms look for patterns among multiple tables (relational patterns). For most types of propositional patterns, there are corresponding relational patterns. For example, there are relational classification rules, relational regression tree, and relational association rules.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies in this field have been applied to the biomedical literature available through services such as PubMed.

Semantic integration is the process of interrelating information from diverse sources, for example calendars and to do lists, email archives, presence information, documents of all sorts, contacts, search results, and advertising and marketing relevance derived from them. In this regard, semantics focuses on the organization of and action upon information by acting as an intermediary between heterogeneous data sources, which may conflict not only by structure but also context or value.

<span class="mw-page-title-main">Ontology learning</span> Automatic creation of ontologies

Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.

<span class="mw-page-title-main">DBpedia</span> Online database project

DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

Coupled Pattern Learner (CPL) is a machine learning algorithm which couples the semi-supervised learning of categories and relations to forestall the problem of semantic drift associated with boot-strap learning methods.

<span class="mw-page-title-main">BabelNet</span> Multilingual semantic network and encyclopedic dictionary

BabelNet is a multilingual lexicalized semantic network and ontology developed at the NLP group of the Sapienza University of Rome. BabelNet was automatically created by linking Wikipedia to the most popular computational lexicon of the English language, WordNet. The integration is done using an automatic mapping and by filling in lexical gaps in resource-poor languages by using statistical machine translation. The result is an encyclopedic dictionary that provides concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations. Additional lexicalizations and definitions are added by linking to free-license wordnets, OmegaWiki, the English Wiktionary, Wikidata, FrameNet, VerbNet and others. Similarly to WordNet, BabelNet groups words in different languages into sets of synonyms, called Babel synsets. For each Babel synset, BabelNet provides short definitions in many languages harvested from both WordNet and Wikipedia.

<span class="mw-page-title-main">Entity linking</span>

In natural language processing, entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a unique identity to entities mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris". Entity linking is different from named-entity recognition (NER) in that NER identifies the occurrence of a named entity in text but it does not identify which specific entity it is.

<span class="mw-page-title-main">Word2vec</span> Models used to produce word embeddings

Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that they capture the semantic and syntactic qualities of words; as such, a simple mathematical function can indicate the level of semantic similarity between the words represented by those vectors.

Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Applications of semantic parsing include machine translation, question answering, ontology induction, automated reasoning, and code generation. The phrase was first used in the 1970s by Yorick Wilks as the basis for machine translation programs working with only semantic representations.

<span class="mw-page-title-main">Knowledge graph</span> Type of knowledge base

In knowledge representation and reasoning, knowledge graph is a knowledge base that uses a graph-structured data model or topology to integrate data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the semantics underlying the used terminology.

Biocuration is the field of life sciences dedicated to organizing biomedical data, information and knowledge into structured formats, such as spreadsheets, tables and knowledge graphs. The biocuration of biomedical knowledge is made possible by the cooperative work of biocurators, software developers and bioinformaticians and is at the base of the work of biological databases.

<span class="mw-page-title-main">Knowledge graph embedding</span> Dimensionality reduction of graph-based semantic data objects [machine learning task]

In representation learning, knowledge graph embedding (KGE), also referred to as knowledge representation learning (KRL), or multi-relation learning, is a machine learning task of learning a low-dimensional representation of a knowledge graph's entities and relations while preserving their semantic meaning. Leveraging their embedded representation, knowledge graphs (KGs) can be used for various applications such as link prediction, triple classification, entity recognition, clustering, and relation extraction.

References

  1. 1 2 Ning, Huansheng (2019). Cyberspace Data and Intelligence, and Cyber-Living, Syndrome, and Health: International 2019 Cyberspace Congress, CyberDI and CyberLife, Beijing, China, December 16–18, 2019, Proceedings, Part II. Singapore: Springer Nature. p. 260. ISBN   978-981-15-1924-6.
  2. Nasar, Zara; Jaffry, Syed Waqar; Malik, Muhammad Kamran (2021-02-11). "Named Entity Recognition and Relation Extraction: State-of-the-Art". ACM Computing Surveys. 54 (1): 20:1–20:39. doi:10.1145/3445965. ISSN   0360-0300. S2CID   233353895.
  3. Hong-Woo Chun; Yoshimasa Tsuruoka; Jin-Dong Kim; Rie Shiba; Naoki Nagata; Teruyoshi Hishiki; Jun-ichi Tsujii (2006). "Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning". Pacific Symposium on Biocomputing. CiteSeerX   10.1.1.105.9656 .
  4. Minlie Huang and Xiaoyan Zhu and Yu Hao and Donald G. Payan and Kunbin Qu and Ming Li (2004). "Discovering patterns to extract protein-protein interactions from full texts". Bioinformatics. 20 (18): 3604–3612. doi: 10.1093/bioinformatics/bth451 . PMID   15284092.
  5. Tickoo, Omesh; Iyer, Ravi (2016). Making Sense of Sensors: End-to-End Algorithms and Infrastructure Design from Wearable-Devices to Data Centers. Portland: Apress. p. 68. ISBN   978-1-4302-6592-4.
  6. T.C.Rindflesch and L.Tanabe and J.N.Weinstein and L.Hunter (2000). "EDGAR: Extraction of drugs, genes, and relations from the biomedical literature". Proc. Pacific Symposium on Biocomputing. pp. 514–525. PMC   2709525 .
  7. C. Ramakrishnan and K. J. Kochut and A. P. Sheth (2006). "A Framework for Schema-Driven Relationship Discovery from Unstructured Text". Proc. International Semantic Web Conference. pp. 583–596.
  8. W. Wong and W. Liu and M. Bennamoun (2009). "Acquiring Semantic Relations using the Web for Constructing Lightweight Ontologies". Proc. 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). doi:10.1007/978-3-642-01307-2_26.
  9. Dat Quoc Nguyen and Karin Verspoor (2019). "End-to-end neural relation extraction using deep biaffine attention". Proceedings of the 41st European Conference on Information Retrieval (ECIR). arXiv: 1812.11275 . doi:10.1007/978-3-030-15712-8_47.
  10. Elena Bruches; Alexey Pauls; Tatiana Batura; Vladimir Isachenko (14 December 2020), Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian (PDF), arXiv: 2011.09817 , Wikidata   Q104419957
  11. Pham Quang Nhat Minh (18 December 2020). "An Empirical Study of Using Pre-trained BERT Models for Vietnamese Relation Extraction Task at VLSP 2020" (PDF). arXiv . arXiv: 2012.10275 . ISSN   2331-8422. Wikidata   Q104418048.
  12. 1 2 Yuan Yao; Deming Ye; Peng Li; et al. (2019). "DocRED: A Large-Scale Document-Level Relation Extraction Dataset" (PDF). Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: 764–777. arXiv: 1906.06127 . doi:10.18653/V1/P19-1074. Wikidata   Q104419388.
  13. Wang Xu; Kehai Chen; Tiejun Zhao (21 December 2020). "Document-Level Relation Extraction with Reconstruction" (PDF). arXiv . arXiv: 2012.11384 . ISSN   2331-8422. Wikidata   Q104417795.
  14. "DocRED. Competition. CodaLab".