Open information extraction

Last updated

In natural language processing, open information extraction (OIE) is the task of generating a structured, machine-readable representation of the information in text, usually in the form of triples or n-ary propositions.

Contents

Overview

A proposition can be understood as truth-bearer, a textual expression of a potential fact (e.g., "Dante wrote the Divine Comedy"), represented in an amenable structure for computers [e.g., ("Dante", "wrote", "Divine Comedy")]. An OIE extraction normally consists of a relation and a set of arguments. For instance, ("Dante", "passed away in" "Ravenna") is a proposition formed by the relation "passed away in" and the arguments "Dante" and "Ravenna". The first argument is usually referred as the subject while the second is considered to be the object. [1]

The extraction is said to be a textual representation of a potential fact because its elements are not linked to a knowledge base. Furthermore, the factual nature of the proposition has not yet been established. In the above example, transforming the extraction into a full fledged fact would first require linking, if possible, the relation and the arguments to a knowledge base. Second, the truth of the extraction would need to be determined. In computer science transforming OIE extractions into ontological facts is known as relation extraction.

In fact, OIE can be seen as the first step to a wide range of deeper text understanding tasks such as relation extraction, knowledge-base construction, question answering, semantic role labeling. The extracted propositions can also be directly used for end-user applications such as structured search (e.g., retrieve all propositions with "Dante" as subject).

OIE was first introduced by TextRunner [2] developed at the University of Washington Turing Center headed by Oren Etzioni. Other methods introduced later such as Reverb, [3] OLLIE, [4] ClausIE [5] or CSD [6] helped to shape the OIE task by characterizing some of its aspects. At a high level, all of these approaches make use of a set of patterns to generate the extractions. Depending on the particular approach, these patterns are either hand-crafted or learned.

OIE systems and contributions

Reverb [3] suggested the necessity to produce meaningful relations to more accurately capture the information in the input text. For instance, given the sentence "Faust made a pact with the devil", it would be erroneous to just produce the extraction ("Faust", "made", "a pact") since it would not be adequately informative. A more precise extraction would be ("Faust", "made a pact with", "the devil"). Reverb also argued against the generation of overspecific relations.

OLLIE [4] stressed two important aspects for OIE. First, it pointed to the lack of factuality of the propositions. For instance, in a sentence like "If John studies hard, he will pass the exam", it would be inaccurate to consider ("John", "will pass", "the exam") as a fact. Additionally, the authors indicated that an OIE system should be able to extract non-verb mediated relations, which account for significant portion of the information expressed in natural language text. For instance, in the sentence "Obama, the former US president, was born in Hawaii", an OIE system should be able to recognize a proposition ("Obama", "is", "former US president").

ClausIE [5] introduced the connection between grammatical clauses, propositions, and OIE extractions. The authors stated that as each grammatical clause expresses a proposition, each verb mediated proposition can be identified by solely recognizing the set of clauses expressed in each sentence. This implies that to correctly recognize the set of propositions in an input sentence, it is necessary to understand its grammatical structure. The authors studied the case in the English language that only admits seven clause types, meaning that the identification of each proposition only requires defining seven grammatical patterns.

The finding also established a separation between the recognition of the propositions and its materialization. In a first step, the proposition can be identified without any consideration of its final form, in a domain-independent and unsupervised way, mostly based on linguistic principles. In a second step, the information can be represented according to the requirements of the underlying application, without conditioning the identification phase.

Consider the sentence "Albert Einstein was born in Ulm and died in Princeton". The first step will recognize the two propositions ("Albert Einstein", "was born", "in Ulm") and ("Albert Einstein", "died", "in Princeton"). Once the information has been correctly identified, the propositions can take the particular form required by the underlying application [e.g., ("Albert Einstein", "was born in", "Ulm") and ("Albert Einstein", "died in", "Princeton")].

CSD [6] introduced the idea of minimality in OIE. It considers that computers can make better use of the extractions if they are expressed in a compact way. This is especially important in sentences with subordinate clauses. In these cases, CSD suggests the generation of nested extractions. For example, consider the sentence "The Embassy said that 6,700 Americans were in Pakistan". CSD generates two extractions [i] ("6,700 Americans", "were", "in Pakistan") and [ii] ("The Embassy", "said", "that [i]). This is usually known as reification.

Related Research Articles

<span class="mw-page-title-main">Natural language processing</span> Field of linguistics and computer science

Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

In linguistics, syntax is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency), agreement, the nature of crosslinguistic variation, and the relationship between form and meaning (semantics). There are numerous approaches to syntax that differ in their central assumptions and goals.

Counterfactual conditionals are conditional sentences which discuss what would have been true under different circumstances, e.g. "If Peter believed in ghosts, he would be afraid to be here." Counterfactuals are contrasted with indicatives, which are generally restricted to discussing open possibilities. Counterfactuals are characterized grammatically by their use of fake tense morphology, which some languages use in combination with other kinds of morphology including aspect and mood.

Conditional sentences are natural language sentences that express that one thing is contingent on something else, e.g. "If it rains, the picnic will be cancelled." They are so called because the impact of the main clause of the sentence is conditional on the dependent clause. A full conditional thus contains two clauses: a dependent clause called the antecedent, which expresses the condition, and a main clause called the consequent expressing the result.

In linguistics, word order is the order of the syntactic constituents of a language. Word order typology studies it from a cross-linguistic perspective, and examines how different languages employ different orders. Correlations between orders found in different syntactic sub-domains are also of interest. The primary word orders that are of interest are

In generative grammar, a theta role or θ-role is the formal device for representing syntactic argument structure—the number and type of noun phrases—required syntactically by a particular verb. For example, the verb put requires three arguments.

Sentence extraction is a technique used for automatic summarization of a text. In this shallow approach, statistical heuristics are used to identify the most salient sentences of a text. Sentence extraction is a low-cost approach compared to more knowledge-intensive deeper approaches which require additional knowledge bases such as ontologies or linguistic knowledge. In short "sentence extraction" works as a filter which allows only important sentences to pass.

The term predicate is used in one of two ways in linguistics and its subfields. The first defines a predicate as everything in a standard declarative sentence except the subject, and the other views it as just the main content verb or associated predicative expression of a clause. Thus, by the first definition the predicate of the sentence Frank likes cake is likes cake. By the second definition, the predicate of the same sentence is just the content verb likes, whereby Frank and cake are the arguments of this predicate. Differences between these two definitions can lead to confusion.

<span class="mw-page-title-main">English passive voice</span> Grammatical voice in the English language

The passive voice in English is a grammatical voice whose syntax is marked by a subject followed by a stative verb complemented by a past participle. For example:

Verbal reasoning is understanding and reasoning using concepts framed in words. It aims at evaluating ability to think constructively, rather than at simple fluency or vocabulary recognition.

The theta-criterion is a constraint on x-bar theory that was first proposed by Noam Chomsky (1981) as a rule within the system of principles of the government and binding theory, called theta-theory (θ-theory). As theta-theory is concerned with the distribution and assignment of theta-roles, the theta-criterion describes the specific match between arguments and theta-roles (θ-roles) in logical form (LF):

In linguistics, an argument is an expression that helps complete the meaning of a predicate, the latter referring in this context to a main verb and its auxiliaries. In this regard, the complement is a closely related concept. Most predicates take one, two, or three arguments. A predicate and its arguments form a predicate-argument structure. The discussion of predicates and arguments is associated most with (content) verbs and noun phrases (NPs), although other syntactic categories can also be construed as predicates and as arguments. Arguments must be distinguished from adjuncts. While a predicate needs its arguments to complete its meaning, the adjuncts that appear with a predicate are optional; they are not necessary to complete the meaning of the predicate. Most theories of syntax and semantics acknowledge arguments and adjuncts, although the terminology varies, and the distinction is generally believed to exist in all languages. Dependency grammars sometimes call arguments actants, following Lucien Tesnière (1959).

In linguistics, a small clause consists of a subject and its predicate, but lacks an overt expression of tense. Small clauses have the semantic subject-predicate characteristics of a clause, and have some, but not all, the properties of a constituent. Structural analyses of small clauses vary according to whether a flat or layered analysis is pursued. The small clause is related to the phenomena of raising-to-object, exceptional case-marking, accusativus cum infinitivo, and object control.

Logic is the formal science of using reason and is considered a branch of both philosophy and mathematics and to a lesser extent computer science. Logic investigates and classifies the structure of statements and arguments, both through the study of formal systems of inference and the study of arguments in natural language. The scope of logic can therefore be very large, ranging from core topics such as the study of fallacies and paradoxes, to specialized analyses of reasoning such as probability, correct reasoning, and arguments involving causality. One of the aims of logic is to identify the correct and incorrect inferences. Logicians study the criteria for the evaluation of arguments.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

In linguistics, locality refers to the proximity of elements in a linguistic structure. Constraints on locality limit the span over which rules can apply to a particular structure. Theories of transformational grammar use syntactic locality constraints to explain restrictions on argument selection, syntactic binding, and syntactic movement.

A resumptive pronoun is a personal pronoun appearing in a relative clause, which restates the antecedent after a pause or interruption, as in This is the girli that whenever it rains shei cries.

Philosophy of logic is the area of philosophy that studies the scope and nature of logic. It investigates the philosophical problems raised by logic, such as the presuppositions often implicitly at work in theories of logic and in their application. This involves questions about how logic is to be defined and how different logical systems are connected to each other. It includes the study of the nature of the fundamental concepts used by logic and the relation of logic to other disciplines. According to a common characterization, philosophical logic is the part of the philosophy of logic that studies the application of logical methods to philosophical problems, often in the form of extended logical systems like modal logic. But other theorists draw the distinction between the philosophy of logic and philosophical logic differently or not at all. Metalogic is closely related to the philosophy of logic as the discipline investigating the properties of formal logical systems, like consistency and completeness.

Logophoricity is a phenomenon of binding relation that may employ a morphologically different set of anaphoric forms, in the context where the referent is an entity whose speech, thoughts, or feelings are being reported. This entity may or may not be distant from the discourse, but the referent must reside in a clause external to the one in which the logophor resides. The specially-formed anaphors that are morphologically distinct from the typical pronouns of a language are known as logophoric pronouns, originally coined by the linguist Claude Hagège. The linguistic importance of logophoricity is its capability to do away with ambiguity as to who is being referred to. A crucial element of logophoricity is the logophoric context, defined as the environment where use of logophoric pronouns is possible. Several syntactic and semantic accounts have been suggested. While some languages may not be purely logophoric, logophoric context may still be found in those languages; in those cases, it is common to find that in the place where logophoric pronouns would typically occur, non-clause-bounded reflexive pronouns appear instead.

Coupled Pattern Learner (CPL) is a machine learning algorithm which couples the semi-supervised learning of categories and relations to forestall the problem of semantic drift associated with boot-strap learning methods.

References

  1. Del Corro, Luciano. "Methods for Open Information Extraction and Sense Disambiguation on Natural Language Text" (PDF).{{cite journal}}: Cite journal requires |journal= (help)
  2. Banko, Michele; Cafarella, Michael; Soderland, Stephen; Broadhead, Matt; Etzioni, Oren (2007). "Open Information Extraction from the Web" (PDF). Conference on Artificial Intelligence.
  3. 1 2 Fader, Anthony; Soderland, Stephen; Etzioni, Oren (2011). "Identifying relations for open information extraction" (PDF). EMNLP.
  4. 1 2 Mausam; Schmitz, Michael; Soderland, Stephen; Bart, Robert; Etzioni, Oren (2012). "Open language learning for information extraction" (PDF). EMNLP.
  5. 1 2 Del Corro, Luciano; Gemulla, Rainer (2013). "ClausIE: clause-based open information extraction" (PDF). WWW.
  6. 1 2 Bast, Hannah; Haussmann, Elmar (2013). "Open Information Extraction via Contextual Sentence Decomposition". ICSC.