Information extraction

Last updated

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). [1] Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.

Contents

Recent advances in NLP techniques have allowed for significantly improved performance compared to previous years. [2] An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation:

,

from an online news sentence such as:

"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp."

A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow automated reasoning about the logical form of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context.

Information extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of information retrieval (IR) [3] has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of natural language processing (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on Latin American terrorism with each article presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to "understand" an attack article only enough to find data corresponding to the slots in this template.

History

Information extraction dates back to the late 1970s in the early days of NLP. [4] An early commercial system from the mid-1980s was JASPER built for Reuters by the Carnegie Group Inc with the aim of providing real-time financial news to financial traders. [5]

Beginning in 1987, IE was spurred by a series of Message Understanding Conferences. MUC is a competition-based conference [6] that focused on the following domains:

Considerable support came from the U.S. Defense Advanced Research Projects Agency (DARPA), who wished to automate mundane tasks performed by government analysts, such as scanning newspapers for possible links to terrorism.[ citation needed ]

Present significance

The present significance of IE pertains to the growing amount of information available in unstructured form. Tim Berners-Lee, inventor of the World Wide Web, refers to the existing Internet as the web of documents [7] and advocates that more of the content be made available as a web of data. [8] Until this transpires, the web largely consists of unstructured documents lacking semantic metadata. Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into relational form, or by marking-up with XML tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with. A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted. [9]

Tasks and subtasks

Applying information extraction to text is linked to the problem of text simplification in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical IE tasks and subtasks include:

Note that this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. Machine learning, statistical analysis and/or natural language processing are often used in IE.

IE on non-text documents is becoming an increasingly interesting topic[ when? ] in research, and information extracted from multimedia documents can now[ when? ] be expressed in a high level structure as it is done on text. This naturally leads to the fusion of extracted information from multiple kinds of documents and sources.

World Wide Web applications

IE has been the focus of the MUC conferences. The proliferation of the Web, however, intensified the need for developing IE systems that help people to cope with the enormous amount of data that are available online. Systems that perform IE from online text should meet the requirements of low cost, flexibility in development and easy adaptation to new domains. MUC systems fail to meet those criteria. Moreover, linguistic analysis performed for unstructured text does not exploit the HTML/XML tags and the layout formats that are available in online texts. As a result, less linguistically intensive approaches have been developed for IE on the Web using wrappers, which are sets of highly accurate rules that extract a particular page's content. Manually developing wrappers has proved to be a time-consuming task, requiring a high level of expertise. Machine learning techniques, either supervised or unsupervised, have been used to induce such rules automatically.

Wrappers typically handle highly structured collections of web pages, such as product catalogs and telephone directories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on adaptive information extraction motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured texts.

A recent[ when? ] development is Visual Information Extraction, [16] [17] that relies on rendering a webpage in a browser and creating rules based on the proximity of regions in the rendered web page. This helps in extracting entities from complex web pages that may exhibit a visual pattern, but lack a discernible pattern in the HTML source code.

Approaches

The following standard approaches are now widely accepted:

Numerous other approaches exist for IE including hybrid approaches that combine some of the standard approaches previously listed.

Free or open source software and services

See also

Extraction
Mining, crawling, scraping, and recognition
Search and translation
General
Lists

Related Research Articles

<span class="mw-page-title-main">Natural language processing</span> Field of linguistics and computer science

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

Named-entity recognition (NER) (also known as (named)entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated in documents.

Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies in this field have been applied to the biomedical literature available through services such as PubMed.

Noisy text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from noisy unstructured text data. While Text analytics is a growing and mature field that has great value because of the huge amounts of data being produced, processing of noisy text is gaining in importance because a lot of common applications produce noisy text data. Noisy unstructured text data is found in informal settings such as online chat, text messages, e-mails, message boards, newsgroups, blogs, wikis and web pages. Also, text produced by processing spontaneous speech using automatic speech recognition and printed or handwritten text using optical character recognition contains processing noise. Text produced under such circumstances is typically highly noisy containing spelling errors, abbreviations, non-standard words, false starts, repetitions, missing punctuations, missing letter case information, pause filling words such as “um” and “uh” and other texting and speech disfluencies. Such text can be seen in large amounts in contact centers, chat rooms, optical character recognition (OCR) of text documents, short message service (SMS) text, etc. Documents with historical language can also be considered noisy with respect to today's knowledge about the language. Such text contains important historical, religious, ancient medical knowledge that is useful. The nature of the noisy text produced in all these contexts warrants moving beyond traditional text analysis techniques.

<span class="mw-page-title-main">General Architecture for Text Engineering</span>

General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for many natural language processing tasks, including information extraction in many languages.

A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE additionally requires the removal of repeated relations (disambiguation) and generally refers to the extraction of many different relationships.

Data extraction is the act or process of retrieving data out of data sources for further data processing or data storage. The import into the intermediate extracting system is thus usually followed by data transformation and possibly the addition of metadata prior to export to another stage in the data workflow.

In information extraction, a named entity is a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper name. It can be abstract or have a physical existence. Examples of named entities include Barack Obama, New York City, Volkswagen Golf, or anything else that can be named. Named entities can simply be viewed as entity instances.

<span class="mw-page-title-main">Apache OpenNLP</span> Open-source natural language processing library

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as language detection, tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing and coreference resolution. These tasks are usually required to build more advanced text processing services.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

<span class="mw-page-title-main">Apache cTAKES</span> Natural language processing system

Apache cTAKES: clinical Text Analysis and Knowledge Extraction System is an open-source Natural Language Processing (NLP) system that extracts clinical information from electronic health record unstructured text. It processes clinical notes, identifying types of clinical named entities — drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, context, and negated/not negated.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Entity linking</span> Concept in Natural Language Processing

In natural language processing, entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a unique identity to entities mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris". Entity linking is different from named-entity recognition (NER) in that NER identifies the occurrence of a named entity in text but it does not identify which specific entity it is.

NetOwl is a suite of multilingual text and identity analytics products that analyze big data in the form of text data – reports, web, social media, etc. – as well as structured entity data about people, organizations, places, and things.

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages. The library is built on top of Apache Spark and its Spark ML library.

Document AI or Document Intelligence is a technology that uses natural language processing (NLP) and machine learning (ML) to train computer models to simulate a human review of documents.

References

  1. name=Kariampuzha2023 Kariampuzha, William; Alyea, Gioconda; Qu, Sue; Sanjak, Jaleal; Mathé, Ewy; Sid, Eric; Chatelaine, Haley; Yadaw, Arjun; Xu, Yanji; Zhu, Qian (2023). "Precision information extraction for rare disease epidemiology at scale". Journal of Translational Medicine. 21 (1): 157. doi: 10.1186/s12967-023-04011-y . PMC   9972634 . PMID   36855134.
  2. Christina Niklaus, Matthias Cetto, André Freitas, and Siegfried Handschuh. 2018. A Survey on Open Information Extraction. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3866–3878, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  3. FREITAG, DAYNE. "Machine Learning for Information Extraction in Informal Domains" (PDF). 2000 Kluwer Academic Publishers. Printed in the Netherlands.
  4. Cowie, Jim; Wilks, Yorick (1996). Information Extraction (PDF). p. 3. CiteSeerX   10.1.1.61.6480 . S2CID   10237124. Archived from the original (PDF) on 2019-02-20.
  5. Andersen, Peggy M.; Hayes, Philip J.; Huettner, Alison K.; Schmandt, Linda M.; Nirenburg, Irene B.; Weinstein, Steven P. (1992). "Automatic Extraction of Facts from Press Releases to Generate News Stories". Proceedings of the third conference on Applied natural language processing -. pp. 170–177. CiteSeerX   10.1.1.14.7943 . doi:10.3115/974499.974531. S2CID   14746386.
  6. Marco Costantino, Paolo Coletti, Information Extraction in Finance, Wit Press, 2008. ISBN   978-1-84564-146-7
  7. "Linked Data - The Story So Far" (PDF).
  8. "Tim Berners-Lee on the next Web". Archived from the original on 2011-04-10. Retrieved 2010-03-27.
  9. R. K. Srihari, W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine",Journal of Natural Language Engineering,[ dead link ] Cambridge U. Press, 14(1), 2008, pp.33-69.
  10. 1 2 Dat Quoc Nguyen and Karin Verspoor (2019). "End-to-end neural relation extraction using deep biaffine attention". Proceedings of the 41st European Conference on Information Retrieval (ECIR). arXiv: 1812.11275 . doi:10.1007/978-3-030-15712-8_47.
  11. 1 2 Milosevic N, Gregson C, Hernandez R, Nenadic G (February 2019). "A framework for information extraction from tables in biomedical literature". International Journal on Document Analysis and Recognition. 22 (1): 55–78. arXiv: 1902.10031 . Bibcode:2019arXiv190210031M. doi:10.1007/s10032-019-00317-0. S2CID   62880746.
  12. Milosevic, Nikola (2018). A multi-layered approach to information extraction from tables in biomedical documents (PDF) (PhD). University of Manchester.
  13. Milosevic N, Gregson C, Hernandez R, Nenadic G (June 2016). "Disentangling the Structure of Tables in Scientific Literature" (PDF). Natural Language Processing and Information Systems. Lecture Notes in Computer Science. Vol. 21. pp. 162–174. doi:10.1007/978-3-319-41754-7_14. ISBN   978-3-319-41753-0. S2CID   19538141.
  14. Milosevic, Nikola (2018). A multi-layered approach to information extraction from tables in biomedical documents (PDF) (PhD). University of Manchester.
  15. A.Zils, F.Pachet, O.Delerue and F. Gouyon, Automatic Extraction of Drum Tracks from Polyphonic Music Signals Archived 2017-08-29 at the Wayback Machine , Proceedings of WedelMusic, Darmstadt, Germany, 2002.
  16. Chenthamarakshan, Vijil; Desphande, Prasad M; Krishnapuram, Raghu; Varadarajan, Ramakrishnan; Stolze, Knut (2015). "WYSIWYE: An Algebra for Expressing Spatial and Textual Rules for Information Extraction". arXiv: 1506.08454 [cs.CL].
  17. Baumgartner, Robert; Flesca, Sergio; Gottlob, Georg (2001). "Visual Web Information Extraction with Lixto". pp. 119–128. CiteSeerX   10.1.1.21.8236 .
  18. Peng, F.; McCallum, A. (2006). "Information extraction from research papers using conditional random fields☆". Information Processing & Management. 42 (4): 963. doi:10.1016/j.ipm.2005.09.002.
  19. Shimizu, Nobuyuki; Hass, Andrew (2006). "Extracting Frame-based Knowledge Representation from Route Instructions" (PDF). Archived from the original (PDF) on 2006-09-01. Retrieved 2010-03-27.