Table extraction

Last updated

Table extraction is the process of recognizing and separating a table from a large document, possibly also recognizing individual rows, columns or elements. It may be regarded as a special form of information extraction.

Table extractions from webpages can take advantage of the special HTML elements that exist for tables, e.g., the "table" tag, and programming libraries may implement table extraction from webpages. The Python pandas software library can extract tables from HTML webpages via its read_html() function.

More challenging is table extraction from PDFs or scanned images, where there usually is no table-specific machine readable markup. [1] Systems that extract data from tables in scientific PDFs have been described. [2] [3]

Wikipedia presents some of its information in tables, and, e.g., 3.5 million tables can be extracted from the English Wikipedia. [4] Some of the tables have a specific format, e.g., the so-called infoboxes. Large-scale table extraction of Wikipedia infoboxes forms one of the sources for DBpedia. [5]

Commercial web services for table extraction exist, e.g., Amazon Textract, Google's Document AI , IBM Watson Discovery, and Microsoft Form Recognizer. [1] Open source tools also exist, e.g., PDFFigures 2.0 that has been used in Semantic Scholar. [6] In a comparison published in 2017, the researchers found the proprietary program ABBYY FineReader to yield the best PDF table extraction performance among six different tools evaluated. [7] In a 2023 benchmark evaluation, [8] Adobe Extract, [9] a cloud-based API that employs Adobe’s Sensei AI-platform, [10] performed best among five tools evaluated for table extraction.

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

<span class="mw-page-title-main">PDF</span> Portable Document Format, a digital file format

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it. PDF has its roots in "The Camelot Project" initiated by Adobe co-founder John Warnock in 1991. PDF was standardized as ISO 32000 in 2008. The last edition as ISO 32000-2:2020 was published in December 2020.

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

<span class="mw-page-title-main">Wiktionary</span> Multilingual online dictionary

Wiktionary is a multilingual, web-based project to create a free content dictionary of terms in all natural languages and in a number of artificial languages. These entries may contain definitions, images for illustration, pronunciations, etymologies, inflections, usage examples, quotations, related terms, and translations of terms into other languages, among other features. It is collaboratively edited via a wiki. Its name is a portmanteau of the words wiki and dictionary. It is available in 192 languages and in Simple English. Like its sister project Wikipedia, Wiktionary is run by the Wikimedia Foundation, and is written collaboratively by volunteers, dubbed "Wiktionarians". Its wiki software, MediaWiki, allows almost anyone with access to the website to create and edit entries.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.

Reference management software, citation management software, or bibliographic management software is software that stores a database of bibliographic records and produces bibliographic citations (references) for those records, needed in scholarly research. Once a record has been stored, it can be used time and again in generating bibliographies, such as lists of references in scholarly books and articles. Modern reference management applications can usually be integrated with word processors so that a reference list in one of the many different bibliographic formats required by publishers and scholarly journals is produced automatically as an article is written, reducing the risk that a cited source is not included in the reference list. They will also have a facility for importing bibliographic records from bibliographic databases.

A recommender system, or a recommendation system, is a subclass of information filtering system that provides suggestions for items that are most pertinent to a particular user. Recommender systems are particularly useful when an individual needs to choose an item from a potentially overwhelming number of items that a service may offer.

An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For annotations of different digital media, see web annotation and text annotation.

<span class="mw-page-title-main">Linked data</span> Structured data and method for its publication

In computing, linked data is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web pages only for human readers, it extends them to share information in a way that can be read automatically by computers. Part of the vision of linked data is for the Internet to become a global database.

A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE additionally requires the removal of repeated relations (disambiguation) and generally refers to the extraction of many different relationships.

<span class="mw-page-title-main">DBpedia</span> Online database project

DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web using OpenLink Virtuoso. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.

<span class="mw-page-title-main">YAGO (database)</span> Open-source information repository

YAGO is an open source knowledge base developed at the Max Planck Institute for Informatics in Saarbrücken. It is automatically extracted from Wikipedia and other sources.

Value sensitive design (VSD) is a theoretically grounded approach to the design of technology that accounts for human values in a principled and comprehensive manner. VSD originated within the field of information systems design and human-computer interaction to address design issues within the fields by emphasizing the ethical values of direct and indirect stakeholders. It was developed by Batya Friedman and Peter Kahn at the University of Washington starting in the late 1980s and early 1990s. Later, in 2019, Batya Friedman and David Hendry wrote a book on this topic called "Value Sensitive Design: Shaping Technology with Moral Imagination". Value Sensitive Design takes human values into account in a well-defined matter throughout the whole process. Designs are developed using an investigation consisting of three phases: conceptual, empirical and technological. These investigations are intended to be iterative, allowing the designer to modify the design continuously.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

<span class="mw-page-title-main">Entity linking</span> Concept in Natural Language Processing

In natural language processing, entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a unique identity to entities mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris". Entity linking is different from named-entity recognition (NER) in that NER identifies the occurrence of a named entity in text but it does not identify which specific entity it is.

<span class="mw-page-title-main">Infobox</span> Template used to collect and present a subset of information about a subject

An infobox is a digital or physical table used to collect and present a subset of information about its subject, such as a document. It is a structured document containing a set of attribute–value pairs, and in Wikipedia represents a summary of information about the subject of an article. In this way, they are comparable to data tables in some aspects. When presented within the larger document it summarizes, an infobox is often presented in a sidebar format.

Semantic Scholar is a research tool for scientific literature powered by artificial intelligence. It is developed at the Allen Institute for AI and was publicly released in November 2015. Semantic Scholar uses modern techniques in natural language processing to support the research process, for example by providing automatically generated summaries of scholarly papers. The Semantic Scholar team is actively researching the use of artificial intelligence in natural language processing, machine learning, human–computer interaction, and information retrieval.

<span class="mw-page-title-main">UMBEL</span>

UMBEL is a logically organized knowledge graph of 34,000 concepts and entity types that can be used in information science for relating information from disparate sources to one another. It was retired at the end of 2019. UMBEL was first released in July 2008. Version 1.00 was released in February 2011. Its current release is version 1.50.

<span class="mw-page-title-main">Knowledge graph</span> Type of knowledge base

In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the semantics or relationships underlying these entities.

Soufflé is an open source parallel logic programming language, influenced by Datalog. Soufflé includes both an interpreter and a compiler that targets parallel C++. Soufflé has been used to build static analyzers, disassemblers, and tools for binary reverse engineering. Soufflé is considered by academic researchers to be high-performance and "state of the art," and is often used in benchmarks in academic papers.

References

  1. 1 2 Douglas Burdick; Marina Danilevsky; Alexandre V Evfimievski; Yannis Katsis; Nancy Wang (August 2020). "Table extraction and understanding for scientific and enterprise applications". Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases. 13 (12): 3433–3436. doi:10.14778/3415478.3415563. ISSN   2150-8097. Wikidata   Q108170445.
  2. Wenhao Yu; Wei Peng; Yu Shu; Qingkai Zeng; Meng Jiang (19 April 2020). Experimental Evidence Extraction System in Data Science with Hybrid Table Features and Ensemble Learning. pp. 951–961. doi:10.1145/3366423.3380174. ISBN   978-1-4503-7023-3. Wikidata   Q108172460.{{cite book}}: |journal= ignored (help)
  3. Benno Kruit; Hongyu He; Jacopo Urbani (1 November 2020). Tab2Know: Building a Knowledge Base from Tables in Scientific Papers. Lecture Notes in Computer Science. pp. 349–365. arXiv: 2107.13306 . doi:10.1007/978-3-030-62419-4_20. ISBN   978-3-030-62419-4. Wikidata   Q101086651.{{cite book}}: |journal= ignored (help)
  4. Tobias Bleifuß; Leon Bornemann; Dmitri V. Kalashnikov; Felix Naumann; Divesh Srivastava (17 August 2021). "The Secret Life of Wikipedia Tables" (PDF). Proceedings of the 2nd Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores. CEUR Workshop Proceedings: 20–26. Wikidata   Q108215401.
  5. Sören Auer; Christian Bizer; Georgi Kobilarov; Jens Lehmann; Richard Cyganiak; Zachary Ives (2007). DBpedia: A Nucleus for a Web of Open Data. Lecture Notes in Computer Science. pp. 722–735. doi:10.1007/978-3-540-76298-0_52. ISBN   978-3-540-76297-3. Wikidata   Q27910422.{{cite book}}: |journal= ignored (help)
  6. Christopher Clark; Santosh Divvala (2016). PDFFigures 2.0: Mining figures from research papers. ISBN   978-1-4503-4229-2. Wikidata   Q108172042.{{cite book}}: |journal= ignored (help)
  7. Andreiwid Sheffer Corrêa; Pär-Ola Zander (7 June 2017), Unleashing Tabular Content to Open Data: A Survey on PDF Table Extraction Methods and Tools, doi:10.1145/3085228.3085278, Wikidata   Q108173686
  8. Meuschke, Norman; Jagdale, Apurva; Spinde, Timo; Mitrović, Jelena; Gipp, Bela (2023), Sserwanga, Isaac; Goulding, Anne; Moulaison-Sandy, Heather; Du, Jia Tina (eds.), "A Benchmark of PDF Information Extraction Tools Using a Multi-task and Multi-domain Evaluation Framework for Academic Documents", Information for a Better World: Normality, Virtuality, Physicality, Inclusivity, Cham: Springer Nature Switzerland, vol. 13972, pp. 383–405, doi:10.1007/978-3-031-28032-0_31, ISBN   978-3-031-28031-3
  9. "Adobe PDF Extract API". Adobe. Retrieved 2024-03-15.{{cite web}}: CS1 maint: url-status (link)
  10. "Experience Cloud AI Services with Adobe Sensei". Adobe. Retrieved 2024-03-15.{{cite web}}: CS1 maint: url-status (link)