Machine-readable data

Last updated

Machine-readable data, or computer-readable data, is data in a format that can be processed by a computer. Machine-readable data must be structured data. [1]

Attempts to create machine-readable data occurred as early as the 1960s. At the same time that seminal developments in machine-reading and natural-language processing were releasing (like Weizenbaum's ELIZA), people were anticipating the success of machine-readable functionality and attempting to create machine-readable documents. One such example was musicologist Nancy B. Reich's creation of a machine-readable catalog of composer William Jay Sydeman's works in 1966.

In the United States, the OPEN Government Data Act of 14 January 2019 defines machine-readable data as "data in a format that can be easily processed by a computer without human intervention while ensuring no semantic meaning is lost." The law directs U.S. federal agencies to publish public data in such a manner, [2] ensuring that "any public data asset of the agency is machine-readable". [3]

Machine-readable data may be classified into two groups: human-readable data that is marked up so that it can also be read by machines (e.g. microformats, RDFa, HTML), and data file formats intended principally for processing by machines (CSV, RDF, XML, JSON). These formats are only machine readable if the data contained within them is formally structured; exporting a CSV file from a badly structured spreadsheet does not meet the definition.

Machine readable is not synonymous with digitally accessible. A digitally accessible document may be online, making it easier for humans to access via computers, but its content is much harder to extract, transform, and process via computer programming logic if it is not machine-readable. [4]

Extensible Markup Language (XML) is designed to be both human- and machine-readable, and Extensible Stylesheet Language Transformation (XSLT) is used to improve the presentation of the data for human readability. For example, XSLT can be used to automatically render XML in Portable Document Format (PDF). Machine-readable data can be automatically transformed for human-readability but, generally speaking, the reverse is not true.

For purposes of implementation of the Government Performance and Results Act (GPRA) Modernization Act, the Office of Management and Budget (OMB) defines "machine readable format" as follows: "Format in a standard computer language (not English text) that can be read automatically by a web browser or computer system. (e.g.; xml). Traditional word processing documents and portable document format (PDF) files are easily read by humans but typically are difficult for machines to interpret. Other formats such as extensible markup language (XML), (JSON), or spreadsheets with header columns that can be exported as comma separated values (CSV) are machine readable formats. As HTML is a structural markup language, discreetly labeling parts of the document, computers are able to gather document components to assemble tables of contents, outlines, literature search bibliographies, etc. It is possible to make traditional word processing documents and other formats machine readable but the documents must include enhanced structural elements." [5]

See also

Related Research Articles

<span class="mw-page-title-main">Markup language</span> Modern system for annotating a document

Markuplanguage refers to a text-encoding system consisting of a set of symbols inserted in a text document to control its structure, formatting, or the relationship between its parts. Markup is often used to control the display of the document or to enrich its content to facilitating automated processing. A markup language is a set of rules governing what markup information may be included in a document and how it is combined with the content of the document in a way to facilitate use by humans and computer programs. The idea and terminology evolved from the "marking up" of paper manuscripts, which is traditionally written with a red pen or blue pencil on authors' manuscripts.

In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

XSLT is a language originally designed for transforming XML documents into other XML documents, or other formats such as HTML for web pages, plain text or XSL Formatting Objects, which may subsequently be converted to other formats, such as PDF, PostScript and PNG. Support for JSON and plain-text transformation was added in later updates to the XSLT 1.0 specification.

DocBook is a semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software, but it can be used for any other sort of documentation.

YAML is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax which intentionally differs from Standard Generalized Markup Language (SGML). It uses both Python-style indentation to indicate nesting, and a more compact format that uses [...] for lists and {...} for maps thus JSON files are valid YAML 1.2.

<span class="mw-page-title-main">Human-readable medium</span> Presentation of data for humans to read

A human-readable medium or human-readable format is any encoding of data or information that can be naturally read by humans.

XSL-FO is a markup language for XML document formatting that is most often used to generate PDF files. XSL-FO is part of XSL, a set of W3C technologies designed for the transformation and formatting of XML data. The other parts of XSL are XSLT and XPath. Version 1.1 of XSL-FO was published in 2006.

An XML editor is a markup language editor with added functionality to facilitate the editing of XML. This can be done using a plain text editor, with all the code visible, but XML editors have added facilities like tag completion and menus and buttons for tasks that are common in XML editing, based on data supplied with document type definition (DTD) or the XML tree.

Data exchange is the process of taking data structured under a source schema and transforming it into a target schema, so that the target data is an accurate representation of the source data. Data exchange allows data to be shared between different computer programs.

A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free.

XQuery is a query and functional programming language that queries and transforms collections of structured and unstructured data, usually in the form of XML, text and with vendor-specific extensions for other data formats. The language is developed by the XML Query working group of the W3C. The work is closely coordinated with the development of XSLT by the XSL Working Group; the two groups share responsibility for XPath, which is a subset of XQuery.

<span class="mw-page-title-main">Sheetster</span>

Sheetster is a GPL open-source Web Spreadsheet and a Java Application Server created by Extentech Inc. The product was created for the enterprise and small and medium-sized businesses as an Open Source alternative to closed document management systems.

The Office Open XML file formats are a set of file formats that can be used to represent electronic office documents. There are formats for word processing documents, spreadsheets and presentations as well as specific formats for material such as mathematical formulae, graphics, bibliographies etc.

A machine-readable document is a document whose content can be readily processed by computers. Such documents are distinguished from machine-readable data by virtue of having sufficient structure to provide the necessary context to support the business processes for which they are created.

References

  1. "Machine readable". opendatahandbook.org. Retrieved 2019-07-22.
  2. "HR4174". stratml.us.
  3. "HR4174". stratml.us.
  4. "A Primer on Machine Readability for Online Documents and Data". Data.gov. 2012-09-24. Retrieved 2015-02-27.
  5. OMB Circular A-11, Part 6 Archived 2020-04-22 at the Wayback Machine , Preparation, Submission, and Execution of the Budget