Machine-readable document

Last updated October 22, 2024

A machine-readable document is a document whose content can be readily processed by computers. Such documents are distinguished from more general machine-readable data by virtue of having further structure to provide the necessary context to support the business processes for which they are created.

Definition

Data without context is meaningless and lacks the four essential characteristics of trustworthy business records specified in ISO 15489 Information and documentation – Records management:^[1]

Reliability
Authenticity
Integrity
Usability

The vast bulk of information is unstructured data and, from a business perspective, that means it is "immature", i.e., Level 1 (chaotic) of the Capability Maturity Model. Such immaturity fosters inefficiency, diminishes quality, and limits effectiveness. Unstructured information is also ill-suited for records management functions, provides inadequate evidence for legal purposes, drives up the cost of discovery in litigation, and makes access and usage needlessly cumbersome in routine, ongoing business processes.

There are at least four aspects to machine-readability:

First, words or phrases should be discretely delineated (tagged) so that computer software and/or hardware logic can be applied to them as individual conceptual elements.
Second, the semantics of each element should be specified so that computers can help human beings achieve a common understanding of their meanings and potential usages.
Third, if the relationships among the individual elements are also specified, computers can automatically apply inferences to them, thereby further relieving human beings of the burden of trying to understand them, particularly for purposes of inquiry, discovery, and analysis.
Fourth, if the structures of the documents in which the elements occur are also specified, human understanding is further enhanced and the data becomes more reliable for legal and business-quality purposes.

As early as 1983, the U.S. Government Accountability Office (GAO) began emphasizing the benefits of machine-readable information.^[2] Still sooner, in 1981, GAO began reporting on the problem of inadequate record-keeping practices in the U.S. federal government.^[3] Such deficiencies are not unique to government and advances in information technology mean that most information is now "born digital" and thus potentially far more easily managed by automated means.^[4] However, in testimony to Congress in 2010, GAO highlighted problems with managing electronic records, and as recently as 2015, GAO has continued to report inadequacies in the performance of Executive Branch agencies in meeting records management requirements.^[5]^[6] Moreover, more than two decades after a major and formerly highly respected auditing firm, Arthur Andersen, met its demise due to a records destruction scandal, record-keeping practices became a central issue in the 2016 Presidential election.

On January 4, 2011, President Obama signed H.R. 2142, the Government Performance and Results Act (GPRA) Modernization Act of 2010 (GPRAMA), into law as P.L. 111-352. Section 10 of GPRAMA requires U.S. federal agencies to publish their strategic and performance plans and reports in searchable, machine-readable format.^[7] Additionally, in 2013, he issued Executive Order 13642, Making Open and Machine Readable the New Default for Government Information in general.^[8] On July 28, 2016, the Office of Management and Budget (OMB) followed up by including in the revised issuance of Circular A-130 direction for agencies to use open, machine-readable formats,^[9] and to publish "public information online in a manner that promotes analysis and reuse for the widest possible range of purposes",^[10] meaning that the information is both publicly accessible and machine-readable. On January 14, 2019, President Trump signed into law H.R. 4174,^[11] the OPEN Government Data Act (OGDA), which codifies in law the requirement for agencies to make their public data assets available in machine-readable format. On June 28, 2019, in Circular A-11,^[12] OMB expressed intent to begin complying with section 10 of GPRAMA.^[13]

In support of such policy direction, technological advancement is enabling more efficient and effective management and use of machine-readable electronic records. Document-oriented databases have been developed for storing, retrieving, and managing document-oriented information, also known as semi-structured data. Extensible Markup Language (XML) is a World Wide Web Consortium (W3C) Recommendation setting forth rules for encoding documents in a format that is both human-readable and machine-readable. Many XML editor tools have been developed and most, if not all major information technology applications support XML to greater or lesser degrees. The fact that XML itself is an open, standard, machine-readable format makes it relatively easy for application developers to do so.

The W3C's accompanying XML Schema (XSD) Recommendation specifies how to formally describe the elements in an XML document. With respect to the specification of XML schemas, the Organization for the Advancement of Structured Information Standards (OASIS) is a leading standards-developing organization. However, many technical developers prefer to work with JSON, and to define the structure of JSON data for validation, documentation, and interaction control, JSON Schema ^{[ broken anchor ]} was developed by the Internet Engineering Task Force (IETF).

The Portable Document Format (PDF) is a file format used to present documents in a manner independent of application software, hardware, and operating systems. Each PDF file encapsulates a complete description of the presentation of the document, including the text, fonts, graphics, and other information needed to display it. PDF/A is an ISO-standardized version of the PDF specialized for use in the archiving and long-term preservation of electronic documents. PDF/A-3 allows embedding of other file formats, including XML, into PDF/A conforming documents, thus potentially providing the best of both human- and machine-readability. The W3C's XSL-FO (XSL Formatting Objects) markup language is commonly used to generate PDF files

Metadata, data about data, can be used to organize electronic resources, provide digital identification, and support the archiving and preservation of resources. In well-structured, machine-readable electronic records, the content can be repurposed as both data and metadata. In the context of electronic record-keeping systems, the terms "management" and "metadata" are virtually synonymous. Given proper metadata, records management functions can be automated, thereby reducing the risk of spoliation of evidence and other fraudulent manipulations of records. Moreover, such records can be used to automate the process of auditing data maintained in databases, thereby reducing the risk of single points of failure associated with the Machiavellian concept of a single source of truth.

Blockchains allow to create and maintain continuously-growing lists of records secured from tampering and revision. A key feature is that every node in a decentralized system has a copy of the blockchain so there is no single point of failure subject to manipulation and fraud.

Related Research Articles

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

<span class="mw-page-title-main">Machine-readable medium and data</span> Medium capable of storing data in a format readable by a machine

In communications and computing, a machine-readable medium is a medium capable of storing data in a format easily readable by a digital computer or a sensor. It contrasts with human-readable medium and data.

A web service (WS) is either:

The Organization for the Advancement of Structured Information Standards is a nonprofit consortium that works on the development, convergence, and adoption of projects - both open standards and open source - for Computer security, blockchain, Internet of things (IoT), emergency management, cloud computing, legal data exchange, energy, content technologies, and other areas.

XSD, a recommendation of the World Wide Web Consortium (W3C), specifies how to formally describe the elements in an Extensible Markup Language (XML) document. It can be used by programmers to verify each piece of item content in a document, to assure it adheres to the description of the element it is placed in.

The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains the TEI technical standard, a journal, a wiki, a GitHub repository and a toolchain.

The International Press Telecommunications Council (IPTC), based in London, United Kingdom, is a consortium of the world's major news agencies, other news providers and news industry vendors and acts as the global standards body of the news media.

An XML database is a data persistence software system that allows data to be specified, and sometimes stored, in XML format. This data can be queried, transformed, exported and returned to a calling system. XML databases are a flavor of document-oriented databases which are in turn a category of NoSQL database.

XML Information Set is a W3C specification describing an abstract data model of an XML document in terms of a set of information items. The definitions in the XML Information Set specification are meant to be used in other specifications that need to refer to the information in a well-formed XML document.

Extensible Forms Description Language (XFDL) is a high-level computer language that facilitates defining a form as a single, stand-alone object using elements and attributes from the Extensible Markup Language (XML). Technically, it is a class of XML originally specified in a World Wide Web Consortium (W3C) Note. See Specifications below for links to the current versions of XFDL. XFDL It offers precise control over form layout, permitting replacement of existing business/government forms with electronic documents in a human-readable, open standard.

Data exchange is the process of taking data structured under a source schema and transforming it into a target schema, so that the target data is an accurate representation of the source data. Data exchange allows data to be shared between different computer programs.

The Office Open XML file formats are a set of file formats that can be used to represent electronic office documents. There are formats for word processing documents, spreadsheets and presentations as well as specific formats for material such as mathematical formulas, graphics, bibliographies etc.

XHTML+RDFa is an extended version of the XHTML markup language for supporting RDF through a collection of attributes and processing rules in the form of well-formed XML documents. XHTML+RDFa is one of the techniques used to develop Semantic Web content by embedding rich semantic markup. Version 1.1 of the language is a superset of XHTML 1.1, integrating the attributes according to RDFa Core 1.1. In other words, it is an RDFa support through XHTML Modularization.

Schema.org is a reference website that publishes documentation and guidelines for using structured data mark-up on web-pages. Its main objective is to standardize HTML tags to be used by webmasters for creating rich results about a certain topic of interest. It is a part of the semantic web project, which aims to make document mark-up codes more readable and meaningful to both humans and machines.

Strategy Markup Language (StratML) is an XML-based standard vocabulary and schema for the information commonly contained in strategic and performance plans and reports. StratML Part 1 specifies the elements of strategic plans, including: mission, vision, values, goals, objectives, and stakeholders. Part 2 extends Part 1 to include the additional elements required for performance plans and reports, including stakeholder roles and performance indicators.

TechPort is a Technology Portfolio System for the National Aeronautics and Space Administration (NASA). The TechPort system was created in response to a request by the Office of Management and Budget (OMB), resulting in the NASA Performance Goal 3.4.1.5 and APG 3.4.1.5: ST-12-17.

Document, coordinate, and prioritize Agency-level technology strategic investments to ensure NASA has a balanced portfolio of both near-term NASA mission (pull) technologies and longer-term transformational (push) technologies that benefit both Agency programs and national needs.

Ensure that 75 percent of all NASA technology projects are recorded in the portfolio database and are analyzed against the prioritizations in the space technology roadmaps.

References

↑ "NARA Guidance on Managing Web Records". National Archives. August 15, 2016.
↑ "Better Use Of Information Technology Can Reduce The Burden Of Federal Paperwork" (PDF). gao.gov. 1983-04-11. Retrieved 2019-07-25.
↑ "FEDERAL RECORDS MANAGEMENT: A History of Neglect". gao.gov. 1981-02-24. Retrieved 2016-09-08.
↑ "Defining "Born Digital": An Essay by Ricky Erway, OCLC Research" (PDF). oclc.org. 2010-11-30. Retrieved 2016-09-08.
↑ "INFORMATION MANAGEMENT: The Challenges of Managing Electronic Records, Statement of Valerie C. Melvin, Director, Information Management and Human Capital Issues" (PDF). gao.gov. 2010-06-17. Retrieved 2016-09-08.
↑ "INFORMATION MANAGEMENT: Additional Actions Are Needed to Meet Requirements of the Managing Government Records Directive". gao.gov. 2015-05-14. Retrieved 2016-09-08.
↑ "GPRAMA SEC. 10. FORMAT OF PERFORMANCE PLANS AND REPORTS". congress.gov. 2011-01-04. Archived from the original on 2016-04-13. Retrieved 2016-09-08.
↑ "Executive Order 13642 in open, standard, machine-readable Strategy Markup Language format". whitehouse.gov. 2013-05-09. Archived from the original on 2016-03-03. Retrieved 2016-09-08.
↑ "StrategicPlan Circular No. A-130, Managing Information as a Strategic Resource, Objective d.5.a: Interoperability, APIs & Machine-Readability".
↑ "StrategicPlan Circular No. A-130, Managing Information as a Strategic Resource, Objective e.2.a: Publication".
↑ Ryan, Paul D. (January 14, 2019). "Text - H.R.4174 - 115th Congress (2017-2018): Foundations for Evidence-Based Policymaking Act of 2018". www.congress.gov.
↑ "PREPARATION, SUBMISSION, AND EXECUTION OF THE BUDGET" (PDF). whitehouse.gov. 2019-06-28. Retrieved 2019-07-25.
↑ "StrategicPlan Circular No. A-130, Managing Information as a Strategic Resource, Objective Machine-Readability".

External links

OMB M-13-13, Open Data Policy: Managing Information as an Asset, which requires agencies to use open, machine-readable, data format standards
NARA Guidance on Managing Web Records, January 2005, which outlines the characteristics of trustworthy records.
Driving a Stake in the Heart of the Capone Consultancy Method of Records Management: Best Practices for Correcting Non-Records Non-Policy Nonsense, March 9, 2015
The U.S. Code, which includes the term "machine-readable" over 50 times as of September 10, 2016

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "NARA Guidance on Managing Web Records". National Archives. August 15, 2016.

[2] "Better Use Of Information Technology Can Reduce The Burden Of Federal Paperwork" (PDF). gao.gov. 1983-04-11. Retrieved 2019-07-25.

[3] "FEDERAL RECORDS MANAGEMENT: A History of Neglect". gao.gov. 1981-02-24. Retrieved 2016-09-08.

[4] "Defining "Born Digital": An Essay by Ricky Erway, OCLC Research" (PDF). oclc.org. 2010-11-30. Retrieved 2016-09-08.

[5] "INFORMATION MANAGEMENT: The Challenges of Managing Electronic Records, Statement of Valerie C. Melvin, Director, Information Management and Human Capital Issues" (PDF). gao.gov. 2010-06-17. Retrieved 2016-09-08.

[6] "INFORMATION MANAGEMENT: Additional Actions Are Needed to Meet Requirements of the Managing Government Records Directive". gao.gov. 2015-05-14. Retrieved 2016-09-08.

[7] "GPRAMA SEC. 10. FORMAT OF PERFORMANCE PLANS AND REPORTS". congress.gov. 2011-01-04. Archived from the original on 2016-04-13. Retrieved 2016-09-08.

[8] "Executive Order 13642 in open, standard, machine-readable Strategy Markup Language format". whitehouse.gov. 2013-05-09. Archived from the original on 2016-03-03. Retrieved 2016-09-08.

[9] "StrategicPlan Circular No. A-130, Managing Information as a Strategic Resource, Objective d.5.a: Interoperability, APIs & Machine-Readability".

[10] "StrategicPlan Circular No. A-130, Managing Information as a Strategic Resource, Objective e.2.a: Publication".

[11] Ryan, Paul D. (January 14, 2019). "Text - H.R.4174 - 115th Congress (2017-2018): Foundations for Evidence-Based Policymaking Act of 2018". www.congress.gov.

[12] "PREPARATION, SUBMISSION, AND EXECUTION OF THE BUDGET" (PDF). whitehouse.gov. 2019-06-28. Retrieved 2019-07-25.

[13] "StrategicPlan Circular No. A-130, Managing Information as a Strategic Resource, Objective Machine-Readability".

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

Machine-readable document

Contents

Definition

See also

Related Research Articles

References

External links