Machine-readable document

Last updated

A machine-readable document is a document whose content can be readily processed by computers. Such documents are distinguished from more general machine-readable data by virtue of having further structure to provide the necessary context to support the business processes for which they are created.

Contents

Definition

Data without context (language use) is meaningless and lacks the four essential characteristics of trustworthy business records specified in ISO 15489 Information and documentation -- Records management: [1]

The vast bulk of information is unstructured data and, from a business perspective, that means it is "immature", i.e., Level 1 (chaotic) of the Capability Maturity Model. Such immaturity fosters inefficiency, diminishes quality, and limits effectiveness. Unstructured information is also ill-suited for records management functions, provides inadequate evidence for legal purposes, drives up the cost of discovery in litigation, and makes access and usage needlessly cumbersome in routine, ongoing business processes.

There are at least four aspects to machine-readability:

As early as 1983, the U.S. Government Accountability Office (GAO) began emphasizing the benefits of machine-readable information. [2] Still sooner, in 1981, GAO began reporting on the problem of inadequate record-keeping practices in the U.S. federal government. [3] Such deficiencies are not unique to government and advances in information technology mean that most information is now "born digital" and thus potentially far more easily managed by automated means. [4] However, in testimony to Congress in 2010, GAO highlighted problems with managing electronic records, and as recently as 2015, GAO has continued to report inadequacies in the performance of Executive Branch agencies in meeting records management requirements. [5] [6] Moreover, more than two decades after a major and formerly highly respected auditing firm, Arthur Andersen, met its demise due to a records destruction scandal, record-keeping practices became a central issue in the 2016 Presidential election.

On January 4, 2011, President Obama signed H.R. 2142, the Government Performance and Results Act (GPRA) Modernization Act of 2010 (GPRAMA), into law as P.L. 111-352. Section 10 of GPRAMA requires U.S. federal agencies to publish their strategic and performance plans and reports in searchable, machine-readable format. [7] Additionally, in 2013, he issued Executive Order 13642, Making Open and Machine Readable the New Default for Government Information in general. [8] On July 28, 2016, the Office of Management and Budget (OMB) followed up by including in the revised issuance of Circular A-130 direction for agencies to use open, machine-readable formats, [9] and to publish "public information online in a manner that promotes analysis and reuse for the widest possible range of purposes", [10] meaning that the information is both publicly accessible and machine-readable. On January 14, 2019, President Trump signed into law H.R. 4174, [11] the OPEN Government Data Act (OGDA), which codifies in law the requirement for agencies to make their public data assets available in machine-readable format. On June 28, 2019, in Circular A-11, [12] OMB expressed intent to begin complying with section 10 of GPRAMA. [13]

In support of such policy direction, technological advancement is enabling more efficient and effective management and use of machine-readable electronic records. Document-oriented databases have been developed for storing, retrieving, and managing document-oriented information, also known as semi-structured data. Extensible Markup Language (XML) is a World Wide Web Consortium (W3C) Recommendation setting forth rules for encoding documents in a format that is both human-readable and machine-readable. Many XML editor tools have been developed and most, if not all major information technology applications support XML to greater or lesser degrees. The fact that XML itself is an open, standard, machine-readable format makes it relatively easy for application developers to do so.

The W3C's accompanying XML Schema (XSD) Recommendation specifies how to formally describe the elements in an XML document. With respect to the specification of XML schemas, the Organization for the Advancement of Structured Information Standards (OASIS) is a leading standards-developing organization. However, many technical developers prefer to work with JSON, and to define the structure of JSON data for validation, documentation, and interaction control, JSON Schema [ broken anchor ] was developed by the Internet Engineering Task Force (IETF).

The Portable Document Format (PDF) is a file format used to present documents in a manner independent of application software, hardware, and operating systems. Each PDF file encapsulates a complete description of the presentation of the document, including the text, fonts, graphics, and other information needed to display it. PDF/A is an ISO-standardized version of the PDF specialized for use in the archiving and long-term preservation of electronic documents. PDF/A-3 allows embedding of other file formats, including XML, into PDF/A conforming documents, thus potentially providing the best of both human- and machine-readability. The W3C's XSL-FO (XSL Formatting Objects) markup language is commonly used to generate PDF files

Metadata, data about data, can be used to organize electronic resources, provide digital identification, and support the archiving and preservation of resources. In well-structured, machine-readable electronic records, the content can be repurposed as both data and metadata. In the context of electronic record-keeping systems, the terms "management" and "metadata" are virtually synonymous. Given proper metadata, records management functions can be automated, thereby reducing the risk of spoliation of evidence and other fraudulent manipulations of records. Moreover, such records can be used to automate the process of auditing data maintained in databases, thereby reducing the risk of single points of failure associated with the Machiavellian concept of a single source of truth.

Blockchain (database) is a new technology for maintaining continuously-growing lists of records secured from tampering and revision. A key feature is that every node in a decentralized system has a copy of the blockchain so there is no single point of failure subject to manipulation and fraud.

See also

Related Research Articles

<span class="mw-page-title-main">Markup language</span> Modern system for annotating a document

A markuplanguage is a text-encoding system which specifies the structure and formatting of a document and potentially the relationship between its parts. Markup can control the display of a document or enrich its content to facilitate automated processing.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

<span class="mw-page-title-main">Machine-readable medium and data</span> Medium capable of storing data in a format readable by a machine

In communications and computing, a machine-readable medium is a medium capable of storing data in a format easily readable by a digital computer or a sensor. It contrasts with human-readable medium and data.

The XML Metadata Interchange (XMI) is an Object Management Group (OMG) standard for exchanging metadata information via Extensible Markup Language (XML).

The Organization for the Advancement of Structured Information Standards is a nonprofit consortium that works on the development, convergence, and adoption of projects - both open standards and open source - for cybersecurity, blockchain, Internet of things (IoT), emergency management, cloud computing, legal data exchange, energy, content technologies, and other areas.

XSD, a recommendation of the World Wide Web Consortium (W3C), specifies how to formally describe the elements in an Extensible Markup Language (XML) document. It can be used by programmers to verify each piece of item content in a document, to assure it adheres to the description of the element it is placed in.

<span class="mw-page-title-main">Text Encoding Initiative</span> Academic community concerned with text encoding

The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains the TEI technical standard, a journal, a wiki, a GitHub repository and a toolchain.

The International Press Telecommunications Council (IPTC), based in London, United Kingdom, is a consortium of the world's major news agencies, other news providers and news industry vendors and acts as the global standards body of the news media.

An XML database is a data persistence software system that allows data to be specified, and sometimes stored, in XML format. This data can be queried, transformed, exported and returned to a calling system. XML databases are a flavor of document-oriented databases which are in turn a category of NoSQL database.

The Multimodal Interaction Activity is an initiative from W3C aiming to provide means to support Multimodal interaction scenarios on the Web.

Extensible Forms Description Language (XFDL) is a high-level computer language that facilitates defining a form as a single, stand-alone object using elements and attributes from the Extensible Markup Language (XML). Technically, it is a class of XML originally specified in a World Wide Web Consortium (W3C) Note. See Specifications below for links to the current versions of XFDL. XFDL It offers precise control over form layout, permitting replacement of existing business/government forms with electronic documents in a human-readable, open standard.

Data exchange is the process of taking data structured under a source schema and transforming it into a target schema, so that the target data is an accurate representation of the source data. Data exchange allows data to be shared between different computer programs.

The Office Open XML file formats are a set of file formats that can be used to represent electronic office documents. There are formats for word processing documents, spreadsheets and presentations as well as specific formats for material such as mathematical formulas, graphics, bibliographies etc.

XHTML+RDFa is an extended version of the XHTML markup language for supporting RDF through a collection of attributes and processing rules in the form of well-formed XML documents. XHTML+RDFa is one of the techniques used to develop Semantic Web content by embedding rich semantic markup. Version 1.1 of the language is a superset of XHTML 1.1, integrating the attributes according to RDFa Core 1.1. In other words, it is an RDFa support through XHTML Modularization.

Strategy Markup Language (StratML) is an XML-based standard vocabulary and schema for the information commonly contained in strategic and performance plans and reports. StratML Part 1 specifies the elements of strategic plans, including: mission, vision, values, goals, objectives, and stakeholders. Part 2 extends Part 1 to include the additional elements required for performance plans and reports, including stakeholder roles and performance indicators.

References

  1. "NARA Guidance on Managing Web Records". National Archives. August 15, 2016.
  2. "Better Use Of Information Technology Can Reduce The Burden Of Federal Paperwork" (PDF). gao.gov. 1983-04-11. Retrieved 2019-07-25.
  3. "FEDERAL RECORDS MANAGEMENT: A History of Neglect". gao.gov. 1981-02-24. Retrieved 2016-09-08.
  4. "Defining "Born Digital": An Essay by Ricky Erway, OCLC Research" (PDF). oclc.org. 2010-11-30. Retrieved 2016-09-08.
  5. "INFORMATION MANAGEMENT: The Challenges of Managing Electronic Records, Statement of Valerie C. Melvin, Director, Information Management and Human Capital Issues" (PDF). gao.gov. 2010-06-17. Retrieved 2016-09-08.
  6. "INFORMATION MANAGEMENT: Additional Actions Are Needed to Meet Requirements of the Managing Government Records Directive". gao.gov. 2015-05-14. Retrieved 2016-09-08.
  7. "GPRAMA SEC. 10. FORMAT OF PERFORMANCE PLANS AND REPORTS". congress.gov. 2011-01-04. Archived from the original on 2016-04-13. Retrieved 2016-09-08.
  8. "Executive Order 13642 in open, standard, machine-readable Strategy Markup Language format". whitehouse.gov. 2013-05-09. Archived from the original on 2016-03-03. Retrieved 2016-09-08.
  9. "StrategicPlan Circular No. A-130, Managing Information as a Strategic Resource, Objective d.5.a: Interoperability, APIs & Machine-Readability".
  10. "StrategicPlan Circular No. A-130, Managing Information as a Strategic Resource, Objective e.2.a: Publication".
  11. Ryan, Paul D. (January 14, 2019). "Text - H.R.4174 - 115th Congress (2017-2018): Foundations for Evidence-Based Policymaking Act of 2018". www.congress.gov.
  12. "PREPARATION, SUBMISSION, AND EXECUTION OF THE BUDGET" (PDF). whitehouse.gov. 2019-06-28. Retrieved 2019-07-25.
  13. "StrategicPlan Circular No. A-130, Managing Information as a Strategic Resource, Objective Machine-Readability".