Round-trip format conversion

Last updated

The term round-trip is used in document conversion particularly involving markup languages such as XML and SGML. Round-tripping consists of converting a document in format A (docA) to one in format B (docB) and then back again to format A (docA′). If docA and docA′ are identical then there has been no information loss and the round-trip has been successful. [1] More generally it means converting from any data representation and back again, including from one data structure to another.

Contents

Common use cases

  1. Databases: When migrating data between different database systems or formats, round-tripping validates that data remains consistent after conversion.
  2. File Formats: Converting documents between formats, such as from Microsoft Word to OpenDocument Format and back, to ensure document fidelity. [2] Converting a document from HTML to Markdown and back, checking for consistency. [3]
  3. Serialization and Deserialization: Converting objects to a storable or transmittable format (like JSON or XML) and back into objects without losing data. [1]

In the context of graph databases, round-tripping can validate conversions between different graph models, such as from the Resource Description Framework (RDF) to Property Graphs and back, ensuring the original semantics and structure are preserved. [4]

Information loss

When a document in one format is converted to another there is likely to be information loss. For example, suppose an HTML document is saved as plain text (*.txt). Then all the markup (structure, formatting, superscripts, …) will be lost. Compound documents will frequently lose information on images and other embedded objects. If the text file is converted back to the original format, information will necessarily be missing.

A similar effect happens with image formats. Some formats such as JPEG achieve compression through small amount of information loss. If a lossless file, such as a BMP or PNG file, is converted to JPEG and back again then the result will be different from the original (although it may be visually very similar).

Just because the initial and final documents are not bitwise identical does not mean there is information loss. Some formats have undefined fields, or fields where the contents have no impact on the result.

Markup languages

Markup languages such as XML can, in principle, hold any information and so the process docA → docX → docA' could be designed to avoid information loss. It is now common to convert legacy formats to XML formats because they have greater interoperability and a wider set of available tools. Thus it is possible to convert Word documents to an XML format and reimport them.

The XML document should contain identical information to the legacy format. An important condition is that the roundtrip (legacy → XML → legacy') should result in effectively identical documents. Because some document structures allow some flexibility in content order, whitespace, case-sensitivity, etc. it is useful to have a means of canonicalizing the legacy format. The full roundtrip may then be:

legacy → canonicalLegacy → XML → legacy′ → canonicalLegacy′

If canonicalLegacy = canonicalLegacy′ then the roundtrip has been successful.

Character encodings

Unicode has a principle to have round-trip compatibility with older standardized legacy encodings, so conversion of documents to Unicode do not lose information; they can be converted back. To achieve this, Unicode compatibility characters have been introduced.

Limitation

An application can claim to round-trip and be dishonest. For example, it may save the original data from docA as a field in docX, so the reverse transformation to docA′ simply extracts that field. While this may be needed for some cases, the idea of a round-trip conversion is to go through another format representation or data structure and back again. Such a strategy means that small changes in a document means that it can not be converted back to the original format.

Usage

The term appears to be common, but not reported in dictionaries. A typical usage occurs on a 1999 xml-dev thread but the term is likely to have been used before this. [5]

See also

Related Research Articles

<span class="mw-page-title-main">Markup language</span> Modern system for annotating a document

A markuplanguage is a text-encoding system which specifies the structure and formatting of a document and potentially the relationships among its parts. Markup can control the display of a document or enrich its content to facilitate automated processing.

<span class="mw-page-title-main">Plain text</span> Term for computer data consisting only of unformatted characters of readable material

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

The Resource Description Framework (RDF) is a method to describe and exchange graph data. It was originally designed as a data model for metadata by the World Wide Web Consortium (W3C). It provides a variety of syntax notations and formats, of which the most widely used is Turtle.

<span class="mw-page-title-main">Topic map</span> Knowledge organization system

A topic map is a standard for the representation and interchange of knowledge, with an emphasis on the findability of information. Topic maps were originally developed in the late 1990s as a way to represent back-of-the-book index structures so that multiple indexes from different sources could be merged. However, the developers quickly realized that with a little additional generalization, they could create a meta-model with potentially far wider application. The ISO/IEC standard is formally known as ISO/IEC 13250:2003.

YAML is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.

A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of name–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.

<span class="mw-page-title-main">RDFLib</span> Python library to serialize, parse and process RDF data

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information. This library contains parsers/serializers for almost all of the known RDF serializations, such as RDF/XML, Turtle, N-Triples, & JSON-LD, many of which are now supported in their updated form. The library also contains both in-memory and persistent Graph back-ends for storing RDF information and numerous convenience functions for declaring graph namespaces, lodging SPARQL queries and so on. It is in continuous development with the most recent stable release, rdflib 6.1.1 having been released on 20 December 2021. It was originally created by Daniel Krech with the first release in November, 2002.

Semantic publishing on the Web, or semantic web publishing, refers to publishing information on the web as documents accompanied by semantic markup. Semantic publication provides a way for computers to understand the structure and even the meaning of the published information, making information search and data integration more efficient.

A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

Data exchange is the process of taking data structured under a source schema and transforming it into a target schema, so that the target data is an accurate representation of the source data. Data exchange allows data to be shared between different computer programs.

The Semantic Web Stack, also known as Semantic Web Cake or Semantic Web Layer Cake, illustrates the architecture of the Semantic Web.

<span class="mw-page-title-main">Journal Article Tag Suite</span>

The Journal Article Tag Suite (JATS) is an XML format used to describe scientific literature published online. It is a technical standard developed by the National Information Standards Organization (NISO) and approved by the American National Standards Institute with the code Z39.96-2012.

Pandoc is a free-software document converter, widely used as a writing tool and as a basis for publishing workflows. It was created by John MacFarlane, a philosophy professor at the University of California, Berkeley.

In markup languages and the digital humanities, overlap occurs when a document has two or more structures that interact in a non-hierarchical manner. A document with overlapping markup cannot be represented as a tree. This is also known as concurrent markup. Overlap happens, for instance, in poetry, where there may be a metrical structure of feet and lines; a linguistic structure of sentences and quotations; and a physical structure of volumes and pages and editorial annotations.

References

  1. 1 2 Boyer, John; Gao, Sandy; Malaika, Susan; Maximilien, Michael; Salz, Rich; Simeon, Jerome (2011). "Experiences with JSON and XML Transformations". W3C Workshop on Data and Services Integration. 9.
  2. "Is LibreOffice compatible with Microsoft Office?". Ask LibreOffice. 20 December 2022. Retrieved 26 September 2024.
  3. "How To Use Python-Markdown to Convert Markdown Text to HTML". DigitalOcean. Retrieved 26 September 2024.
  4. Hartig, Olaf (2014). "Reconciliation of RDF* and property graphs". arXiv preprint arXiv:1409.3288.
  5. Kesselman, Joseph “keshlam” (March 25, 1999). "Round-trip issues". XML-dev. IBM Research. Gathering and replying to several comments [including CDATA]