Data exchange is the process of taking data structured under a source schema and transforming it into a target schema, so that the target data is an accurate representation of the source data. [1] Data exchange allows data to be shared between different computer programs.
It is similar to the related concept of data integration except that data is actually restructured (with possible loss of content) in data exchange. There may be no way to transform an instance given all of the constraints. Conversely, there may be numerous ways to transform the instance (possibly infinitely many), in which case a "best" choice of solutions has to be identified and justified.
In some domains, a few dozen different source and target schema (proprietary data formats) may exist. An "exchange" or "interchange format" is often developed for a single domain, and then necessary routines (mappings) are written to (indirectly) transform/translate each and every source schema to each and every target schema by using the interchange format as an intermediate step. [2] That requires a lot less work than writing and debugging the hundreds of different routines that would be required to directly translate each and every source schema directly to each and every target schema.
Examples of these transformative interchange formats include:
There are two types of data exchange: broadcast data exchange vs peer-to-peer (unicast) data exchange. [9]
In a broadcast network, data is transmitted simultaneously to all participants. Just as a conference call, all participants get the exact same information from the speaker at the same time. [10]
In a peer-to-peer (unicast) data exchange model, data is sent only to the targeted receiver defined by a specific address. Just as a telephone call or a email, information only flows between two network participants. [11]
A data interchange (or exchange) language/format is a language that is domain-independent and can be used for data from any kind of discipline. [12] They have "evolved from being markup and display-oriented to further support the encoding of metadata that describes the structural attributes of the information." [13]
Practice has shown that certain types of formal languages are better suited for this task than others, since their specification is driven by a formal process instead of particular software implementation needs. For example, XML is a markup language that was designed to enable the creation of dialects (the definition of domain-specific sublanguages). [14] However, it does not contain domain-specific dictionaries or fact types. Beneficial to a reliable data exchange is the availability of standard dictionaries-taxonomies and tools libraries such as parsers, schema validators, and transformation tools.[ citation needed ]
The following is a partial list of popular generic languages used for data exchange in multiple domains.
Name/Abbreviation | Schemas | Flexible | Semantic verification | Dictionary | Information Model | Synonyms and homonyms | Dialecting | Web standard | Transformations | Lightweight | Human readable | Compatibility |
---|---|---|---|---|---|---|---|---|---|---|---|---|
RDF | Yes [1] | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Partial | Subset of Semantic web |
XML | Yes [2] | Yes | No | No | No | No | Yes | Yes | Yes | No | Yes | subset of SGML, HTML |
Atom | Yes | Unknown | Unknown | Unknown | No | Unknown | Yes | Yes | Yes | No | No | XML dialect |
JSON | No | Unknown | Unknown | Unknown | No | Unknown | No | Yes | No | Yes | Yes | subset of YAML |
YAML | No [3] | Unknown | Unknown | Unknown | No | Unknown | No | No | No [3] | Yes | Yes [4] | superset of JSON |
REBOL | Yes [7] | Yes | No | Yes | No | Yes | Yes | No | Yes [7] | Yes | Yes [5] | |
Gellish | Yes | Yes | Yes | Yes [8] | No | Yes | Yes | ISO | No | Yes | Partial [6] | SQL, RDF/XML, OWL |
Nomenclature
Notes:
The popularity of XML for data exchange on the World Wide Web has several reasons. First of all, it is closely related to the preexisting standards Standard Generalized Markup Language (SGML) and Hypertext Markup Language (HTML), and as such a parser written to support these two languages can be easily extended to support XML as well. For example, XHTML has been defined as a format that is formal XML, but understood correctly by most (if not all) HTML parsers. [14]
YAML is a language that was designed to be human-readable (and as such to be easy to edit with any standard text editor). Its notion often is similar to reStructuredText or a Wiki syntax, who also try to be readable both by humans and computers. YAML 1.2 also includes a shorthand notion that is compatible with JSON, and as such any JSON document is also valid YAML; this however does not hold the other way. [16]
REBOL is a language that was designed to be human-readable and easy to edit using any standard text editor. To achieve that it uses a simple free-form syntax with minimal punctuation and a rich set of datatypes. REBOL datatypes like URLs, emails, date and time values, tuples, strings, tags, etc. respect the common standards. REBOL is designed to not need any additional meta-language, being designed in a metacircular fashion. The metacircularity of the language is the reason why, e.g., the Parse dialect used (not exclusively) for definitions and transformations of REBOL dialects is also itself a dialect of REBOL. [17] REBOL was used as a source of inspiration for JSON. [18]
Gellish English is a formalized subset of natural English, which includes a simple grammar and a large extensible English Dictionary-Taxonomy that defines the general and domain specific terminology (terms for concepts), whereas the concepts are arranged in a subtype-supertype hierarchy (a taxonomy), which supports inheritance of knowledge and requirements. The Dictionary-Taxonomy also includes standardized fact types (also called relation types). The terms and relation types together can be used to create and interpret expressions of facts, knowledge, requirements and other information. Gellish can be used in combination with SQL, RDF/XML, OWL and various other meta-languages. The Gellish standard is a combination of ISO 10303-221 (AP221) and ISO 15926. [19]
Rebol is a cross-platform data exchange language and a multi-paradigm dynamic programming language designed by Carl Sassenrath for network communications and distributed computing. It introduces the concept of dialecting: small, optimized, domain-specific languages for code and data, which is also the most notable property of the language according to its designer Carl Sassenrath:
Although it can be used for programming, writing functions, and performing processes, its greatest strength is the ability to easily create domain-specific languages or dialects
In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.
The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":
The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.
In communications and computing, a machine-readable medium is a medium capable of storing data in a format easily readable by a digital computer or a sensor. It contrasts with human-readable medium and data.
Abstract Syntax Notation One (ASN.1) is a standard interface description language (IDL) for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.
X3D is a set of royalty-free ISO/IEC standards for declaratively representing 3D computer graphics. X3D includes multiple graphics file formats, programming-language API definitions, and run-time specifications for both delivery and integration of interactive network-capable 3D data. X3D version 4.0 has been approved by Web3D Consortium, and is under final review by ISO/IEC as a revised International Standard (IS).
YAML(see § History and name) is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.
In computing, configuration files are files used to configure the parameters and initial settings for some computer programs or applications, server processes and operating system settings.
The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains the TEI technical standard, a journal, a wiki, a GitHub repository and a toolchain.
United Nations/Electronic Data Interchange for Administration, Commerce and Transport (UN/EDIFACT) is an international standard for electronic data interchange (EDI) developed for the United Nations and approved and published by UNECE, the UN Economic Commission for Europe.
JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.
GXL is designed to be a standard exchange format for graphs. GXL is an extensible markup language (XML) sublanguage and the syntax is given by an XML document type definition (DTD). This exchange format offers an adaptable and flexible means to support interoperability between graph-based tools.
An INI file is a configuration file for computer software that consists of a text-based content with a structure and syntax comprising key–value pairs for properties, and sections that organize the properties. The name of these configuration files comes from the filename extension INI, for initialization, used in the MS-DOS operating system which popularized this method of software configuration. The format has become an informal standard in many contexts of configuration, but many applications on other operating systems use different file name extensions, such as conf and cfg.
Gellish is an ontology language for data storage and communication, designed and developed by Andries van Renssen since mid-1990s. It started out as an engineering modeling language but evolved into a universal and extendable conceptual data modeling language with general applications. Because it includes domain-specific terminology and definitions, it is also a semantic data modelling language and the Gellish modeling methodology is a member of the family of semantic modeling methodologies.
A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.
Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.
This is a comparison of data serialization formats, various ways to convert complex objects to sequences of bits. It does not include markup languages used exclusively as document file formats.
The transformation routines will constitute a language and syntax which must be discipline and machine independent.