Data exchange

Last updated

Data exchange is the process of taking data structured under a source schema and transforming it into a target schema, so that the target data is an accurate representation of the source data. [1] Data exchange allows data to be shared between different computer programs.

Contents

It is similar to the related concept of data integration except that data is actually restructured (with possible loss of content) in data exchange. There may be no way to transform an instance given all of the constraints. Conversely, there may be numerous ways to transform the instance (possibly infinitely many), in which case a "best" choice of solutions has to be identified and justified.

Single-domain data exchange

In some domains, a few dozen different source and target schema (proprietary data formats) may exist. An "exchange" or "interchange format" is often developed for a single domain, and then necessary routines (mappings) are written to (indirectly) transform/translate each and every source schema to each and every target schema by using the interchange format as an intermediate step. [2] That requires a lot less work than writing and debugging the hundreds of different routines that would be required to directly translate each and every source schema directly to each and every target schema.

Examples of these transformative interchange formats include:

Data exchange methods

There are two types of data exchange: broadcast data exchange vs peer-to-peer (unicast) data exchange. [9]

In a broadcast network, data is transmitted simultaneously to all participants. Just as a conference call, all participants get the exact same information from the speaker at the same time. [10]

In a peer-to-peer (unicast) data exchange model, data is sent only to the targeted receiver defined by a specific address. Just as a telephone call or a email, information only flows between two network participants. [11]

Data exchange languages

A data interchange (or exchange) language/format is a language that is domain-independent and can be used for data from any kind of discipline. [12] They have "evolved from being markup and display-oriented to further support the encoding of metadata that describes the structural attributes of the information." [13]

Practice has shown that certain types of formal languages are better suited for this task than others, since their specification is driven by a formal process instead of particular software implementation needs. For example, XML is a markup language that was designed to enable the creation of dialects (the definition of domain-specific sublanguages). [14] However, it does not contain domain-specific dictionaries or fact types. Beneficial to a reliable data exchange is the availability of standard dictionaries-taxonomies and tools libraries such as parsers, schema validators, and transformation tools.[ citation needed ]

The following is a partial list of popular generic languages used for data exchange in multiple domains.


Name/AbbreviationSchemasFlexibleSemantic verificationDictionaryInformation ModelSynonyms and homonymsDialectingWeb standardTransformationsLightweightHuman readableCompatibility
RDF Yes [1] YesYesYesYesYesYesYesYesYesPartialSubset of Semantic web
XML Yes [2] YesNoNoNoNoYesYesYesNoYessubset of SGML, HTML
Atom YesUn­knownUn­knownUn­knownNoUn­knownYesYesYesNoNo XML dialect
JSON NoUn­knownUn­knownUn­knownNoUn­knownNoYesNoYesYessubset of YAML
YAML No [3] Un­knownUn­knownUn­knownNoUn­knownNoNoNo [3] YesYes [4] superset of JSON
REBOL Yes [7] YesNoYesNoYesYesNoYes [7] YesYes [5]
Gellish YesYesYesYes [8] NoYesYesISONoYesPartial [6] SQL, RDF/XML, OWL

Nomenclature

Notes:

  1. ^ RDF is a schema-flexible language.
  2. ^ The schema of XML contains a very limited grammar and vocabulary.
  3. ^ Available as an extension.
  4. ^ In the default format, not the compact syntax.
  5. ^ The syntax is fairly simple (the language was designed to be human-readable); the dialects may require domain knowledge.
  6. ^ The standardized fact types are denoted by standardized English phrases, which interpretation and use needs some training.
  7. ^ The Parse dialect is used to specify, validate, and transform dialects.
  8. ^ The English version includes a Gellish English Dictionary-Taxonomy that also includes standardized fact types (= kinds of relations).

XML for data exchange

The popularity of XML for data exchange on the World Wide Web has several reasons. First of all, it is closely related to the preexisting standards Standard Generalized Markup Language (SGML) and Hypertext Markup Language (HTML), and as such a parser written to support these two languages can be easily extended to support XML as well. For example, XHTML has been defined as a format that is formal XML, but understood correctly by most (if not all) HTML parsers. [14]

YAML for data exchange

YAML is a language that was designed to be human-readable (and as such to be easy to edit with any standard text editor). Its notion often is similar to reStructuredText or a Wiki syntax, who also try to be readable both by humans and computers. YAML 1.2 also includes a shorthand notion that is compatible with JSON, and as such any JSON document is also valid YAML; this however does not hold the other way. [16]

REBOL for data exchange

REBOL is a language that was designed to be human-readable and easy to edit using any standard text editor. To achieve that it uses a simple free-form syntax with minimal punctuation and a rich set of datatypes. REBOL datatypes like URLs, emails, date and time values, tuples, strings, tags, etc. respect the common standards. REBOL is designed to not need any additional meta-language, being designed in a metacircular fashion. The metacircularity of the language is the reason why, e.g., the Parse dialect used (not exclusively) for definitions and transformations of REBOL dialects is also itself a dialect of REBOL. [17] REBOL was used as a source of inspiration for JSON. [18]

Gellish for data exchange

Gellish English is a formalized subset of natural English, which includes a simple grammar and a large extensible English Dictionary-Taxonomy that defines the general and domain specific terminology (terms for concepts), whereas the concepts are arranged in a subtype-supertype hierarchy (a taxonomy), which supports inheritance of knowledge and requirements. The Dictionary-Taxonomy also includes standardized fact types (also called relation types). The terms and relation types together can be used to create and interpret expressions of facts, knowledge, requirements and other information. Gellish can be used in combination with SQL, RDF/XML, OWL and various other meta-languages. The Gellish standard is a combination of ISO 10303-221 (AP221) and ISO 15926. [19]

See also

Related Research Articles

Rebol is a cross-platform data exchange language and a multi-paradigm dynamic programming language designed by Carl Sassenrath for network communications and distributed computing. It introduces the concept of dialecting: small, optimized, domain-specific languages for code and data, which is also the most notable property of the language according to its designer Carl Sassenrath:

Although it can be used for programming, writing functions, and performing processes, its greatest strength is the ability to easily create domain-specific languages or dialects

In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

<span class="mw-page-title-main">Machine-readable medium and data</span> Medium capable of storing data in a format readable by a machine

In communications and computing, a machine-readable medium is a medium capable of storing data in a format easily readable by a digital computer or a sensor. It contrasts with human-readable medium and data.

Abstract Syntax Notation One (ASN.1) is a standard interface description language (IDL) for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.

X3D is a set of royalty-free ISO/IEC standards for declaratively representing 3D computer graphics. X3D includes multiple graphics file formats, programming-language API definitions, and run-time specifications for both delivery and integration of interactive network-capable 3D data. X3D version 4.0 has been approved by Web3D Consortium, and is under final review by ISO/IEC as a revised International Standard (IS).

YAML(see § History and name) is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.

<span class="mw-page-title-main">Configuration file</span> Software file used to configure the initial settings for a computer program

In computing, configuration files are files used to configure the parameters and initial settings for some computer programs or applications, server processes and operating system settings.

<span class="mw-page-title-main">Text Encoding Initiative</span> Academic community concerned with text encoding

The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains the TEI technical standard, a journal, a wiki, a GitHub repository and a toolchain.

United Nations/Electronic Data Interchange for Administration, Commerce and Transport (UN/EDIFACT) is an international standard for electronic data interchange (EDI) developed for the United Nations and approved and published by UNECE, the UN Economic Commission for Europe.

<span class="mw-page-title-main">JSON</span> Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.

GXL is designed to be a standard exchange format for graphs. GXL is an extensible markup language (XML) sublanguage and the syntax is given by an XML document type definition (DTD). This exchange format offers an adaptable and flexible means to support interoperability between graph-based tools.

An INI file is a configuration file for computer software that consists of a text-based content with a structure and syntax comprising key–value pairs for properties, and sections that organize the properties. The name of these configuration files comes from the filename extension INI, for initialization, used in the MS-DOS operating system which popularized this method of software configuration. The format has become an informal standard in many contexts of configuration, but many applications on other operating systems use different file name extensions, such as conf and cfg.

Gellish is an ontology language for data storage and communication, designed and developed by Andries van Renssen since mid-1990s. It started out as an engineering modeling language but evolved into a universal and extendable conceptual data modeling language with general applications. Because it includes domain-specific terminology and definitions, it is also a semantic data modelling language and the Gellish modeling methodology is a member of the family of semantic modeling methodologies.

A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.

This is a comparison of data serialization formats, various ways to convert complex objects to sequences of bits. It does not include markup languages used exclusively as document file formats.

References

  1. A. Doan, A. Halevy, and Z. Ives. "Principles of data integration", Morgan Kaufmann,s 2012 pp. 276
  2. Arenas, M.; Barceló, P.; Libkin, L.; Murlak, F. (2014). Foundations of Data Exchange. Cambridge University Press. pp. 1–11. ISBN   9781107016163 . Retrieved 25 May 2018.
  3. Clancy, J.J. (2012). "Chapter 1: Directions for Engineering Data Exchange for Computer Aided Design and Manufacturing". In Wang, P.C.C. (ed.). Advances in CAD/CAM: Case Studies. Springer Science & Business Media. pp. 1–36. ISBN   9781461328193 . Retrieved 25 May 2018.
  4. Kalish, C.E.; Mayer, M.F. (November 1981). "DIF: A format for data exchange between application programs". BYTE Magazine: 174.
  5. "About ODF". OpenDoc Society. Retrieved 25 May 2018.
  6. Zhu, X. (2016). GIS for Environmental Applications: A practical approach. Routledge. ISBN   9781134094509 . Retrieved 25 May 2018.
  7. "KML Reference". Google Inc. 21 January 2016. Retrieved 25 May 2018.
  8. Martins, R.M.F.; Lourenço, N.C.C.; Horta, N.C.G. (2012). Generating Analog IC Layouts with LAYGEN II. Springer Science & Business Media. p. 34. ISBN   9783642331466 . Retrieved 25 May 2018.
  9. Heidarzadeh, A.; Sprintson, A. (2017-03-30). "Optimal exchange of data over broadcast networks with adversaries". 2016 Information Theory and Applications Workshop (ITA). ISBN   978-1-5090-2529-9 via IEEE.
  10. "What is a Broadcast?". IONOS Digital Guide. 2023-03-20. Retrieved 2024-04-03.
  11. "Unicast". IONOS Digital Guide. 2023-03-23. Retrieved 2024-04-03.
  12. Billingsley, F.C. (1988). "General Data Interchange Language". ISPRS Archives. 27 (B3): 80–91. Retrieved 25 May 2018. The transformation routines will constitute a language and syntax which must be discipline and machine independent.
  13. Nurseitov, N.; Paulson, M.; Reynolds, R.; Izurieta, C. (2009). "Comparison of JSON and XML Data Interchange Formats: A Case Study". Scenario: 157–162.
  14. 1 2 Lewis, J.; Moscovitz, M. (2009). AdvancED CSS. APress. pp. 5–6. ISBN   9781430219323 . Retrieved 25 May 2018.
  15. "human-readable". Oxford Dictionaries. Oxford University Press. Archived from the original on May 30, 2018. Retrieved 29 May 2018.
  16. Bendersky, E. (22 November 2008). "JSON is YAML, but YAML is not JSON". Eli Bendersky's website. Retrieved 29 May 2018.
  17. Sassenrath, C. (2000). "The REBOL Scripting Language". Dr. Dobb's Journal. 25 (314): 64–8. Retrieved 29 May 2018.
  18. Sassenrath, C. (13 December 2012). "On JSON and REBOL". REBOL.com. Retrieved 29 May 2018.
  19. van Renssen, A.; Vermaas, P.E.; Zwart, S.D. (2007). "A Taxonomy of Functions in Gellish English". Proceedings from the International Conference on Engineering Design 2007: DS42_P_230. Retrieved 29 May 2018.