Data Format Description Language

Last updated

Data Format Description Language (DFDL, often pronounced daff-o-dil) is a modeling language for describing general text and binary data in a standard way. It was published as an Open Grid Forum Recommendation [1] in February 2021, and in April 2024 was published as an ISO standard. [2]

Contents

A DFDL model or schema allows any text or binary data to be read (or "parsed") from its native format and to be presented as an instance of an information set. (An information set is a logical representation of the data contents, independent of the physical format. For example, two records could be in different formats, because one has fixed-length fields and the other uses delimiters, but they could contain exactly the same data, and would both be represented by the same information set). The same DFDL schema also allows data to be taken from an instance of an information set and written out (or "serialized") to its native format.

DFDL is descriptive and not prescriptive . DFDL is not a data format, nor does it impose the use of any particular data format. Instead it provides a standard way of describing many different kinds of data formats. This approach has several advantages. [3] It allows an application author to design an appropriate data representation according to their requirements while describing it in a standard way which can be shared, enabling multiple programs to directly interchange the data.

DFDL achieves this by building upon the facilities of W3C XML Schema 1.0. A subset of XML Schema is used, enough to enable the modeling of non-XML data. The motivations for this approach are to avoid inventing a completely new schema language, and to make it easy to convert general text and binary data, via a DFDL information set, into a corresponding XML document.

Educational material is available in the form of DFDL Tutorials, videos and several hands-on DFDL labs.

History

DFDL was created in response to a need for grid APIs to be able to understand data regardless of source. A language was needed capable of modeling a wide variety of existing text and binary data formats. A working group was established at the Global Grid Forum (which later became the Open Grid Forum) in 2003 to create a specification for such a language.

A decision was made early on to base the language on a subset of W3C XML Schema, using <xs:appinfo> annotations to carry the extra information necessary to describe non-XML physical representations. This is an established approach that was already being used by 2003 in commercial systems. DFDL takes this approach and evolves it into an open standard capable of describing many text or binary data formats.

Work continued on the language, resulting in the publication of a DFDL 1.0 specification as OGF Proposed Recommendation GFD.174 in January 2011.

The official OGF Recommendation is now GFD.240 published in February 2021 which obsoletes all prior versions and incorporates all issues noted to date (also available as html). A summary of DFDL and its features is available at the OGF. Any issues with the specification are being tracked using GitHub issue trackers.

In April 2024, DFDL was published as ISO/IEC 23415:2024 by way of the ISO Publicly Available Standards (PAS) process. The standard is available from ISO but will remain publicly available from the Open Grid Forum as well.

Implementations

Implementations of DFDL processors that can parse and serialize data using DFDL schemas are available.

A public repository for DFDL schemas that describe commercial and scientific data formats has been established on GitHub. DFDL schemas for formats like UN/EDIFACT, NACHA, MIL-STD-2045, NITF, and ISO8583 are available for free download.

Example

Take as an example the following text data stream which gives the name, age and location of a person:

The logical model for this data can be described by the following fragment of an XML Schema document. The order, names, types and cardinality of the fields are expressed by the XML schema model.

<xs:schemaxmlns:xs="http://www.w3.org/2001/XMLSchema"...><xs:complexTypename="person_type"><xs:sequence><xs:elementname="name"type="xs:string"/><xs:elementname="age"type="xs:short"/><xs:elementname="county"type="xs:string"/><xs:elementname="country"type="xs:string"/></xs:sequence></xs:complexType></xs:schema>

To additionally model the physical representation of the data stream, DFDL augments the XML schema fragment with annotations on the xs:element and xs:sequence objects, as follows:

<xs:schemaxmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"xmlns:xs="http://www.w3.org/2001/XMLSchema"...><xs:complexTypename="person_type"><xs:sequence><xs:annotation><xs:appinfosource="http://www.ogf.org/dfdl/"><dfdl:sequenceencoding="ASCII"sequenceKind="ordered"separator=","separatorType="infix"separatorPolicy="required"/></xs:appinfo></xs:annotation><xs:elementname="name"type="xs:string"><xs:annotation><xs:appinfosource="http://www.ogf.org/dfdl/"><dfdl:elementlengthKind="delimited"encoding="ASCII"/></xs:appinfo></xs:annotation></xs:element><xs:elementname="age"type="xs:short"><xs:annotation><xs:appinfosource="http://www.ogf.org/dfdl/"><dfdl:elementrepresentation="text"lengthKind="delimited"encoding="ASCII"textNumberRep="standard"textNumberPattern="#0"textNumberBase="10"/></xs:appinfo></xs:annotation></xs:element><xs:elementname="county"type="xs:string"><xs:annotation><xs:appinfosource="http://www.ogf.org/dfdl/"><dfdl:elementlengthKind="delimited"encoding="ASCII"/></xs:appinfo></xs:annotation></xs:element><xs:elementname="country"type="xs:string"><xs:annotation><xs:appinfosource="http://www.ogf.org/dfdl/"><dfdl:elementlengthKind="delimited"encoding="ASCII"/></xs:appinfo></xs:annotation></xs:element></xs:sequence></xs:complexType></xs:schema>

The property attributes on these DFDL annotations express that the data are represented in an ASCII text format with fields being of variable length and delimited by commas

An alternative, more compact syntax is also provided, where DFDL properties are carried as non-native attributes on the XML Schema objects themselves.

<xs:schemaxmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"xmlns:xs="http://www.w3.org/2001/XMLSchema"...><xs:complexTypename="person_type"><xs:sequencedfdl:encoding="ASCII"dfdl:sequenceKind="ordered"dfdl:separator=","dfdl:separatorType="infix"dfdl:separatorPolicy="required"><xs:elementname="name"type="xs:string"dfdl:lengthKind="delimited"dfdl:encoding="ASCII"/><xs:elementname="age"type="xs:short"dfdl:representation="text"dfdl:lengthKind="delimited"dfdl:encoding="ASCII"dfdl:textNumberRep="standard"dfdl:textNumberPattern="##0"dfdl:textNumberBase="10"/><xs:elementname="county"type="xs:string"dfdl:lengthKind="delimited"dfdl:encoding="ASCII"/><xs:elementname="country"type="xs:string"dfdl:lengthKind="delimited"dfdl:encoding="ASCII"/></xs:sequence></xs:complexType></xs:schema>

Features

The goal of DFDL is to provide a rich modeling language capable of representing any text or binary data format. The 1.0 release is a major step towards this goal. The capability includes support for:

See also

Related Research Articles

A document type definition (DTD) is a specification file that contains set of markup declarations that define a document type for an SGML-family markup language. The DTD specification file can be used to validate documents.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

Abstract Syntax Notation One (ASN.1) is a standard interface description language (IDL) for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.

YAML(see § History and name) is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.

<span class="mw-page-title-main">Comma-separated values</span> File format used to store data

Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores tabular data in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

<span class="mw-page-title-main">Delimiter</span> Characters that specify the boundary between regions in a data stream

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.

Bencode is the encoding used by the peer-to-peer file sharing system BitTorrent for storing and transmitting loosely structured data.

<span class="mw-page-title-main">JSON</span> Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.

Fast Infoset is an international standard that specifies a binary encoding format for the XML Information Set as an alternative to the XML document format. It aims to provide more efficient serialization than the text-based XML format.

XMLBeans is a Java-to-XML binding framework which is part of the Apache Software Foundation XML project.

Catalogue Service for the Web (CSW), sometimes seen as Catalogue Service - Web, is a standard for exposing a catalogue of geospatial records in XML on the Internet. The catalogue is made up of records that describe geospatial data, geospatial services, and related resources.

A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

The Microsoft Office XML formats are XML-based document formats introduced in versions of Microsoft Office prior to Office 2007. Microsoft Office XP introduced a new XML format for storing Excel spreadsheets and Office 2003 added an XML-based format for Word documents.

N-Triples is a format for storing and transmitting data. It is a line-based, plain text serialisation format for RDF graphs, and a subset of the Turtle format. N-Triples should not be confused with Notation3 which is a superset of Turtle. N-Triples was primarily developed by Dave Beckett at the University of Bristol and Art Barstow at the World Wide Web Consortium (W3C).

In computer programming, a netstring is a formatting method for byte strings that uses a declarative notation to indicate the size of the string.

The W3C's XML Schema Recommendation defines a formal mechanism for describing XML documents. The standard has become popular and is used by the majority of standards bodies when describing their data.

<span class="mw-page-title-main">Web Services Description Language</span> XML-based interface description language

The Web Services Description Language is an XML-based interface description language that is used for describing the functionality offered by a web service. The acronym is also used for any specific WSDL description of a web service, which provides a machine-readable description of how the service can be called, what parameters it expects, and what data structures it returns. Therefore, its purpose is roughly similar to that of a type signature in a programming language.

gSOAP is a C and C++ software development toolkit for SOAP/XML web services and generic XML data bindings. Given a set of C/C++ type declarations, the compiler-based gSOAP tools generate serialization routines in source code for efficient XML serialization of the specified C and C++ data structures. Serialization takes zero-copy overhead.

Universal Binary JSON (UBJSON) is a computer data interchange format. It is a binary form directly imitating JSON, but requiring fewer bytes of data. It aims to achieve the generality of JSON, combined with being much easier to process than JSON.

References