Data Format Description Language

Last updated December 10, 2024

Data Format Description Language (DFDL, often pronounced daff-o-dil) is a modeling language for describing general text and binary data in a standard way. It was published as an Open Grid Forum Recommendation ^[1] in February 2021, and in April 2024 was published as an ISO standard.^[2]

A DFDL model or schema allows any text or binary data to be read (or "parsed") from its native format and to be presented as an instance of an information set. (An information set is a logical representation of the data contents, independent of the physical format. For example, two records could be in different formats, because one has fixed-length fields and the other uses delimiters, but they could contain exactly the same data, and would both be represented by the same information set). The same DFDL schema also allows data to be taken from an instance of an information set and written out (or "serialized") to its native format.

DFDL is descriptive and not prescriptive . DFDL is not a data format, nor does it impose the use of any particular data format. Instead it provides a standard way of describing many different kinds of data formats. This approach has several advantages.^[3] It allows an application author to design an appropriate data representation according to their requirements while describing it in a standard way which can be shared, enabling multiple programs to directly interchange the data.

DFDL achieves this by building upon the facilities of W3C XML Schema 1.0. A subset of XML Schema is used, enough to enable the modeling of non-XML data. The motivations for this approach are to avoid inventing a completely new schema language, and to make it easy to convert general text and binary data, via a DFDL information set, into a corresponding XML document.

Educational material is available in the form of DFDL Tutorials, videos and several hands-on DFDL labs.

History

DFDL was created in response to a need for grid APIs to be able to understand data regardless of source. A language was needed capable of modeling a wide variety of existing text and binary data formats. A working group was established at the Global Grid Forum (which later became the Open Grid Forum) in 2003 to create a specification for such a language.

A decision was made early on to base the language on a subset of W3C XML Schema, using <xs:appinfo> annotations to carry the extra information necessary to describe non-XML physical representations. This is an established approach that was already being used by 2003 in commercial systems. DFDL takes this approach and evolves it into an open standard capable of describing many text or binary data formats.

Work continued on the language, resulting in the publication of a DFDL 1.0 specification as OGF Proposed Recommendation GFD.174 in January 2011.

The official OGF Recommendation is now GFD.240 published in February 2021 which obsoletes all prior versions and incorporates all issues noted to date (also available as html). A summary of DFDL and its features is available at the OGF. Any issues with the specification are being tracked using GitHub issue trackers.

In April 2024, DFDL was published as ISO/IEC 23415:2024 by way of the ISO Publicly Available Standards (PAS) process. The standard is available from ISO but will remain publicly available from the Open Grid Forum as well.

Implementations

Implementations of DFDL processors that can parse and serialize data using DFDL schemas are available.

IBM has multiple DFDL implementations.
- a production-ready DFDL 1.0 streaming parser, modeler and visual tester.^[4] This is available in several IBM products including IBM App Connect Enterprise (formerly known as IBM Integration Bus). A free developer edition is available.
- IBM z/TPF DFDL which is part of the IBM Mainframe z/Transaction Processing Facility.
Apache Daffodil is an open-source DFDL processor having both parser and unparser, an IDE that is an extension of VSCode, as well as integrations into Apache NiFi, the XML Calabash XProc pipeline engine, and Smooks. It continues to be under active development.
European Space Agency project S2G Data Viewer includes a parser DFDL4S^[5] that implements a subset of the DFDL 1.0 specification.

A public repository for DFDL schemas that describe commercial and scientific data formats has been established on GitHub. DFDL schemas for formats like UN/EDIFACT, NACHA, MIL-STD-2045, NITF, and ISO8583 are available for free download.

Example

Take as an example the following text data stream which gives the name, age and location of a person:

The logical model for this data can be described by the following fragment of an XML Schema document. The order, names, types and cardinality of the fields are expressed by the XML schema model.

<xs:schemaxmlns:xs="http://www.w3.org/2001/XMLSchema"...><xs:complexTypename="person_type"><xs:sequence><xs:elementname="name"type="xs:string"/><xs:elementname="age"type="xs:short"/><xs:elementname="county"type="xs:string"/><xs:elementname="country"type="xs:string"/></xs:sequence></xs:complexType></xs:schema>

To additionally model the physical representation of the data stream, DFDL augments the XML schema fragment with annotations on the xs:element and xs:sequence objects, as follows:

<xs:schemaxmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"xmlns:xs="http://www.w3.org/2001/XMLSchema"...><xs:complexTypename="person_type"><xs:sequence><xs:annotation><xs:appinfosource="http://www.ogf.org/dfdl/"><dfdl:sequenceencoding="ASCII"sequenceKind="ordered"separator=","separatorType="infix"separatorPolicy="required"/></xs:appinfo></xs:annotation><xs:elementname="name"type="xs:string"><xs:annotation><xs:appinfosource="http://www.ogf.org/dfdl/"><dfdl:elementlengthKind="delimited"encoding="ASCII"/></xs:appinfo></xs:annotation></xs:element><xs:elementname="age"type="xs:short"><xs:annotation><xs:appinfosource="http://www.ogf.org/dfdl/"><dfdl:elementrepresentation="text"lengthKind="delimited"encoding="ASCII"textNumberRep="standard"textNumberPattern="#0"textNumberBase="10"/></xs:appinfo></xs:annotation></xs:element><xs:elementname="county"type="xs:string"><xs:annotation><xs:appinfosource="http://www.ogf.org/dfdl/"><dfdl:elementlengthKind="delimited"encoding="ASCII"/></xs:appinfo></xs:annotation></xs:element><xs:elementname="country"type="xs:string"><xs:annotation><xs:appinfosource="http://www.ogf.org/dfdl/"><dfdl:elementlengthKind="delimited"encoding="ASCII"/></xs:appinfo></xs:annotation></xs:element></xs:sequence></xs:complexType></xs:schema>

The property attributes on these DFDL annotations express that the data are represented in an ASCII text format with fields being of variable length and delimited by commas

An alternative, more compact syntax is also provided, where DFDL properties are carried as non-native attributes on the XML Schema objects themselves.

<xs:schemaxmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"xmlns:xs="http://www.w3.org/2001/XMLSchema"...><xs:complexTypename="person_type"><xs:sequencedfdl:encoding="ASCII"dfdl:sequenceKind="ordered"dfdl:separator=","dfdl:separatorType="infix"dfdl:separatorPolicy="required"><xs:elementname="name"type="xs:string"dfdl:lengthKind="delimited"dfdl:encoding="ASCII"/><xs:elementname="age"type="xs:short"dfdl:representation="text"dfdl:lengthKind="delimited"dfdl:encoding="ASCII"dfdl:textNumberRep="standard"dfdl:textNumberPattern="##0"dfdl:textNumberBase="10"/><xs:elementname="county"type="xs:string"dfdl:lengthKind="delimited"dfdl:encoding="ASCII"/><xs:elementname="country"type="xs:string"dfdl:lengthKind="delimited"dfdl:encoding="ASCII"/></xs:sequence></xs:complexType></xs:schema>

Features

The goal of DFDL is to provide a rich modeling language capable of representing any text or binary data format. The 1.0 release is a major step towards this goal. The capability includes support for:

Text data types such as strings, numbers, zoned decimals, calendars and Booleans
Binary data types such as two's complement integers, BCD, packed decimals, floats, calendars and Booleans
Fixed length data and data delimited by text or binary markup
Language data structures found in languages like COBOL, C and PL/1
Industry standards such as CSV, SWIFT, FIX, HL7, X12, HIPAA, EDIFACT, ISO 8583
Any encoding and endian-ness
Bit data of arbitrary length
Pattern languages for text numbers and calendars
Ordered, unordered and floating content
Default values on parsing and serializing
Nil values capability for handling out-of-band data
Fixed and variable arrays
XPath 2.0 expression language including variables to model dynamic data
Speculative parsing and other mechanisms to resolve choices and optionality
Validation to XML Schema 1.0 rules
A scoping mechanism that allows common property values to be applied at multiple annotation points
Hiding elements in the data from the information set
Calculating element values for the information set

Related Research Articles

A document type definition (DTD) is a specification file that contains set of markup declarations that define a document type for an SGML-family markup language. The DTD specification file can be used to validate documents.

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

<span class="mw-page-title-main">Comma-separated values</span> File format used to store data

Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores tabular data in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.

Bencode is the encoding used by the peer-to-peer file sharing system BitTorrent for storing and transmitting loosely structured data.

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of name–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.

Fast Infoset is an international standard that specifies a binary encoding format for the XML Information Set as an alternative to the XML document format. It aims to provide more efficient serialization than the text-based XML format.

XMLBeans is a Java-to-XML binding framework which is part of the Apache Software Foundation XML project.

Catalogue Service for the Web (CSW), sometimes seen as Catalogue Service - Web, is a standard for exposing a catalogue of geospatial records in XML on the Internet. The catalogue is made up of records that describe geospatial data, geospatial services, and related resources.

A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

The Web Application Description Language (WADL) is a machine-readable XML description of HTTP-based web services. WADL models the resources provided by a service and the relationships between them. WADL is intended to simplify the reuse of web services that are based on the existing HTTP architecture of the Web. It is platform and language independent and aims to promote reuse of applications beyond the basic use in a web browser. WADL was submitted to the World Wide Web Consortium by Sun Microsystems on 31 August 2009, but the consortium has no current plans to standardize it. WADL is the REST equivalent of SOAP's Web Services Description Languages (WSDL), which can also be used to describe REST web services.

The Microsoft Office XML formats are XML-based document formats introduced in versions of Microsoft Office prior to Office 2007. Microsoft Office XP introduced a new XML format for storing Excel spreadsheets and Office 2003 added an XML-based format for Word documents.

N-Triples is a format for storing and transmitting data. It is a line-based, plain text serialisation format for RDF graphs, and a subset of the Turtle format. N-Triples should not be confused with Notation3 which is a superset of Turtle. N-Triples was primarily developed by Dave Beckett at the University of Bristol and Art Barstow at the World Wide Web Consortium (W3C).

In computer programming, a netstring is a formatting method for byte strings that uses a declarative notation to indicate the size of the string.

The W3C's XML Schema Recommendation defines a formal mechanism for describing XML documents. The standard has become popular and is used by the majority of standards bodies when describing their data.

<span class="mw-page-title-main">Web Services Description Language</span> XML-based interface description language

The Web Services Description Language is an XML-based interface description language that is used for describing the functionality offered by a web service. The acronym is also used for any specific WSDL description of a web service, which provides a machine-readable description of how the service can be called, what parameters it expects, and what data structures it returns. Therefore, its purpose is roughly similar to that of a type signature in a programming language.

The Office Open XML file formats are a set of file formats that can be used to represent electronic office documents. There are formats for word processing documents, spreadsheets and presentations as well as specific formats for material such as mathematical formulas, graphics, bibliographies etc.

gSOAP is a C and C++ software development toolkit for SOAP/XML web services and generic XML data bindings. Given a set of C/C++ type declarations, the compiler-based gSOAP tools generate serialization routines in source code for efficient XML serialization of the specified C and C++ data structures. Serialization takes zero-copy overhead.

Universal Binary JSON (UBJSON) is a computer data interchange format. It is a binary form directly imitating JSON, but requiring fewer bytes of data. It aims to achieve the generality of JSON, combined with being much easier to process than JSON.

References

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] OGF GFD.240

[2] ISO/IEC 23415:2024

[3] The Syntax of Data, Mike Beckerle blog

[4] IBM DFDL 1.0

[5] DFDL4S

[1]

[2]

[3]

[4]

[5]