SDXF

Last updated

SDXF (Structured Data eXchange Format) is a data serialization format defined by RFC 3072. [1] It allows arbitrary structured data of different types to be assembled in one file for exchanging between arbitrary computers.

Contents

The ability to arbitrarily serialize data into a self-describing format is reminiscent of XML, but SDXF is not a text format (as XML) — SDXF is not compatible with text editors. The maximal length of a datum (composite as well as elementary) encoded using SDXF is 16777215 bytes (one less than 16  MB).

Technical structure format

SDXF data can express arbitrary levels of structural depth. Data elements are self-documenting, meaning that the metadata (numeric, character string or structure) are encoded into the data elements. The design of this format is simple and transparent: computer programs access SDXF data with the help of well-defined functions, exempting programmers from learning the precise data layout.

The word "exchange" in the name reflects another kind of transparency: the SDXF functions provide a computer architecture independent conversion of the data. Serializations can be exchanged among computers (via direct network, file transfer or CD) without further measures. The SDXF functions on the receiving side handle architectural adaptation.

Structured data is data with patterns predictable more complex than strings of text. [2]

Example

A commercial example: two companies want to exchange digital invoices. The invoices have the following hierarchical nested structure:

INVOICE │ ├─ INVOICE_NO   ├─ DATE ├─ ADDRESS_SENDER │    ├─ NAME │    ├─ NAME │    ├─ STREET │    ├─ ZIP │    ├─ CITY │    └─ COUNTRY ├─ ADDRESS_RECIPIENT │    ├─ NAME │    ├─ NAME │    ├─ STREET │    ├─ ZIP │    ├─ CITY │    └─ COUNTRY ├─ INVOICE_SUM ├─ SINGLE_ITEMS │    ├─ SINGLE_ITEM │    │    ├─ QUANTITY │    │    ├─ ITEM_NUMBER │    │    ├─ ITEM_TEXT │    │    ├─ CHARGE │    │    └─ SUM │    └─ ...            ├─ CONDITIONS ...

UML invoice.svg

Example invoice relation structure.svg

Structure

The basic element is a chunk. An SDXF serialization is itself a chunk. A chunk can consist of a set of smaller chunks. Chunks are composed of a header prefix of six bytes, followed by data. The header contains a chunk identifier as a 2-byte binary number (Chunk_ID), the chunk length and type. It may contain additional information about compression, encryption and more.

The chunk type indicates whether the data consists of text (a string of characters), a binary number (integer or floating point) or if the chunk a composite of other chunks.

Structured chunks enable the programmer to pack hierarchical constructions such as the INVOICE above into an SDXF structure as follow: Every named term (INVOICE, INVOICE_NO, DATE, ADDRESS_SENDER, etc.) is given a unique number out in the range 1 to 65535 (2 byte unsigned binary integer without sign). The top/outermost chunk is constructed with the ID INVOICE (that means with the associated numerical chunk_ID) as a structured chunk on level 1. This INVOICE chunk is filled with other chunks on level 2 and beyond: INVOICE_NO, DATE, ADDRESS_SENDER, ADDRESS_RECIPIENT, INVOICE_SUM, SINGLE_ITEMS, CONDITIONS. Some level 2 chunks are structured in turn as for the two addresses and SINGLE_ITEMS.

For a precise description see page 2 of the RFC or alternatively here. [3]

SDXF allows programmer to work on SDXF structures with a compact function set. There are only few of them:

To read Chunks, following functions has to be used:
init
To initialize the parameter structure and linking to the existing Chunk.
enter
To step into a structured Chunk, the 1st Chunk of this structure is ready to process.
leave
To leave the current structure. This structure is already current.
next
Goes to next Chunk if exists (otherwise it leaves the current structure).
extract
To transfer (and adapt) data from the current Chunk into a program variable.
select
To search the next Chunk with a given Chunk ID and make it current.
To build Chunks, following functions has to be used:
init
To initialize the parameter structure and linking to an empty output buffer for to create a new Chunk.
create
Create a new Chunk and append it to the current existing structure (if exists).
append
Append a complete Chunk to an SDXF-Structure.
leave
To leave the current structure. This structure is already current.

The following pseudocode creates invoices:

init(sdx,buffersize=1000);// initialize the SDXF parameter structure sdxcreate(sdx,ID=INVOICE,datatype=STRUCTURED);// start of the main structurecreate(sdx,ID=INVOICE_NO,datatype=NUMERIC,value=123456);// create an elementary Chunkcreate(sdx,ID=DATE,datatype=CHAR,value="2005-06-17");// once morecreate(sdx,ID=ADDRESS_SENDER,datatype=STRUCTURED);// Substructurecreate(sdx,ID=NAME,datatype=CHAR,value="Peter Somebody");// element. Chunk inside this substructure...create(sdx,ID=COUNTRY,datatype=CHAR,value="France");// the last one inside this substructureleave;// closing the substructure ADDRESS_SENDER...leave;// closing the substructure INVOICE

[4]

Pseudocode to extract the INVOICE structure could look like:

init(sdx,container=pointertoanSDXF-structure);// initialize the SDXF parameter structure sdxenter(sdx);// join into the INVOICE structure.//while(sdx.rc==SDX_RC_ok){switch(sdx.Chunk_ID){caseINVOICE_NO:extract(sdx);invno=sdx.value;// the extract function put integer values into the parameter field 'value'break;//caseDATE:extract(sdx);strcpy(invdate,sdx.data);// sdx.data is a pointer to the extracted character stringbreak;//caseADDRESS_SENDER:enter(sdx);// we use 'enter' because ADDRESS is a structured Chunkdowhile(sdx.rc==SDX_RC_ok)// inner loop...break;...}}

SDXF is not designed for readability or to be modified by text editors. A related editable structure is SDEF - Structured Data Editable Format. [5]

See also

Related Research Articles

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.

Interchange File Format (IFF) is a generic digital container file format originally introduced by Electronic Arts in 1985 to facilitate transfer of data between software produced by different companies.

HyperTalk is a discontinued high-level, procedural programming language created in 1987 by Dan Winkler and used in conjunction with Apple Computer's HyperCard hypermedia program by Bill Atkinson. Because the main target audience of HyperTalk was beginning programmers, HyperTalk programmers were usually called "authors" and the process of writing programs was known as "scripting". HyperTalk scripts resembled written English and used a logical structure similar to that of the Pascal programming language.

<span class="mw-page-title-main">Data type</span> Attribute of data

In computer science and computer programming, a data type is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these values as machine types. A data type specification in a program constrains the possible values that an expression, such as a variable or a function call, might take. On literal data, it tells the compiler or interpreter how the programmer intends to use the data. Most programming languages support basic data types of integer numbers, floating-point numbers, characters and Booleans.

Generic programming is a style of computer programming in which algorithms are written in terms of data types to-be-specified-later that are then instantiated when needed for specific types provided as parameters. This approach, pioneered by the ML programming language in 1973, permits writing common functions or types that differ only in the set of types on which they operate when used, thus reducing duplicate code.

In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own, such as devices that use magnetic tape. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.

ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common. This format was originally created in 1989 and was first implemented in PKWARE, Inc.'s PKZIP utility, as a replacement for the previous ARC compression format by Thom Henderson. The ZIP format was then quickly supported by many software utilities other than PKZIP. Microsoft has included built-in ZIP support in versions of Microsoft Windows since 1998 via the "Plus! 98" addon for Windows 98. Native support was added as of the year 2000 in Windows ME. Apple has included built-in ZIP support in Mac OS X 10.3 and later. Most free operating systems have built in support for ZIP in similar manners to Windows and macOS.

YAML(see § History and name) is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.

<span class="mw-page-title-main">Pointer (computer programming)</span> Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

In computer science, a union is a value that may have any of several representations or formats within the same position in memory; that consists of a variable that may hold such a data structure. Some programming languages support special data types, called union types, to describe such values and variables. In other words, a union type definition will specify which of a number of permitted primitive types may be stored in its instances, e.g., "float or long integer". In contrast with a record, which could be defined to contain both a float and an integer; in a union, there is only one value at any given time.

Real-Time Messaging Protocol (RTMP) is a communication protocol for streaming audio, video, and data over the Internet. Originally developed as a proprietary protocol by Macromedia for streaming between Flash Player and the Flash Communication Server, Adobe has released an incomplete version of the specification of the protocol for public use.

The Apple Icon Image format (.icns) is an icon format used in Apple Inc.'s macOS. It supports icons of 16 × 16, 32 × 32, 48 × 48, 128 × 128, 256 × 256, 512 × 512 points at 1x and 2x scale, with both 1- and 8-bit alpha channels and multiple image states. The fixed-size icons can be scaled by the operating system and displayed at any intermediate size.

CANopen is a communication protocol and device profile specification for embedded systems used in automation. In terms of the OSI model, CANopen implements the layers above and including the network layer. The CANopen standard consists of an addressing scheme, several small communication protocols and an application layer defined by a device profile. The communication protocols have support for network management, device monitoring and communication between nodes, including a simple transport layer for message segmentation/desegmentation. The lower level protocol implementing the data link and physical layers is usually Controller Area Network (CAN), although devices using some other means of communication can also implement the CANopen device profile.

A class in C++ is a user-defined type or data structure declared with any of the keywords class, struct or union that has data and functions as its members whose access is governed by the three access specifiers private, protected or public. By default access to members of a C++ class declared with the keyword class is private. The private members are not accessible outside the class; they can be accessed only through member functions of the class. The public members form an interface to the class and are accessible outside the class.

The Stream Control Transmission Protocol (SCTP) has a simpler basic packet structure than TCP. Each consists of two basic sections:

  1. The common header, which occupies the first 12 bytes. In the adjacent diagram, this header is highlighted in blue.
  2. The data chunks, which form the remaining portion of the packet. In the diagram, the first chunk is highlighted in green and the last of N chunks (Chunk N) is highlighted in red. There are several types, including payload data and different control messages.

Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs that communicate with each other over a network or for storing data. The method involves an interface description language that describes the structure of some data and a program that generates source code from that description for generating or parsing a stream of bytes that represents the structured data.

A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free.

MessagePack is a computer data interchange format. It is a binary form for representing simple data structures like arrays and associative arrays. MessagePack aims to be as compact and simple as possible. The official implementation is available in a variety of languages such as C, C++, C#, D, Erlang, Go, Haskell, Java, JavaScript (NodeJS), Lua, OCaml, Perl, PHP, Python, Ruby, Rust, Scala, Smalltalk, and Swift.

PL/SQL is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database, Times Ten in-memory database, and IBM Db2. Oracle Corporation usually extends PL/SQL functionality with each successive release of the Oracle Database.

References

  1. Wildgrube, Max (March 2001). "RFC-3072".
  2. It may be argued that "structured" is used here in the same sense as in structured programming — like there are no gotos in a (strictly) structured program, there are no pointers/references in SDXF. This need not be how the name arose, however.
  3. "SDXF - 2. Description of the SDXF Format". Pinpi.com. Retrieved 2013-09-10.
  4. "6.3 The Project PRNT: a complete example". PINPI. Retrieved 2013-09-10.
  5. "SDEF Site (from Archive.org)". Archived from the original on 2016-03-07.