Fast Infoset

Last updated

Fast Infoset (or FI) is an international standard that specifies a binary encoding format for the XML Information Set (XML Infoset) as an alternative to the XML document format. It aims to provide more efficient serialization than the text-based XML format.

Contents

FI is effectively a lossless compression, analogous to gzip , for XML, except that while the original formatting is lost, no information is lost in the conversion from XML to FI, and back to XML. While the purpose of compression is to reduce physical data size, FI aims to optimize both document size and processing performance.

The Fast Infoset specification is defined by both the ITU-T and the ISO/IEC standards bodies. FI is officially defined in ITU-T Rec. X.891 and ISO/IEC 24824-1, and entitled Fast Infoset. The standard was published by ITU-T on May 14, 2005, and by ISO on May 4, 2007. The Fast Infoset standard document can be downloaded from the ITU website. Though the document does not assert intellectual property (IP) restrictions on implementation or use, page ii warns that it has received notices and the subject may not be completely free of IP assertions.

A common misconception is that FI requires ASN.1 tool support. Although the formal specification uses ASN.1 notation, the standard includes Encoding Control Notation (ECN) and ASN.1 tools are not required by implementations.

An alternative to FI is FleXPath. [1]

Structure

The underlying file format is ASN.1, with tag/length/value blocks. Text values of attributes and elements are stored with length prefixes rather than end delimiters, and data segments do not require escapement for special characters. The equivalent of end tags ("terminators") are needed only at the end of a list of child-elements. Binary data is transmitted in native format, and need not be converted to a transmission format such as base64.

Fast Infoset is a higher level format built on ASN.1 forms and notation. Element and attribute names are stored within the octet stream, unlike traditional ASN.1 encoding schemes. In consequence, The conventional XML file can be recovered from the binary stream without reference the XML Schema, and the XML Schema need not be expressed as an ASN.1 definition. (ASN.1 "Tags" are just type names, e.g. String, Integer, or complex types.) ASN.1 together with ECN is used to define the file format.

An index table is built for most strings, which includes element and attribute names, and their values. This means that the text of repeated tags and values only appears once per document.

Implementations

Reference implementation

A Java implementation of the FI specification is available as part of the Eclipse Implementation of JAXB. The library is open source and is distributed under the terms of the Apache License 2.0. Several projects use this implementation, including the reference implementation for JAX-WS used in Eclipse Metro.

The QtitanFastInfoset implementation for C++ is available under commercial license as a component for the Qt framework.

Performance

Because Fast Infosets are compressed as part of the XML generation process, they are much faster than using Zip-style compression algorithms on an XML stream, although the output is not as well compressed.

SAX-type parsing performance of Fast Infoset is also much faster than parsing performance of XML 1.0, even without any Zip-style compression. Typical increases in parsing speed observed for the reference Java implementation are a factor of 10 over Java Xerces, and a factor of 4 over the Piccolo driver (one of the fastest Java-based XML parsers). [2] [3] [4]

Typical applications

Portable devices – Mobile devices typically have low bandwidth data connections and slower CPUs. Fast Infoset uses less bandwidth than XML and is faster to process, making it a superior choice.

Storing large volumes of data – When storing XML to either file or database, the volume of data a system produces can often exceed reasonable limits, with a number of detriments: the access times go up as more data is read, CPU load goes up as XML data takes more power to process, and storage costs go up. By storing XML data in Fast Infoset format, data volume may be reduced by as much as 80 percent.

Passing XML through the Internet – When an application passes data over the internet, network bandwidth can be a major bottleneck, seriously degrading the performance of client applications and limiting the server's power to process requests.[ citation needed ] Reducing the size of data transferred across the internet reduces the time required to send or receive the message, and increases the number of transactions a server can process per hour.

See also

Related Research Articles

Moving Picture Experts Group Alliance of working groups to set standards for multimedia coding

The Moving Picture Experts Group (MPEG) is an alliance of working groups established jointly by ISO and IEC that sets standards for media coding, including compression coding of audio, video, graphics, and genomic data; and transmission and file formats for various applications. Together with JPEG, MPEG is organized under ISO/IEC JTC 1/SC 29 – Coding of audio, picture, multimedia and hypermedia information.

In computing, serialization or serialisation is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.

Standard Generalized Markup Language Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

XML Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

SAX is an event-driven online algorithm for parsing XML documents, with an API developed by the XML-DEV mailing list. SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM). Where the DOM operates on the document as a whole—building the full abstract syntax tree of an XML document for convenience of the user—SAX parsers operate on each piece of the XML document sequentially, issuing parsing events while making a single pass through the input stream.

JPEG 2000 Image compression standard and coding system

JPEG 2000 (JP2) is an image compression standard and coding system. It was developed from 1997 to 2000 by a Joint Photographic Experts Group committee chaired by Touradj Ebrahimi, with the intention of superseding their original JPEG standard, which is based on a discrete cosine transform (DCT), with a newly designed, wavelet-based method. The standardized filename extension is .jp2 for ISO/IEC 15444-1 conforming files and .jpx for the extended part-2 specifications, published as ISO/IEC 15444-2. The registered MIME types are defined in RFC 3745. For ISO/IEC 15444-1 it is image/jp2.

Abstract Syntax Notation One (ASN.1) is a standard interface description language for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.

A document file format is a text or binary file format for storing documents on a storage media, especially for use by computers. There currently exist a multitude of incompatible document file formats.

Various binary formats have been proposed as compact representations for XML. Using a binary XML format generally reduces the verbosity of XML documents thereby also reducing the cost of parsing, but hinders the use of ordinary text editors and third-party tools to view and edit the document. There are several competing formats, but none has yet emerged as a de facto standard, although the World Wide Web Consortium adopted EXI as a Recommendation on 10 March 2011.

JSON Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a common data format with diverse uses in electronic data interchange, including that of web applications with servers.

Office Open XML is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. The format was initially standardized by the Ecma, and by the ISO and IEC in later versions.

JPEG XR is an image compression standard for continuous tone photographic images, based on the HD Photo specifications that Microsoft originally developed and patented. It supports both lossy and lossless compression, and is the preferred image format for Ecma-388 Open XML Paper Specification documents.

Efficient XML Interchange (EXI) is a binary XML format for exchange of data on a computer network. It was developed by the W3C's Efficient Extensible Interchange Working Group and is one of the most prominent efforts to encode XML documents in a binary data format, rather than plain text. Using EXI format reduces the verbosity of XML documents as well as the cost of parsing. Improvements in the performance of writing (generating) content depends on the speed of the medium being written to, the methods and quality of actual implementations. EXI is useful for

This is a comparison of data-serialization formats, various ways to convert complex objects to sequences of bits. It does not include markup languages used exclusively as document file formats.

Virtual Token Descriptor for eXtensible Markup Language (VTD-XML) refers to a collection of cross-platform XML processing technologies centered on a non-extractive XML, "document-centric" parsing technique called Virtual Token Descriptor (VTD). Depending on the perspective, VTD-XML can be viewed as one of the following:

X.690 is an ITU-T standard specifying several ASN.1 encoding formats:

XML transformation language Type of programming language

An XML transformation language is a programming language designed specifically to transform an input XML document into an output document which satisfies some specific goal.

ISO/IEC 20248Automatic Identification and Data Capture Techniques – Data Structures – Digital Signature Meta Structure is an international standard specification under development by ISO/IEC JTC 1/SC 31/WG 2. This development is an extension of SANS 1368, which is the current published specification. ISO/IEC 20248 and SANS 1368 are equivalent standard specifications. SANS 1368 is a South African national standard developed by the South African Bureau of Standards.

References

  1. Amer-Yahia, Sihem, Laks VS Lakshmanan, and Shashank Pandit. "FleXPath: flexible structure and full-text querying for XML." Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, 2004.
  2. "Fast Infoset performance reports". 2005-10-06. Archived from the original on 2011-08-07. Retrieved 2007-10-11.
  3. "Japex Report: ParsingPerformance". 2005-01-10. Archived from the original on 2011-08-07. Retrieved 2007-10-11.
  4. "Japex Report: SizePerformance". 2005-01-10. Archived from the original on 2011-08-07. Retrieved 2007-10-11.