VTD-XML

Last updated
VTD-XML
Developer(s) XimpleWare
Stable release
2.13_4 / July 14, 2017;5 years ago (2017-07-14)
Operating system Portable
Platform Java, C#, C and C++
Type XML parser/indexer/slicer/editor library
License GPL and Proprietary License
Website vtd-xml.sourceforge.io ximpleware.wordpress.com

Virtual Token Descriptor for eXtensible Markup Language (VTD-XML) refers to a collection of cross-platform XML processing technologies centered on a non-extractive [1] [2] XML, "document-centric" parsing technique called Virtual Token Descriptor (VTD). Depending on the perspective, VTD-XML can be viewed as one of the following:

Contents

VTD-XML is developed by XimpleWare and dual-licensed under GPL and proprietary license. It is originally written in Java, but is now available in C, [14] C++ and C#.

Basic concept

Non-extractive, document-centric parsing

Traditionally, a lexical analyzer represents tokens (the small units of indivisible character values) as discrete string objects. This approach is designated extractive parsing. In contrast, non-extractive tokenization mandates that one keeps the source text intact, and uses offsets and lengths to describe those tokens.

Virtual token descriptor

Virtual Token Descriptor (VTD) applies the concept of non-extractive, document-centric parsing to XML processing. A VTD record uses a 64-bit integer to encode the offset, length, token type and nesting depth of a token in an XML document. Because all VTD records are 64 bits in length, they can be stored efficiently and managed as an array. [15]

Location cache

Location Caches (LC) build on VTD records to provide efficient random access. Organized as tables, with one table per nesting depth level, LCs contain entries modeling an XML document's element hierarchy. An LC entry is a 64-bit integer encoding a pair of 32-bit values. The upper 32 bits identify the VTD record for the corresponding element. The lower 32 bits identify that element's first child in the LC at the next lower nesting level.

Benefits

Overview

Virtually all the core benefits of VTD-XML are inherent to non-extractive, document-centric parsing which provides these characteristics:

Combining those characteristics permits thinking of XML purely as syntax (bits, bytes, offsets, lengths, fragments, namespace-compensated fragments, and document composition) instead of the serialization/deserialization of objects. This is a powerful way to think about XML/SOA applications.

Conformance

VTD-XML conforms strictly to XML 1.0 (Except the DTD part) and XML Namespace 1.0. It essentially conforms to XPath 1.0 spec (with some subtle differences in terms of underlying data model) with extension to XPath 2.0 built-in functions.

Simplicity

As parser

When used in parsing mode, VTD-XML is a general purpose, extremely high performance [17] XML parser which compares favorably with others:

As indexer

Because of the inherent persistence of VTD-XML, developers can write the internal representation of a parsed XML document to disk and later reload it to avoid repetitive parsing. To this end, XimpleWare has introduced VTD+XML as a binary packaging format combining VTD, LC and the XML text. It can typically be viewed in one of the following two ways:

XML content modifier

Because VTD-XML keeps the XML text intact without decoding, when an application intends to modify the content of XML it only needs to modify the portions most relevant to the changes. This is in stark contrast with DOM, SAX, or StAx parsing, which incur the cost of parsing and re-serialization no matter how small the changes are.

Since VTDs refer to document elements by their offsets, changes to the length of elements occurring earlier in a document require adjustments to VTDs referring to all later elements. However, those adjustments are integer additions, albeit to many integers in multiple tables, so they are quick.

XML slicer/splitter/assembler

An application based on VTD-XML can also use offsets and lengths to address tokens, or element fragments. This allows XML documents to be manipulated like arrays of bytes.

XML editor/eraser

Used as an editor/eraser, VTD-XML can directly edit/erase the underlying byte content of the XML text, provided that the token length is wider than the intended new content. An immediate benefit of this approach is that the application can immediately reuse the original VTD and LC. In contrast, when using VTD-XML to incrementally update an XML document, an application needs to reparse the updated document before the application can process it.

An editor can be made smart enough to track the location of each token, permitting new, longer tokens to replace existing, shorter tokens by merely addressing the new token in separate memory outside that used to store the original document. Likewise, when reordering the document, element text does not need to be copied; only the LCs need to be updated. When a complete, contiguous XML document is needed, such as when saving it, the disparate parts can be reassembled into a new, contiguous document.

Other benefits

VTD-XML also pioneers the non-blocking, stateless XPath evaluation approach. [ citation needed ]

Weaknesses

VTD-XML also exhibits a few noticeable shortcomings:

Areas of applications

General-purpose replacement for DOM or SAX

Because of VTD-XML's performance and memory advantages, it covers a larger portion of XML use cases than either DOM or SAX. [18]

XPath over huge XML documents

The extended edition of VTD-XML combining with 64-bit JVM makes possible XPath-based XML processing over huge XML documents (up to 256 GB) in size.

For SOA/WS/XML security

The combination of VTD-XML's high performance and incremental-update capability makes it essential [19] [20] [21] to achieve the desired level of quality of service for SOA/WS/XML security applications.

For SOA/WS/XML intermediary

VTD-XML is well suited for SOA intermediary applications such as XML routers/switches/gateways, Enterprise Service Buses, and services aggregation points. All those applications perform the basic "store and forward" operations for which retaining the original XML is critical for minimizing latency. VTD-XML's incremental update capability also contributes significantly to the forwarding performance.

VTD-XML's random-access capability lends itself well to XPath-based XML routing/switching/filtering common in AJAX and SOA deployment.

Intelligent SOA/WS/XML Load-balancing and Offloading

When an XML document travels through several middle-tier SOA components, the first message stop, after finishing the inspection of the XML document, can choose to send the VTD+XML file format to the downstream components to avoid repetitive parsing, thus improving throughput.

By the same token, an intelligent SOA load balancer can choose to generate VTD+XML for incoming/outgoing SOAP messages to offload XML parsing from the application servers that receive those messages.

XML persistence data store

When viewed from the perspective of native XML persistence, VTD-XML can be used as a human-readable, easy to use, general-purpose XML index. XML documents stored this way can be loaded into memory to be queried, updated, or edited without the overhead of parsing/re-serialization.

Schemaless XML data binding

VTD-XML's combination of high performance, low memory usage, and efficient XPath evaluation makes possible a new XML data binding approach based entirely on XPath. This approach's biggest benefit is it no longer requires XML schema, avoids needless object creation, and takes advantage of XML's inherent loose encoding. [22]

It is worth noting that data binding discussed in the article mentioned above needs to be implemented by the application: VTD-XML itself only offers accessors. In this regard VTD-XML is not a data binding solution itself (unlike JiBX, JAXB, XMLBeans), although it offers extraction functionality for data binding packages, much like other XML parsers (DOM, SAX, StAX).

Essential classes

As of Version 2.11, the Java and C# versions of VTD-XML consist of the following classes:

The extended VTD-XML consists of the following classes:

Code sample

/* In this java program, we demonstrate how to use XMLModifier to incrementally* update a simple XML purchase order.* a particular name space. We also are going* to use VTDGen's parseFile to simplify programming.*/importcom.ximpleware.*;publicclassUpdate{publicstaticvoidmain(Stringargv[])throwsNavException,ModifyException,IOException{// open a file and read the content into a byte arrayVTDGenvg=newVTDGen();if(vg.parseFile("oldpo.xml",true)){VTDNavvn=vg.getNav();AutoPilotap=newAutoPilot(vn);XMLModifierxm=newXMLModifier(vn);ap.selectXPath("/purchaseOrder/items/item[@partNum='872-AA']");inti=-1;while((i=ap.evalXPath())!=-1){xm.remove();xm.insertBeforeElement("<something/>\n");}ap.selectXPath("/purchaseOrder/items/item/USPrice[.<40]/text()");while((i=ap.evalXPath())!=-1){xm.updateToken(i,"200");}xm.output("newpo.xml");}}}

Related Research Articles

<span class="mw-page-title-main">Document Object Model</span> Convention for representing and interacting with objects in HTML, XHTML and XML documents

The Document Object Model (DOM) is a cross-platform and language-independent interface that treats an XML or HTML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document with a logical tree. Each branch of the tree ends in a node, and each node contains objects. DOM methods allow programmatic access to the tree; with them one can change the structure, style or content of a document. Nodes can have event handlers attached to them. Once an event is triggered, the event handlers get executed.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

XSLT is a language originally designed for transforming XML documents into other XML documents, or other formats such as HTML for web pages, plain text or XSL Formatting Objects, which may subsequently be converted to other formats, such as PDF, PostScript and PNG. Support for JSON and plain-text transformation was added in later updates to the XSLT 1.0 specification.

In computing, the Java API for XML Processing, or JAXP, one of the Java XML Application programming interfaces, provides the capability of validating and parsing XML documents. It has three basic parsing interfaces:

SAX is an event-driven online algorithm for parsing XML documents, with an API developed by the XML-DEV mailing list. SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM). Where the DOM operates on the document as a whole—building the full abstract syntax tree of an XML document for convenience of the user—SAX parsers operate on each piece of the XML document sequentially, issuing parsing events while making a single pass through the input stream.

Streaming Transformations for XML (STX) is an XML transformation language intended as a high-speed, low memory consumption alternative to XSLT version 1.0 and 2.0. Current work on XSLT 3.0 includes Streaming capabilities.

XForms is an XML format used for collecting inputs from web forms. XForms was designed to be the next generation of HTML / XHTML forms, but is generic enough that it can also be used in a standalone manner or with presentation languages other than XHTML to describe a user interface and a set of common data manipulation tasks.

In software engineering, service-oriented architecture (SOA) is an architectural style that focuses on discrete services instead of a monolithic design. By consequence, it is also applied in the field of software design where services are provided to the other components by application components, through a communication protocol over a network. A service is a discrete unit of functionality that can be accessed remotely and acted upon and updated independently, such as retrieving a credit card statement online. SOA is also intended to be independent of vendors, products and technologies.

XML Signature defines an XML syntax for digital signatures and is defined in the W3C recommendation XML Signature Syntax and Processing. Functionally, it has much in common with PKCS #7 but is more extensible and geared towards signing XML documents. It is used by various Web technologies such as SOAP, SAML, and others.

XML data binding refers to a means of representing information in an XML document as a business object in computer memory. This allows applications to access the data in the XML from the object rather than using the DOM or SAX to retrieve the data from a direct representation of the XML itself.

Various binary formats have been proposed as compact representations for XML. Using a binary XML format generally reduces the verbosity of XML documents thereby also reducing the cost of parsing, but hinders the use of ordinary text editors and third-party tools to view and edit the document. There are several competing formats, but none has yet emerged as a de facto standard, although the World Wide Web Consortium adopted EXI as a Recommendation on 10 March 2011.

SimpleXML is a PHP extension that allows users to easily manipulate/use XML data. It was introduced in PHP 5 as an object oriented approach to the XML DOM providing an object that can be processed with normal property selectors and array iterators. It represents an easy way of getting an element's attributes and textual content if you know the XML document's structure or layout.

<span class="mw-page-title-main">JDOM</span>

JDOM is an open-source Java-based document object model for XML that was designed specifically for the Java platform so that it can take advantage of its language features. JDOM integrates with Document Object Model (DOM) and Simple API for XML (SAX), supports XPath and XSLT. It uses external parsers to build documents. JDOM was developed by Jason Hunter and Brett McLaughlin starting in March 2000. It has been part of the Java Community Process as JSR 102, though that effort has since been abandoned.

Fast Infoset is an international standard that specifies a binary encoding format for the XML Information Set as an alternative to the XML document format. It aims to provide more efficient serialization than the text-based XML format.

Streaming API for XML (StAX) is an application programming interface (API) to read and write XML documents, originating from the Java programming language community.

A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

XPath is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) and can be used to compute values from the content of an XML document. Support for XPath exists in applications that support XML, such as web browsers, and many programming languages.

XQuery is a query and functional programming language that queries and transforms collections of structured and unstructured data, usually in the form of XML, text and with vendor-specific extensions for other data formats. The language is developed by the XML Query working group of the W3C. The work is closely coordinated with the development of XSLT by the XSL Working Group; the two groups share responsibility for XPath, which is a subset of XQuery.

<span class="mw-page-title-main">XQuery API for Java</span> Application programming interface

XQuery API for Java (XQJ) refers to the common Java API for the W3C XQuery 1.0 specification.

References

  1. Zhang, Jimmy (May 19, 2004). "Non-Extractive Parsing for XML". XML.com. Retrieved 2020-07-24.
  2. XML Processing for the Future
  3. Zhang, Jimmy (January 9, 2008). "Manipulate XML Content the Ximple Way". DevX. Archived from the original on 2017-07-30. Retrieved 2020-07-24.
  4. Zhang, Jimmy (June 24, 2008). "VTD-XML: XML Processing for the Future (Part II)". Code Project . Retrieved 2020-07-24.
  5. Zhang, Jimmy (March 27, 2006). "Simplify XML processing with VTD-XML". JavaWorld . Retrieved 2020-07-24.
  6. Zhang, Jimmy (October 21, 2004). "Better, Faster XML Processing with VTD-XML". DevX. Retrieved 2020-07-24.
  7. Zhang, Jimmy (April 17, 2008). "VTD-XML: XML Processing for the Future (Part I)". Code Project . Retrieved 2020-07-24.
  8. Zhang, Jimmy (November 2, 2007). "Index XML Documents with VTD-XML". SYS-CON Publications. Archived from the original on 2007-11-05.
  9. Zhang, Jimmy (July 24, 2006). "Cut, paste, split, and assemble XML documents with VTD-XML". JavaWorld . Retrieved 2020-07-24.
  10. XML on a chip?
  11. Zhang, Jimmy (March 9, 2005). "XML on a Chip". XML.com. Retrieved 2020-07-24.
  12. XimpleWare's W3C binary XML workshop Position Paper
  13. Zhang, Jimmy (March 19, 2007). "Improve XPath Efficiency with VTD-XML". DevX. Retrieved 2020-07-24.
  14. Volkman, Victor (December 3, 2007). "VTD-XML: A New Vision of XML". Developer.com. Retrieved 2020-07-24.
  15. Virtual Token Descriptor introduction at SourceForge
  16. Zhang, Jimmy (July 31, 2006). "The Performance Woe of Binary XML". SYS-CON Publications. Archived from the original on 2006-08-08.
  17. VTD-XML Parsing/Navigation Performance Report
  18. Zhang, Jimmy (February 8, 2006). "A Step in the Right Direction: VTD-XML Improves XML Processing". DevX. Retrieved 2020-07-24.
  19. Zhang, Jimmy (January 9, 2007). "Accelerate WSS applications with VTD-XML". JavaWorld . Retrieved 2020-07-24.
  20. W3C workshop presentation on XML security
  21. Position Paper for W3C Workshop on Next Steps for XML Signature and XML Encryption
  22. Zhang, Jimmy (September 10, 2007). "Schemaless Java-XML Data Binding with VTD-XML". ONJava. Archived from the original on 2017-09-27.