PureXML

Last updated

pureXML is the native XML storage feature in the IBM Db2 data server. pureXML provides query languages, storage technologies, indexing technologies, and other features to support XML data. The word pure in pureXML was chosen to indicate that Db2 natively stores and natively processes XML data in its inherent hierarchical structure, as opposed to treating XML data as plain text or converting it into a relational format. [1]

Contents

Technical information

Db2 includes two distinct storage mechanisms: one for efficiently managing traditional SQL data types, and another for managing XML data. The underlying storage mechanism is transparent to users and applications; they simply use SQL (including SQL with XML extensions or SQL/XML) or XQuery to work with the data.

XML data is stored in columns of Db2 tables that have the XML data type. XML data is stored in a parsed format that reflects the hierarchical nature of the original XML data. As such, pureXML uses trees and nodes as its model for storing and processing XML data. If you instruct Db2 to validate XML data against an XML schema prior to storage, Db2 annotates all nodes in the XML hierarchy with information about the schema types; otherwise, it will annotate the nodes with default type information. Upon storage, Db2 preserves the internal structure of XML data, converting its tag names and other information into integer values. Doing so helps conserve disk space and also improves the performance of queries that use navigational expressions. However, users aren't aware of this internal representation. Finally, Db2 automatically splits XML nodes across multiple database pages, as needed.

XML schemas specify which XML elements are valid, in what order these elements should appear in XML data, which XML data types are associated with each element, and so on. pureXML allows you to validate the cells in a column of XML data against no schema, one schema, or multiple schemas. pureXML also provides tools to support evolving XML schemas.

IBM has enhanced its programming language interfaces to support access to its XML data. These enhancements span Java (JDBC), C (embedded SQL and call-level interface), COBOL (embedded SQL), PHP, and Microsoft's .NET Framework (through the DB2.NET provider).

History

pureXML was first included in the DB2 9 for Linux, Unix, and Microsoft Windows release, which was codenamed Viper, in June 2006. [2] It was available on DB2 9 for z/OS in March 2007. [3] In October 2007, IBM released DB2 9.5 with improved XML data transaction performance and improved storage savings. [4] In June 2009, IBM released DB2 9.7 with XML supported for database-partitioned, range-partitioned, and multi-dimensionally clustered tables as well as compression of XML data and indices. [5]

Competition

Db2 is a hybrid data server—it offers data management for traditional relational data, as well as providing native XML data management. Other vendors that offer data management for both relational data and native XML storage include Oracle with its 11g product and Microsoft with its SQL Server product.

pureXML also competes with native XML databases like BaseX, eXist, MarkLogic or Sedna.

Books

IBM International Technical Support Organization (ITSO) has published the following books, which are available in print or as free e-books:

The following books are also available for purchase:

Education and training

The following pureXML classroom and online courses are available from IBM Education:

See also

Related Research Articles

Database Organized collection of data in computing

In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spans formal techniques and practical considerations including data modeling, efficient data representation and storage, query languages, security and privacy of sensitive data, and distributed computing issues including supporting concurrent access and fault tolerance.

A relational database is a digital database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems have an option of using the SQL for querying and maintaining the database.

SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS). It is particularly useful in handling structured data, i.e. data incorporating relations among entities and variables. SQL offers two main advantages over older read–write APIs such as ISAM or VSAM. Firstly, it introduced the concept of accessing many records with one single command. Secondly, it eliminates the need to specify how to reach a record, e.g. with or without an index.

IBM Db2 Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. They initially supported the relational model, but were extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB/2, then DB2 until 2017 and finally changed to its present form.

A hierarchical database model is a data model in which the data are organized into a tree-like structure. The data are stored as records which are connected to one another through links. A record is a collection of fields, with each field containing only one value. The type of a record defines which fields the record contains.

Query languages, data query languages or database query languages (DQLs) are computer languages used to make queries in databases and information systems. A well known example is the Structured Query Language (SQL).

Physical schema Representation of a data design

A physical data model is a representation of a data design as implemented, or intended to be implemented, in a database management system. In the lifecycle of a project it typically derives from a logical data model, though it may be reverse-engineered from a given database implementation. A complete physical data model will include all the database artifacts required to create relationships between tables or to achieve performance goals, such as indexes, constraint definitions, linking tables, partitioned tables or clusters. Analysts can usually use a physical data model to calculate storage estimates; it may include specific storage allocation details for a given database system.

ADO.NET is a data access technology from the Microsoft .NET Framework that provides communication between relational and non-relational systems through a common set of components. ADO.NET is a set of computer software components that programmers can use to access data and data services from a database. It is a part of the base class library that is included with the Microsoft .NET Framework. It is commonly used by programmers to access and modify data stored in relational database systems, though it can also access data in non-relational data sources. ADO.NET is sometimes considered an evolution of ActiveX Data Objects (ADO) technology, but was changed so extensively that it can be considered an entirely new product.

An XML database is a data persistence software system that allows data to be specified, and sometimes stored, in XML format. This data can be queried, transformed, exported and returned to a calling system. XML databases are a flavor of document-oriented databases which are in turn a category of NoSQL database.

The following tables compare general and technical information for a number of relational database management systems. Please see the individual products' articles for further information. Unless otherwise specified in footnotes, comparisons are based on the stable versions without any add-ons, extensions or external programs.

Essbase is a multidimensional database management system (MDBMS) that provides a platform upon which to build analytic applications. Essbase began as a product from Arbor Software, which merged with Hyperion Software in 1998. Oracle Corporation acquired Hyperion Solutions Corporation in 2007, as of 2009 Oracle markets Essbase as "Oracle Essbase", both on-premises and in Oracle's Cloud Infrastructure (OCI). Until late 2005 IBM also marketed an OEM version of Essbase as DB2 OLAP Server.

Entity–attribute–value model (EAV) is a data model to encode, in a space-efficient manner, entities where the number of attributes that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest. Such entities correspond to the mathematical notion of a sparse matrix.

A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

Microsoft SQL Server is a relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network. Microsoft markets at least a dozen different editions of Microsoft SQL Server, aimed at different audiences and for workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.

Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.

DatabaseSpy SQL database profiling tool and GUI

DatabaseSpy is a multi-database query, design, and database comparison tool from Altova, the creator of XMLSpy. DatabaseSpy connects to many major relational databases, facilitating SQL querying, database structure design, database content editing, and database comparison and conversion.

Versant Object Database (VOD) is an object database software product developed by Versant Corporation.

The following is provided as an overview of and topical guide to databases:

Apache Drill Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system, also productized as BigQuery. Drill is an Apache top-level project.

The history of Microsoft SQL Server begins with the first Microsoft SQL Server database product - SQL Server v1.0, a 16-bit Relational Database for the OS/2 operating system, released in 1989.

References

  1. "IBM Developer". IBM .
  2. "IBM News room - 2006-06-08 IBM Transforms Database Market With Introduction of DB2 - United States". Archived from the original on 2012-10-11.
  3. "IBM News room - 2007-03-06 IBM Unveils DB2 Viper for the Mainframe - United States". Archived from the original on 2012-10-11.
  4. "IBM News room - 2007-10-15 IBM Extends Data Server Technology Lead With Introduction of DB2 "Viper 2" - United States". Archived from the original on 2012-10-11.
  5. "IBM News room - 2009-04-22 IBM Database Software Improves Operational Efficiency and Cuts Storage Costs by Up to 75% - United States". Archived from the original on 2012-11-21.

Online communities

Online communities allow pureXML users to network with fellow professionals.