Document-oriented database

Last updated

A document-oriented database, or document store, is a computer program and data storage system designed for storing, retrieving and managing document-oriented information, also known as semi-structured data. [1]

Contents

Document-oriented databases are one of the main categories of NoSQL databases, and the popularity of the term "document-oriented database" has grown [2] with the use of the term NoSQL itself. XML databases are a subclass of document-oriented databases that are optimized to work with XML documents. Graph databases are similar, but add another layer, the relationship, which allows them to link documents for rapid traversal.

Document-oriented databases are inherently a subclass of the key-value store, another NoSQL database concept. The difference[ contradictory ] lies in the way the data is processed; in a key-value store, the data is considered to be inherently opaque to the database, whereas a document-oriented system relies on internal structure in the document in order to extract metadata that the database engine uses for further optimization. Although the difference is often negligible due to tools in the systems, [lower-alpha 1] conceptually the document-store is designed to offer a richer experience with modern programming techniques.

Document databases [lower-alpha 2] contrast strongly with the traditional relational database (RDB). Relational databases generally store data in separate tables that are defined by the programmer, and a single object may be spread across several tables. Document databases store all information for a given object in a single instance in the database, and every stored object can be different from every other. This eliminates the need for object-relational mapping while loading data into the database.

Documents

The central concept of a document-oriented database is the notion of a document. While each document-oriented database implementation differs on the details of this definition, in general, they all assume documents encapsulate and encode data (or information) in some standard format or encoding. Encodings in use include XML, YAML, JSON, as well as binary forms like BSON.

Documents in a document store are roughly equivalent to the programming concept of an object. They are not required to adhere to a standard schema, nor will they have all the same sections, slots, parts or keys. Generally, programs using objects have many different types of objects, and those objects often have many optional fields. Every object, even those of the same class, can look very different. Document stores are similar in that they allow different types of documents in a single store, allow the fields within them to be optional, and often allow them to be encoded using different encoding systems. For example, the following is a document, encoded in JSON:

{"FirstName":"Bob","Address":"5 Oak St.","Hobby":"sailing"}

A second document might be encoded in XML as:

<contact><firstname>Bob</firstname><lastname>Smith</lastname><phonetype="Cell">(123)555-0178</phone><phonetype="Work">(890)555-0133</phone><address><type>Home</type><street1>123BackSt.</street1><city>Boys</city><state>AR</state><zip>32225</zip><country>US</country></address></contact>

These two documents share some structural elements with one another, but each also has unique elements. The structure and text and other data inside the document are usually referred to as the document's content and may be referenced via retrieval or editing methods, (see below). Unlike a relational database where every record contains the same fields, leaving unused fields empty; there are no empty 'fields' in either document (record) in the above example. This approach allows new information to be added to some records without requiring that every other record in the database share the same structure.

Document databases typically provide for additional metadata to be associated with and stored along with the document content. That metadata may be related to facilities the datastore provides for organizing documents, providing security, or other implementation specific features.

CRUD operations

The core operations that a document-oriented database supports for documents are similar to other databases, and while the terminology is not perfectly standardized, most practitioners will recognize them as CRUD:

Keys

Documents are addressed in the database via a unique key that represents that document. This key is a simple identifier (or ID), typically a string, a URI, or a path. The key can be used to retrieve the document from the database. Typically the database retains an index on the key to speed up document retrieval, and in some cases the key is required to create or insert the document into the database.

Retrieval

Another defining characteristic of a document-oriented database is that, beyond the simple key-to-document lookup that can be used to retrieve a document, the database offers an API or query language that allows the user to retrieve documents based on content (or metadata). For example, you may want a query that retrieves all the documents with a certain field set to a certain value. The set of query APIs or query language features available, as well as the expected performance of the queries, varies significantly from one implementation to another. Likewise, the specific set of indexing options and configuration that are available vary greatly by implementation.

It is here that the document store varies most from the key-value store. In theory, the values in a key-value store are opaque to the store, they are essentially black boxes. They may offer search systems similar to those of a document store, but may have less understanding about the organization of the content. Document stores use the metadata in the document to classify the content, allowing them, for instance, to understand that one series of digits is a phone number, and another is a postal code. This allows them to search on those types of data, for instance, all phone numbers containing 555, which would ignore the zip code 55555.

Editing

Document databases typically provide some mechanism for updating or editing the content (or metadata) of a document, either by allowing for replacement of the entire document, or individual structural pieces of the document.

Organization

Document database implementations offer a variety of ways of organizing documents, including notions of

Sometimes these organizational notions vary in how much they are logical vs physical, (e.g. on disk or in memory), representations.

Relationship to other databases

Relationship to key-value stores

A document-oriented database is a specialized key-value store, which itself is another NoSQL database category. In a simple key-value store, the document content is opaque. A document-oriented database provides APIs or a query/update language that exposes the ability to query or update based on the internal structure in the document. This difference may be minor for users that do not need richer query, retrieval, or editing APIs that are typically provided by document databases. Modern key-value stores often include features for working with metadata, blurring the lines between document stores.

Relationship to search engines

Some search engine (aka information retrieval) systems like Apache Solr and Elasticsearch provide enough of the core operations on documents to fit the definition of a document-oriented database.

Relationship to relational databases

In a relational database, data is first categorized into a number of predefined types, and tables are created to hold individual entries, or records, of each type. The tables define the data within each record's fields, meaning that every record in the table has the same overall form. The administrator also defines the relationships between the tables, and selects certain fields that they believe will be most commonly used for searching and defines indexes on them. A key concept in the relational design is that any data that may be repeated is normally placed in its own table, and if these instances are related to each other, a column is selected to group them together, the foreign key. This design is known as database normalization . [3]

For example, an address book application will generally need to store the contact name, an optional image, one or more phone numbers, one or more mailing addresses, and one or more email addresses. In a canonical relational database, tables would be created for each of these rows with predefined fields for each bit of data: the CONTACT table might include FIRST_NAME, LAST_NAME and IMAGE columns, while the PHONE_NUMBER table might include COUNTRY_CODE, AREA_CODE, PHONE_NUMBER and TYPE (home, work, etc.). The PHONE_NUMBER table also contains a foreign key column, "CONTACT_ID", which holds the unique ID number assigned to the contact when it was created. In order to recreate the original contact, the database engine uses the foreign keys to look for the related items across the group of tables and reconstruct the original data.

In contrast, in a document-oriented database there may be no internal structure that maps directly onto the concept of a table, and the fields and relationships generally don't exist as predefined concepts. Instead, all of the data for an object is placed in a single document, and stored in the database as a single entry. In the address book example, the document would contain the contact's name, image, and any contact info, all in a single record. That entry is accessed through its key, which allows the database to retrieve and return the document to the application. No additional work is needed to retrieve the related data; all of this is returned in a single object.

A key difference between the document-oriented and relational models is that the data formats are not predefined in the document case. In most cases, any sort of document can be stored in any database, and those documents can change in type and form at any time. If one wishes to add a COUNTRY_FLAG to a CONTACT, this field can be added to new documents as they are inserted, this will have no effect on the database or the existing documents already stored. To aid retrieval of information from the database, document-oriented systems generally allow the administrator to provide hints to the database to look for certain types of information. These work in a similar fashion to indexes in the relational case. Most also offer the ability to add additional metadata outside of the content of the document itself, for instance, tagging entries as being part of an address book, which allows the programmer to retrieve related types of information, like "all the address book entries". This provides functionality similar to a table, but separates the concept (categories of data) from its physical implementation (tables).

In the classic normalized relational model, objects in the database are represented as separate rows of data with no inherent structure beyond that given to them as they are retrieved. This leads to problems when trying to translate programming objects to and from their associated database rows, a problem known as object-relational impedance mismatch. [4] Document stores more closely, or in some cases directly, map programming objects into the store. These are often marketed using the term NoSQL.

Implementations

NamePublisherLicenseLanguages supportedNotes RESTful API
Aerospike Aerospike AGPL and Proprietary C, C#, Java, Scala, Python, Node.js, PHP, Go, Rust, Spring Framework Aerospike is a flash-optimized and in-memory distributed key value NoSQL database which also supports a document store model. [5] Yes [6]
AllegroGraph Franz, Inc. Proprietary Java, Python, Common Lisp, Ruby, Scala, C#, Perl The database platform supports document store and graph data models in a single database. Supports JSON, JSON-LD, RDF, full-text search, ACID, two-phase commit, Multi-Master Replication, Prolog and SPARQL.Yes [7]
ArangoDB ArangoDB Apache License C, C#, Java, Python, Node.js, PHP, Scala, Go, Ruby, Elixir The database system supports document store as well as key/value and graph data models with one database core and a unified query language AQL (ArangoDB Query Language).Yes [8]
BaseX BaseX Team BSD License Java, XQuery Support for XML, JSON and binary formats; client-/server based architecture; concurrent structural and full-text searches and updates.Yes
Caché InterSystems Corporation Proprietary Java, C#, Node.js Commonly used in Health, Business and Government applications.Yes
Cloudant Cloudant, Inc. Proprietary Erlang, Java, Scala, and C Distributed database service based on BigCouch, the company's open source fork of the Apache-backed CouchDB project. Uses JSON model.Yes
Clusterpoint Database Clusterpoint Ltd. Proprietary with free download JavaScript, SQL, PHP, C#, Java, Python, Node.js, C, C++,Distributed document-oriented XML / JSON database platform with ACID-compliant transactions; high-availability data replication and sharding; built-in full-text search engine with relevance ranking; JS/SQL query language; GIS; Available as pay-per-use database as a service or as an on-premise free software download.Yes
Couchbase Server Couchbase, Inc. Apache License C, C#, Java, Python, Node.js, PHP, SQL, Go, Spring Framework, LINQ Distributed NoSQL Document Database, JSON model and SQL based Query Language.Yes [9]
CouchDB Apache Software Foundation Apache License Any language that can make HTTP requestsJSON over REST/HTTP with Multi-Version Concurrency Control and limited ACID properties. Uses map and reduce for views and queries. [10] Yes [11]
CrateIO CRATE Technology GmbH Apache License Java Use familiar SQL syntax for real time distributed queries across a cluster. Based on Lucene / Elasticsearch ecosystem with built-in support for binary objects (BLOBs).Yes [12]
Cosmos DB Microsoft Proprietary C#, Java, Python, Node.js, JavaScript, SQL Platform-as-a-Service offering, part of the Microsoft Azure platform. Builds upon and extends the earlier Azure DocumentDB.Yes
DocumentDB Amazon Web Services Proprietary online servicevarious, REST fully managed MongoDB v3.6-compatible database serviceYes
DynamoDB Amazon Web Services Proprietary Java, JavaScript, Node.js, Go, C# .NET, Perl, PHP, Python, Ruby, Rust, Haskell, Erlang, Django, and Grails fully managed proprietary NoSQL database service that supports key–value and document data structuresYes
Elasticsearch Shay Banon Dual-licensed under Server Side Public License and Elastic license. Java JSON, Search engine.Yes
eXist eXist LGPL XQuery, Java XML over REST/HTTP, WebDAV, Lucene Fulltext search, binary data support, validation, versioning, clustering, triggers, URL rewriting, collections, ACLS, XQuery UpdateYes [13]
Informix IBMProprietary, with no-cost editions [14] Various (Compatible with MongoDB API)RDBMS with JSON, replication, sharding and ACID compliance.Yes
Jackrabbit Apache Foundation Apache License Java Java Content Repository implementation?
HCL Notes (HCL Domino) HCL Proprietary LotusScript, Java, Notes Formula Language MultiValue Yes
MarkLogic MarkLogic CorporationFree Developer license or Commercial [15] Java, JavaScript, Node.js, XQuery, SPARQL, XSLT, C++ Distributed document-oriented database for JSON, XML, and RDF triples. Built-in full-text search, ACID transactions, high availability and disaster recovery, certified security.Yes
MongoDB MongoDB, Inc Server Side Public License for the DBMS, Apache 2 License for the client drivers [16] C, C++, C#, Java, Perl, PHP, Python, Go, Node.js, Ruby, Rust, [17] Scala [18] Document database with replication and sharding, BSON store (binary format JSON).Yes [19] [20]
MUMPS Database? Proprietary and AGPL [21] MUMPS Commonly used in health applications.?
ObjectDatabase++ Ekky Software Proprietary C++, C#, TScript Binary Native C++ class structures?
OpenLink Virtuoso OpenLink SoftwareGPLv2[1] and proprietary C++, C#, Java, SPARQL Middleware and database engine hybridYes
OrientDB Orient Technologies Apache License Java JSON over HTTP, SQL support, ACID transactionsYes
Oracle NoSQL Database Oracle Corp Apache and proprietaryC, C#, Java, Python, node.js, GoShared nothing, horizontally scalable database with support for schema-less JSON, fixed schema tables, and key/value pairs. Also supports ACID transactions.Yes
Qizx Qualcomm Proprietary REST, Java, XQuery, XSLT, C, C++, Python Distributed document-oriented XML database with integrated full-text search; support for JSON, text, and binaries.Yes
RedisJSON Redis Redis Source Available License (RSAL) Python JSON with integrated full-text search. [22] Yes
RethinkDB ? Apache License [23] C++, Python, JavaScript, Ruby, Java Distributed document-oriented JSON database with replication and sharding.No
SAP HANA SAP Proprietary SQL-like language ACID transaction supported, JSON onlyYes
Sedna sedna.org Apache License C++, XQuery XML database No
SimpleDB Amazon Web Services Proprietary online service Erlang ?
SurrealDB SurrealDB Business Source License and to Apache License after 4 years Rust multi-modal graph, relational, document & vector database [24] Yes
Apache Solr Apache Software Foundation Apache License [25] Java JSON, CSV, XML, and a few other formats. [26] Search engine.Yes [27]
TerminusDB TerminusDB Apache License Python, Node.js, JavaScriptThe database system supports document store as well as graph data models with one database core and a unified, datalog based query language WOQL (Web Object Query Language). [28] Yes

XML database implementations

Most XML databases are document-oriented databases.

See also

Notes

  1. To the point that document-oriented and key-value systems can often be interchanged in operation.
  2. And key-value stores in general.

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

Object–relational mapping in computer science is a programming technique for converting data between a relational database and the heap of an object-oriented programming language. This creates, in effect, a virtual object database that can be used from within the programming language.

A query language, also known as data query language or database query language (DQL), is a computer language used to make queries in databases and information systems. In database systems, query languages rely on strict theory to retrieve information. A well known example is the Structured Query Language (SQL).

WinFS was the code name for a canceled data storage and management system project based on relational databases, developed by Microsoft and first demonstrated in 2003. It was intended as an advanced storage subsystem for the Microsoft Windows operating system, designed for persistence and management of structured, semi-structured and unstructured data.

Web development is the work involved in developing a website for the Internet or an intranet. Web development can range from developing a simple single static page of plain text to complex web applications, electronic businesses, and social network services. A more comprehensive list of tasks to which Web development commonly refers, may include Web engineering, Web design, Web content development, client liaison, client-side/server-side scripting, Web server and network security configuration, and e-commerce development.

An XML database is a data persistence software system that allows data to be specified, and sometimes stored, in XML format. This data can be queried, transformed, exported and returned to a calling system. XML databases are a flavor of document-oriented databases which are in turn a category of NoSQL database.

SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.

A spatial database is a general-purpose database that has been enhanced to include spatial data that represents objects defined in a geometric space, along with tools for querying and analyzing such data.

An entity–attribute–value model (EAV) is a data model optimized for the space-efficient storage of sparse—or ad-hoc—property or data values, intended for situations where runtime usage patterns are arbitrary, subject to user variation, or otherwise unforeseeable using a fixed design. The use-case targets applications which offer a large or rich system of defined property types, which are in turn appropriate to a wide set of entities, but where typically only a small, specific selection of these are instantiated for a given entity. Therefore, this type of data model relates to the mathematical notion of a sparse matrix. EAV is also known as object–attribute–value model, vertical database model, and open schema.

<span class="mw-page-title-main">Database model</span> Type of data model

A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

Entity Framework (EF) is an open source object–relational mapping (ORM) framework for ADO.NET. It was originally shipped as an integral part of .NET Framework, however starting with Entity Framework version 6.0 it has been delivered separately from the .NET Framework.

<span class="mw-page-title-main">MarkLogic Server</span>

MarkLogic Server is a document-oriented database developed by MarkLogic. It is a NoSQL multi-model database that evolved from an XML database to natively store JSON documents and RDF triples, the data model for semantics. MarkLogic is designed to be a data hub for operational and analytical data.

Database preservation usually involves converting the information stored in a database to a form likely to be accessible in the long term as technology changes, without losing the initial characteristics of the data.

Apache Empire-db is a Java library that provides a high level object-oriented API for accessing relational database management systems (RDBMS) through JDBC. Apache Empire-db is open source and provided under the Apache License 2.0 from the Apache Software Foundation.

NoSQL is an approach to database design that focuses on providing a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Instead of the typical tabular structure of a relational database, NoSQL databases house data within one data structure. Since this non-relational database design does not require a schema, it offers rapid scalability to manage large and typically unstructured data sets. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.

A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.

The following is provided as an overview of and topical guide to databases:

The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare the relative performance of NoSQL database management systems.

In the field of database design, a multi-model database is a database management system designed to support multiple data models against a single, integrated backend. In contrast, most database management systems are organized around a single data model that determines how data can be organized, stored, and manipulated. Document, graph, relational, and key–value models are examples of data models that may be supported by a multi-model database.

References

  1. Drake, Mark (9 August 2019). "A Comparison of NoSQL Database Management Systems and Models". DigitalOcean . Archived from the original on 13 August 2019. Retrieved 23 August 2019. Document-oriented databases, or document stores, are NoSQL databases that store data in the form of documents. Document stores are a type of key-value store: each document has a unique identifier — its key — and the document itself serves as the value.
  2. "DB-Engines Ranking per database model category".
  3. "Description of the database normalization basics". Microsoft. 14 July 2023.
  4. Wambler, Scott (22 March 2023). "The Object-Relational Impedance Mismatch". Agile Data.
  5. "Documentation | Aerospike - Key-Value Store". docs.aerospike.com. Retrieved 3 May 2021.
  6. "Documentation | Aerospike". docs.aerospike.com. Retrieved 3 May 2021.
  7. "HTTP Protocol for AllegroGraph".
  8. "Multi-model highly available NoSQL database". ArangoDB.
  9. Documentation Archived 2012-08-20 at the Wayback Machine . Couchbase. Retrieved on 2013-09-18.
  10. "Apache CouchDB". Apache Couchdb. Archived from the original on October 20, 2011.
  11. "HTTP_Document_API - Couchdb Wiki". Archived from the original on 2013-03-01. Retrieved 2011-10-14.
  12. "Crate SQL HTTP Endpoint (Archived copy)". Archived from the original on 2015-06-22. Retrieved 2015-06-22.
  13. eXist-db Open Source Native XML Database. Exist-db.org. Retrieved on 2013-09-18.
  14. "Compare the Informix Version 12 editions". IBM . 22 July 2016.
  15. "MarkLogic Licensing". Archived from the original on 2012-01-12. Retrieved 2011-12-28.
  16. "MongoDB Licensing".
  17. "The New MongoDB Rust Driver". MongoDB. Retrieved 2018-02-01.
  18. "Community Supported Drivers Reference".
  19. "HTTP Interface — MongoDB Ecosystem". MongoDB Docs.
  20. "MongoDB Ecosystem Documentation". GitHub. June 27, 2019.
  21. "GT.M High end TP database engine". 26 September 2023.
  22. "RedisJSON - a JSON data type for Redis".
  23. "Transferring copyright to The Linux Foundation, relicensing RethinkDB under ASLv2". github.com. Retrieved 27 January 2020.
  24. Wiggers, Kyle (2023-01-04). "SurrealDB raises $6M for its database-as-a-service offering". TechCrunch. Retrieved 2024-01-19.
  25. "solr/LICENSE.txt at main · apache/solr · GitHub". github.com. Retrieved 24 December 2022.
  26. "Response Writers :: Apache Solr Reference Guide". solr.apache.org. Retrieved 24 December 2022.
  27. "Managed Resources :: Apache Solr Reference Guide". solr.apache.org. Retrieved 24 December 2022.
  28. "TerminusDB and open-source in-memory document-oriented graph database". terminusdb.com. Retrieved 2023-08-09.

Further reading