Cosmos DB

Last updated
Azure Cosmos DB
Developer(s) Microsoft
Initial release2017;7 years ago (2017)
Available inEnglish
Type Multi-model database
Website learn.microsoft.com/en-us/azure/cosmos-db/introduction

Azure Cosmos DB is a globally distributed, multi-model database service offered by Microsoft. It is designed to provide high availability, scalability, and low-latency access to data for modern applications. Unlike traditional relational databases, Cosmos DB is a NoSQL (meaning "Not only SQL", rather than "zero SQL") and vector database, [1] which means it can handle unstructured, semi-structured, structured, and vector data types. [2]

Contents

Data model

Internally, Cosmos DB stores "items" in "containers", [3] with these two concepts being surfaced differently depending on the API used (these would be "documents" in "collections" when using the MongoDB-compatible API, for example). Containers are grouped in "databases", which are analogous to namespaces above containers. Containers are schema-agnostic, which means that no schema is enforced when adding items.

By default, every field in each item is automatically indexed, generally providing good performance without tuning to specific query patterns. These defaults can be modified by setting an indexing policy which can specify, for each field, the index type and precision desired. Cosmos DB offers two types of indexes:

Containers can also enforce unique key constraints to ensure data integrity. [4]

Each Cosmos DB container exposes a change feed, which clients can subscribe to in order to get notified of new items being added or updated in the container. [5] As of 7 June 2021, item deletions are currently not exposed by the change feed. Changes are persisted by Cosmos DB, which makes it possible to request changes from any point in time since the creation of the container.

A "Time to Live" (or TTL) can be specified at the container level to let Cosmos DB automatically delete items after a certain amount of time expressed in seconds. This countdown starts after the last update of the item. If needed, the TTL can also be overloaded at the item level.

Multi-model APIs

The internal data model described in the previous section is exposed through:

APIInternal mappingCompatibility status and remarks
ContainersItems
MongoDBCollectionsDocumentsCompatible with wire protocol version 6 and server version 3.6 of the MongoDB. [6]
GremlinGraphsNodes and edgesCompatible with version 3.2 of the Gremlin specification.
CassandraTableRowCompatible with version 4 of the Cassandra Query Language (CQL) wire protocol.
Azure Table StorageTableItem
etcdKeyValueCompatible with version 3 of etcd. [7]

SQL API

The SQL API lets clients create, update and delete containers and items. Items can be queried with a read-only, JSON-friendly SQL dialect. [8] As Cosmos DB embeds a JavaScript engine, the SQL API also enables:

The SQL API is exposed as a REST API, which itself is implemented in various SDKs that are officially supported by Microsoft and available for .NET Framework, .NET, [10] Node.js (JavaScript), Java and Python.

Partitioning

Cosmos DB added automatic partitioning capability in 2016 with the introduction of partitioned containers. Behind the scenes, partitioned containers span multiple physical partitions with items distributed by a client-supplied partition key. Cosmos DB automatically decides how many partitions to spread data across depending on the size and throughput needs. When partitions are added or removed, the operation is performed without any downtime so data remains available while it is re-balanced across the new or remaining partitions.

Before partitioned containers were available, it was common to write custom code to partition data and some of the Cosmos DB SDKs explicitly supported several different partitioning schemes. That mode is still available but only recommended when storage and throughput requirements do not exceed the capacity of one container, or when the built-in partitioning capability does not otherwise meet the application's needs.

Tunable throughput

Developers can specify desired throughput to match the application's expected load. Cosmos DB reserves resources (memory, CPU and IOPS) to guarantee the requested throughput while maintaining request latency below 10ms for both reads and writes at the 99th percentile. Throughput is specified in Request Units (RUs) per second. The cost to read a 1 KB item is 1 Request Unit (or 1 RU). Select by 'id' operations consume lower number of RUs compared to Delete, Update, and Insert operations for the same document. Large queries (e.g. aggregations like count) and stored procedure executions can consume hundreds to thousands of RUs depending on the complexity of the operations needed. [11] The minimum billing is per hour.

Throughput can be provisioned at either the container or the database level. When provisioned at the database level, the throughput is shared across all the containers within that database, with the additional ability to have dedicated throughput for some containers. The throughput provisioned on an Azure Cosmos container is exclusively reserved for that container. [12] The default maximum RUs that can be provisioned per database and per container are 1,000,000 RUs, but customers can get this limit increased by contacting customer support.

As an example of costing, using a single region instance, a count of 1,000,000 records of 1k each in 5s requires 1,000,000 RUs At $0.008/h , which would equal $800. Two regions double the cost.

Global distribution

Cosmos DB databases can be configured to be available in any of the Microsoft Azure regions (54 regions as of December 2018), letting application developers place their data closer to where their users are. [13] Each container's data gets transparently replicated across all configured regions. Adding or removing regions is performed without any downtime or impact on performance. By leveraging Cosmos DB's multi-homing API, applications don't have to be updated or redeployed when regions are added or removed, as Cosmos DB will automatically route their requests to the regions that are available and closest to their location.

Consistency levels

Data consistency is configurable on Cosmos DB, letting application developers choose among five different levels: [14]

The desired consistency level is defined at the account level but can be overridden on a per request basis by using a specific HTTP header or the corresponding feature exposed by the SDKs. All five consistency levels have been specified and verified using the TLA+ specification language, with the TLA+ model being open-sourced on GitHub. [16]

Multi-master

Cosmos DB's original distribution model involves one single write region, with all other regions being read-only replicas. In March 2018, a new multi-master capability was announced, enabling multiple regions to be write replicas within a global deployment. Potential merge conflicts that may arise when different write regions issue concurrent, conflicting writes can be resolved by either the default Last Write Wins policy, or a custom JavaScript function.

Analytical Store

This feature, announced in May 2020, [17] is a fully isolated column store for enabling large scale analytics against operational data in the Azure Cosmos DB, without any impact to its transactional workloads. This feature addresses the complexity and latency challenges that occur with the traditional ETL pipelines required to have a data repository optimized to execute Online analytical processing by automatically syncing the operational data into a separate column store suitable for large scale analytical queries to be performed in an optimized manner, resulting in improving the latency of such queries.

Using Microsoft Azure Synapse Link [18] for Cosmos DB, it is possible to build no-ETL Hybrid transactional/analytical processing solutions by directly linking to Azure Cosmos DB analytical store from Synapse Analytics. It enables to run near real-time large-scale analytics directly on the operational data.

Reception

Gartner Research positions Microsoft as the leader in the Magic Quadrant Operational Database Management Systems in 2016 [19] and calls out the unique capabilities of Cosmos DB in their write-up.

Real-world use cases

Microsoft utilizes Cosmos DB in many of its own apps, [20] including Microsoft Office, Skype, Active Directory, Xbox, and MSN.

In building a more globally-resilient application / system, Cosmos DB combines with other Azure services, such as Azure App Services and Azure Traffic Manager. [21]

Cosmos DB Profiler

The Cosmos DB Profiler cloud cost optimization tool detects inefficient data queries in the interactions between an application and its Cosmos DB database. The profiler alerts users to wasted performance and excessive cloud expenditures. It also recommends how to resolve them by isolating and analyzing the code and directing its users to the exact location. [22]

Limitations

Related Research Articles

<span class="mw-page-title-main">MySQL</span> SQL database engine software

MySQL is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database organizes data into one or more data tables in which data may be related to each other; these relations help structure the data. SQL is a language that programmers use to create, modify and extract data from the relational database, as well as control user access to the database. In addition to relational databases and SQL, an RDBMS like MySQL works with an operating system to implement a relational database in a computer's storage system, manages users, allows for network access and facilitates testing database integrity and creation of backups.

<span class="mw-page-title-main">Apache CouchDB</span> Document-oriented NoSQL database

Apache CouchDB is an open-source document-oriented NoSQL database, implemented in Erlang.

Microsoft SQL Server is a proprietary relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network. Microsoft markets at least a dozen different editions of Microsoft SQL Server, aimed at different audiences and for workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.

A document-oriented database, or document store, is a computer program and data storage system designed for storing, retrieving and managing document-oriented information, also known as semi-structured data.

A database shard, or simply a shard, is a horizontal partition of data in a database or search engine. Each shard is held on a separate database server instance, to spread load.

<span class="mw-page-title-main">Microsoft Azure</span> Cloud computing platform by Microsoft

Microsoft Azure, or just Azure, is the cloud computing platform developed by Microsoft. It offers management, access and development of applications and services to individuals, companies, and governments through its global infrastructure. It also provides a range of capabilities, including software as a service (SaaS), platform as a service (PaaS), and infrastructure as a service (IaaS). Microsoft Azure supports many programming languages, tools, and frameworks, including Microsoft-specific and third-party software and systems.

<span class="mw-page-title-main">MarkLogic Server</span>

MarkLogic Server is a document-oriented database developed by MarkLogic. It is a NoSQL multi-model database that evolved from an XML database to natively store JSON documents and RDF triples, the data model for semantics. MarkLogic is designed to be a data hub for operational and analytical data.

NoSQL is an approach to database design that focuses on providing a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Instead of the typical tabular structure of a relational database, NoSQL databases house data within one data structure. Since this non-relational database design does not require a  schema, it offers rapid  scalability  to manage  large and typically unstructured data sets. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.

<span class="mw-page-title-main">Couchbase Server</span> Open-source NoSQL database

Couchbase Server, originally known as Membase, is a source-available, distributed multi-model NoSQL document-oriented database software package optimized for interactive applications. These applications may serve many concurrent users by creating, storing, retrieving, aggregating, manipulating and presenting data. In support of these kinds of application needs, Couchbase Server is designed to provide easy-to-scale key-value, or JSON document access, with low latency and high sustainability throughput. It is designed to be clustered from a single machine to very large-scale deployments spanning many machines.

BigQuery is a managed, serverless data warehouse product by Google, offering scalable analysis over large quantities of data. It is a Platform as a Service (PaaS) that supports querying using a dialect of SQL. It also has built-in machine learning capabilities. BigQuery was announced in May 2010 and made generally available in November 2011.

A cloud database is a database that typically runs on a cloud computing platform and access to the database is provided as-a-service. There are two common deployment models: users can run databases on the cloud independently, using a virtual machine image, or they can purchase access to a database service, maintained by a cloud database provider. Of the databases available on the cloud, some are SQL-based and some use a NoSQL data model.

<span class="mw-page-title-main">Amazon DynamoDB</span> NoSQL database service

Amazon DynamoDB is a fully managed proprietary NoSQL database offered by Amazon.com as part of the Amazon Web Services portfolio. DynamoDB offers a fast persistent key–value datastore with built-in support for replication, autoscaling, encryption at rest, and on-demand backup among other features.

<span class="mw-page-title-main">SingleStore</span> Database management system

SingleStore is a proprietary, cloud-native database designed for data-intensive applications. A distributed, relational, SQL database management system (RDBMS) that features ANSI SQL support, it is known for speed in data ingest, transaction processing, and query processing.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

<span class="mw-page-title-main">ArangoDB</span> Multi-model database

ArangoDB is a graph database system developed by ArangoDB Inc. ArangoDB is a multi-model database system since it supports three data models with one database core and a unified query language AQL. AQL is mainly a declarative language and allows the combination of different data access patterns in a single query.

<span class="mw-page-title-main">PACELC theorem</span> Theorem in theoretical computer science

In database theory, the PACELC theorem is an extension to the CAP theorem. It states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and loss of consistency (C).

Azure Data Lake is a scalable data storage and analytics service. The service is hosted in Azure, Microsoft's public cloud.

Microsoft Azure Stream Analytics is a serverless scalable complex event processing engine by Microsoft that enables users to develop and run real-time analytics on multiple streams of data from sources such as devices, sensors, web sites, social media, and other applications. Users can set up alerts to detect anomalies, predict trends, trigger necessary workflows when certain conditions are observed, and make data available to other downstream applications and services for presentation, archiving, or further analysis.

Azure Data Explorer is a fully-managed big data analytics cloud platform and data-exploration service, developed by Microsoft, that ingests structured, semi-structured and unstructured data. The service then stores this data and answers analytic ad hoc queries on it with seconds of latency. It is a full text indexing and retrieval database, including time series analysis capabilities and regular expression evaluation and text parsing.

References

  1. "Vector Database". learn.microsoft.com. Retrieved 30 March 2024.
  2. Kumar, Chandan (7 March 2023). "Azure Cosmos DB and NoSQL databases". skillzcafe. Retrieved 2023-04-11.
  3. "Working with Azure Cosmos DB databases, containers and items". docs.microsoft.com. Retrieved 2018-12-13.
  4. "Unique keys in Azure Cosmos DB". Dibran's Blog. 3 July 2018. Retrieved 2018-12-13.
  5. "Working with the change feed support in Azure Cosmos DB". docs.microsoft.com. Retrieved 2021-07-03.
  6. "Azure Cosmos DB API now supports MongoDB version 3.6". azure.microsoft.com. Retrieved 2020-02-11.
  7. "Introduction to the Azure Cosmos DB etcd API". docs.microsoft.com. Retrieved 2020-06-10.
  8. "SQL language syntax in Azure Cosmos DB". docs.microsoft.com. Retrieved 2018-12-13.
  9. Maccherone, Larry. "Announcing documentdb-lumenize". blog.lumenize.com. Retrieved 2016-12-11.
  10. "Using Azure DocumentDB and ASP.NET Core for extreme NoSQL performance". auth0.com.
  11. "Provisioned Throughput: Request Units in Azure Cosmos DB". docs.microsoft.com. Retrieved 2019-07-21.
  12. "Provision throughput on containers and databases". docs.microsoft.com. Retrieved 2019-07-21.
  13. "How to distribute data globally with Azure Cosmos DB". docs.microsoft.com. Retrieved 2017-08-22.
  14. "Diving Deep Into Different Consistency Levels Of Azure Cosmos DB". www.c-sharpcorner.com. Retrieved 2018-12-13.
  15. "Tunable data consistency levels in Azure Cosmos DB". docs.microsoft.com. Microsoft. Retrieved 2017-08-22.
  16. GitHub - Azure/azure-cosmos-tla: Azure Cosmos TLA+ specifications., Microsoft Azure, 2018-12-09, retrieved 2018-12-13
  17. "Microsoft Announces a New Pricing Model Option for Azure Cosmos DB and More Capabilities". www.infoq.com. Retrieved 2020-06-20.
  18. "A closer look at Azure Synapse Link". ZDNet . Retrieved 2017-04-15.
  19. "Magic Quadrant for Operational Database Management Systems". www.gartner.com. Retrieved 2016-12-11.
  20. http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf [ bare URL PDF ]
  21. Pietschmann, Chris (28 June 2017). "Building Globally Resilient Apps with Azure App Service and Cosmos DB". Build5Nines.com. Opsgility. Retrieved 30 January 2018.
  22. "Cosmos DB Profiler". hibernatingrhinos.com. Hibernating Rhinos. Retrieved 2020-05-20.
  23. "Add Group By support for Aggregate Functions". feedback.azure.com. Retrieved 2019-03-31.