Azure Cosmos DB
   HOME

TheInfoList



OR:

Azure Cosmos DB is Microsoft's proprietary globally distributed, multi-model database service "for managing data at planet-scale" launched in May 2017. It is schema-agnostic, horizontally scalable, and generally classified as a
NoSQL A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...
database.


Data model

Internally, Cosmos DB stores "items" in "containers", with these two concepts being surfaced differently depending on the
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software Interface (computing), interface, offering a service to other pieces of software. A document or standa ...
used (these would be "documents" in "collections" when using the MongoDB-compatible API, for example). Containers are grouped in "databases", which are analogous to namespaces above containers. Containers are schema-agnostic, which means that no schema is enforced when adding items. By default, every field in each item is automatically indexed, generally providing good performance without tuning to specific query patterns. These defaults can be modified by setting an indexing policy which can specify, for each field, the index type and precision desired. Cosmos DB offers two types of indexes: * range, supporting
range Range may refer to: Geography * Range (geographic), a chain of hills or mountains; a somewhat linear, complex mountainous or hilly area (cordillera, sierra) ** Mountain range, a group of mountains bordered by lowlands * Range, a term used to i ...
and ORDER BY queries, * spatial, supporting spatial queries from points, polygons and line strings encoded in standard
GeoJSON GeoJSON is an open standard format designed for representing simple geographical features, along with their non-spatial attributes. It is based on the JSON format. The features include points (therefore addresses and locations), line strings ( ...
fragments. Containers can also enforce unique key constraints to ensure data integrity. Each Cosmos DB container exposes a change feed, which clients can subscribe to in order to get notified of new items being added or updated in the container. As of , item deletions are currently not exposed by the change feed. Changes are persisted by Cosmos DB, which makes it possible to request changes from any point in time since the creation of the container. A "
Time to Live Time to live (TTL) or hop limit is a mechanism which limits the lifespan or lifetime of data in a computer or network. TTL may be implemented as a counter or timestamp attached to or embedded in the data. Once the prescribed event count or timesp ...
" (or TTL) can be specified at the container level to let Cosmos DB automatically delete items after a certain amount of time expressed in seconds. This countdown starts after the last update of the item. If needed, the TTL can also be overloaded at the item level.


Multi-model APIs

The internal data model described in the previous section is exposed through: * a proprietary SQL API * five different compatibility APIs, exposing endpoints that are partially compatible with the wire protocols of
MongoDB MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Serve ...
,
Gremlin A gremlin is a mischievous folkloric creature invented at the beginning of the 20th century to originally explain malfunctions in aircraft and later in other machinery and processes and their operators. Depictions of these creatures vary widely ...
,
Cassandra Cassandra or Kassandra (; Ancient Greek: Κασσάνδρα, , also , and sometimes referred to as Alexandra) in Greek mythology was a Trojan priestess dedicated to the god Apollo and fated by him to utter true prophecies but never to be believe ...
, Azure Table Storage, and
etcd Container Linux (formerly CoreOS Linux) is a discontinued open-source lightweight operating system based on the Linux kernel and designed for providing infrastructure to clustered deployments, while focusing on automation, ease of application ...
; these compatibility APIs make it possible for any compatible application to connect to and use Cosmos DB through standard drivers or SDKs, while also benefiting from Cosmos DB's core features like partitioning and global distribution.


SQL API

The SQL API lets clients create, update and delete containers and items. Items can be queried with a read-only,
JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other ser ...
-friendly SQL dialect. As Cosmos DB embeds a
JavaScript engine A JavaScript engine is a software component that executes JavaScript code. The first JavaScript engines were mere interpreters, but all relevant modern engines use just-in-time compilation for improved performance. JavaScript engines are typical ...
, the SQL API also enables: *
Stored procedure A stored procedure (also termed proc, storp, sproc, StoPro, StoredProc, StoreProc, sp, or SP) is a subroutine available to applications that access a relational database management system (RDBMS). Such procedures are stored in the database data dic ...
s. Functions that bundle an arbitrarily complex set of operations and logic into an
ACID In computer science, ACID ( atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequ ...
-compliant transaction. They are isolated from changes made while the stored procedure is executing and either all write operations succeed or they all fail, leaving the database in a consistent state. Stored procedures are executed in a single partition. Therefore, the caller must provide a partition key when calling into a partitioned collection. Stored procedures can be used to make up for the lack of certain functionality. For instance, the lack of aggregation capability is made up for by the implementation of an
OLAP cube An OLAP cube is a multi-dimensional array of data. Online analytical processing (OLAP) is a computer-based technique of analyzing data to look for insights. The term ''cube'' here refers to a multi-dimensional dataset, which is also sometimes ca ...
as a stored procedure in the open sourced documentdb-lumenize project. *
Trigger Trigger may refer to: Notable animals and people ;Mononym * Trigger (horse), owned by cowboy star Roy Rogers ;Nickname * Trigger Alpert (1916–2013), American jazz bassist * "Trigger Mike" Coppola (1900–1966), American gangster ;Surname * Bru ...
s. Functions that get executed before or after specific operations (like on a document insertion for example) that can either alter the operation or cancel it. Triggers are only executed on request. *
User-defined function A user-defined function (UDF) is a function provided by the user of a program or environment, in a context where the usual assumption is that functions are built into the program or environment. UDFs are usually written for the requirement of its cr ...
s (UDF). Functions that can be called from and augment the SQL query language making up for limited SQL features. The SQL API is exposed as a
REST Rest or REST may refer to: Relief from activity * Sleep ** Bed rest * Kneeling * Lying (position) * Sitting * Squatting position Structural support * Structural support ** Rest (cue sports) ** Armrest ** Headrest ** Footrest Arts and entert ...
API, which itself is implemented in various SDKs that are officially supported by Microsoft and available for
.NET Framework The .NET Framework (pronounced as "''dot net"'') is a proprietary software framework developed by Microsoft that runs primarily on Microsoft Windows. It was the predominant implementation of the Common Language Infrastructure (CLI) until bein ...
, .NET,
Node.js Node.js is an open-source server environment. Node.js is cross-platform and runs on Windows, Linux, Unix, and macOS. Node.js is a back-end JavaScript runtime environment. Node.js runs on the V8 JavaScript Engine and executes JavaScript code ou ...
(
JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of Website, websites use JavaScript on the Client (computing), client side ...
),
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
and
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
.


Partitioning

Cosmos DB added automatic partitioning capability in 2016 with the introduction of partitioned containers. Behind the scenes, partitioned containers span multiple physical partitions with items distributed by a client-supplied partition key. Cosmos DB automatically decides how many partitions to spread data across depending on the size and throughput needs. When partitions are added or removed, the operation is performed without any downtime so data remains available while it is re-balanced across the new or remaining partitions. Before partitioned containers were available, it was common to write custom code to partition data and some of the Cosmos DB SDKs explicitly supported several different partitioning schemes. That mode is still available but only recommended when storage and throughput requirements do not exceed the capacity of one container, or when the built-in partitioning capability does not otherwise meet the application's needs.


Tunable throughput

Developers can specify desired throughput to match the application's expected load. Cosmos DB reserves resources (memory, CPU and
IOPS Input/output operations per second (IOPS, pronounced ''eye-ops'') is an input/output performance measurement used to characterize computer storage devices like hard disk drives (HDD), solid state drives (SSD), and storage area networks (SAN). Like ...
) to guarantee the requested throughput while maintaining request latency below 10ms for both reads and writes at the 99th
percentile In statistics, a ''k''-th percentile (percentile score or centile) is a score ''below which'' a given percentage ''k'' of scores in its frequency distribution falls (exclusive definition) or a score ''at or below which'' a given percentage falls ...
. Throughput is specified in Request Units (RUs) per second. The cost to read a 1 KB item is 1 Request Unit (or 1 RU). Select by 'id' operations consume lower number of RUs compared to Delete, Update, and Insert operations for the same document. Large queries (e.g. aggregations like count) and stored procedure executions can consume hundreds to thousands of RUs depending on the complexity of the operations needed. The minimum billing is per hour. Throughput can be provisioned at either the container or the database level. When provisioned at the database level, the throughput is shared across all the containers within that database, with the additional ability to have dedicated throughput for some containers. The throughput provisioned on an Azure Cosmos container is exclusively reserved for that container. The default maximum RUs that can be provisioned per database and per container are 1,000,000 RUs, but customers can get this limit increased by contacting customer support. As an example of costing, using a single region instance, a count of 1,000,000 records of 1k each in 5s requires 1,000,000 RUs At , which would equal . Two regions double the cost.


Global distribution

Cosmos DB databases can be configured to be available in any of the Microsoft Azure regions (54 regions as of December 2018), letting application developers place their data closer to where their users are. Each container's data gets transparently replicated across all configured regions. Adding or removing regions is performed without any downtime or impact on performance. By leveraging Cosmos DB's multi-homing API, applications don't have to be updated or redeployed when regions are added or removed, as Cosmos DB will automatically route their requests to the regions that are available and closest to their location.


Consistency levels

Data consistency Data consistency refers to whether the same data kept at different places do or do not match. Point-in-time consistency Point-in-time consistency is an important property of backup files and a critical objective of software that creates backups. ...
is configurable on Cosmos DB, letting application developers choose among five different levels: * Eventual does not guarantee any ordering and only ensures that replicas will eventually converge * Consistent prefix adds ordering guarantees on top of eventual * Session is scoped to a single client connection and basically ensures a read-your-own-writes consistency for each client; it is the default consistency level * Bounded staleness augments consistent prefix by ensuring that reads won't lag beyond x versions of an item or some specified time window * Strong consistency (or linearizable) ensures that clients always read the latest globally committed write The desired consistency level is defined at the account level but can be overridden on a per request basis by using a specific HTTP header or the corresponding feature exposed by the SDKs. All five consistency levels have been specified and verified using the TLA+ specification language, with the TLA+ model being open-sourced on GitHub.


Multi-master

Cosmos DB's original distribution model involves one single write region, with all other regions being read-only replicas. In March 2018, a new multi-master capability was announced, enabling multiple regions to be write replicas within a global deployment. Potential merge conflicts that may arise when different write regions issue concurrent, conflicting writes can be resolved by either the default Last Write Wins policy, or a custom JavaScript function.


Analytical Store

This feature, announced in May 2020, is a fully isolated column store for enabling large scale analytics against operational data in the Azure Cosmos DB, without any impact to its transactional workloads. This feature addresses the complexity and latency challenges that occur with the traditional ETL pipelines required to have a data repository optimized to execute
Online analytical processing Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repo ...
by automatically syncing the operational data into a separate column store suitable for large scale analytical queries to be performed in an optimized manner, resulting in improving the latency of such queries. Using
Microsoft Azure Microsoft Azure, often referred to as Azure ( , ), is a cloud computing platform operated by Microsoft for application management via around the world-distributed data centers. Microsoft Azure has multiple capabilities such as software as a ...
Synapse Link for Cosmos DB, it is possible to build no-ETL
Hybrid transactional/analytical processing Hybrid transaction/analytical processing (HTAP) is a term created by Gartner Inc., an information technology research and advisory company, in its early 2014 research report ''Hybrid Transaction/Analytical Processing Will Foster Opportunities for ...
solutions by directly linking to Azure Cosmos DB analytical store from Synapse Analytics. It enables to run near real-time large-scale analytics directly on the operational data.


Reception

Gartner Research positions Microsoft as the leader in the Magic Quadrant Operational Database Management Systems in 2016 and calls out the unique capabilities of Cosmos DB in their write-up.


Real-world use cases

Microsoft utilizes Cosmos DB in many of its own apps, including
Microsoft Office Microsoft Office, or simply Office, is the former name of a family of client software, server software, and services developed by Microsoft. It was first announced by Bill Gates on August 1, 1988, at COMDEX in Las Vegas. Initially a marketin ...
,
Skype Skype () is a proprietary telecommunications application operated by Skype Technologies, a division of Microsoft, best known for VoIP-based videotelephony, videoconferencing and voice calls. It also has instant messaging, file transfer, deb ...
,
Active Directory Active Directory (AD) is a directory service developed by Microsoft for Windows domain networks. It is included in most Windows Server operating systems as a set of processes and services. Initially, Active Directory was used only for centralize ...
,
Xbox Xbox is a video gaming brand created and owned by Microsoft. The brand consists of five video game consoles, as well as applications (games), streaming services, an online service by the name of Xbox network, and the development arm by the na ...
, and
MSN MSN (meaning Microsoft Network) is a web portal and related collection of Internet services and apps for Windows and mobile devices, provided by Microsoft and launched on August 24, 1995, alongside the release of Windows 95. The Microsoft Net ...
. In building a more globally-resilient application / system, Cosmos DB combines with other Azure services, such as Azure App Services and Azure Traffic Manager.


Cosmos DB Profiler

The Cosmos DB Profiler cloud cost optimization tool detects inefficient data queries in the interactions between an application and its Cosmos DB database. The profiler alerts users to wasted performance and excessive cloud expenditures. It also recommends how to resolve them by isolating and analyzing the code and directing its users to the exact location.


Limitations

* SQL is limited. Aggregations limited to COUNT, SUM, MIN, MAX, AVG functions but no support for GROUP BY or other aggregation functionality found in database systems. However, stored procedures can be used to implement in-the-database aggregation capability. * SQL joins between "tables" are not possible, * Support only for pure JSON data types. Most notably, Cosmos DB lacks support for date-time data requiring that you store this data using the available data types. For instance, it can be stored as an ISO-8601 string or epoch integer.
MongoDB MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Serve ...
, the database to which Cosmos DB is most often compared, extended JSON in their BSON binary serialization specification to cover date-time data as well as traditional number types, regular expressions, and Undefined. However, many argue that Cosmos DB's choice of pure JSON is actually an advantage as it's a better fit for JSON-based REST APIs and the JavaScript engine built into the database.


References


External links

* {{Microsoft Azure Services Platform, state=expanded Document-oriented databases Microsoft cloud services NoSQL