HOME

TheInfoList



OR:

A document-oriented database, or document store, is a
computer program A computer program is a sequence or set of instructions in a programming language for a computer to execute. Computer programs are one component of software, which also includes documentation and other intangible components. A computer program ...
and data storage system designed for storing, retrieving and managing document-oriented information, also known as
semi-structured data Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic eleme ...
. Document-oriented databases are one of the main categories of
NoSQL A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...
databases, and the popularity of the term "document-oriented database" has grown with the use of the term NoSQL itself.
XML database An XML database is a data persistence software system that allows data to be specified, and sometimes stored, in XML format. This data can be queried, transformed, exported and returned to a calling system. XML databases are a flavor of document- ...
s are a subclass of document-oriented databases that are optimized to work with
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
documents.
Graph databases A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the ''graph'' (or ''edge'' or ''relationship''). The graph relat ...
are similar, but add another layer, the ''relationship'', which allows them to link documents for rapid traversal. Document-oriented databases are inherently a subclass of the key-value store, another NoSQL database concept. The difference lies in the way the data is processed; in a key-value store, the data is considered to be inherently opaque to the database, whereas a document-oriented system relies on internal structure in the ''document'' in order to extract
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
that the database engine uses for further optimization. Although the difference is often negligible due to tools in the systems, conceptually the document-store is designed to offer a richer experience with modern programming techniques. Document databases contrast strongly with the traditional
relational database A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
(RDB). Relational databases generally store data in separate ''tables'' that are defined by the programmer, and a single object may be spread across several tables. Document databases store all information for a given object in a single instance in the database, and every stored object can be different from every other. This eliminates the need for object-relational mapping while loading data into the database.


Documents

The central concept of a document-oriented database is the notion of a ''document''. While each document-oriented database implementation differs on the details of this definition, in general, they all assume documents encapsulate and encode data (or information) in some standard format or encoding. Encodings in use include
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
,
YAML YAML ( and ) (''see '') is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Exte ...
,
JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other ser ...
, as well as binary forms like
BSON BSON () is a computer data interchange format. The name "BSON" is based on the term JSON and stands for "Binary JSON". It is a binary form for representing simple or complex data structures including associative arrays (also known as name-value ...
. Documents in a document store are roughly equivalent to the programming concept of an object. They are not required to adhere to a standard schema, nor will they have all the same sections, slots, parts or keys. Generally, programs using objects have many different types of objects, and those objects often have many optional fields. Every object, even those of the same class, can look very different. Document stores are similar in that they allow different types of documents in a single store, allow the fields within them to be optional, and often allow them to be encoded using different encoding systems. For example, the following is a document, encoded in JSON: A second document might be encoded in XML as: Bob Smith (123) 555-0178 (890) 555-0133
Home 123 Back St. Boys AR 32225 US
These two documents share some structural elements with one another, but each also has unique elements. The structure and text and other data inside the document are usually referred to as the document's ''content'' and may be referenced via retrieval or editing methods, (see below). Unlike a relational database where every record contains the same fields, leaving unused fields empty; there are no empty 'fields' in either document (record) in the above example. This approach allows new information to be added to some records without requiring that every other record in the database share the same structure. Document databases typically provide for additional
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
to be associated with and stored along with the document content. That metadata may be related to facilities the datastore provides for organizing documents, providing security, or other implementation specific features.


CRUD operations

The core operations that a document-oriented database supports for documents are similar to other databases, and while the terminology is not perfectly standardized, most practitioners will recognize them as
CRUD In computer programming, create, read, update, and delete (CRUD) are the four basic operations of persistent storage. CRUD is also sometimes used to describe user interface conventions that facilitate viewing, searching, and changing information u ...
: * Creation (or insertion) * Retrieval (or query, search, read or find) * Update (or edit) * Deletion (or removal)


Keys

Documents are addressed in the database via a unique ''key'' that represents that document. This key is a simple
identifier An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique ''class'' of objects, where the "object" or class may be an idea, physical countable object (or class thereof), or physical noncountable ...
(or ID), typically a string, a
URI Uri may refer to: Places * Canton of Uri, a canton in Switzerland * Úri, a village and commune in Hungary * Uri, Iran, a village in East Azerbaijan Province * Uri, Jammu and Kashmir, a town in India * Uri (island), an island off Malakula Islan ...
, or a
path A path is a route for physical travel – see Trail. Path or PATH may also refer to: Physical paths of different types * Bicycle path * Bridle path, used by people on horseback * Course (navigation), the intended path of a vehicle * Desire p ...
. The key can be used to retrieve the document from the database. Typically the database retains an
index Index (or its plural form indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on a Halo megastru ...
on the key to speed up document retrieval, and in some cases the key is required to create or insert the document into the database.


Retrieval

Another defining characteristic of a document-oriented database is that, beyond the simple key-to-document lookup that can be used to retrieve a document, the database offers an API or query language that allows the user to retrieve documents based on content (or metadata). For example, you may want a query that retrieves all the documents with a certain field set to a certain value. The set of query APIs or query language features available, as well as the expected performance of the queries, varies significantly from one implementation to another. Likewise, the specific set of indexing options and configuration that are available vary greatly by implementation. It is here that the document store varies most from the key-value store. In theory, the values in a key-value store are opaque to the store, they are essentially black boxes. They may offer search systems similar to those of a document store, but may have less understanding about the organization of the content. Document stores use the metadata in the document to classify the content, allowing them, for instance, to understand that one series of digits is a phone number, and another is a postal code. This allows them to search on those types of data, for instance, all phone numbers containing 555, which would ignore the zip code 55555.


Editing

Document databases typically provide some mechanism for updating or editing the content (or metadata) of a document, either by allowing for replacement of the entire document, or individual structural pieces of the document.


Organization

Document database implementations offer a variety of ways of organizing documents, including notions of * Collections: groups of documents, where depending on implementation, a document may be enforced to live inside one collection, or may be allowed to live in multiple collections * Tags and non-visible metadata: additional data outside the document content * Directory hierarchies: groups of documents organized in a tree-like structure, typically based on path or URI Sometimes these organizational notions vary in how much they are logical vs physical, (e.g. on disk or in memory), representations.


Relationship to other databases


Relationship to key-value stores

A document-oriented database is a specialized key-value store, which itself is another NoSQL database category. In a simple key-value store, the document content is opaque. A document-oriented database provides APIs or a query/update language that exposes the ability to query or update based on the internal structure in the ''document''. This difference may be minor for users that do not need richer query, retrieval, or editing APIs that are typically provided by document databases. Modern key-value stores often include features for working with metadata, blurring the lines between document stores.


Relationship to search engines

Some search engine (aka
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
) systems like
Apache Solr Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features a ...
and
Elasticsearch Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-l ...
provide enough of the core operations on documents to fit the definition of a document-oriented database.


Relationship to relational databases

In a relational database, data is first categorized into a number of predefined types, and ''tables'' are created to hold individual entries, or ''records'', of each type. The tables define the data within each record's ''fields'', meaning that every record in the table has the same overall form. The administrator also defines the ''relationships'' between the tables, and selects certain fields that they believe will be most commonly used for searching and defines ''indexes'' on them. A key concept in the relational design is that any data that may be repeated is normally placed in its own table, and if these instances are related to each other, a column is selected to group them together, the ''foreign key''. This design is known as ''
database normalization Database normalization or database normalisation (see spelling differences) is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. ...
''. For example, an address book application will generally need to store the contact name, an optional image, one or more phone numbers, one or more mailing addresses, and one or more email addresses. In a canonical relational database, tables would be created for each of these rows with predefined fields for each bit of data: the CONTACT table might include FIRST_NAME, LAST_NAME and IMAGE columns, while the PHONE_NUMBER table might include COUNTRY_CODE, AREA_CODE, PHONE_NUMBER and TYPE (home, work, etc.). The PHONE_NUMBER table also contains a foreign key column, "CONTACT_ID", which holds the unique ID number assigned to the contact when it was created. In order to recreate the original contact, the database engine uses the foreign keys to look for the related items across the group of tables and reconstruct the original data. In contrast, in a document-oriented database there may be no internal structure that maps directly onto the concept of a table, and the fields and relationships generally don't exist as predefined concepts. Instead, all of the data for an object is placed in a single document, and stored in the database as a single entry. In the address book example, the document would contain the contact's name, image, and any contact info, all in a single record. That entry is accessed through its key, which allows the database to retrieve and return the document to the application. No additional work is needed to retrieve the related data; all of this is returned in a single object. A key difference between the document-oriented and relational models is that the data formats are not predefined in the document case. In most cases, any sort of document can be stored in any database, and those documents can change in type and form at any time. If one wishes to add a COUNTRY_FLAG to a CONTACT, this field can be added to new documents as they are inserted, this will have no effect on the database or the existing documents already stored. To aid retrieval of information from the database, document-oriented systems generally allow the administrator to provide ''hints'' to the database to look for certain types of information. These work in a similar fashion to indexes in the relational case. Most also offer the ability to add additional metadata outside of the content of the document itself, for instance, tagging entries as being part of an address book, which allows the programmer to retrieve related types of information, like "all the address book entries". This provides functionality similar to a table, but separates the concept (categories of data) from its physical implementation (tables). In the classic normalized relational model, objects in the database are represented as separate rows of data with no inherent structure beyond that given to them as they are retrieved. This leads to problems when trying to translate programming objects to and from their associated database rows, a problem known as object-relational impedance mismatch. Document stores more closely, or in some cases directly, map programming objects into the store. These are often marketed using the term
NoSQL A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...
.


Implementations


XML database implementations

Most XML databases are document-oriented databases.


See also

*
Database theory Database theory encapsulates a broad range of topics related to the study and research of the theoretical realm of databases and database management systems. Theoretical aspects of data management include, among other areas, the foundations of q ...
*
Data hierarchy Data hierarchy refers to the systematic organization of data, often in a hierarchical form. Data organization involves characters, fields, records, files and so on. This concept is a starting point when trying to see what makes up data and whether d ...
*
Data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enco ...
*
Full-text search In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts ...
*
In-memory database An in-memory database (IMDB, or main memory database system (MMDB) or memory resident database) is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that e ...
*
Internet Message Access Protocol In computing, the Internet Message Access Protocol (IMAP) is an Internet standard protocol used by email clients to retrieve email messages from a mail server over a TCP/IP connection. IMAP is defined by . IMAP was designed with the goal of per ...
(IMAP) *
Machine-readable document A machine-readable document is a document whose content can be readily processed by computers. Such documents are distinguished from machine-readable data by virtue of having sufficient structure to provide the necessary context to support the bu ...
*
Multi-model database In the field of database design, a multi-model database is a database management system designed to support multiple data models against a single, integrated backend. In contrast, most database management systems are organized around a single dat ...
*
NoSQL A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...
*
Object database An object database or object-oriented database is a database management system in which information is represented in the form of objects as used in object-oriented programming. Object databases are different from relational databases which a ...
*
Online database An online database is a database accessible from a local network or the Internet, as opposed to one that is stored locally on an individual computer or its attached storage (such as a CD). Online databases are hosted on websites, made available as s ...
*
Real-time database A real-time database is a database system which uses real-time processing to handle workloads whose state is constantly changing. This differs from traditional databases containing persistent data, mostly unaffected by time. For example, a stock ma ...
*
Relational database A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
*
Content management system A content management system (CMS) is computer software used to manage the creation and modification of digital content (content management).''Managing Enterprise Content: A Unified Content Strategy''. Ann Rockley, Pamela Kostur, Steve Manning. New ...


Notes


References


Further reading

* Assaf Arkin. (2007, September 20)
Read Consistency: Dumb Databases, Smart Services.


External links


DB-Engines Ranking of Document Stores
by popularity, updated monthly {{Databases Data management Database management systems Types of databases Data analysis Databases