Database Shard
   HOME

TheInfoList



OR:

A database shard, or simply a shard, is a horizontal
partition Partition may refer to: Computing Hardware * Disk partitioning, the division of a hard disk drive * Memory partition, a subdivision of a computer's memory, usually for use by a single job Software * Partition (database), the division of a ...
of data in a
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
or
search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
. Each shard is held on a separate
database server A database server is a server which uses a database application that provides database services to other computer programs or to computers, as defined by the client–server model. Database management systems (DBMSs) frequently provide database- ...
instance, to spread load. Some data within a database remains present in all shards, but some appear only in a single shard. Each shard (or server) acts as the ''single'' source for this subset of data.


Database architecture

Horizontal partitioning is a database design principle whereby '' rows'' of a database table are held separately, rather than being split into
columns A column or pillar in architecture and structural engineering is a structural element that transmits, through compression, the weight of the structure above to other structural elements below. In other words, a column is a compression member. ...
(which is what
normalization Normalization or normalisation refers to a process that makes something more normal or regular. Most commonly it refers to: * Normalization (sociology) or social normalization, the process through which ideas and behaviors that may fall outside of ...
and vertical partitioning do, to differing extents). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location. There are numerous advantages to the horizontal partitioning approach. Since the tables are divided and distributed into multiple servers, the total number of rows in each table in each database is reduced. This reduces
index Index (or its plural form indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on a Halo megastru ...
size, which generally improves search performance. A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, greatly improving performance. In addition, if the database shard is based on some real-world segmentation of the data (e.g., European customers v. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard. In practice, sharding is complex. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping, as per the example above), this is often inflexible. There is a desire to support sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately.
Consistent hashing In computer science, consistent hashing is a special kind of hashing technique such that when a hash table is resized, only n/m keys need to be remapped on average where n is the number of keys and m is the number of slots. In contrast, in most tra ...
is a technique used in sharding to spread large loads across multiple smaller services and servers. Where
distributed computing A distributed system is a system whose components are located on different computer network, networked computers, which communicate and coordinate their actions by message passing, passing messages to one another from any system. Distributed com ...
is used to separate load between multiple servers (either for performance or reliability reasons), a shard approach may also be useful. In the 2010s, sharding of
execution Capital punishment, also known as the death penalty, is the State (polity), state-sanctioned practice of deliberately killing a person as a punishment for an actual or supposed crime, usually following an authorized, rule-governed process to ...
capacity, as well as the more traditional sharding of
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
, has emerged as a potential approach to overcome performance and scalability problems in
blockchain A blockchain is a type of distributed ledger technology (DLT) that consists of growing lists of records, called ''blocks'', that are securely linked together using cryptography. Each block contains a cryptographic hash of the previous block, a ...
s.


Compared to horizontal partitioning

Horizontal partitioning splits one or more tables by row, usually within a ''single'' instance of a
schema The word schema comes from the Greek word ('), which means ''shape'', or more generally, ''plan''. The plural is ('). In English, both ''schemas'' and ''schemata'' are used as plural forms. Schema may refer to: Science and technology * SCHEMA ...
and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which partition a particular row will be found, without first needing to search the index, e.g., the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found. Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially ''multiple'' instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server. Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost, if querying the database required ''multiple'' instances to be queried, just to retrieve a simple
dimension table A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time. (Note: People and time sometimes are not modeled as dimensions.) ...
. Beyond partitioning, sharding thus splits large partitionable tables across the servers, while smaller tables are replicated as complete units. This is also why sharding is related to a
shared-nothing architecture A shared-nothing architecture (SN) is a distributed computing architecture in which each update request is satisfied by a single node (processor/memory/storage unit) in a computer cluster. The intent is to eliminate contention among nodes. Nodes do ...
—once sharded, each shard can live in a totally separate logical schema instance / physical database server /
data center A data center (American English) or data centre (British English)See spelling differences. is a building, a dedicated space within a building, or a group of buildings used to house computer systems and associated components, such as telecommunic ...
/
continent A continent is any of several large landmasses. Generally identified by convention rather than any strict criteria, up to seven geographical regions are commonly regarded as continents. Ordered from largest in area to smallest, these seven ...
. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards. This makes replication across multiple servers easy (simple horizontal partitioning does not). It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck. There is also a requirement for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only (updates are rare and batched), to dynamically replicated tables (at the cost of reducing some of the distribution benefits of sharding) and many options in between.


Implementations

*
Altibase ALTIBASE is a hybrid database, relational open source database management system manufactured by The Altibase Corporation. The software comes with a hybrid architecture which allows it to access both memory-resident and disk-resident tables usi ...
provides combined (client-side and server-side) sharding architecture transparent to client applications. * Apache
HBase HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File Sys ...
can shard automatically. * Azure SQL Database Elastic Database tools shards to scale out and in the data-tier of an application. *
ClickHouse ClickHouse is an open-source column-oriented DBMS (columnar database management system) for online analytical processing (OLAP) that allows users to generate analytical reports using SQL queries in real-time. ClickHouse Inc. is headquartered in ...
, a fast open-source OLAP database management system, shards. *
Couchbase Couchbase Server, originally known as Membase, is an open-source, distributed (shared-nothing architecture) multi-model NoSQL document-oriented database software package optimized for interactive applications. These applications may serve many ...
shards automatically and transparently. *
CUBRID CUBRID ( "cube-rid") is an open-source SQL-based relational database management system (RDBMS) with object extensions developed by CUBRID Corp. for OLTP. The name CUBRID is a combination of the two words ''cube'' and ''bridge'', ''cube'' standing ...
shards since version 9.0 * Db2 Data Partitioning Feature (MPP) which is a shared-nothing database partitions running on separate nodes. * DRDS (Distributed Relational Database Service) of
Alibaba Cloud Alibaba Cloud, also known as Aliyun (), is a cloud computing company, a subsidiary of Alibaba Group. Alibaba Cloud provides cloud computing services to online businesses and Alibaba's own e-commerce ecosystem. Its international operations are re ...
does database/table sharding, and supports
Singles' Day The Singles' Day () or Double 11 (), originally called Bachelors' Day, is a Chinese unofficial holiday and shopping season that celebrates people who are not in a relationship. The date, 11 November (11/11), was chosen because the numeral 1 res ...
. *
Elasticsearch Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-l ...
enterprise search server shards. * eXtreme Scale is a cross-process in-memory key/value data store (a
NoSQL A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...
data store). It uses sharding to achieve scalability across processes for both data and
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
-style parallel processing. *
Hibernate Hibernation is a state of minimal activity and metabolic depression undergone by some animal species. Hibernation is a seasonal heterothermy characterized by low body-temperature, slow breathing and heart-rate, and low metabolic rate. It most ...
shards, but has had little development since 2007. * IBM
Informix IBM Informix is a product family within IBM's Information Management division that is centered on several relational database management system (RDBMS) offerings. The Informix products were originally developed by Informix Corporation, whose I ...
shards since version 12.1 xC1 as part of the MACH11 technology. Informix 12.10 xC2 added full compatibility with MongoDB drivers, allowing the mix of regular relational tables with NoSQL collections, while still allowing sharding, fail-over and ACID properties. *
Kdb+ kdb+ is a column-based relational time series database (TSDB) with in-memory (IMDB) abilities, developed and marketed by KX. The database is commonly used in high-frequency trading (HFT) to store, analyze, process, and retrieve large data set ...
shards since version 2.0. *
MariaDB MariaDB is a community-developed, commercially supported fork of the MySQL relational database management system (RDBMS), intended to remain free and open-source software under the GNU General Public License. Development is led by some of the ori ...
Spider, an storage engine that supports table federation, table sharding, XA transactions, and ODBC data sources. The MariaDB Spider engine is bundled in MariaDB server since version 10.0.4. *
MonetDB MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against lar ...
, an open-source column-store, does read-only sharding in its July 2015 release. *
MongoDB MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Serve ...
shards since version 1.6. *
MySQL Cluster MySQL Cluster is a technology providing shared-nothing clustering and auto-sharding for the MySQL database management system. It is designed to provide high availability and high throughput with low latency, while allowing for near linear sca ...
automatically and transparently shards across low-cost commodity nodes, allowing scale-out of read and write queries, without requiring changes to the application. *
MySQL MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database o ...
Fabric (part of MySQL utilities) shards. * Oracle Database shards since 12c Release 2 and in one liner: Combination of sharding advantages with well-known capabilities of enterprise ready multi-model Oracle Database. *
Oracle NoSQL Database Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring. Oracle NoSQL Database Cl ...
has automatic sharding and elastic, online expansion of the cluster (adding more shards). *
OrientDB OrientDB is an open source NoSQL database management system written in Java (programming language), Java. It is a Multi-model database, supporting Graph database, graph, Document-oriented database, document, Key-value database, key/value, and Obj ...
shards since version 1.7 *
Solr Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features an ...
enterprise search server shards. *
Spanner A wrench or spanner is a tool used to provide grip and mechanical advantage in applying torque to turn objects—usually rotary fasteners, such as nuts and bolts—or keep them from turning. In the UK, Ireland, Australia, and New Zeala ...
, Google's global-scale distributed database, shards across multiple
Paxos Paxos ( gr, Παξός) is a Greek island in the Ionian Sea, lying just south of Corfu. As a group with the nearby island of Antipaxos and adjoining islets, it is also called by the plural form Paxi or Paxoi ( gr, Παξοί, pronounced in Engli ...
state machines to scale to "millions of machines across hundreds of data centers and trillions of database rows". * SQLAlchemy ORM, a data-mapper for the
Python programming language Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically-typed and garbage-collected. It supports multiple programming p ...
shards. * SQL Server, since SQL Server 2005 shards with help of 3rd party tools. *
Teradata Teradata Corporation is an American software company that provides cloud database and analytics-related software, products, and services. The company was formed in 1979 in Brentwood, California, as a collaboration between researchers at Caltech a ...
markets a massive parallel database management system as a " data warehouse" *Vault, a
cryptocurrency A cryptocurrency, crypto-currency, or crypto is a digital currency designed to work as a medium of exchange through a computer network that is not reliant on any central authority, such as a government or bank, to uphold or maintain it. It i ...
, shards to drastically reduce the data that users need to join the network and verify transactions. This allows the network to scale much more. * Vitess open-source database clustering system shards MySQL. It is a
Cloud Native Computing Foundation The Cloud Native Computing Foundation (CNCF) is a Linux Foundation project that was founded in 2015 to help advance container technology and align the tech industry around its evolution. It was announced alongside Kubernetes 1.0, an open sourc ...
project. * ShardingSphere related to a database clustering system providing data sharding, distributed transactions, and distributed database management. It is an
Apache Software Foundation The Apache Software Foundation (ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open source software projects. The ASF was formed from a group of developers of the A ...
(ASF) project.


Disadvantages

Sharding a database table before it has been optimized locally causes premature complexity. Sharding should be used only when all other options for optimization are inadequate. The introduced complexity of database sharding causes the following potential problems: * ''SQL complexity'' - Increased bugs because the developers have to write more complicated SQL to handle sharding logic * ''Additional software'' - that partitions, balances, coordinates, and ensures integrity can fail * ''
Single point of failure A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software appl ...
'' - Corruption of one shard due to network/hardware/systems problems causes failure of the entire table. * ''
Fail-over Failover is switching to a redundant or standby computer server, system, hardware component or network upon the failure or abnormal termination of the previously active application, server, system, hardware component, or network in a computer net ...
server complexity'' - Fail-over servers must have copies of the fleets of database shards. * ''
Backup In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", w ...
s complexity'' - Database backups of the individual shards must be coordinated with the backups of the other shards. * ''Operational complexity'' - Adding/removing indexes, adding/deleting columns, modifying the schema becomes much more difficult. These historical complications of do-it-yourself sharding were addressed by independent software vendors who provided automatic sharding.


Etymology

In a database context, most recognize the term "shard" is most likely derived from either one of two sources:
Computer Corporation of America Computer Corporation of America (CCA) was a computer software and database systems company founded in 1965. It was best known for its Model 204 (M204) database system for IBM and compatible mainframes. It was acquired by Rocket Software in 2010. ...
's "A System for Highly Available Replicated Data", which utilized redundant hardware to facilitate data ''replication'' (as opposed to horizontal partitioning); or the critically acclaimed 1997
MMORPG A massively multiplayer online role-playing game (MMORPG) is a video game that combines aspects of a role-playing video game and a massively multiplayer online game. As in role-playing games (RPGs), the player assumes the role of a Player charac ...
video game ''
Ultima Online ''Ultima Online'' (''UO'') is a fantasy massively multiplayer online role-playing game (MMORPG) released on September 24, 1997 by Origin Systems. Set in the '' Ultima'' universe, it is known for its extensive player versus player combat system. ...
'' which set 8
Guinness World Records ''Guinness World Records'', known from its inception in 1955 until 1999 as ''The Guinness Book of Records'' and in previous United States editions as ''The Guinness Book of World Records'', is a reference book published annually, listing world ...
and was designated by ''Time'' as one of the 100 greatest video games produced of all time.
Richard Garriott Richard Allen Garriott de Cayeux (''né'' Garriott; born July 4, 1961) is an American video game developer, entrepreneur and private astronaut. Although both his parents were American, he maintains dual British and American citizenship by birth. ...
, creator of ''Ultima Online'', recollects the term being coined during production phase when they attempted to create a self-regulating virtual ecology system, whereby players may leverage new internet access (a revolutionary technology at the time) to interact and harvest in-game resources. Although the virtual ecology functioned as intended during in-house testing, its natural balance failed "almost instantaneously" due to players killing off every living wildlife across the playable area faster than the spawning system could operate. Garriott's production team attempted to mitigate this issue by separating the global player base into separate sessions, and rewriting part of ''Ultima Online'' fictional connection to the end of '' Ultima I: The First Age of Darkness'', where the defeat of its antagonist
Mondain This is a list of significant or recurring characters in the ''Ultima'' series of computer games, indicating the games in which they appeared. The Avatar and Companions * Yes : The companion is in that game. * No : The companion is not in that ...
also led to the creation of
multiverse The multiverse is a hypothetical group of multiple universes. Together, these universes comprise everything that exists: the entirety of space, time, matter, energy, information, and the physical laws and constants that describe them. The di ...
"shards". This modification provided Garriott's team with the fictional basis needed to justify creating copies of the virtual environment. However, the game's sharp rise to critical acclaim also meant that the new multiverse virtual ecology system was quickly overwhelmed as well. After several months of testing, Garriott's team decided to abandon the feature altogether, and stripped the game of its functionality. Today, the term "shard" refers to the deployment and use of redundant hardware across database systems.


See also

*
Block Range Index A Block Range Index or BRIN is a database indexing technique. They are intended to improve performance with extremely large tables. BRIN indexes provide similar benefits to horizontal partitioning or sharding but without needing to explicitly decla ...
*
Shared-nothing architecture A shared-nothing architecture (SN) is a distributed computing architecture in which each update request is satisfied by a single node (processor/memory/storage unit) in a computer cluster. The intent is to eliminate contention among nodes. Nodes do ...


Notes


References


External links


Informix JSON data sharding
{{Design Patterns patterns Data partitioning Database management systems Software design patterns de:Denormalisierung#Fragmentierung