ClickHouse
   HOME

TheInfoList



OR:

ClickHouse is an open-source
column-oriented DBMS A column-oriented DBMS or columnar DBMS is a database management system (DBMS) that stores data tables by column rather than by row. Benefits include more efficient access to data when only querying a subset of columns (by eliminating the need to r ...
(columnar database management system) for
online analytical processing Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repo ...
(OLAP) that allows users to generate analytical reports using SQL queries in real-time. ClickHouse Inc. is headquartered in the Bay Area of California, United States with the subsidiary, ClickHouse B.V., based in
Amsterdam Amsterdam ( , , , lit. ''The Dam on the River Amstel'') is the Capital of the Netherlands, capital and Municipalities of the Netherlands, most populous city of the Netherlands, with The Hague being the seat of government. It has a population ...
, Netherlands. In September of 2021 in San Francisco, CA, ClickHouse incorporated to house the open source technology with an initial $50 million investment from
Index Ventures Index Ventures is a European venture capital firm with dual headquarters in San Francisco and London, investing in technology-enabled companies with a focus on e-commerce, fintech, mobility, gaming, infrastructure/ AI, and security. Since its f ...
and
Benchmark Capital Benchmark is a venture capital firm based in San Francisco that provides seed money to startups. History The firm's most successful investment was a 1997 investment of $6.7 million in eBay for 22.1% of the company. In 2011, it invested $12 mill ...
with participation by Yandex N.V. and others. On October 28, 2021 the company received Series B funding totaling $250 million at an valuation of $2 billion from
Coatue Management Coatue is an American technology-focused investment manager led by founder and portfolio manager Philippe Laffont. Coatue invests in public and private markets with a focus on technology, media, telecommunications. the consumer and healthcare se ...
,
Altimeter Capital Altimeter Capital is an American investment firm based in Boston, Massachusetts and Menlo Park, California. The firm focuses on technology investments in both public and private markets globally. Background In 2008, Brad Gerstner founded Alt ...
, and other investors. The company continues to build the open source project and engineering cloud technology.


History

ClickHouse’s technology was first developed over 10 years ago at
Yandex Yandex LLC (russian: link=no, Яндекс, p=ˈjandəks) is a Russian multinational technology company providing Internet-related products and services, including an Internet search engine, information services, e-commerce, transportation, maps ...
, Russia's largest technology company. In 2009, Alexey Milovidov and developers started an experimental project to check the hypothesis if it was viable to generate analytical reports in real-time from non-aggregated data that is also constantly added in real-time. The developers spent 3 years to prove this hypothesis, and in 2012 ClickHouse launched in production for the first time to power Yandex.Metrica, the second-largest web analytics platform in the world, after Google Analytics. Unlike custom data structures used before, ClickHouse was applicable more generally to work as a database management system. The power and utility of ClickHouse offered a true
column-oriented DBMS A column-oriented DBMS or columnar DBMS is a database management system (DBMS) that stores data tables by column rather than by row. Benefits include more efficient access to data when only querying a subset of columns (by eliminating the need to r ...
, it allowed for systems to generate reports from petabytes of raw data with sub-second latencies. ClickHouse was widely adopted at Yandex including for Yandex.Tank load testing tool and Yandex.Market to monitor site accessibility and KPIs. In 2016, the ClickHouse project was released as
open-source software Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Op ...
under the Apache 2 license in June 2016 to power analytical use cases around the globe. The systems at the time offered a server throughput of a hundred thousand rows per second, ClickHouse out performed that speed with a throughput of hundreds of millions of rows per second. Since ClickHouse became available as open source in 2016, its popularity has grown exponentially, as evidenced through adoption by industry-leading companies like Uber, Comcast, eBay, and Cisco. ClickHouse was also implemented at CERN's
LHCb experiment The LHCb (Large Hadron Collider beauty) experiment is one of eight particle physics detector experiments collecting data at the Large Hadron Collider at CERN. LHCb is a specialized b-physics experiment, designed primarily to measure the paramete ...
to store and process
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
on 10 billion events with over 1000 attributes per event.


Features

The main features of the ClickHouse DBMS are: * ''True column-oriented
DBMS In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
.'' Nothing is stored with the values. For example, constant-length values are supported to avoid storing their length "number" next to the values. * ''Linear scalability.'' It's possible to extend a cluster by adding servers. * ''Fault tolerance.'' The system is a cluster of shards, where each shard is a group of replicas. ClickHouse uses asynchronous multi-master replication. Data is written to any available replica, then distributed to all the remaining replicas. ZooKeeper is used for coordinating processes, but it's not involved in query processing and execution. * ''Capability to store and process petabytes of data.'' * '' SQL support.'' ClickHouse supports an extended SQL-like language that includes arrays and nested data structures, approximate and
URI Uri may refer to: Places * Canton of Uri, a canton in Switzerland * Úri, a village and commune in Hungary * Uri, Iran, a village in East Azerbaijan Province * Uri, Jammu and Kashmir, a town in India * Uri (island), an island off Malakula Islan ...
functions, and the availability to connect an external key-value store. * ''High performance.'' ** Vector calculations are used. Data is not only stored by columns, but is processed by vectors (parts of columns). This approach allows it to achieve high
CPU A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, and ...
performance. ** Sampling and approximate calculations are supported. ** Parallel and distributed query processing is available (including
JOINs Join may refer to: * Join (law), to include additional counts or additional defendants on an indictment *In mathematics: ** Join (mathematics), a least upper bound of sets orders in lattice theory ** Join (topology), an operation combining two topo ...
). * ''Data compression.'' * ''
Hard disk drive A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magnet ...
(HDD) optimization.'' The system can process data that doesn't fit in
random-access memory Random-access memory (RAM; ) is a form of computer memory that can be read and changed in any order, typically used to store working Data (computing), data and machine code. A Random access, random-access memory device allows data items to b ...
(RAM). * ''Clients for
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
(DB) connectivity.'' Database connection options include the console client, the
HTTP The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, ...
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software Interface (computing), interface, offering a service to other pieces of software. A document or standa ...
, or one of the wrappers (wrappers are available for
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
,
PHP PHP is a general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by The PHP Group ...
,
NodeJS Node.js is an open-source server environment. Node.js is cross-platform and runs on Windows, Linux, Unix, and macOS. Node.js is a back-end JavaScript runtime environment. Node.js runs on the V8 JavaScript Engine and executes JavaScript code ou ...
,
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offici ...
,
Ruby A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sa ...
and R).
ODBC driver In computing, Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of database systems and operating systems. An ...
and
JDBC driver A JDBC driver is a software component enabling a Java application to interact with a database. JDBC drivers are analogous to ODBC drivers, ADO.NET data providers, and OLE DB providers. To connect with individual databases, JDBC (the Java Databas ...
are also available for ClickHouse.


Limitations

ClickHouse has some features that can be considered disadvantages: * There is no support for transactions. * Lack of full-fledged UPDATE/DELETE implementation.


Use cases

ClickHouse was designed for
OLAP Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repor ...
queries. * It works with a small number of tables that contain a large number of columns. * Queries can use a large number of rows extracted from the DB, but only a small subset of columns. * Queries are relatively rare (usually around 100 RPS per server). * For simple queries, latencies of about 50 ms are allowed. * Column values are fairly small, usually consisting of numbers and short strings (for example, 60
byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit ...
s per
URL A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifie ...
). * High throughput is required when processing a single query (up to billions of rows per second per server). * A query result is mostly filtered or aggregated. * Data update uses a simple scenario (usually batch-only, without complicated transactions). One of the common cases for ClickHouse is server log analysis. After setting regular data uploads to ClickHouse (it's recommended to insert data in fairly large batches with more than 1000 rows), it's possible to analyze incidents with instant queries or monitor a service's metrics, such as error rates, response times, and so on. ClickHouse can also be used as an internal data warehouse for in-house analysts. ClickHouse can store data from different systems (such as
Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
or certain logs) and analysts can build internal dashboards with the data or perform real-time analysis for business purposes.


Benchmark results

According to
benchmark Benchmark may refer to: Business and economics * Benchmarking, evaluating performance within organizations * Benchmark price * Benchmark (crude oil), oil-specific practices Science and technology * Benchmark (surveying), a point of known elevatio ...
tests conducted by its developers, for
OLAP Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repor ...
queries ClickHouse is more than 100 times faster than
Hive A hive may refer to a beehive, an enclosed structure in which some honey bee species live and raise their young. Hive or hives may also refer to: Arts * ''Hive'' (game), an abstract-strategy board game published in 2001 * "Hive" (song), a 201 ...
(a
DBMS In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
based on the
Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
technology stack) or
MySQL MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database o ...
(a common
RDBMS A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relation ...
).


See also

*
List of column-oriented DBMSes This article is a list of column-oriented database management system software. Free and open-source software (FOSS) Platform as a Service (PaaS) *Amazon Redshift * Microsoft Azure SQL Data Warehouse * Google BigQuery * Oracle Autonomous ...


References

{{reflist, 2


External links


ClickHouse official website
C++ software Free software Free database management systems Online analytical processing Structured storage Data warehousing Data warehousing products Data analysis software Distributed data stores