In September 2021 in San Francisco, CA, ClickHouse incorporated to house the open source technology with an initial $50 million investment from Index Ventures and Benchmark Capital with participation by Yandex N.V.[2] and others. On October 28, 2021 the company received Series B funding totaling $250 million at a valuation of $2 billion from Coatue Management, Altimeter Capital, and other investors. The company continues to build the open source project and engineering cloud technology.
History
ClickHouse’s technology was first developed at Yandex, Russia's largest technology company.[3] In 2009, Alexey Milovidov and developers started an experimental project to check the hypothesis if it was viable to generate analytical reports in real-time from non-aggregated data that is also constantly added in real-time. The developers spent 3 years to prove this hypothesis, and in 2012 ClickHouse launched in production for the first time to power Yandex.Metrica.
In 2016, the ClickHouse project was released as open-source software under the Apache 2 license to power analytical use cases around the globe. The systems at the time offered a server throughput of a hundred thousand rows per second, ClickHouse outperformed them with a throughput of hundreds of millions of rows per second[citation needed].
Since ClickHouse became available as open source in 2016, its popularity has grown exponentially, as evidenced through adoption by industry-leading companies like Uber, Comcast, eBay, and Cisco. [4] ClickHouse was also implemented at CERN'sLHCb experiment to store and process metadata on 10 billion events with over 1000 attributes per event.[5]
True column-oriented DBMS. Nothing is stored with the values. For example, constant-length values are supported to avoid storing their length "number" next to the values.
Linear scalability. It's possible to extend a cluster by adding servers.
Fault tolerance. The system is a cluster of shards, where each shard is a group of replicas. ClickHouse uses asynchronous multi-master replication. Data is written to any available replica, then distributed to all the remaining replicas. ClickHouse Keeper (a C++ implementation of Zookeeper) is used for coordinating processes such as data replication, but it's not involved in query processing and execution.
Capability to store and process petabytes of data.
SQL support. ClickHouse supports an extended SQL-like language that includes arrays and nested data structures, approximate and URI functions, and the availability to connect an external key-value store.
A vectorized query engine that parallelizes execution to maximize hardware utilization, selecting the most optimized SIMD variant based on the host CPU.
Data is written as independent table parts without global coordination, enabling fast, parallel inserts. Background merge operations combine parts asynchronously to optimize query performance and storage efficiency.
Inserts are fully isolated from SELECT queries, and merging inserted data parts happens in the background so as to minimize the impact on concurrent queries.
Primary key indexes to define the sort order of table data to enable efficient binary search during query execution, reducing scan time from linear to logarithmic.
Table projections for alternative sort orders, storing internal copies of data sorted by different keys to optimize performance for multiple common filter patterns.
Skipping indexes to avoid unnecessary reads by adding lightweight column-level statistics (e.g. min/max, unique values) to accelerate filter evaluation by skipping irrelevant data blocks.
Sampling and approximate calculations are supported.
Parallel and distributed query processing is available (including JOINs).
Data compression. A column-oriented design, in which values are sorted by an explicit ordering, allows data to be efficiently compressed due to similar values being adjacent on disk. Configurable compression algorithms, such as Zstandard (Zstd), which combines high speed with effective compression, and LZ4 (compression algorithm), known for its rapid (de)compression, as well as configurable codecs such as Delta encoding, allow for high compression rates to be achieved.
Complex type support. Including semi-structured data such as JSON, where the schema is determined at write time based on the fields present.
Vector search support. Available through distance functions, with both exact matching and Approximate Nearest neighbor search indices [8].
Change Data Capture (CDC). Through its acquisition of PeerDB, an open-source CDC solution, ClickHouse can mirror inserts, updates, and deletes from external databases such as PostgreSQL in near real time.[9]
Dictionaries. ClickHouse provides in-memory key-value stores known as Dictionaries, which enable efficient enrichment and accelerate `LEFT ANY JOIN` queries.[10]
Open file formats. ClickHouse natively supports reading and writing open formats such as Parquet and Avro.[12]
Open table formats. The system can query modern open table formats including Apache Iceberg and Delta Lake, enabling interoperability with data lakehouse ecosystems.[13]
Row deduplication engines. Multiple table engines in the MergeTree family, such as `ReplacingMergeTree`, support asynchronous row merging logic to remove duplicates efficiently.[14]
Incremental materialized views. ClickHouse supports incremental updates to materialized views, where partial aggregation states can be stored and refreshed without recomputing the full dataset.[15]
Refreshable materialized views. In addition to incremental updates, ClickHouse supports refreshable materialized views that periodically execute queries and persist their results into target tables.[16]
Historically, UPDATE and DELETE operations in ClickHouse were implemented as background mutations that rewrote table parts, a reliable but expensive process unsuited to frequent row-level changes. Lightweight deletes later reduced this cost by rewriting only a deletion mask column. More recently, ClickHouse introduced patch parts, which record only the modified values and their row positions. These patches are applied immediately at query time and merged in the background, enabling efficient updates and "featherweight deletes", where deletions are expressed as compact patches that remove rows with minimal overhead.[26]
Use cases
ClickHouse was designed for OLAP queries.[6] ClickHouse performs well when:
It works with tables (up to 1000 recommended) that contain a large number of columns.
Queries use a large number of rows extracted from the DB, but only a small subset of columns.
Queries are relatively rare (usually around 100 requests per second per server).
High throughput is required when processing a single query (up to billions of rows per second per server).
A query result is mostly filtered or aggregated.
Data update uses a simple scenario (usually batch-only, without complicated transactions).
For simple queries, latencies of 50 ms are typical.
One of the common cases for ClickHouse is server log analysis. After setting regular data uploads to ClickHouse (it's recommended to insert data in fairly large batches with more than 1000 rows), it's possible to analyze incidents with instant queries or monitor a service's metrics, such as error rates, response times, and so on.
ClickHouse can also be used as an internal data warehouse for in-house analysts. ClickHouse can store data from different systems (such as Hadoop or certain logs) and analysts can build internal dashboards with the data or perform real-time analysis for business purposes.
Benchmark results
According to benchmark tests conducted by its developers,[7] for OLAP queries ClickHouse is more than 100 times faster than Hive (a DBMS based on the Hadoop technology stack) or MySQL (a common RDBMS).
ClickHouse Inc. maintains ClickBench, an open and reproducible benchmark for analytical database systems, based on real-world web analytics data and designed to evaluate performance across diverse OLAP, OLTP, and cloud-native databases using standardized SQL queries and realistic workloads.[27] A related benchmark, JSONBench, evaluates the JSON analytics capabilities of modern database systems using a real-world dataset of one billion Bluesky events.[28]
This page is based on this Wikipedia article Text is available under the CC BY-SA 4.0 license; additional terms may apply. Images, videos and audio are available under their respective licenses.