Apache Druid

Apache Druid^[1]
Original author(s)	Metamarkets
Developer(s)	Apache Software Foundation
Stable release	29.0.1 / 3 April 2024;24 days ago
Repository	github.com/apache/druid
Written in	Java
Operating system	Cross-platform
Type	distributed ; real-time ; time-series ; column-oriented data store ;
License	Apache License 2.0
Website	druid.apache.org

Last updated April 27, 2024

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.^[3] The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

Druid is commonly used in business intelligence-OLAP applications to analyze high volumes of real-time and historical data.^[4] Druid is used in production by technology companies such as Alibaba,^[4] Airbnb,^[4] Cisco,^[5]^[4] eBay,^[6] Lyft,^[7] Netflix,^[8] PayPal,^[4] Pinterest,^[9] Reddit,^[10] Twitter,^[11] Walmart,^[12] Wikimedia Foundation ^[13] and Yahoo.^[14]

History

Druid was started in 2011 by Eric Tschetter, Fangjin Yang, Gian Merlino and Vadim Ogievetsky^[15] to power the analytics product of Metamarkets. The project was open-sourced under the GPL license in October 2012,^[16]^[17]^[18] and moved to an Apache License in February 2015.^[19]^[20]

Architecture

Fully deployed, Druid runs as a cluster of specialized processes (called nodes in Druid) to support a fault-tolerant architecture^[21] where data is stored redundantly, and there is no single point of failure.^[22] The cluster includes external dependencies for coordination (Apache ZooKeeper), metadata storage (e.g. MySQL, PostgreSQL, or Derby), and a deep storage facility (e.g. HDFS, or Amazon S3) for permanent data backup.

Query management

Client queries first hit broker nodes, which forward them to the appropriate data nodes (either historical or real-time). Since Druid segments may be partitioned, an incoming query can require data from multiple segments and partitions (or shards) stored on different nodes in the cluster. Brokers are able to learn which nodes have the required data, and also merge partial results before returning the aggregated result.

Cluster management

Operations relating to data management in historical nodes are overseen by coordinator nodes. Apache ZooKeeper is used to register all nodes, manage certain aspects of internode communications, and provide for leader elections.

Features

Low latency (streaming) data ingestion.
Arbitrary slice and dice data exploration.
Sub-second analytic queries.
Approximate and exact computations.

Performance

In 2019, researchers compared the performance of Hive, Presto, and Druid using a denormalized Star Schema Benchmark based on the TPC-H standard. Druid was tested using both a “Druid Best” configuration using tables with hashed partitions and a “Druid Suboptimal” configuration which does not use hashed partitions.^[23]

Tests were conducted by running the 13 TPC-H queries using TPC-H Scale Factor 30 (a 30GB database), Scale Factor 100 (a 100GB database), and Scale Factor 300 (a 300GB database).


Scale Factor	Hive	Presto	Druid Best	Druid Suboptimal
30	256s	33s	2.09s	3.21s
100	424s	90s	6.12s	8.08s
300	982s	452s	7.60s	20.02s

Druid performance was measured as at least 98% faster than Hive and at least 90% faster than Presto in each scenario, even when using the Druid Suboptimized configuration.

Related Research Articles

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple data centers, with asynchronous masterless replication allowing low latency operations for all clients. Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model.

The following tables compare general and technical information for a number of online analytical processing (OLAP) servers. Please see the individual products articles for further information.

Couchbase Server, originally known as Membase, is a source-available, distributed multi-model NoSQL document-oriented database software package optimized for interactive applications. These applications may serve many concurrent users by creating, storing, retrieving, aggregating, manipulating and presenting data. In support of these kinds of application needs, Couchbase Server is designed to provide easy-to-scale key-value, or JSON document access, with low latency and high sustainability throughput. It is designed to be clustered from a single machine to very large-scale deployments spanning many machines.

Apache Hive is a data warehouse software project, built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

InfiniDB was a database management software company based in Frisco, Texas. The company developed InfiniDB, a scalable, software-only columnar database management system for analytic applications.

Actian Vector is an SQL relational database management system designed for high performance in analytical database applications. It published record breaking results on the Transaction Processing Performance Council's TPC-H benchmark for database sizes of 100 GB, 300 GB, 1 TB and 3 TB on non-clustered hardware.

SingleStore is a proprietary, cloud-native database designed for data-intensive applications. A distributed, relational, SQL database management system (RDBMS) that features ANSI SQL support, it is known for speed in data ingest, transaction processing, and query processing.

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

Imply Data, Inc. is an American software company. It develops and provides commercial support for the open-source Apache Druid, a real-time database designed to power analytics applications.

<span class="mw-page-title-main">ClickHouse</span> Open-source database management system

ClickHouse is an open-source column-oriented DBMS for online analytical processing (OLAP) that allows users to generate analytical reports using SQL queries in real-time. ClickHouse Inc. is headquartered in the San Francisco Bay Area with the subsidiary, ClickHouse B.V., based in Amsterdam, Netherlands.

Apache Ignite is a distributed database management system for high-performance computing.

Apache Pinot is a column-oriented, open-source, distributed data store written in Java. Pinot is designed to execute OLAP queries with low latency. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. The name Pinot comes from the Pinot grape vines that are pressed into liquid that is used to produce a variety of different wines. The founders of the database chose the name as a metaphor for analyzing vast quantities of data from a variety of different file formats or streaming data sources.

References

↑ "Apache Druid at GitHub". github.com. Retrieved 4 May 2021.
↑ Error: Unable to display the reference properly. See the documentation for details.
↑ Hemsoth, Nicole. ""Druid Summons Strength in Real-Time"". Archived from the original on 2013-02-27. Retrieved 2014-02-07., Datanami, 8 November 2012
1 2 3 4 5 druid. "Druid | Powered by Druid". druid.apache.org. Retrieved 2016-06-29.
↑ Butler, Brandon (20 June 2016). "Under the hood of Cisco's Tetration Analytics platform". Archived from the original on 2024-04-26. Retrieved 2016-06-23.
↑ "Druid at Pulsar - ebay的专栏 - 博客频道 - CSDN.NET". blog.csdn.net. Retrieved 2016-06-23.
↑ Streaming SQL and Druid by Arup Malakar , retrieved 2020-01-29
↑ "The Netflix Tech Blog: Announcing Suro: Backbone of Netflix's Data Pipeline". techblog.netflix.com. Retrieved 2016-06-23.
↑ Pinterest: Powering Ad Analytics with Apache Druid , retrieved 2020-01-29
↑ "Scaling Reporting at Reddit - Upvoted". www.redditinc.com. 26 February 2021. Retrieved 2022-09-13.
↑ "Interactive Analytics at MoPub: Querying Terabytes of Data in Seconds". blog.twitter.com. Retrieved 2020-01-29.
↑ Nayak, Amaresh (2018-02-23). "Event Stream Analytics at Walmart with Druid". Medium. Retrieved 2020-01-29.
↑ "Conferences - O'Reilly Media".
↑ "Complementing Hadoop at Yahoo: Interactive Analytics with Druid" . Retrieved 2016-06-23.
↑ "Druid: A Real-time Analytical Data Store" (PDF).
↑ Tschetter, Eric. ""Introducing Druid"". Archived from the original on 2022-02-08. Retrieved 2019-06-12., druid.apache.org, 24 October 2012
↑ Higginbotham, Stacey. ""Metamarkets open sources Druid, its in-memory database"". Archived from the original on 2021-09-18. Retrieved 2014-02-07., GigaOM , 24 October 2012
↑ "Metamarkets Open Sources Druid, Streaming Real-Time Data Store". Yahoo News. 2012-10-24. Retrieved 2023-07-24.
↑ Harris, Derrick (2015-02-20). "The Druid real-time database moves to an Apache license". Archived from the original on 2015-08-22. Retrieved 2015-08-04.
↑ "Druid Gets Open Source-ier Under the Apache License" . Retrieved 2015-08-04.
↑ "Druid Project Documentation".
↑ Yang, Fangjin; Tschetter, Eric; Léauté, Xavier; Ray, Nelson; Merlino, Gian; Ganguli, Deep. ""Druid: A Real-time Analytical Data Store"" (PDF)., Metamarkets , retrieved 6 February 2014
↑ Correia, José; Costa, Carlos; Santos, Maribel Yasmina (2019). "Challenging SQL-on-Hadoop Performance with Apache Druid". In Abramowicz, Witold; Corchuelo, Rafael (eds.). Business Information Systems. Lecture Notes in Business Information Processing. Vol. 353. Cham: Springer International Publishing. pp. 149–161. doi:10.1007/978-3-030-20485-3_12. hdl: 1822/66785 . ISBN 978-3-030-20485-3. S2CID 190005302.

External links

Official website

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Apache Druid at GitHub". github.com. Retrieved 4 May 2021.

[wikidata-73ccc80ef5501511b9882fcdc891ceda6586b20f-v11-2] Error: Unable to display the reference properly. See the documentation for details.

[datanami-3] Hemsoth, Nicole. ""Druid Summons Strength in Real-Time"". Archived from the original on 2013-02-27. Retrieved 2014-02-07., Datanami, 8 November 2012

[powered-4] 1 2 3 4 5 druid. "Druid | Powered by Druid". druid.apache.org. Retrieved 2016-06-29.

[5] Butler, Brandon (20 June 2016). "Under the hood of Cisco's Tetration Analytics platform". Archived from the original on 2024-04-26. Retrieved 2016-06-23.

[6] "Druid at Pulsar - ebay的专栏 - 博客频道 - CSDN.NET". blog.csdn.net. Retrieved 2016-06-23.

[7] Streaming SQL and Druid by Arup Malakar , retrieved 2020-01-29

[8] "The Netflix Tech Blog: Announcing Suro: Backbone of Netflix's Data Pipeline". techblog.netflix.com. Retrieved 2016-06-23.

[9] Pinterest: Powering Ad Analytics with Apache Druid , retrieved 2020-01-29

[10] "Scaling Reporting at Reddit - Upvoted". www.redditinc.com. 26 February 2021. Retrieved 2022-09-13.

[11] "Interactive Analytics at MoPub: Querying Terabytes of Data in Seconds". blog.twitter.com. Retrieved 2020-01-29.

[12] Nayak, Amaresh (2018-02-23). "Event Stream Analytics at Walmart with Druid". Medium. Retrieved 2020-01-29.

[13] "Conferences - O'Reilly Media".

[14] "Complementing Hadoop at Yahoo: Interactive Analytics with Druid" . Retrieved 2016-06-23.

[15] "Druid: A Real-time Analytical Data Store" (PDF).

[druidblog-16] Tschetter, Eric. ""Introducing Druid"". Archived from the original on 2022-02-08. Retrieved 2019-06-12., druid.apache.org, 24 October 2012

[gigaom-17] Higginbotham, Stacey. ""Metamarkets open sources Druid, its in-memory database"". Archived from the original on 2021-09-18. Retrieved 2014-02-07., GigaOM , 24 October 2012

[18] "Metamarkets Open Sources Druid, Streaming Real-Time Data Store". Yahoo News. 2012-10-24. Retrieved 2023-07-24.

[19] Harris, Derrick (2015-02-20). "The Druid real-time database moves to an Apache license". Archived from the original on 2015-08-22. Retrieved 2015-08-04.

[20] "Druid Gets Open Source-ier Under the Apache License" . Retrieved 2015-08-04.

[druid-docs-21] "Druid Project Documentation".

[22] Yang, Fangjin; Tschetter, Eric; Léauté, Xavier; Ray, Nelson; Merlino, Gian; Ganguli, Deep. ""Druid: A Real-time Analytical Data Store"" (PDF)., Metamarkets , retrieved 6 February 2014

[23] Correia, José; Costa, Carlos; Santos, Maribel Yasmina (2019). "Challenging SQL-on-Hadoop Performance with Apache Druid". In Abramowicz, Witold; Corchuelo, Rafael (eds.). Business Information Systems. Lecture Notes in Business Information Processing. Vol. 353. Cham: Springer International Publishing. pp. 149–161. doi:10.1007/978-3-030-20485-3_12. hdl: 1822/66785 . ISBN 978-3-030-20485-3. S2CID 190005302.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Struts 2 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik FOP Ivy Log4j
Attic	Apex AxKit Beehive Bluesky iBATIS Click Continuum Deltacloud Etch Giraph Hama Harmony Jakarta Marmotta MXNet ODE River Shale Slide Sqoop Stanbol Tuscany Wave XML
Licenses	Apache License
Category