Apache Druid

Last updated
Apache Druid [1]
Original author(s) Metamarkets
Developer(s) Apache Software Foundation
Stable release
29.0.1 [2]   OOjs UI icon edit-ltr-progressive.svg / 3 April 2024;24 days ago (3 April 2024)
Repository github.com/apache/druid
Written in Java
Operating system Cross-platform
Type
License Apache License 2.0
Website druid.apache.org

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. [3] The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

Contents

Druid is commonly used in business intelligence-OLAP applications to analyze high volumes of real-time and historical data. [4] Druid is used in production by technology companies such as Alibaba, [4] Airbnb, [4] Cisco, [5] [4] eBay, [6] Lyft, [7] Netflix, [8] PayPal, [4] Pinterest, [9] Reddit, [10] Twitter, [11] Walmart, [12] Wikimedia Foundation [13] and Yahoo. [14]

History

Druid was started in 2011 by Eric Tschetter, Fangjin Yang, Gian Merlino and Vadim Ogievetsky [15] to power the analytics product of Metamarkets. The project was open-sourced under the GPL license in October 2012, [16] [17] [18] and moved to an Apache License in February 2015. [19] [20]

Architecture

Apache Druid Architecture.svg

Fully deployed, Druid runs as a cluster of specialized processes (called nodes in Druid) to support a fault-tolerant architecture [21] where data is stored redundantly, and there is no single point of failure. [22] The cluster includes external dependencies for coordination (Apache ZooKeeper), metadata storage (e.g. MySQL, PostgreSQL, or Derby), and a deep storage facility (e.g. HDFS, or Amazon S3) for permanent data backup.

Query management

Client queries first hit broker nodes, which forward them to the appropriate data nodes (either historical or real-time). Since Druid segments may be partitioned, an incoming query can require data from multiple segments and partitions (or shards) stored on different nodes in the cluster. Brokers are able to learn which nodes have the required data, and also merge partial results before returning the aggregated result.

Cluster management

Operations relating to data management in historical nodes are overseen by coordinator nodes. Apache ZooKeeper is used to register all nodes, manage certain aspects of internode communications, and provide for leader elections.

Features

Performance

In 2019, researchers compared the performance of Hive, Presto, and Druid using a denormalized Star Schema Benchmark based on the TPC-H standard. Druid was tested using both a “Druid Best” configuration using tables with hashed partitions and a “Druid Suboptimal” configuration which does not use hashed partitions. [23]

Tests were conducted by running the 13 TPC-H queries using TPC-H Scale Factor 30 (a 30GB database), Scale Factor 100 (a 100GB database), and Scale Factor 300 (a 300GB database).

Scale FactorHivePrestoDruid BestDruid Suboptimal
30256s33s2.09s3.21s
100424s90s6.12s8.08s
300982s452s7.60s20.02s

Druid performance was measured as at least 98% faster than Hive and at least 90% faster than Presto in each scenario, even when using the Druid Suboptimized configuration.

See also

Related Research Articles

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

<span class="mw-page-title-main">Apache Cassandra</span> Free and open-source database management system

Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple data centers, with asynchronous masterless replication allowing low latency operations for all clients. Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model.

The following tables compare general and technical information for a number of online analytical processing (OLAP) servers. Please see the individual products articles for further information.

<span class="mw-page-title-main">Couchbase Server</span> Open-source NoSQL database

Couchbase Server, originally known as Membase, is a source-available, distributed multi-model NoSQL document-oriented database software package optimized for interactive applications. These applications may serve many concurrent users by creating, storing, retrieving, aggregating, manipulating and presenting data. In support of these kinds of application needs, Couchbase Server is designed to provide easy-to-scale key-value, or JSON document access, with low latency and high sustainability throughput. It is designed to be clustered from a single machine to very large-scale deployments spanning many machines.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project, built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

<span class="mw-page-title-main">InfiniDB</span> Database management software company based in Frisco, Texas

InfiniDB was a database management software company based in Frisco, Texas. The company developed InfiniDB, a scalable, software-only columnar database management system for analytic applications.

<span class="mw-page-title-main">Actian Vector</span>

Actian Vector is an SQL relational database management system designed for high performance in analytical database applications. It published record breaking results on the Transaction Processing Performance Council's TPC-H benchmark for database sizes of 100 GB, 300 GB, 1 TB and 3 TB on non-clustered hardware.

<span class="mw-page-title-main">SingleStore</span> Database management system

SingleStore is a proprietary, cloud-native database designed for data-intensive applications. A distributed, relational, SQL database management system (RDBMS) that features ANSI SQL support, it is known for speed in data ingest, transaction processing, and query processing.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Lambda architecture</span> Data-processing architecture

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

Imply Data, Inc. is an American software company. It develops and provides commercial support for the open-source Apache Druid, a real-time database designed to power analytics applications.

<span class="mw-page-title-main">ClickHouse</span> Open-source database management system

ClickHouse is an open-source column-oriented DBMS for online analytical processing (OLAP) that allows users to generate analytical reports using SQL queries in real-time. ClickHouse Inc. is headquartered in the San Francisco Bay Area with the subsidiary, ClickHouse B.V., based in Amsterdam, Netherlands.

<span class="mw-page-title-main">Apache Ignite</span>

Apache Ignite is a distributed database management system for high-performance computing.

<span class="mw-page-title-main">Apache Pinot</span> Open-source distributed data store

Apache Pinot is a column-oriented, open-source, distributed data store written in Java. Pinot is designed to execute OLAP queries with low latency. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. The name Pinot comes from the Pinot grape vines that are pressed into liquid that is used to produce a variety of different wines. The founders of the database chose the name as a metaphor for analyzing vast quantities of data from a variety of different file formats or streaming data sources.

References

  1. "Apache Druid at GitHub". github.com. Retrieved 4 May 2021.
  2. Error: Unable to display the reference properly. See the documentation for details.
  3. Hemsoth, Nicole. ""Druid Summons Strength in Real-Time"". Archived from the original on 2013-02-27. Retrieved 2014-02-07., Datanami, 8 November 2012
  4. 1 2 3 4 5 druid. "Druid | Powered by Druid". druid.apache.org. Retrieved 2016-06-29.
  5. Butler, Brandon (20 June 2016). "Under the hood of Cisco's Tetration Analytics platform". Archived from the original on 2024-04-26. Retrieved 2016-06-23.
  6. "Druid at Pulsar - ebay的专栏 - 博客频道 - CSDN.NET". blog.csdn.net. Retrieved 2016-06-23.
  7. Streaming SQL and Druid by Arup Malakar , retrieved 2020-01-29
  8. "The Netflix Tech Blog: Announcing Suro: Backbone of Netflix's Data Pipeline". techblog.netflix.com. Retrieved 2016-06-23.
  9. Pinterest: Powering Ad Analytics with Apache Druid , retrieved 2020-01-29
  10. "Scaling Reporting at Reddit - Upvoted". www.redditinc.com. 26 February 2021. Retrieved 2022-09-13.
  11. "Interactive Analytics at MoPub: Querying Terabytes of Data in Seconds". blog.twitter.com. Retrieved 2020-01-29.
  12. Nayak, Amaresh (2018-02-23). "Event Stream Analytics at Walmart with Druid". Medium. Retrieved 2020-01-29.
  13. "Conferences - O'Reilly Media".
  14. "Complementing Hadoop at Yahoo: Interactive Analytics with Druid" . Retrieved 2016-06-23.
  15. "Druid: A Real-time Analytical Data Store" (PDF).
  16. Tschetter, Eric. ""Introducing Druid"". Archived from the original on 2022-02-08. Retrieved 2019-06-12., druid.apache.org, 24 October 2012
  17. Higginbotham, Stacey. ""Metamarkets open sources Druid, its in-memory database"". Archived from the original on 2021-09-18. Retrieved 2014-02-07., GigaOM , 24 October 2012
  18. "Metamarkets Open Sources Druid, Streaming Real-Time Data Store". Yahoo News. 2012-10-24. Retrieved 2023-07-24.
  19. Harris, Derrick (2015-02-20). "The Druid real-time database moves to an Apache license". Archived from the original on 2015-08-22. Retrieved 2015-08-04.
  20. "Druid Gets Open Source-ier Under the Apache License" . Retrieved 2015-08-04.
  21. "Druid Project Documentation".
  22. Yang, Fangjin; Tschetter, Eric; Léauté, Xavier; Ray, Nelson; Merlino, Gian; Ganguli, Deep. ""Druid: A Real-time Analytical Data Store"" (PDF)., Metamarkets , retrieved 6 February 2014
  23. Correia, José; Costa, Carlos; Santos, Maribel Yasmina (2019). "Challenging SQL-on-Hadoop Performance with Apache Druid". In Abramowicz, Witold; Corchuelo, Rafael (eds.). Business Information Systems. Lecture Notes in Business Information Processing. Vol. 353. Cham: Springer International Publishing. pp. 149–161. doi:10.1007/978-3-030-20485-3_12. hdl: 1822/66785 . ISBN   978-3-030-20485-3. S2CID   190005302.