Apache Drill

Apache Drill
Developer(s)	Apache Software Foundation
Initial release	May 19, 2015;8 years ago
Stable release	1.20.3 / January 7, 2023;6 months ago
Repository	Drill Repository
Written in	Java
Operating system	Cross-platform
License	Apache License 2.0
Website	drill.apache.org

Last updated July 08, 2023

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR,^[1]^[2] Drill is inspired by Google's Dremel system.^[3] Drill is an Apache top-level project.^[4] Tom Shiran is the founder of the Apache Drill Project.^[5] It was designated an Apache Software Foundation top-level project in December 2016.^[6]

Features

One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds.^[8]

Schema-free JSON document model similar to MongoDB and Elasticsearch, without requiring a formal schema to be declared
Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
Extremely user and developer friendly
Pluggable architecture enables connectivity to multiple datastores
Version 1.9 added dynamic user-defined functions
Version 1.11 added cryptographic-related functions and PCAP file format support

Back-end Support

Drill is primarily focused on non-relational datastores, including Apache Hadoop text files, NoSQL, and cloud storage. A notable feature also includes in situ querying of local JSON and Apache Parquet files. Some additional datastores that it supports include:

All Hadoop distributions (HDFS API 2.3+), including Apache Hadoop, MapR, CDH and Amazon EMR
NoSQL: MongoDB, Apache HBase, Apache Cassandra
Online Analytical Processing: Apache Kudu, Apache Druid, OpenTSDB
Cloud storage: Amazon S3, Google Cloud Storage, Azure Blob Storage, Swift, IBM Cloud Object Storage
Diverse data formats, including Apache Avro, Apache Parquet and JSON
RDBMs storage plugins (Using JDBC to connect to MySQL, PostgreSQL, and others)

A new datastore can be added by developing a storage plugin. Drill's "schema-free" JSON data model enables it to query non-relational datastores in-situ .^[9]

Front-end Support

Drill itself can be queried via JDBC, ODBC, or REST through a variety of methods and languages including Python and Java. The default install includes a web interface allowing end-users to execute ANSI SQL directly and export data tables as CSV files without any programming.

The dashboard library, Apache Superset,^[10] is particularly well suited for visualization of data queried with Drill.

Related Research Articles

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB/2, then DB2 until 2017 and finally changed to its present form.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Apache CouchDB is an open-source document-oriented NoSQL database, implemented in Erlang.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

A database shard, or simply a shard, is a horizontal partition of data in a database or search engine. Each shard is held on a separate database server instance, to spread load.

DataNucleus is an open source project which provides software products around data management in Java. The DataNucleus project started in 2008.

Structured storage is computer storage for structured data, often in the form of a distributed database. Computer software formally known as structured storage systems include Apache Cassandra, Google's Bigtable and Apache HBase.

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

A cloud database is a database that typically runs on a cloud computing platform and access to the database is provided as-a-service. There are two common deployment models: users can run databases on the cloud independently, using a virtual machine image, or they can purchase access to a database service, maintained by a cloud database provider. Of the databases available on the cloud, some are SQL-based and some use a NoSQL data model.

<span class="mw-page-title-main">Oracle NoSQL Database</span>

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Apache Trafodion is an open-source Top-Level Project at the Apache Software Foundation. It was originally developed by the information technology division of Hewlett-Packard Company and HP Labs to provide the SQL query language on Apache HBase targeting big data transactional or operational workloads. The project was named after the Welsh word for transactions. As of April 2021, it is no longer actively developed.

Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the NoSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query data through SQL. Phoenix compiles queries and other statements into native NoSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of NoSQL stores.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

DBeaver is a SQL client software application and a database administration tool. For relational databases it uses the JDBC application programming interface (API) to interact with databases via a JDBC driver. For other databases (NoSQL) it uses proprietary database drivers. It provides an editor that supports code completion and syntax highlighting. It provides a plug-in architecture that allows users to modify much of the application's behavior to provide database-specific functionality or features that are database-independent. This is a desktop application written in Java and based on Eclipse platform.

Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino can query datalakes that contain open column-oriented data file formats like ORC or Parquet residing on different storage systems like HDFS, AWS S3, Google Cloud Storage, or Azure Blob Storage using the Hive and Iceberg table formats. Trino also has the ability to run federated queries that query tables in different data sources such as MySQL, PostgreSQL, Cassandra, Kafka, MongoDB and Elasticsearch. Trino is released under the Apache License.

References

↑ Friedman, Ellen (21 Sep 2015). "Apache Drill: Tracking its history as an open source community". Archived from the original on 18 March 2016.
↑ "Brief About The Differences between Apache Drill Vs Presto". HitechNectar. Retrieved 2023-04-13.
↑ "Spark SQL vs. Apache Drill-War of the SQL-on-Hadoop Tools". ProjectPro. Retrieved 2022-11-15.
↑ "The Apache Software Foundation Announces Apache Drill as a Top-Level Project" . Retrieved 2014-12-02.
↑ Vizard, Michael (2021-09-01). "Apache Software Foundation updates Drill for broader SQL queries". VentureBeat. Retrieved 2022-10-20.
↑ "Apache Drill Eliminates ETL, Data Transformation for MapR Database". The New Stack. 2016-04-11. Retrieved 2022-11-15.
↑ "Apache Drill - Schema-free SQL for Hadoop, NoSQL and Cloud Storage". drill.apache.org. Retrieved 2015-12-29.
↑ "DrillProposal - INCUBATOR - Apache Software Foundation".
↑ "Frequently Asked Questions - Apache Drill". drill.apache.org. Retrieved 2015-12-29.
↑ Wayner, James R. Borck, Martin Heller, Steven Nuñez, Andrew C. Oliver, Ian Pointer and Peter (2020-10-05). "The best open source software of 2020". InfoWorld. Retrieved 2022-11-26.

Papers

Some papers influenced the birth and design. Here is a partial list:

2005 From Databases to Dataspaces: A New Abstraction for Information Management, the authors highlight the need for storage systems to accept all data formats and to provide APIs for data access that evolve based on the storage system's understanding of the data.
2010 Dremel: Interactive Analysis of Web-Scale Datasets

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Friedman, Ellen (21 Sep 2015). "Apache Drill: Tracking its history as an open source community". Archived from the original on 18 March 2016.

[2] "Brief About The Differences between Apache Drill Vs Presto". HitechNectar. Retrieved 2023-04-13.

[3] "Spark SQL vs. Apache Drill-War of the SQL-on-Hadoop Tools". ProjectPro. Retrieved 2022-11-15.

[announce-4] "The Apache Software Foundation Announces Apache Drill as a Top-Level Project" . Retrieved 2014-12-02.

[5] Vizard, Michael (2021-09-01). "Apache Software Foundation updates Drill for broader SQL queries". VentureBeat. Retrieved 2022-10-20.

[6] "Apache Drill Eliminates ETL, Data Transformation for MapR Database". The New Stack. 2016-04-11. Retrieved 2022-11-15.

[7] "Apache Drill - Schema-free SQL for Hadoop, NoSQL and Cloud Storage". drill.apache.org. Retrieved 2015-12-29.

[8] "DrillProposal - INCUBATOR - Apache Software Foundation".

[9] "Frequently Asked Questions - Apache Drill". drill.apache.org. Retrieved 2015-12-29.

[10] Wayner, James R. Borck, Martin Heller, Steven Nuñez, Andrew C. Oliver, Ian Pointer and Peter (2020-10-05). "The best open source software of 2020". InfoWorld. Retrieved 2022-11-26.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ AGE Airflow Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Buildr Calcite Camel CarbonData Cassandra Cayenne Chemistry CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Giraph Gump Hadoop HBase Helix Hive Impala Jackrabbit James Jena Jini JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces NiFi NetBeans Nutch OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza ServiceMix Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Struts 2 Subversion Superset SystemDS Tapestry Thrift Tika Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	MXNet NuttX Taverna
Other projects	Batik Chainsaw FOP Ivy Log4j
Attic	Abdera Apex AxKit Beehive Bluesky iBATIS C++ Standard Library Cactus Click Continuum Deltacloud Etch Excalibur Forrest Hama Harmony HiveMind Jakarta Lenya Marmotta ODE Shale Shindig Slide Sqoop Stanbol Tuscany Wave Wink XML
Licenses	Apache License
Category