Dremel (software)

Last updated May 26, 2023

Dremel is a distributed system developed at Google for interactively querying large datasets.

Dremel is the query engine used in Google's BigQuery service.^[1]

Dremel is the inspiration for Apache Drill,^[2] Apache Impala,^[3] and Dremio,^[4] an Apache licensed platform that includes a distributed SQL execution engine.

In 2020, Dremel won the Test of Time award^[5] at the VLDB 2020 conference recognizing the innovations it pioneered.

Related Research Articles

A time series database (TSDB) is a software system that is optimized for storing and serving time series through associated pairs of time(s) and value(s). In some fields, time series may be called profiles, curves, traces or trends. Several early time series databases are associated with industrial applications which could efficiently store measured values from sensory equipment (also referred to as data historians), but now are used in support of a much wider range of applications.

Mosha Pasumansky is one of the inventors of the MultiDimensional eXpressions (MDX) language, a query language for online analytical processing (OLAP) databases. Pasumansky is also one of the architects of the Microsoft Analysis Services, and an OLAP expert.

Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications. It is a project of the Apache Software Foundation.

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

BigQuery is Google's fully managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a Platform as a Service (PaaS) that supports querying using ANSI SQL. It also has built-in machine learning capabilities. BigQuery was announced in May 2010 and made generally available in November 2011.

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system, also productized as BigQuery. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Cloud analytics is a marketing term for businesses to carry out analysis using cloud computing. It uses a range of analytical tools and techniques to help companies extract information from massive data and present it in a way that is easily categorised and readily available via a web browser.

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Databricks is an American enterprise software company founded by the creators of Apache Spark. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks. The company develops Delta Lake, an open-source project to bring reliability to data lakes for machine learning and other data science use cases.

Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability. It is the third generation distributed messaging middleware open sourced by Alibaba in 2012. On November 21, 2016, Alibaba donated RocketMQ to the Apache Software Foundation. Next year, on February 20, the Apache Software Foundation announced Apache RocketMQ as a Top-Level Project.

Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Daniel Abadi is the Darnell-Kanal Professor of Computer Science at University of Maryland, College Park and was named an ACM Fellow in 2020. His primary area of research is database systems. He is known for his contributions to distributed databases, column-store databases, deterministic databases, graph databases, and stream databases. Specifically, he developed the storage and query execution engines of the C-Store prototype, which was commercialized by Vertica and later acquired by Hewlett-Packard in 2011. His HadoopDB research on fault tolerant scalable analytical database systems was commercialized by Hadapt and acquired by Teradata in 2014.

Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino can query datalakes that contain open column-oriented data file formats like ORC or Parquet residing on different storage systems like HDFS, AWS S3, Google Cloud Storage, or Azure Blob Storage using the Hive and Iceberg table formats. Trino also has the ability to run federated queries that query tables in different data sources such as MySQL, PostgreSQL, Cassandra, Kafka, MongoDB and Elasticsearch. Trino is released under the Apache License.

References

↑ "BigQuery under the hood" . Retrieved 2023-05-25.
↑ "Apache Drill - Architecture Introduction" . Retrieved 2017-10-08.
↑ "Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real" . Retrieved 2017-10-08.
↑ "Recognizing A New Tier" . Retrieved 2018-05-01.
↑ "VLDB Test of Time Award". www.vldb.org. Retrieved 2021-01-20.

Melnik, Sergey; Gubarev, Andrey; Long, Jing Jing; Romer, Geoffrey; Shivakumar, Shiva; Tolton, Matt; Vassilakis, Theo (2010). "Dremel: Interactive Analysis of Web-Scale Datasets". Proc. of the 36th Int'l Conf on Very Large Data Bases: 330–339.

This Google-related article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "BigQuery under the hood" . Retrieved 2023-05-25.

[2] "Apache Drill - Architecture Introduction" . Retrieved 2017-10-08.

[3] "Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real" . Retrieved 2017-10-08.

[4] "Recognizing A New Tier" . Retrieved 2018-05-01.

[5] "VLDB Test of Time Award". www.vldb.org. Retrieved 2021-01-20.

[1]

[2]

[3]

[4]

[5]