Apache Impala

Last updated
Apache Impala
Developer(s) Apache Software Foundation
Initial releaseApril 28, 2013;9 years ago (2013-04-28)
Stable release
4.1.0 / June 28, 2022;3 months ago (2022-06-28) [1]
Repository Impala Repository
Written in C++, Java
Operating system Cross-platform
Type Relational Hadoop-analytics
License Apache License 2.0
Website impala.apache.org

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. [2] Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012. [3]

Contents

Description

Apache Impala is a query engine that runs on Apache Hadoop. The project was announced in October 2012 with a public beta test distribution [4] [5] and became generally available in May 2013. [6]

Impala brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software.

Impala is promoted for analysts and data scientists to perform analytics on data stored in Hadoop via SQL or business intelligence tools. The result is that large-scale data processing (via MapReduce) and interactive queries can be done on the same system using the same data and metadata – removing the need to migrate data sets into specialized systems and/or proprietary formats simply to perform analysis.

Features include:

In early 2013, a column-oriented file format called Parquet was announced for architectures including Impala. [7] In December 2013, Amazon Web Services announced support for Impala. [8] In early 2014, MapR added support for Impala. [9] In 2015, another format called Kudu was announced, which Cloudera proposed to donate to the Apache Software Foundation along with Impala. [10] Impala graduated to an Apache Top-Level Project (TLP) on 28 November 2017. [11]

See also

Related Research Articles

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

Within computing database management systems, the RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

<span class="mw-page-title-main">Hortonworks</span> American software company

Hortonworks was a data software company based in Santa Clara, California that developed and supported open-source software designed to manage big data and associated processing.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system, also productized as BigQuery. Drill is an Apache top-level project.

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.

Dremel is a distributed system developed at Google for interactively querying large datasets.

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Apache Trafodion</span> Relational database management system for Apache Hadoop

Apache Trafodion is an open-source Top-Level Project at the Apache Software Foundation. It was originally developed by the information technology division of Hewlett-Packard Company and HP Labs to provide the SQL query language on Apache HBase targeting big data transactional or operational workloads. The project was named after the Welsh word for transactions. As of April 2021, it is no longer actively developed.

Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the NoSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query data through SQL. Phoenix compiles queries and other statements into native NoSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of NoSQL stores.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">Apache Kudu</span> Open-source column-oriented data store

Apache Kudu is a free and open source column-oriented data store of the Apache Hadoop ecosystem. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides completeness to Hadoop's storage layer to enable fast analytics on fast data.

<span class="mw-page-title-main">Apache ORC</span> Column-oriented data storage format

Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink and Apache Hadoop.

Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">Trino (SQL query engine)</span>

Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino can query datalakes that contain open column-oriented data file formats like ORC or Parquet residing on different storage systems like HDFS, AWS S3, Google Cloud Storage, or Azure Blob Storage using the Hive and Iceberg table formats. Trino also has the ability to run federated queries that query tables in different data sources such as MySQL, PostgreSQL, Cassandra, Kafka, MongoDB and Elasticsearch. Trino is released under the Apache License.

References

  1. @ApacheImpala (June 27, 2022). "The Apache Impala team is pleased to announce the release of Impala 4.1.0" (Tweet) via Twitter.
  2. "Apache Impala" . Retrieved 15 September 2017.
  3. Cade Metz (October 24, 2012). "Man Busts Out of Google, Rebuilds Top-Secret Query Machine". Wired Magazine. Retrieved October 10, 2016.
  4. Larry Digna (October 24, 2012). "Cloudera aims to bring real-time queries to Hadoop, big data". Between the lines blog. ZDNet. Retrieved January 20, 2014.
  5. Andrew Brust (October 25, 2012). "Cloudera's Impala brings Hadoop to SQL and BI". ZDNet. Retrieved January 20, 2014.
  6. Marcel Kornacker, Justin Erickson (May 1, 2013). "Cloudera Impala 1.0: It's Here, It's Real, It's Already the Standard for SQL on Hadoop". Archived from the original on April 13, 2014. Retrieved April 10, 2014.
  7. "Parquet: Columnar Storage for Hadoop". Project web site. 2013. Retrieved January 20, 2014.
  8. "Announcing Support for Impala with Amazon Elastic MapReduce". Amazon.com. December 12, 2013. Retrieved January 20, 2014.
  9. "Impala for MapR". MapR.com. February 2, 2014. Retrieved April 10, 2014.
  10. David Ramel (November 18, 2015). "Cloudera to Donate Impala and Kudu Big Data Projects to Apache". Application Development Trends. Retrieved October 10, 2016.
  11. "The Apache Software Foundation Announces Apache Impala as a Top-Level Project". November 28, 2017. Retrieved November 30, 2017.