Presto (SQL query engine)

Last updated
Presto
Original author(s) Martin Traverso, Dain Sundstrom, David Phillips, Eric Hwang
Initial release10 November 2013;9 years ago (10 November 2013)
Written in Java
Operating system Cross-platform
Standard(s) SQL
Type Data warehouse
License Apache License 2.0
Website

Presto (including PrestoDB, and PrestoSQL which was re-branded to Trino ) is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, [1] and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

Contents

History

Presto was originally designed and developed at Facebook, Inc. (later renamed Meta) for their data analysts to run interactive queries on its large data warehouse in Apache Hadoop. The first four developers were Martin Traverso, Dain Sundstrom, David Phillips, and Eric Hwang. Before Presto, the data analysts at Facebook relied on Apache Hive for running SQL analytics on their multi-petabyte data warehouse. [2] Hive was deemed too slow for Facebook's scale and Presto was invented to fill the gap to run fast queries. [3] Original development started in 2012 and deployed at Facebook later that year. In November 2013, Facebook announced its open source release. [3] [4]

In 2014, Netflix disclosed they used Presto on 10 petabytes of data stored in the Amazon Simple Storage Service (S3). [5] In November, 2016, Amazon announced a service called Athena that was based on Presto. [6] In 2017, Teradata spun out a company called Starburst Data to commercially support Presto, which included staff acquired from Hadapt in 2014. [7] Teradata's QueryGrid software allowed Presto to access a Teradata relational database. [8]

In January 2019, the Presto Software Foundation was announced. The foundation is a not-for-profit organization for the advancement of the Presto open source distributed SQL query engine. [9] [10] At the same time, Presto development forked: PrestoDB maintained by Facebook, and PrestoSQL maintained by the Presto Software Foundation, with some cross pollination of code.

In September 2019, Facebook donated PrestoDB to the Linux Foundation, establishing the Presto Foundation. [11] Neither the creators of Presto, nor the top contributors and committers, were invited to join this foundation. [12]

By 2020, all four of the original Presto developers had joined Starburst. [13] In December 2020, PrestoSQL was rebranded as Trino, since Facebook had obtained a trademark on the name "Presto" (also donated to the Linux Foundation). [14]

Another company called Ahana was announced in 2020 to commercialize the PrestoDB fork as a cloud service and was acquired by IBM in 2023. [15]

Architecture

Presto's architecture is very similar to other database management systems using cluster computing, sometimes called massively parallel processing (MPP). One coordinator works in sync with multiple workers. Clients submit SQL statements that are parsed and planned, following which parallel tasks are scheduled to workers. Workers jointly process rows from the data sources and produce results that are returned to the client. Compared to the original Apache Hive execution model which used the Hadoop MapReduce mechanism on each query, Presto does not write intermediate results to disk, resulting in a significant speed improvement. Presto is written in Java.

A Presto query can combine data from multiple sources. Presto offers connectors to data sources including files in Alluxio, Hadoop Distributed File System (often called a data lake), Amazon S3, MySQL, PostgreSQL, Microsoft SQL Server, Amazon Redshift, Apache Kudu, Apache Phoenix, Apache Kafka, Apache Cassandra, Apache Accumulo, MongoDB and Redis. Unlike other Hadoop distribution-specific tools, such as Apache Impala, Presto can work with any variant of Hadoop or without it. Presto supports separation of compute and storage and may be deployed on-premises or using cloud computing.

See also

Related Research Articles

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB/2, then DB2 until 2017 and finally changed to its present form.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

<span class="mw-page-title-main">Greenplum</span>

Greenplum is a big data technology based on MPP architecture and the Postgres open source database technology. The technology was created by a company of the same name headquartered in San Mateo, California around 2005. Greenplum was acquired by EMC Corporation in July 2010.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

HPCC, also known as DAS, is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.

A cloud database is a database that typically runs on a cloud computing platform and access to the database is provided as-a-service. There are two common deployment models: users can run databases on the cloud independently, using a virtual machine image, or they can purchase access to a database service, maintained by a cloud database provider. Of the databases available on the cloud, some are SQL-based and some use a NoSQL data model.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Cloud analytics is a marketing term for businesses to carry out analysis using cloud computing. It uses a range of analytical tools and techniques to help companies extract information from massive data and present it in a way that is easily categorised and readily available via a web browser.

<span class="mw-page-title-main">Hue (software)</span> Open-source SQL Cloud Editor

Hue is an open-source SQL Cloud Editor, licensed under the Apache License 2.0.

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">Apache ORC</span> Column-oriented data storage format

Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink and Apache Hadoop.

Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis, advised by Professor Scott Shenker & Professor Ion Stoica. Alluxio sits between computation and storage in the big data analytics stack. It provides a data abstraction layer for computation frameworks, enabling applications to connect to numerous storage systems through a common interface. The software is published under the Apache License.

Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">Apache Pinot</span> Open-source distributed data store

Apache Pinot is a column-oriented, open-source, distributed data store written in Java. Pinot is designed to execute OLAP queries with low latency. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. The name Pinot comes from the Pinot grape vines that are pressed into liquid that is used to produce a variety of different wines. The founders of the database chose the name as a metaphor for analyzing vast quantities of data from a variety of different file formats or streaming data sources.

<span class="mw-page-title-main">Trino (SQL query engine)</span>

Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino can query datalakes that contain open column-oriented data file formats like ORC or Parquet residing on different storage systems like HDFS, AWS S3, Google Cloud Storage, or Azure Blob Storage using the Hive and Iceberg table formats. Trino also has the ability to run federated queries that query tables in different data sources such as MySQL, PostgreSQL, Cassandra, Kafka, MongoDB and Elasticsearch. Trino is released under the Apache License.

Apache Iceberg is an open source high-performance format for huge analytic tables. Iceberg enables the use of SQL tables for big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, Doris, and Pig to safely work with the same tables, at the same time. Iceberg is released under the Apache License. Iceberg addresses the performance and usability challenges of using Apache Hive tables in large and demanding data lake environments. Vendors currently supporting Apache Iceberg tables in their products include CelerData, Cloudera, Dremio, IOMETE, Snowflake, Starburst, Tabular, and AWS.

References

  1. 1.1. Teradata Distribution of Presto — Teradata Distribution of Presto 0.167-t.0.2 Documentation
  2. Mike Volpi (November 20, 2019). "Starburst and Presto: with Stellar Velocity". Index Ventures Blog. Retrieved January 27, 2022.
  3. 1 2 Joab Jackson (November 6, 2013). "Facebook goes open source with query engine for big data". Computer World. Retrieved April 26, 2017.
  4. Jordan Novet (June 6, 2013). "Facebook unveils Presto engine for querying 250 PB data warehouse". Giga Om. Retrieved April 26, 2017.
  5. Eva Tse; Zhenxiao Luo; Nezih Yigitbasi (October 7, 2014). "Using Presto in our Big Data Platform on AWS". Netflix technical blog. Retrieved April 26, 2017.
  6. Jeff Barr (November 30, 2016). "Amazon Athena – Interactive SQL Queries for Data in Amazon S3". AWS News Blog. Retrieved January 27, 2022.
  7. Philip Howard (December 21, 2017). "Teradata spins off Starburst". Bloor. Retrieved January 26, 2022.
  8. Lindsay Clark (December 17, 2020). "Hey Presto! Teradata admits its vision is dead by hooking QueryGrid analytics platform up to rival data warehouses". The Register. Retrieved January 26, 2022.
  9. "Presto Software Foundation Launches to Advance Presto Open Source Community". Press release. January 31, 2019. Retrieved January 2, 2022.
  10. "Presto's New Foundation Signals Growth for the Big Data SQL Engine". The New Stack. 2019-01-31. Retrieved 2019-02-01.
  11. "Facebook, Uber, Twitter and Alibaba form Presto Foundation to Tackle Distributed Data Processing at Scale". 23 September 2019. Retrieved 2019-11-12.
  12. Piotr Findeisen (November 22, 2019). "What's the relationship between prestosql and prestodb?". Comment on issue #38 of Trino Github. Retrieved January 27, 2022.
  13. "Original Presto Co-Creators Reunite on the Starburst Technical Leadership Team". Press release. September 22, 2020. Retrieved January 26, 2022.
  14. Martin Traverso, Dain Sundstrom, David Phillips (December 27, 2020). "We're rebranding PrestoSQL as Trino". Trino blog. Retrieved January 26, 2022.{{cite web}}: CS1 maint: multiple names: authors list (link)
  15. Gillin, Paul (14 April 2023). "IBM acquires Ahana, joins the Presto Foundation". SiliconANGLE. Retrieved 20 April 2023.