Apache Iceberg

Last updated
Apache Iceberg
Original author(s) Ryan Blue, Daniel Weeks
Initial release10 August 2017;7 years ago (10 August 2017)
Written in Java, Python
Operating system Cross-platform
Type Data warehouse, Data lake
License Apache License 2.0
Website

Apache Iceberg is a high performance open-source format for large analytic tables. Iceberg enables the use of SQL tables for big data while making it possible for engines like Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, Doris, and Pig to safely work with the same tables, at the same time. [1] Iceberg is released under the Apache License. [2] Iceberg addresses the performance and usability challenges of Apache Hive tables in large and demanding data lake environments. [3] Vendors currently supporting Apache Iceberg tables include Buster, [4] CelerData, Cloudera, Crunchy Data, [5] Dremio, IOMETE, Snowflake, Starburst, Tabular, [6] AWS, [7] and Google Cloud. [8]

Contents

History

Iceberg was started at Netflix by Ryan Blue and Dan Weeks. Hive was used by many different services and engines in the Netflix infrastructure. Hive was never able to guarantee correctness and did not provide stable atomic transactions. [3] Many at Netflix avoided using these services and making changes to the data to avert unintended consequences from the Hive format. [3] Ryan Blue set out to address three issues that faced the Hive table by creating Iceberg: [3] [9]

  1. Ensure the correctness of the data and support ACID transactions.
  2. Improve performance by enabling finer-grained operations to be done at the file granularity for optimal writes.
  3. Simplify and abstract general operation and maintenance of tables.

Iceberg development started in 2017. [10] The project was open-sourced and donated to the Apache Software Foundation in November 2018. [11] In May 2020, the Iceberg project graduated to become a top-level Apache project. [11]

Iceberg is used by multiple companies including Airbnb, [12] Apple, [3] Expedia, [13] LinkedIn, [14] Adobe, [15] Lyft, and many more. [16]

See also

Related Research Articles

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

Assembla is a web-based version control and project management software as a service provider for enterprises. It was founded in 2005 and acquired by Idera, Inc. in 2018. It offers Git, Perforce Helix Core and Apache Subversion repository management, integrations with other collaboration tools such as Trello, Slack, GitHub and JIRA. Assembla also offers integrations with customer's managed private clouds.

Redis is a source-available, in-memory storage, used as a distributed, in-memory key–value database, cache and message broker, with optional durability. Because it holds all data in memory and because of its design, Redis offers low-latency reads and writes, making it particularly suitable for use cases that require a cache. Redis is the most popular NoSQL database, and one of the most popular databases overall. Redis is used in companies like Twitter, Airbnb, Tinder, Yahoo, Adobe, Hulu, Amazon and OpenAI.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.

Within database management systems, the record columnar file or RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

<span class="mw-page-title-main">OpenShift</span> Cloud computing software

OpenShift is a family of containerization software products developed by Red Hat. Its flagship product is the OpenShift Container Platform — a hybrid cloud platform as a service built around Linux containers orchestrated and managed by Kubernetes on a foundation of Red Hat Enterprise Linux. The family's other products provide this platform through different environments: OKD serves as the community-driven upstream, Several deployment methods are available including self-managed, cloud native under ROSA, ARO and RHOIC on AWS, Azure, and IBM Cloud respectively, OpenShift Online as software as a service, and OpenShift Dedicated as a managed service.

CloudStack is open-source Infrastructure-as-a-Service cloud computing software for creating, managing, and deploying infrastructure cloud services. It uses existing hypervisor platforms for virtualization, such as KVM, VMware vSphere, including ESXi and vCenter, XenServer/XCP and XCP-ng. In addition to its own API, CloudStack also supports the Amazon Web Services (AWS) API and the Open Cloud Computing Interface from the Open Grid Forum.

Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

<span class="mw-page-title-main">Apache Mesos</span> Software to manage computer clusters

Apache Mesos is an open-source project to manage computer clusters. It was developed at the University of California, Berkeley.

<span class="mw-page-title-main">Apache Flink</span> Framework and distributed processing engine

Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

<span class="mw-page-title-main">RocksDB</span> Embedded key-value database

RocksDB is a high performance embedded database for key-value data. It is a fork of Google's LevelDB optimized to exploit multi-core processors (CPUs), and make efficient use of fast storage, such as solid-state drives (SSD), for input/output (I/O) bound workloads. It is based on a log-structured merge-tree data structure. It is written in C++ and provides official language bindings for C++, C, and Java. Many third-party language bindings exist. RocksDB is free and open-source software, released originally under a BSD 3-clause license. However, in July 2017 the project was migrated to a dual license of both Apache 2.0 and GPLv2 license. This change helped its adoption in Apache Software Foundation's projects after blacklist of the previous BSD+Patents license clause.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

TiDB is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. Designed to be MySQL compatible, it is developed and supported primarily by PingCAP and licensed under Apache 2.0. It is also available as a paid product. TiDB drew its initial design inspiration from Google's Spanner and F1 papers.

<span class="mw-page-title-main">Apache Airflow</span> Open-source workflow management platform

Apache Airflow is an open-source workflow management platform for data engineering pipelines. It started at Airbnb in October 2014 as a solution to manage the company's increasingly complex workflows. Creating Airflow allowed Airbnb to programmatically author and schedule their workflows and monitor them via the built-in Airflow user interface. From the beginning, the project was made open source, becoming an Apache Incubator project in March 2016 and a top-level Apache Software Foundation project in January 2019.

Apache Superset is an open-source software application for data exploration and data visualization able to handle data at petabyte scale. The application started as a hack-a-thon project by Maxime Beauchemin while working at Airbnb and entered the Apache Incubator program in 2017. In addition to Airbnb, the project has seen significant contributions from other leading technology companies, including Lyft and Dropbox. Superset graduated from the incubator program and became a top-level project at the Apache Software Foundation in 2021.

<span class="mw-page-title-main">Trino (SQL query engine)</span> Open-source distributed SQL query engine

Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino can query data lakes that contain open column-oriented data file formats like ORC or Parquet residing on different storage systems like HDFS, AWS S3, Google Cloud Storage, or Azure Blob Storage using the Hive and Iceberg table formats. Trino also has the ability to run federated queries that query tables in different data sources such as MySQL, PostgreSQL, Cassandra, Kafka, MongoDB and Elasticsearch. Trino is released under the Apache License.

References

  1. "Apache Iceberg". iceberg.apache.org. Retrieved 5 October 2022.
  2. "apache/iceberg GitHub License". The Apache Software Foundation. 5 October 2022. Retrieved 5 October 2022.
  3. 1 2 3 4 5 Woodie, Alex (8 February 2021). "Apache Iceberg: The Hub of an Emerging Data Service Ecosystem?". Datanami. Archived from the original on 4 September 2024. Retrieved 5 October 2022.
  4. "Buster". Archived from the original on 2024-09-09. Retrieved 2024-09-09.
  5. Woodie, Alex (24 July 2024). "Crunchy Data Goes All-in With Postgres". The Big Data Wire. Archived from the original on 13 September 2024. Retrieved 9 November 2024.
  6. "Vendors". iceberg.apache.org. Retrieved 2023-05-05.
  7. "Using Apache Iceberg tables – Amazon Athena". Amazon Web Services, Inc. Archived from the original on 2024-09-04. Retrieved 2023-06-16.
  8. "Google Cloud BigQuery tables for Apache Iceberg". Google Cloud, Inc. Archived from the original on 2024-11-22. Retrieved 2024-11-21.
  9. "Iceberg at Netflix and Beyond with Ryan Blue, EPISODE 1654 Transcript". Software Engineering Daily. 7 March 2024. Archived from the original on 10 November 2024. Retrieved 10 November 2024.
  10. "Initial public release in apache/iceberg". GitHub. Archived from the original on 4 September 2024. Retrieved 5 October 2022.
  11. 1 2 "Incubation Status Template - Apache Incubator". incubator.apache.org. Archived from the original on 2022-10-05. Retrieved 2022-10-05.
  12. Zhu, Ronnie (26 September 2022). "Upgrading Data Warehouse Infrastructure at Airbnb". The Airbnb Tech Blog.
  13. Mathiesen, Christine (26 January 2021). "A Short Introduction to Apache Iceberg". Expedia Group Technology. Archived from the original on 5 October 2022. Retrieved 5 October 2022.
  14. "FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format". engineering.linkedin.com. Archived from the original on 2024-09-04. Retrieved 2022-10-05.
  15. Bremner, Jaemi (3 December 2020). "Iceberg at Adobe". Medium. Archived from the original on 4 September 2024. Retrieved 5 October 2022.
  16. Council, Data. "Open Source Highlight: Apache Iceberg". www.datacouncil.ai. Archived from the original on 5 October 2022. Retrieved 5 October 2022.