Apache Iceberg

Last updated
Apache Iceberg
Original author(s) Ryan Blue, Daniel Weeks
Initial release10 August 2017;7 years ago (10 August 2017)
Written in Java, Python
Operating system Cross-platform
Type Data warehouse, Data lake
License Apache License 2.0
Website

Apache Iceberg is a high performance open-source format for large analytic tables. Iceberg enables the use of SQL tables for big data while making it possible for engines like Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, Doris, and Pig to safely work with the same tables, at the same time. [1] Iceberg is released under the Apache License. [2] Iceberg addresses the performance and usability challenges of Apache Hive tables in large and demanding data lake environments. [3] Vendors currently supporting Apache Iceberg tables include Buster, [4] CelerData, Cloudera, Crunchy Data, [5] Dremio, IOMETE, Snowflake, Starburst, Tabular, [6] AWS, [7] and Google Cloud. [8]

Contents

History

Iceberg was started at Netflix by Ryan Blue and Dan Weeks. Hive was used by many different services and engines in the Netflix infrastructure. Hive was never able to guarantee correctness and did not provide stable atomic transactions. [3] Many at Netflix avoided using these services and making changes to the data to avert unintended consequences from the Hive format. [3] Ryan Blue set out to address three issues that faced the Hive table by creating Iceberg: [3] [9]

  1. Ensure the correctness of the data and support ACID transactions.
  2. Improve performance by enabling finer-grained operations to be done at the file granularity for optimal writes.
  3. Simplify and abstract general operation and maintenance of tables.

Iceberg development started in 2017. [10] The project was open-sourced and donated to the Apache Software Foundation in November 2018. [11] In May 2020, the Iceberg project graduated to become a top-level Apache project. [11]

Iceberg is used by multiple companies including Airbnb, [12] Apple, [3] Expedia, [13] LinkedIn, [14] Adobe, [15] Lyft, and many more. [16]

Technical details

Apache Iceberg operates by abstracting table metadata from the underlying data storage. It maintains metadata files that track snapshots, schema information, partition layouts, and data file locations, enabling efficient and atomic table operations. [17]

At a high level, Iceberg organizes table data into snapshots. Each snapshot represents the state of the table at a particular point in time, allowing Iceberg to provide ACID-compliant transactional capabilities, including snapshot isolation, concurrent writes, and rollback functionality. The snapshot metadata is managed as a tree structure of manifest files and metadata files stored within the file system. [18]

Iceberg uses the Apache Parquet file format for storing actual data due to its efficient columnar storage structure, optimized for analytical queries. Parquet files in Iceberg store table rows in a compressed, column-oriented format, significantly reducing storage costs and improving read performance through techniques such as predicate pushdown and column pruning. Iceberg references Parquet files in manifest files, facilitating quick identification and access to relevant data during query execution. [19]

Apache Iceberg employs a multi‐level metadata hierarchy for tracking table contents [20] . At the top, a table metadata file (often metadata.json) stores table-level information—such as the schema, partition specifications, the list of snapshots, and pointers to the current "root" snapshot [21] . Each snapshot represents a consistent view of the table and is associated with a manifest list (an Avro file) that enumerates all manifest files for that snapshot. A manifest file is an index that lists a set of data files (e.g., Parquet files) along with metadata about each file – including row count, partition values, and column statistics such as minimum and maximum values. These manifests are small metadata files (often in Avro format) that segment the table’s metadata, enabling a distributed design whereby entire manifests can be pruned when querying by partition instead of requiring a single, giant file listing all data files. Moreover, Iceberg’s metadata tree provides an historic record of table changes—retaining old snapshots and manifests (thus enabling time travel) until they expire—and it can quickly plan queries by reading only the relevant manifest files rather than scanning all data files or directories. This approach avoids expensive operations such as directory listing and makes metadata access efficient even for huge tables.

See also

References

  1. "Apache Iceberg". iceberg.apache.org. Retrieved 5 October 2022.
  2. "apache/iceberg GitHub License". The Apache Software Foundation. 5 October 2022. Retrieved 5 October 2022.
  3. 1 2 3 4 5 Woodie, Alex (8 February 2021). "Apache Iceberg: The Hub of an Emerging Data Service Ecosystem?". Datanami. Archived from the original on 4 September 2024. Retrieved 5 October 2022.
  4. "Buster". Archived from the original on 2024-09-09. Retrieved 2024-09-09.
  5. Woodie, Alex (24 July 2024). "Crunchy Data Goes All-in With Postgres". The Big Data Wire. Archived from the original on 13 September 2024. Retrieved 9 November 2024.
  6. "Vendors". iceberg.apache.org. Retrieved 2023-05-05.
  7. "Using Apache Iceberg tables – Amazon Athena". Amazon Web Services, Inc. Archived from the original on 2024-09-04. Retrieved 2023-06-16.
  8. "Google Cloud BigQuery tables for Apache Iceberg". Google Cloud, Inc. Archived from the original on 2024-11-22. Retrieved 2024-11-21.
  9. "Iceberg at Netflix and Beyond with Ryan Blue, EPISODE 1654 Transcript". Software Engineering Daily. 7 March 2024. Archived from the original on 10 November 2024. Retrieved 10 November 2024.
  10. "Initial public release in apache/iceberg". GitHub. Archived from the original on 4 September 2024. Retrieved 5 October 2022.
  11. 1 2 "Incubation Status Template - Apache Incubator". incubator.apache.org. Archived from the original on 2022-10-05. Retrieved 2022-10-05.
  12. Zhu, Ronnie (26 September 2022). "Upgrading Data Warehouse Infrastructure at Airbnb". The Airbnb Tech Blog.
  13. Mathiesen, Christine (26 January 2021). "A Short Introduction to Apache Iceberg". Expedia Group Technology. Archived from the original on 5 October 2022. Retrieved 5 October 2022.
  14. "FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format". engineering.linkedin.com. Archived from the original on 2024-09-04. Retrieved 2022-10-05.
  15. Bremner, Jaemi (3 December 2020). "Iceberg at Adobe". Medium. Archived from the original on 4 September 2024. Retrieved 5 October 2022.
  16. Council, Data. "Open Source Highlight: Apache Iceberg". www.datacouncil.ai. Archived from the original on 5 October 2022. Retrieved 5 October 2022.
  17. "Apache Iceberg Documentation". iceberg.apache.org. Retrieved 3 March 2025.
  18. "Apache Iceberg Specification". iceberg.apache.org. Retrieved 3 March 2025.
  19. "Apache Iceberg vs Parquet: File vs. Table Formats for Modern Data Lakes". decube.io. Retrieved 3 March 2025.
  20. "Apache Iceberg Specification". Apache Iceberg. Retrieved 2025-03-17.
  21. "A Hands-On Look at the Structure of an Apache Iceberg Table". Dremio. Retrieved 2025-03-17.