Apache Iceberg

Last updated
Apache Iceberg
Original author(s) Ryan Blue, Daniel Weeks
Initial release10 August 2017;6 years ago (10 August 2017)
Written in Java, Python
Operating system Cross-platform
Type Data warehouse, Data lake
License Apache License 2.0
Website

Apache Iceberg is an open-source high-performance format for huge analytic tables. Iceberg enables the use of SQL tables for big data while making it possible for engines like Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, Doris, and Pig to safely work with the same tables, at the same time. [1] Iceberg is released under the Apache License. [2] Iceberg addresses the performance and usability challenges of using Apache Hive tables in large and demanding data lake environments. [3] Vendors currently supporting Apache Iceberg tables in their products include CelerData, Cloudera, Dremio, IOMETE, Snowflake, Starburst, Tabular, [4] and AWS. [5]

Contents

History

Iceberg was started at Netflix by Ryan Blue and Dan Weeks. Hive was used by many different services and engines in the Netflix infrastructure. Hive was never able to guarantee correctness and did not provide stable atomic transactions. [3] Many at Netflix avoided using these services and making changes to the data to avert unintended consequences from the Hive format. [3] Ryan Blue set out to address three issues that faced the Hive table by creating Iceberg: [3]

  1. Ensure the correctness of the data and support ACID transactions.
  2. Improve performance by enabling finer-grained operations to be done at the file granularity for optimal writes.
  3. Simplify and obfuscate[ citation needed ] general operation and maintenance of tables.

Iceberg development started in 2017. [6] The project was open-sourced and donated to the Apache Software Foundation in November 2018. [7] In May 2020, the Iceberg project graduated to become a top-level Apache project. [7]

Iceberg is used by multiple companies including Airbnb, [8] Apple, [3] Expedia, [9] LinkedIn, [10] Adobe, [11] Lyft, and many more. [12]

See also

Related Research Articles

<span class="mw-page-title-main">OpenSSL</span> Open-source implementation of the SSL and TLS protocols

OpenSSL is a software library for applications that provide secure communications over computer networks against eavesdropping, and identify the party at the other end. It is widely used by Internet servers, including the majority of HTTPS websites.

<span class="mw-page-title-main">Apache Flex</span> Software development kit (SDK) for the development and deployment of rich web applications

Apache Flex, formerly Adobe Flex, is a software development kit (SDK) for the development and deployment of cross-platform rich web applications based on the Adobe Flash platform. Initially developed by Macromedia and then acquired by Adobe Systems, Adobe donated Flex to the Apache Software Foundation in 2011 and it was promoted to a top-level project in December 2012.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project, built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

Within database management systems, the RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

<span class="mw-page-title-main">ERPNext</span> Enterprise resource planning software

ERPNext is a free and open-source integrated Enterprise resource planning (ERP) software developed by an Indian software company Frappe Technologies Pvt. Ltd. It is built on the MariaDB database system using Frappe, a Python based server-side framework.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Apache Allura is an open-source forge software for managing source code repositories, bug reports, discussions, wiki pages, blogs and more for any number of individual projects. Allura graduated from incubation with the Apache Software Foundation in March 2013.

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

<span class="mw-page-title-main">Apache Mesos</span> Software to manage computer clusters

Apache Mesos is an open-source project to manage computer clusters. It was developed at the University of California, Berkeley.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

gRPC is a cross-platform open source high performance remote procedure call (RPC) framework. gRPC was initially created by Google, which used a single general-purpose RPC infrastructure called Stubby to connect the large number of microservices running within and across its data centers from about 2001. In March 2015, Google decided to build the next version of Stubby and make it open source. The result was gRPC, which is now used in many organizations aside from Google to power use cases from microservices to the "last mile" of computing. It uses HTTP/2 for transport, Protocol Buffers as the interface description language, and provides features such as authentication, bidirectional streaming and flow control, blocking or nonblocking bindings, and cancellation and timeouts. It generates cross-platform client and server bindings for many languages. Most common usage scenarios include connecting services in a microservices style architecture, or connecting mobile device clients to backend services.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">Apache Celix</span>

Apache Celix is an open-source implementation of the OSGi specification adapted to C and C++ developed by the Apache Software Foundation. The project aims to provide a framework to develop (dynamic) modular software applications using component and/or service-oriented programming.

Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">Apache Airflow</span> Open-source workflow management platform

Apache Airflow is an open-source workflow management platform for data engineering pipelines. It started at Airbnb in October 2014 as a solution to manage the company's increasingly complex workflows. Creating Airflow allowed Airbnb to programmatically author and schedule their workflows and monitor them via the built-in Airflow user interface. From the beginning, the project was made open source, becoming an Apache Incubator project in March 2016 and a top-level Apache Software Foundation project in January 2019.

Apache Superset is an open-source software application for data exploration and data visualization able to handle data at petabyte scale. The application started as a hack-a-thon project by Maxime Beauchemin while working at Airbnb and entered the Apache Incubator program in 2017. In addition to Airbnb, the project has seen significant contributions from other leading technology companies, including Lyft and Dropbox. Superset graduated from the incubator program and became a top-level project at the Apache Software Foundation in 2021.

<span class="mw-page-title-main">SONiC (operating system)</span> Open-source network operating system

The Software for Open Networking in the Cloud or alternatively abbreviated and stylized as SONiC, is a free and open source network operating system based on Linux. It was originally developed by Microsoft and the Open Compute Project. In 2022, Microsoft ceded oversight of the project to the Linux Foundation, who will continue to work with the Open Compute Project for continued ecosystem and developer growth. SONiC includes the networking software components necessary for a fully functional L3 device and was designed to meet the requirements of a cloud data center. It allows cloud operators to share the same software stack across hardware from different switch vendors and works on over 100 different platforms. There are multiple companies offering enterprise service and support for SONiC.

<span class="mw-page-title-main">Trino (SQL query engine)</span> Open-source distributed SQL query engine

Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino can query data lakes that contain open column-oriented data file formats like ORC or Parquet residing on different storage systems like HDFS, AWS S3, Google Cloud Storage, or Azure Blob Storage using the Hive and Iceberg table formats. Trino also has the ability to run federated queries that query tables in different data sources such as MySQL, PostgreSQL, Cassandra, Kafka, MongoDB and Elasticsearch. Trino is released under the Apache License.

References

  1. "Apache Iceberg". iceberg.apache.org. Retrieved 5 October 2022.
  2. "apache/iceberg GitHub License". The Apache Software Foundation. 5 October 2022. Retrieved 5 October 2022.
  3. 1 2 3 4 5 Woodie, Alex (8 February 2021). "Apache Iceberg: The Hub of an Emerging Data Service Ecosystem?". Datanami.
  4. "Vendors". iceberg.apache.org. Retrieved 2023-05-05.
  5. "Using Apache Iceberg tables – Amazon Athena". Amazon Web Services, Inc.
  6. "Initial public release in apache/iceberg". GitHub. Retrieved 5 October 2022.
  7. 1 2 "Incubation Status Template - Apache Incubator". incubator.apache.org.
  8. Zhu, Ronnie (26 September 2022). "Upgrading Data Warehouse Infrastructure at Airbnb". The Airbnb Tech Blog.
  9. Mathiesen, Christine (26 January 2021). "A Short Introduction to Apache Iceberg". Expedia Group Technology. Retrieved 5 October 2022.
  10. "FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format". engineering.linkedin.com.
  11. Bremner, Jaemi (3 December 2020). "Iceberg at Adobe". Medium.
  12. Council, Data. "Open Source Highlight: Apache Iceberg". www.datacouncil.ai. Retrieved 5 October 2022.