Apache Arrow

Last updated
Apache Arrow
Developer(s) Apache Software Foundation
Initial releaseOctober 10, 2016;5 years ago (2016-10-10)
Stable release
7.0.0 [1]   OOjs UI icon edit-ltr-progressive.svg / 8 February 2022;7 months ago (8 February 2022)
Repository https://github.com/apache/arrow
Written in C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, Rust
Type Data format, algorithms
License Apache License 2.0
Website arrow.apache.org

Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware. [2] [3] [4] [5] [6] This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory. [7]

Contents

Interoperability

Arrow can be used with Apache Parquet, Apache Spark, NumPy, PySpark, pandas and other data processing libraries. The project includes native software libraries written in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust. Arrow allows for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems. [2]

Applications

Arrow has been used in diverse domains, including analytics, [8] genomics, [9] [7] and cloud computing. [10]

Comparison to Apache Parquet and ORC

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory. [11] The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage. [12] The Arrow and Parquet projects include libraries that allow for reading and writing data between the two formats. [13]

Governance

Apache Arrow was announced by The Apache Software Foundation on February 17, 2016, [14] with development led by a coalition of developers from other open source data analytics projects. [15] [16] [6] [17] [18] The initial codebase and Java library was seeded by code from Apache Drill. [14]

Related Research Articles

<span class="mw-page-title-main">Apache Flex</span> Software development kit (SDK) for the development and deployment of rich web applications

Apache Flex, formerly Adobe Flex, is a software development kit (SDK) for the development and deployment of cross-platform rich web applications based on the Adobe Flash platform. Initially developed by Macromedia and then acquired by Adobe Systems, Adobe donated Flex to the Apache Software Foundation in 2011 and it was promoted to a top-level project in December 2012.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Thrift is an interface definition language and binary communication protocol used for defining and creating services for numerous programming languages. It was developed at Facebook for "scalable cross-language services development" and as of 2020 is an open source project in the Apache Software Foundation.

KNIME, the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks of Analytics" concept. A graphical user interface and use of JDBC allows assembly of nodes blending different data sources, including preprocessing, for modeling, data analysis and visualization without, or with only minimal, programming.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

Within computing database management systems, the RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

<span class="mw-page-title-main">SingleStore</span>

SingleStore is a cloud-native database designed for data-intensive applications. A distributed, relational, SQL database management system (RDBMS) that features ANSI SQL support, it is known for speed in data ingest, transaction processing, and query processing.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system, also productized as BigQuery. Drill is an Apache top-level project.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Apache Flink</span> Framework and distributed processing engine

Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.

Apache SystemDS is an open source ML system for the end-to-end data science lifecycle.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">Apache Kudu</span> Open-source column-oriented data store

Apache Kudu is a free and open source column-oriented data store of the Apache Hadoop ecosystem. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides completeness to Hadoop's storage layer to enable fast analytics on fast data.

<span class="mw-page-title-main">Apache ORC</span> Column-oriented data storage format

Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink and Apache Hadoop.

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages. The library is built on top of Apache Spark and its Spark ML library.

Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

References

  1. "Apache Arrow 7.0.0 Release". 8 February 2022. Retrieved 15 April 2022.
  2. 1 2 "Apache Arrow and Distributed Compute with Kubernetes". 13 Dec 2018.
  3. Baer, Tony (17 February 2016). "Apache Arrow: Lining Up The Ducks In A Row... Or Column". Seeking Alpha .
  4. Baer, Tony (25 February 2019). "Apache Arrow: The little data accelerator that could". ZDNet .
  5. Hall, Susan (23 February 2016). "Apache Arrow's Columnar Layouts of Data Could Accelerate Hadoop, Spark". The New Stack .
  6. 1 2 Yegulalp, Serdar (27 February 2016). "Apache Arrow aims to speed access to big data". InfoWorld .
  7. 1 2 Tanveer Ahmad (2019). "ArrowSAM: In-Memory Genomics Data Processing through Apache Arrow Framework". bioRxiv : 741843. doi: 10.1101/741843 .
  8. Dinsmore T.W. (2016). "In-Memory Analytics". In-Memory Analytics. In: Disruptive Analytics. Apress, Berkeley, CA. pp. 97–116. doi:10.1007/978-1-4842-1311-7_5. ISBN   978-1-4842-1312-4.
  9. Versaci F, Pireddu L, Zanetti G (2016). "Scalable genomics: from raw data to aligned reads on Apache YARN" (PDF). IEEE International Conference on Big Data: 1232–1241.
  10. Maas M, Asanović K, Kubiatowicz J (2017). "Return of the runtimes: rethinking the language runtime system for the cloud 3.0 era". Proceedings of the 16th Workshop on Hot Topics in Operating Systems (ACM): 138–143. doi: 10.1145/3102980.3103003 .
  11. Le Dem, Julien. "Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory". KDnuggets .
  12. "Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?". 2017-10-31.
  13. "PyArrow:Reading and Writing the Apache Parquet Format".
  14. 1 2 "The Apache® Software Foundation Announces Apache Arrow™ as a Top-Level Project". The Apache Software Foundation Blog. Archived from the original on 2016-03-13.
  15. Martin, Alexander J. (17 February 2016). "Apache Foundation rushes out Apache Arrow as top-level project". The Register .
  16. "Big data gets a new open-source project, Apache Arrow: It offers performance improvements of more than 100x on analytical workloads, the foundation says". 2016-02-17.
  17. Le Dem, Julien (28 November 2016). "The first release of Apache Arrow". SD Times .
  18. "Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow".