Apache Avro

Last updated

Apache Avro
Developer(s) Apache Software Foundation
Initial release2 November 2009;14 years ago (2009-11-02) [1]
Stable release
1.11.3 / September 23, 2023;7 months ago (2023-09-23) [2]
Repository Avro Repository
Written in Java, C, C++, C#, Perl, Python, PHP, Ruby
Type Remote procedure call framework
License Apache License 2.0
Website avro.apache.org

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages: one for human editing (Avro IDL) and another which is more machine-readable based on JSON. [3]

Contents

It is similar to Thrift and Protocol Buffers, but does not require running a code-generation program when a schema changes (unless desired for statically-typed languages).

Apache Spark SQL can access Avro as a data source. [4]

Avro Object Container File

An Avro Object Container File consists of: [5]

A file header consists of:

For data blocks Avro specifies two serialization encodings: [6] binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.

Schema definition

Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed). [7]

Simple schema example:

{"namespace":"example.avro","type":"record","name":"User","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["null","int"]},{"name":"favorite_color","type":["null","string"]}]}

Serializing and deserializing

Data in Avro might be stored with its corresponding schema, meaning a serialized item can be read without knowing the schema ahead of time.

Example serialization and deserialization code in Python

Serialization: [8]

importavro.schemafromavro.datafileimportDataFileReader,DataFileWriterfromavro.ioimportDatumReader,DatumWriter# Need to know the schema to write. According to 1.8.2 of Apache Avroschema=avro.schema.parse(open("user.avsc","rb").read())writer=DataFileWriter(open("users.avro","wb"),DatumWriter(),schema)writer.append({"name":"Alyssa","favorite_number":256})writer.append({"name":"Ben","favorite_number":8,"favorite_color":"red"})writer.close()

File "users.avro" will contain the schema in JSON and a compact binary representation [9] of the data:

$od-v-tx1zusers.avro00000004f626a0104146176726f2e636f646563>Obj...avro.codec<0000020086e756c6c166176726f2e736368656d>.null.avro.schem<000004061ba037b2274797065223a2022726563>a..{"type": "rec<00000606f7264222c20226e616d65223a202255>ord", "name": "U<0000100736572222c20226e616d657370616365>ser", "namespace<0000120223a20226578616d706c652e6176726f>": "example.avro<0000140222c20226669656c6473223a205b7b22>", "fields": [{"<000016074797065223a2022737472696e67222c>type": "string",<000020020226e616d65223a20226e616d65227d> "name": "name"}<00002202c207b2274797065223a205b22696e74>, {"type": ["int<0000240222c20226e756c6c225d2c20226e616d>", "null"], "nam<000026065223a20226661766f726974655f6e75>e": "favorite_nu<00003006d626572227d2c207b2274797065223a>mber"}, {"type":<0000320205b22737472696e67222c20226e756c> ["string", "nul<00003406c225d2c20226e616d65223a20226661>l"], "name": "fa<0000360766f726974655f636f6c6f72227d5d7d>vorite_color"}]}<00004000005f9a38098475462bf6895a2ab42ef>......GTb.h...B.<000042024042c0c416c79737361008004020642>$.,.Alyssa.....B<0000440656e0010000672656405f9a380984754>en....red.....GT<000046062bf6895a2ab42ef24>b.h...B.$<0000471

Deserialization:

# The schema is embedded in the data filereader=DataFileReader(open("users.avro","rb"),DatumReader())foruserinreader:print(user)reader.close()

This outputs:

{u'favorite_color':None,u'favorite_number':256,u'name':u'Alyssa'}{u'favorite_color':u'red',u'favorite_number':8,u'name':u'Ben'}

Languages with APIs

Though theoretically any language could use Avro, the following languages have APIs written for them: [10] [11]

Avro IDL

In addition to supporting JSON for type and protocol definitions, Avro includes experimental [24] support for an alternative interface description language (IDL) syntax known as Avro IDL. Previously known as GenAvro, this format is designed to ease adoption by users familiar with more traditional IDLs and programming languages, with a syntax similar to C/C++, Protocol Buffers and others.

The original Apache Avro logo was from the defunct British aircraft manufacturer Avro (originally A.V. Roe and Company). [25]

The Apache Avro logo was updated to an original design in late 2023. [26]

See also

Related Research Articles

In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.

Abstract Syntax Notation One (ASN.1) is a standard interface description language (IDL) for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.

<span class="mw-page-title-main">Interface description language</span> Computer language used to describe a software components interface

An interface description language or interface definition language (IDL) is a generic term for a language that lets a program or object written in one language communicate with another program written in an unknown language. IDLs are usually used to describe data types and interfaces in a language-independent way, for example, between those written in C++ and those written in Java.

In computing, a hex dump is a textual hexadecimal view of computer data, from memory or from a computer file or storage device. Looking at a hex dump of data is usually done in the context of either debugging, reverse engineering or digital forensics. Interactive editors that provide a similar view but also manipulating the data in question are called hex editors.

<span class="mw-page-title-main">JSON</span> Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.

Snappy is a fast data compression and decompression library written in C++ by Google based on ideas from LZ77 and open-sourced in 2011. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. Compression speed is 250 MB/s and decompression speed is 500 MB/s using a single core of a circa 2011 "Westmere" 2.26 GHz Core i7 processor running in 64-bit mode. The compression ratio is 20–100% lower than gzip.

<span class="mw-page-title-main">Curtiss P-6 Hawk</span> Fighter aircraft in use by the US Army Air Corps 1929-1937

The Curtiss P-6 Hawk is an American single-engine biplane fighter introduced into service in the late 1920s with the United States Army Air Corps and operated until the late 1930s prior to the outbreak of World War II.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Thrift is an interface definition language and binary communication protocol used for defining and creating services for programming languages. It was developed by Facebook. Since 2020, it is an open source project in the Apache Software Foundation.

In computer programming, a netstring is a formatting method for byte strings that uses a declarative notation to indicate the size of the string.

This is a comparison of data serialization formats, various ways to convert complex objects to sequences of bits. It does not include markup languages used exclusively as document file formats.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project, built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

MessagePack is a computer data interchange format. It is a binary form for representing simple data structures like arrays and associative arrays. MessagePack aims to be as compact and simple as possible. The official implementation is available in a variety of languages such as C, C++, C#, D, Erlang, Go, Haskell, Java, JavaScript (NodeJS), Lua, OCaml, Perl, PHP, Python, Ruby, Rust, Scala, Smalltalk, and Swift.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

<span class="mw-page-title-main">Data lake</span> System or repository of data stored in its natural/raw format

A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases, semi-structured data, unstructured data and binary data. A data lake can be established "on premises" or "in the cloud".

Ion is a data serialization language developed by Amazon. It may be represented by either a human-readable text form or a compact binary form. The text form is a superset of JSON; thus, any valid JSON document is also a valid Ion document.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

FlatBuffers is a free software library implementing a serialization format similar to Protocol Buffers, Thrift, Apache Avro, SBE, and Cap'n Proto, primarily written by Wouter van Oortmerssen and open-sourced by Google. It supports “zero-copy” deserialization, so that accessing the serialized data does not require first copying it into a separate part of memory. This makes accessing data in these formats much faster than data in formats requiring more extensive processing, such as JSON, CSV, and in many cases Protocol Buffers. Compared to other serialization formats however, the handling of FlatBuffers requires usually more code, and some operations are not possible.

References

  1. "Apache Avro: a New Format for Data Interchange". blog.cloudera.com. Retrieved March 10, 2019.
  2. "Apache Avro Releases". avro.apache.org. Retrieved September 23, 2023.
  3. Kleppmann, Martin (2017). Designing Data-Intensive Applications (First ed.). O'Reilly. p. 122.
  4. "3 Reasons Why In-Hadoop Analytics are a Big Deal - Dataconomy". dataconomy.com. April 21, 2016.
  5. "Apache Avro Specification: Object Container Files". avro.apache.org. Retrieved March 10, 2019.
  6. "Apache Avro Specification: Encodings". avro.apache.org. Retrieved March 11, 2019.
  7. "Apache Avro Getting Started (Python)". avro.apache.org. Archived from the original on June 5, 2016. Retrieved March 11, 2019.
  8. "Apache Avro Getting Started (Python)". avro.apache.org. Archived from the original on June 5, 2016. Retrieved March 11, 2019.
  9. "Apache Avro Specification: Data Serialization". avro.apache.org. Retrieved March 11, 2019.
  10. phunt. "GitHub - phunt/avro-rpc-quickstart: Apache Avro RPC Quick Start. Avro is a subproject of Apache Hadoop". GitHub. Retrieved April 13, 2016.
  11. "Supported Languages - Apache Avro - Apache Software Foundation" . Retrieved April 21, 2016.
  12. "Avro: 1.5.1 - ASF JIRA" . Retrieved April 13, 2016.
  13. "[AVRO-533] .NET implementation of Avro - ASF JIRA" . Retrieved April 13, 2016.
  14. "Supported Languages" . Retrieved April 13, 2016.
  15. "AvroEx". hexdocs.pm. Retrieved October 18, 2017.
  16. "Avrora — avrora v0.21.1". hexdocs.pm. Retrieved June 11, 2021.
  17. "avro package - github.com/hamba/avro - Go Packages". pkg.go.dev. Retrieved July 4, 2023.
  18. goavro, LinkedIn, June 30, 2023, retrieved July 4, 2023
  19. "Native Haskell implementation of Avro". Thomas M. DuBuisson, Galois, Inc. Retrieved August 8, 2016.
  20. "Pure JavaScript implementation of the Avro specification". GitHub . Retrieved May 4, 2020.
  21. "Getting Started (Python)". Apache Avro. Retrieved July 4, 2023.
  22. Avro, Apache, avro: Avro is a serialization and RPC framework. , retrieved July 4, 2023
  23. "Apache Avro client library implementation in Rust" . Retrieved December 17, 2018.
  24. "Apache Avro 1.8.2 IDL". Archived from the original on September 20, 2010. Retrieved March 11, 2019.
  25. "The Avro Logo". avroheritagemuseum.co.uk. Retrieved December 31, 2018.
  26. "[AVRO-3908] Update project logo everywhere - ASF JIRA". apache.org. Retrieved February 6, 2024.

Further reading