Apache Avro

Apache Avro
Developer(s)	Apache Software Foundation
Initial release	2 November 2009;15 years ago
Stable release	1.11.3 / September 23, 2023;16 months ago
Repository	Avro Repository
Written in	Java, C, C++, C#, Perl, Python, PHP, Ruby
Type	Remote procedure call framework
License	Apache License 2.0
Website	avro.apache.org

Last updated February 06, 2025

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages: one for human editing (Avro IDL) and another which is more machine-readable based on JSON.^[3]

Avro Object Container File

An Avro Object Container File consists of:^[5]

A file header, followed by
one or more file data blocks.

A file header consists of:

Four bytes, ASCII 'O', 'b', 'j', followed by the Avro version number which is 1 (0x01) (Binary values 0x4F 0x62 0x6A 0x01).
File metadata, including the schema definition.
The 16-byte, randomly-generated sync marker for this file.

For data blocks Avro specifies two serialization encodings:^[6] binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.

Schema definition

Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed).^[7]

Simple schema example:

{"namespace":"example.avro","type":"record","name":"User","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["null","int"]},{"name":"favorite_color","type":["null","string"]}]}

Serializing and deserializing

Data in Avro might be stored with its corresponding schema, meaning a serialized item can be read without knowing the schema ahead of time.

Example serialization and deserialization code in Python

Serialization:^[8]

importavro.schemafromavro.datafileimportDataFileReader,DataFileWriterfromavro.ioimportDatumReader,DatumWriter# Need to know the schema to write. According to 1.8.2 of Apache Avroschema=avro.schema.parse(open("user.avsc","rb").read())writer=DataFileWriter(open("users.avro","wb"),DatumWriter(),schema)writer.append({"name":"Alyssa","favorite_number":256})writer.append({"name":"Ben","favorite_number":8,"favorite_color":"red"})writer.close()

File "users.avro" will contain the schema in JSON and a compact binary representation^[9] of the data:

$od-v-tx1zusers.avro00000004f626a0104146176726f2e636f646563>Obj...avro.codec<0000020086e756c6c166176726f2e736368656d>.null.avro.schem<000004061ba037b2274797065223a2022726563>a..{"type": "rec<00000606f7264222c20226e616d65223a202255>ord", "name": "U<0000100736572222c20226e616d657370616365>ser", "namespace<0000120223a20226578616d706c652e6176726f>": "example.avro<0000140222c20226669656c6473223a205b7b22>", "fields": [{"<000016074797065223a2022737472696e67222c>type": "string",<000020020226e616d65223a20226e616d65227d> "name": "name"}<00002202c207b2274797065223a205b22696e74>, {"type": ["int<0000240222c20226e756c6c225d2c20226e616d>", "null"], "nam<000026065223a20226661766f726974655f6e75>e": "favorite_nu<00003006d626572227d2c207b2274797065223a>mber"}, {"type":<0000320205b22737472696e67222c20226e756c> ["string", "nul<00003406c225d2c20226e616d65223a20226661>l"], "name": "fa<0000360766f726974655f636f6c6f72227d5d7d>vorite_color"}]}<00004000005f9a38098475462bf6895a2ab42ef>......GTb.h...B.<000042024042c0c416c79737361008004020642>$.,.Alyssa.....B<0000440656e0010000672656405f9a380984754>en....red.....GT<000046062bf6895a2ab42ef24>b.h...B.$<0000471

Deserialization:

# The schema is embedded in the data filereader=DataFileReader(open("users.avro","rb"),DatumReader())foruserinreader:print(user)reader.close()

This outputs:

{u'favorite_color':None,u'favorite_number':256,u'name':u'Alyssa'}{u'favorite_color':u'red',u'favorite_number':8,u'name':u'Ben'}

Languages with APIs

Though theoretically any language could use Avro, the following languages have APIs written for them:^[10]^[11]

Avro IDL

In addition to supporting JSON for type and protocol definitions, Avro includes experimental^[24] support for an alternative interface description language (IDL) syntax known as Avro IDL. Previously known as GenAvro, this format is designed to ease adoption by users familiar with more traditional IDLs and programming languages, with a syntax similar to C/C++, Protocol Buffers and others.

Logo

The original Apache Avro logo was from the defunct British aircraft manufacturer Avro (originally A.V. Roe and Company).^[25]

The Apache Avro logo was updated to an original design in late 2023.^[26]

References

↑ "Apache Avro: a New Format for Data Interchange". blog.cloudera.com. Retrieved March 10, 2019.
↑ "Apache Avro Releases". avro.apache.org. Retrieved September 23, 2023.
↑ Kleppmann, Martin (2017). Designing Data-Intensive Applications (First ed.). O'Reilly. p. 122.
↑ "3 Reasons Why In-Hadoop Analytics are a Big Deal - Dataconomy". dataconomy.com. April 21, 2016.
↑ "Apache Avro Specification: Object Container Files". avro.apache.org. Retrieved September 8, 2024.
↑ "Apache Avro Specification: Encodings". avro.apache.org. Retrieved September 8, 2024.
↑ "Apache Avro Getting Started (Python)". avro.apache.org. Archived from the original on June 5, 2016. Retrieved March 11, 2019.
↑ "Apache Avro Getting Started (Python)". avro.apache.org. Archived from the original on June 5, 2016. Retrieved March 11, 2019.
↑ "Apache Avro Specification: Data Serialization". avro.apache.org. Retrieved September 8, 2024.
↑ phunt. "GitHub - phunt/avro-rpc-quickstart: Apache Avro RPC Quick Start. Avro is a subproject of Apache Hadoop". GitHub. Retrieved April 13, 2016.
↑ "Supported Languages - Apache Avro - Apache Software Foundation" . Retrieved April 21, 2016.
↑ "Avro: 1.5.1 - ASF JIRA" . Retrieved April 13, 2016.
↑ "[AVRO-533] .NET implementation of Avro - ASF JIRA" . Retrieved April 13, 2016.
↑ "Supported Languages" . Retrieved April 13, 2016.
↑ "AvroEx". hexdocs.pm. Retrieved October 18, 2017.
↑ "Avrora — avrora v0.21.1". hexdocs.pm. Retrieved June 11, 2021.
↑ "avro package - github.com/hamba/avro - Go Packages". pkg.go.dev. Retrieved July 4, 2023.
↑ goavro, LinkedIn, June 30, 2023, retrieved July 4, 2023
↑ "Native Haskell implementation of Avro". Thomas M. DuBuisson, Galois, Inc. Retrieved August 8, 2016.
↑ "Pure JavaScript implementation of the Avro specification". GitHub . Retrieved May 4, 2020.
↑ "Getting Started (Python)". Apache Avro. Retrieved July 4, 2023.
↑ Avro, Apache, avro: Avro is a serialization and RPC framework. , retrieved July 4, 2023
↑ "Apache Avro client library implementation in Rust" . Retrieved December 17, 2018.
↑ "Apache Avro 1.8.2 IDL". Archived from the original on September 20, 2010. Retrieved March 11, 2019.
↑ "The Avro Logo". avroheritagemuseum.co.uk. Retrieved December 31, 2018.
↑ "[AVRO-3908] Update project logo everywhere - ASF JIRA". apache.org. Retrieved February 6, 2024.

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik FOP Ivy Log4j
Attic	Apex AxKit Beehive iBATIS Click Continuum Deltacloud Etch Giraph Hama Harmony Jakarta Marmotta MXNet ODE River Shale Slide Sqoop Stanbol Tuscany Wave XML
Licenses	Apache License
Category

v t e Data exchange formats
Human readable	Atom CSV EDIFACT JSON Web Encryption Web Token Web Signature Property list RDF Rebol TOML XML YAML
Binary	AMF Ascii85 ASN.1 SMI Avro Base32 Base64 Bencode BSON UBJSON Cap'n Proto CBOR FlatBuffers MessagePack Property list Protocol Buffers Thrift Cyphal DSDL XDR uuencode yEnc
Comparison of data-serialization formats