Developer(s) | Apache Software Foundation |
---|---|
Initial release | 2 November 2009 [1] |
Stable release | 1.11.3 / September 23, 2023 [2] |
Repository | Avro Repository |
Written in | Java, C, C++, C#, Perl, Python, PHP, Ruby |
Type | Remote procedure call framework |
License | Apache License 2.0 |
Website | avro |
Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages: one for human editing (Avro IDL) and another which is more machine-readable based on JSON. [3]
It is similar to Thrift and Protocol Buffers, but does not require running a code-generation program when a schema changes (unless desired for statically-typed languages).
Apache Spark SQL can access Avro as a data source. [4]
An Avro Object Container File consists of: [5]
A file header consists of:
For data blocks Avro specifies two serialization encodings: [6] binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.
Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed). [7]
Simple schema example:
{"namespace":"example.avro","type":"record","name":"User","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["null","int"]},{"name":"favorite_color","type":["null","string"]}]}
Data in Avro might be stored with its corresponding schema, meaning a serialized item can be read without knowing the schema ahead of time.
Serialization: [8]
importavro.schemafromavro.datafileimportDataFileReader,DataFileWriterfromavro.ioimportDatumReader,DatumWriter# Need to know the schema to write. According to 1.8.2 of Apache Avroschema=avro.schema.parse(open("user.avsc","rb").read())writer=DataFileWriter(open("users.avro","wb"),DatumWriter(),schema)writer.append({"name":"Alyssa","favorite_number":256})writer.append({"name":"Ben","favorite_number":8,"favorite_color":"red"})writer.close()
File "users.avro" will contain the schema in JSON and a compact binary representation [9] of the data:
$od-v-tx1zusers.avro00000004f626a0104146176726f2e636f646563>Obj...avro.codec<0000020086e756c6c166176726f2e736368656d>.null.avro.schem<000004061ba037b2274797065223a2022726563>a..{"type": "rec<00000606f7264222c20226e616d65223a202255>ord", "name": "U<0000100736572222c20226e616d657370616365>ser", "namespace<0000120223a20226578616d706c652e6176726f>": "example.avro<0000140222c20226669656c6473223a205b7b22>", "fields": [{"<000016074797065223a2022737472696e67222c>type": "string",<000020020226e616d65223a20226e616d65227d> "name": "name"}<00002202c207b2274797065223a205b22696e74>, {"type": ["int<0000240222c20226e756c6c225d2c20226e616d>", "null"], "nam<000026065223a20226661766f726974655f6e75>e": "favorite_nu<00003006d626572227d2c207b2274797065223a>mber"}, {"type":<0000320205b22737472696e67222c20226e756c> ["string", "nul<00003406c225d2c20226e616d65223a20226661>l"], "name": "fa<0000360766f726974655f636f6c6f72227d5d7d>vorite_color"}]}<00004000005f9a38098475462bf6895a2ab42ef>......GTb.h...B.<000042024042c0c416c79737361008004020642>$.,.Alyssa.....B<0000440656e0010000672656405f9a380984754>en....red.....GT<000046062bf6895a2ab42ef24>b.h...B.$<0000471
Deserialization:
# The schema is embedded in the data filereader=DataFileReader(open("users.avro","rb"),DatumReader())foruserinreader:print(user)reader.close()
This outputs:
{u'favorite_color':None,u'favorite_number':256,u'name':u'Alyssa'}{u'favorite_color':u'red',u'favorite_number':8,u'name':u'Ben'}
Though theoretically any language could use Avro, the following languages have APIs written for them: [10] [11]
In addition to supporting JSON for type and protocol definitions, Avro includes experimental [24] support for an alternative interface description language (IDL) syntax known as Avro IDL. Previously known as GenAvro, this format is designed to ease adoption by users familiar with more traditional IDLs and programming languages, with a syntax similar to C/C++, Protocol Buffers and others.
The original Apache Avro logo was from the defunct British aircraft manufacturer Avro (originally A.V. Roe and Company). [25]
The Apache Avro logo was updated to an original design in late 2023. [26]
In distributed computing, a remote procedure call (RPC) is when a computer program causes a procedure (subroutine) to execute in a different address space, which is written as if it were a normal (local) procedure call, without the programmer explicitly writing the details for the remote interaction. That is, the programmer writes essentially the same code whether the subroutine is local to the executing program, or remote. This is a form of client–server interaction, typically implemented via a request–response message passing system. In the object-oriented programming paradigm, RPCs are represented by remote method invocation (RMI). The RPC model implies a level of location transparency, namely that calling procedures are largely the same whether they are local or remote, but usually, they are not identical, so local calls can be distinguished from remote calls. Remote calls are usually orders of magnitude slower and less reliable than local calls, so distinguishing them is important.
Abstract Syntax Notation One (ASN.1) is a standard interface description language (IDL) for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.
An interface description language or interface definition language (IDL) is a generic term for a language that lets a program or object written in one language communicate with another program written in an unknown language. IDLs are usually used to describe data types and interfaces in a language-independent way, for example, between those written in C++ and those written in Java.
In the context of SQL, data definition or data description language (DDL) is a syntax for creating and modifying database objects such as tables, indices, and users. DDL statements are similar to a computer programming language for defining data structures, especially database schemas. Common examples of DDL statements include CREATE
, ALTER
, and DROP
. If you see a .ddl file, that means the file contains a statement to create a table. Oracle SQL Developer contains the ability to export from an ERD generated with Data Modeler to either a .sql file or a .ddl file.
In computing, a hex dump is a textual hexadecimal view of computer data, from memory or from a computer file or storage device. Looking at a hex dump of data is usually done in the context of either debugging, reverse engineering or digital forensics. Interactive editors that provide a similar view but also manipulating the data in question are called hex editors.
JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.
Snappy is a fast data compression and decompression library written in C++ by Google based on ideas from LZ77 and open-sourced in 2011. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. Compression speed is 250 MB/s and decompression speed is 500 MB/s using a single core of a circa 2011 "Westmere" 2.26 GHz Core i7 processor running in 64-bit mode. The compression ratio is 20–100% lower than gzip.
The Curtiss P-6 Hawk is an American single-engine biplane fighter introduced into service in the late 1920s with the United States Army Air Corps and operated until the late 1930s prior to the outbreak of World War II.
Thrift is an IDL and binary communication protocol used for defining and creating services for programming languages. It was developed by Facebook. Since 2020, it is an open source project in the Apache Software Foundation.
In computer programming, a netstring is a formatting method for byte strings that uses a declarative notation to indicate the size of the string.
This is a comparison of data serialization formats, various ways to convert complex objects to sequences of bits. It does not include markup languages used exclusively as document file formats.
Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.
MessagePack is a computer data interchange format. It is a binary form for representing simple data structures like arrays and associative arrays. MessagePack aims to be as compact and simple as possible. The official implementation is available in a variety of languages, some official libraries and others community created, such as C, C++, C#, D, Erlang, Go, Haskell, Java, JavaScript (NodeJS), Lua, OCaml, Perl, PHP, Python, Ruby, Rust, Scala, Smalltalk, and Swift.
Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.
Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.
Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting, visualization, advanced analytics, and machine learning. A data lake can include structured data from relational databases, semi-structured data, unstructured data, and binary data. A data lake can be established "on premises" or "in the cloud".
Ion is a data serialization language developed by Amazon. It may be represented by either a human-readable text form or a compact binary form. The text form is a superset of JSON; thus, any valid JSON document is also a valid Ion document.
Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
FlatBuffers is a free software library implementing a serialization format similar to Protocol Buffers, Thrift, Apache Avro, SBE, and Cap'n Proto, primarily written by Wouter van Oortmerssen and open-sourced by Google. It supports “zero-copy” deserialization, so that accessing the serialized data does not require first copying it into a separate part of memory. This makes accessing data in these formats much faster than data in formats requiring more extensive processing, such as JSON, CSV, and in many cases Protocol Buffers. Compared to other serialization formats however, the handling of FlatBuffers requires usually more code, and some operations are not possible.