HOME

TheInfoList



OR:

Avro is a row-oriented
remote procedure call In distributed computing, a remote procedure call (RPC) is when a computer program causes a procedure ( subroutine) to execute in a different address space (commonly on another computer on a shared network), which is coded as if it were a normal ...
and data
serialization In computing, serialization (or serialisation) is the process of translating a data structure or object state into a format that can be stored (e.g. files in secondary storage devices, data buffers in primary storage devices) or transmitted (e ...
framework A framework is a generic term commonly referring to an essential supporting structure which other things are built on top of. Framework may refer to: Computing * Application framework, used to implement the structure of an application for an op ...
developed within Apache's Hadoop project. It uses
JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other se ...
for defining
data type In computer science and computer programming, a data type (or simply type) is a set of possible values and a set of allowed operations on it. A data type tells the compiler or interpreter how the programmer intends to use the data. Most progra ...
s and
protocol Protocol may refer to: Sociology and politics * Protocol (politics), a formal agreement between nation states * Protocol (diplomacy), the etiquette of diplomacy and affairs of state * Etiquette, a code of personal behavior Science and technology ...
s, and serializes data in a compact binary format. Its primary use is in
Apache Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage a ...
, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages; one for human editing (Avro IDL) and another which is more machine-readable based on JSON. It is similar to Thrift and
Protocol Buffers Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data. The method involves an i ...
, but does not require running a code-generation program when a
schema The word schema comes from the Greek word ('), which means ''shape'', or more generally, ''plan''. The plural is ('). In English, both ''schemas'' and ''schemata'' are used as plural forms. Schema may refer to: Science and technology * SCHEMA ...
changes (unless desired for statically-typed languages). Apache Spark SQL can access Avro as a data source.


Avro Object Container File

An Avro Object Container File consists of: * A
file header File or filing may refer to: Mechanical tools and processes * File (tool), a tool used to ''remove'' fine amounts of material from a workpiece **Filing (metalworking), a material removal process in manufacturing ** Nail file, a tool used to gent ...
, followed by * one or more file data blocks. A file header consists of: * Four bytes,
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
'O', 'b', 'j', followed by the Avro version number which is 1 (0x01) (Binary values 0x4F 0x62 0x6A 0x01). * File metadata, including the schema definition. * The 16-byte, randomly-generated sync marker for this file. For data blocks Avro specifies two serialization encodings: binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.


Schema definition

Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed). Simple schema example:


Serializing and deserializing

Data in Avro might be stored with its corresponding schema, meaning a serialized item can be read without knowing the schema ahead of time.


Example serialization and deserialization code in Python

Serialization: import avro.schema from avro.datafile import DataFileReader, DataFileWriter from avro.io import DatumReader, DatumWriter schema = avro.schema.parse(open("user.avsc", "rb").read()) # need to know the schema to write. According to 1.8.2 of Apache Avro writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema) writer.append() writer.append() writer.close() File "users.avro" will contain the schema in JSON and a compact binary representation of the data: $ od -v -t x1z users.avro 0000000 4f 62 6a 01 04 14 61 76 72 6f 2e 63 6f 64 65 63 >Obj...avro.codec< 0000020 08 6e 75 6c 6c 16 61 76 72 6f 2e 73 63 68 65 6d >.null.avro.schem< 0000040 61 ba 03 7b 22 74 79 70 65 22 3a 20 22 72 65 63 >a..< 0000400 00 05 f9 a3 80 98 47 54 62 bf 68 95 a2 ab 42 ef >......GTb.h...B.< 0000420 24 04 2c 0c 41 6c 79 73 73 61 00 80 04 02 06 42 >$.,.Alyssa.....B< 0000440 65 6e 00 10 00 06 72 65 64 05 f9 a3 80 98 47 54 >en....red.....GT< 0000460 62 bf 68 95 a2 ab 42 ef 24 >b.h...B.$< 0000471 Deserialization: reader = DataFileReader(open("users.avro", "rb"), DatumReader()) # the schema is embedded in the data file for user in reader: print(user) reader.close() This outputs:


Languages with APIs

Though theoretically any language could use Avro, the following languages have APIs written for them: * C * C++ * C# *
Elixir ELIXIR (the European life-sciences Infrastructure for biological Information) is an initiative that will allow life science laboratories across Europe to share and store their research data as part of an organised network. Its goal is to bring t ...
* Go * Haskell *
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
*
Javascript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of websites use JavaScript on the client side for webpage behavior, of ...
*
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offic ...
* PHP * Python *
Ruby A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called ...
*
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO( ...
* Scala


Avro IDL

In addition to supporting JSON for type and protocol definitions, Avro includes experimental support for an alternative
interface description language interface description language or interface definition language (IDL), is a generic term for a language that lets a program or object written in one language communicate with another program written in an unknown language. IDLs describe an inter ...
(IDL) syntax known as Avro IDL. Previously known as GenAvro, this format is designed to ease adoption by users familiar with more traditional IDLs and programming languages, with a syntax similar to C/C++,
Protocol Buffers Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data. The method involves an i ...
and others.


Logo

The Apache Avro logo is from the defunct British aircraft manufacturer
Avro AVRO, short for Algemene Vereniging Radio Omroep ("General Association of Radio Broadcasting"), was a Dutch public broadcasting association operating within the framework of the Nederlandse Publieke Omroep system. It was the first public broa ...
(originally A.V. Roe and Company). Football team Avro F.C. uses the same logo.


See also

* Comparison of data serialization formats *
Apache Thrift Thrift is an interface definition language and binary communication protocol used for defining and creating services for numerous programming languages. It was developed at Facebook for "scalable cross-language services development" and as of 2 ...
*
Protocol Buffers Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data. The method involves an i ...
*
Etch (protocol) Etch was an open-source, cross-platform framework for building network services, first announced in May 2008 by Cisco Systems. Etch encompasses a service description language, a compiler, and a number of language bindings. It is intended to sup ...
*
Internet Communications Engine The Internet Communications Engine, or Ice, is an open-source RPC framework developed by ZeroC. It provides SDKs for C++, C#, Java, JavaScript, MATLAB, Objective-C, PHP, Python, Ruby and Swift, and can run on various operating systems, inc ...
* MessagePack *
CBOR Concise Binary Object Representation (CBOR) is a binary data serialization format loosely based on JSON authored by C. Bormann. Like JSON it allows the transmission of data objects that contain name–value pairs, but in a more concise manner. ...


References


Further reading

* {{DEFAULTSORT:Avro Apache Software Foundation projects Inter-process communication Application layer protocols Remote procedure call Data serialization formats Articles with example Python (programming language) code