HOME

TheInfoList



OR:

Avro is a row-oriented
remote procedure call In distributed computing, a remote procedure call (RPC) is when a computer program causes a procedure (subroutine) to execute in a different address space (commonly on another computer on a shared network), which is coded as if it were a normal (l ...
and data
serialization In computing, serialization (or serialisation) is the process of translating a data structure or object state into a format that can be stored (e.g. files in secondary storage devices, data buffers in primary storage devices) or transmitted (e ...
framework A framework is a generic term commonly referring to an essential supporting structure which other things are built on top of. Framework may refer to: Computing * Application framework, used to implement the structure of an application for an op ...
developed within Apache's Hadoop project. It uses
JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other ser ...
for defining
data type In computer science and computer programming, a data type (or simply type) is a set of possible values and a set of allowed operations on it. A data type tells the compiler or interpreter how the programmer intends to use the data. Most progra ...
s and
protocol Protocol may refer to: Sociology and politics * Protocol (politics), a formal agreement between nation states * Protocol (diplomacy), the etiquette of diplomacy and affairs of state * Etiquette, a code of personal behavior Science and technology ...
s, and serializes data in a compact binary format. Its primary use is in
Apache Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop
nodes In general, a node is a localized swelling (a "knot") or a point of intersection (a Vertex (graph theory), vertex). Node may refer to: In mathematics *Vertex (graph theory), a vertex in a mathematical graph *Vertex (geometry), a point where two ...
, and from client programs to the Hadoop
service Service may refer to: Activities * Administrative service, a required part of the workload of university faculty * Civil service, the body of employees of a government * Community service, volunteer service for the benefit of a community or a pu ...
s. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages; one for human editing (Avro IDL) and another which is more machine-readable based on JSON. It is similar to Thrift and
Protocol Buffers Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data. The method involves an int ...
, but does not require running a code-generation program when a
schema The word schema comes from the Greek word ('), which means ''shape'', or more generally, ''plan''. The plural is ('). In English, both ''schemas'' and ''schemata'' are used as plural forms. Schema may refer to: Science and technology * SCHEMA ...
changes (unless desired for
statically-typed In computer programming, a type system is a logical system comprising a set of rules that assigns a property called a type (computer science), type to every "term" (a word, phrase, or other set of symbols). Usually the terms are various constru ...
languages). Apache Spark SQL can access Avro as a data source.


Avro Object Container File

An Avro Object Container File consists of: * A file header, followed by * one or more file data blocks. A file header consists of: * Four bytes,
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
'O', 'b', 'j', followed by the Avro version number which is 1 (0x01) (Binary values 0x4F 0x62 0x6A 0x01). * File metadata, including the schema definition. * The 16-byte, randomly-generated sync marker for this file. For data blocks Avro specifies two serialization encodings: binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.


Schema definition

Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed). Simple schema example:


Serializing and deserializing

Data in Avro might be stored with its corresponding schema, meaning a serialized item can be read without knowing the schema ahead of time.


Example serialization and deserialization code in Python

Serialization: import avro.schema from avro.datafile import DataFileReader, DataFileWriter from avro.io import DatumReader, DatumWriter schema = avro.schema.parse(open("user.avsc", "rb").read()) # need to know the schema to write. According to 1.8.2 of Apache Avro writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema) writer.append() writer.append() writer.close() File "users.avro" will contain the schema in JSON and a compact binary representation of the data: $ od -v -t x1z users.avro 0000000 4f 62 6a 01 04 14 61 76 72 6f 2e 63 6f 64 65 63 >Obj...avro.codec< 0000020 08 6e 75 6c 6c 16 61 76 72 6f 2e 73 63 68 65 6d >.null.avro.schem< 0000040 61 ba 03 7b 22 74 79 70 65 22 3a 20 22 72 65 63 >a..< 0000400 00 05 f9 a3 80 98 47 54 62 bf 68 95 a2 ab 42 ef >......GTb.h...B.< 0000420 24 04 2c 0c 41 6c 79 73 73 61 00 80 04 02 06 42 >$.,.Alyssa.....B< 0000440 65 6e 00 10 00 06 72 65 64 05 f9 a3 80 98 47 54 >en....red.....GT< 0000460 62 bf 68 95 a2 ab 42 ef 24 >b.h...B.$< 0000471 Deserialization: reader = DataFileReader(open("users.avro", "rb"), DatumReader()) # the schema is embedded in the data file for user in reader: print(user) reader.close() This outputs:


Languages with APIs

Though theoretically any language could use Avro, the following languages have APIs written for them: * C * C++ * C# * Elixir * Go * Haskell *
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
*
Javascript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of Website, websites use JavaScript on the Client (computing), client side ...
*
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offici ...
*
PHP PHP is a general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by The PHP Group ...
*
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
*
Ruby A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sa ...
*
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH ...
* Scala


Avro IDL

In addition to supporting JSON for type and protocol definitions, Avro includes experimental support for an alternative
interface description language interface description language or interface definition language (IDL), is a generic term for a language that lets a program or object written in one language communicate with another program written in an unknown language. IDLs describe an inter ...
(IDL) syntax known as Avro IDL. Previously known as GenAvro, this format is designed to ease adoption by users familiar with more traditional IDLs and programming languages, with a syntax similar to C/C++,
Protocol Buffers Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data. The method involves an int ...
and others.


Logo

The Apache Avro logo is from the defunct British aircraft manufacturer
Avro AVRO, short for Algemene Vereniging Radio Omroep ("General Association of Radio Broadcasting"), was a Dutch public broadcasting association operating within the framework of the Nederlandse Publieke Omroep system. It was the first public broad ...
(originally A.V. Roe and Company). Football team
Avro F.C. Avro Football Club is a football club based in the Limeside area of Oldham, Greater Manchester. They are currently members of the and play at the Whitebank Stadium. History The club was founded in 1936 at the Failsworth factory of British ai ...
uses the same logo.


See also

* Comparison of data serialization formats *
Apache Thrift Thrift is an interface definition language and binary communication protocol used for defining and creating services for numerous programming languages. It was developed at Facebook for "scalable cross-language services development" and as of ...
*
Protocol Buffers Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data. The method involves an int ...
*
Etch (protocol) Etch was an open-source, cross-platform framework for building network services, first announced in May 2008 by Cisco Systems. Etch encompasses a service description language, a compiler, and a number of language bindings. It is intended to supp ...
* Internet Communications Engine * MessagePack * CBOR


References


Further reading

* {{DEFAULTSORT:Avro Apache Software Foundation projects Inter-process communication Application layer protocols Remote procedure call Data serialization formats Articles with example Python (programming language) code