Apache Avro
   HOME

TheInfoList



OR:

Avro is a row-oriented
remote procedure call In distributed computing, a remote procedure call (RPC) is when a computer program causes a procedure (subroutine) to execute in a different address space (commonly on another computer on a shared computer network), which is written as if it were a ...
and data
serialization In computing, serialization (or serialisation, also referred to as pickling in Python (programming language), Python) is the process of translating a data structure or object (computer science), object state into a format that can be stored (e. ...
framework developed within Apache's Hadoop project. It uses
JSON JSON (JavaScript Object Notation, pronounced or ) is an open standard file format and electronic data interchange, data interchange format that uses Human-readable medium and data, human-readable text to store and transmit data objects consi ...
for defining
data type In computer science and computer programming, a data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these ...
s and
protocol Protocol may refer to: Sociology and politics * Protocol (politics) Protocol originally (in Late Middle English, c. 15th century) meant the minutes or logbook taken at a meeting, upon which an agreement was based. The term now commonly refers to ...
s, and serializes data in a compact binary format. Its primary use is in
Apache Hadoop Apache Hadoop () is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop wa ...
, where it can provide both a serialization format for
persistent data Persistent data in the field of data processing denotes information that is infrequently accessed and unlikely to be modified. Static data is information, for example a record, that does not change and may be intended to be permanent. It may hav ...
, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop
service Service may refer to: Activities * Administrative service, a required part of the workload of university faculty * Civil service, the body of employees of a government * Community service, volunteer service for the benefit of a community or a ...
s. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages: one for human editing (Avro IDL) and another which is more
machine-readable In communications and computing, a machine-readable medium (or computer-readable medium) is a medium capable of storing data in a format easily readable by a digital computer or a sensor. It contrasts with ''human-readable'' medium and data. T ...
based on JSON. It is similar to Thrift and
Protocol Buffers Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs that communicate with each other over a network or for storing data. The method involves an ...
, but does not require running a code-generation program when a
schema Schema may refer to: Science and technology * SCHEMA (bioinformatics), an algorithm used in protein engineering * Schema (genetic algorithms), a set of programs or bit strings that have some genotypic similarity * Schema.org, a web markup vocab ...
changes (unless desired for
statically-typed In computer programming, a type system is a logical system comprising a set of rules that assigns a property called a ''type'' (for example, integer, floating point, string) to every '' term'' (a word, phrase, or other set of symbols). Usuall ...
languages). Apache Spark SQL can access Avro as a data source.


Avro Object Container File

An Avro Object Container File consists of: * A
file header In information technology, header is supplemental data placed at the beginning of a block of data being stored or transmitted. In data transmission, the data following the header is sometimes called the '' payload'' or '' body''. It is vital that ...
, followed by * one or more file data blocks. A file header consists of: * Four bytes,
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
'O', 'b', 'j', followed by the Avro version number which is 1 (0x01) (Binary values 0x4F 0x62 0x6A 0x01). * File metadata, including the schema definition. * The 16-byte, randomly-generated sync marker for this file. For data blocks Avro specifies two serialization encodings: binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.


Schema definition

Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed). Simple schema example:


Serializing and deserializing

Data in Avro might be stored with its corresponding schema, meaning a serialized item can be read without knowing the schema ahead of time.


Example serialization and deserialization code in Python

Serialization: import avro.schema from avro.datafile import DataFileReader, DataFileWriter from avro.io import DatumReader, DatumWriter # Need to know the schema to write. According to 1.8.2 of Apache Avro schema = avro.schema.parse(open("user.avsc", "rb").read()) writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema) writer.append() writer.append() writer.close() File "users.avro" will contain the schema in JSON and a compact binary representation of the data: $ od -v -t x1z users.avro 0000000 4f 62 6a 01 04 14 61 76 72 6f 2e 63 6f 64 65 63 >Obj...avro.codec< 0000020 08 6e 75 6c 6c 16 61 76 72 6f 2e 73 63 68 65 6d >.null.avro.schem< 0000040 61 ba 03 7b 22 74 79 70 65 22 3a 20 22 72 65 63 >a..< 0000400 00 05 f9 a3 80 98 47 54 62 bf 68 95 a2 ab 42 ef >......GTb.h...B.< 0000420 24 04 2c 0c 41 6c 79 73 73 61 00 80 04 02 06 42 >$.,.Alyssa.....B< 0000440 65 6e 00 10 00 06 72 65 64 05 f9 a3 80 98 47 54 >en....red.....GT< 0000460 62 bf 68 95 a2 ab 42 ef 24 >b.h...B.$< 0000471 Deserialization: # The schema is embedded in the data file reader = DataFileReader(open("users.avro", "rb"), DatumReader()) for user in reader: print(user) reader.close() This outputs:


Languages with APIs

Though theoretically any language could use Avro, the following languages have APIs written for them: * C * C++ * C# *
Elixir An elixir is a sweet liquid used for medical purposes, to be taken orally and intended to cure one's illness. When used as a dosage form, pharmaceutical preparation, an elixir contains at least one active ingredient designed to be taken orall ...
* Go *
Haskell Haskell () is a general-purpose, statically typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research, and industrial applications, Haskell pioneered several programming language ...
*
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
*
JavaScript JavaScript (), often abbreviated as JS, is a programming language and core technology of the World Wide Web, alongside HTML and CSS. Ninety-nine percent of websites use JavaScript on the client side for webpage behavior. Web browsers have ...
*
Perl Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language". Perl was developed ...
*
PHP PHP is a general-purpose scripting language geared towards web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by the PHP Group. ...
*
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
*
Ruby Ruby is a pinkish-red-to-blood-red-colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sapph ...
*
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH) ...
* Scala


Avro IDL

In addition to supporting JSON for type and protocol definitions, Avro includes experimental support for an alternative
interface description language An interface description language or interface definition language (IDL) is a generic term for a language that lets a program or object written in one language communicate with another program written in an unknown language. IDLs are usually use ...
(IDL) syntax known as Avro IDL. Previously known as GenAvro, this format is designed to ease adoption by users familiar with more traditional IDLs and programming languages, with a syntax similar to C/C++,
Protocol Buffers Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs that communicate with each other over a network or for storing data. The method involves an ...
and others.


Logo

The original Apache Avro logo was from the defunct British aircraft manufacturer
Avro Avro (an initialism of the founder's name) was a British aircraft manufacturer. Its designs include the Avro 504, used as a trainer in the First World War, the Avro Lancaster, one of the pre-eminent bombers of the Second World War, and the d ...
(originally A.V. Roe and Company). The Apache Avro logo was updated to an original design in late 2023.


See also

*
Comparison of data serialization formats This is a comparison of data serialization formats, various ways to convert complex objects to sequences of bits. It does not include markup languages used exclusively as document file format A document file format is a Text file, text or bi ...
*
Apache Thrift Thrift is an IDL (interface definition language, Interface Definition Language) and Binary protocol, binary communication protocol used for defining and creating service (systems architecture), services for programming languages. It was developed ...
*
Protocol Buffers Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs that communicate with each other over a network or for storing data. The method involves an ...
*
Etch (protocol) Etch was an open-source, cross-platform framework for building network services, first announced in May 2008 by Cisco Systems. Etch encompasses a service description language, a compiler, and a number of language bindings. It is intended to suppl ...
*
Internet Communications Engine The Internet Communications Engine, or Ice, is an open-source RPC framework developed by ZeroC. It provides SDKs for C++, C#, Java, JavaScript, MATLAB, Objective-C, PHP, Python, Ruby and Swift, and can run on various operating systems, incl ...
* MessagePack *
CBOR Concise Binary Object Representation (CBOR) is a binary data serialization format loosely based on JSON authored by Carsten Bormann and Paul Hoffman. Like JSON it allows the transmission of data objects that contain attribute–value pair, name ...


References


Further reading

* {{DEFAULTSORT:Avro Apache Software Foundation projects Inter-process communication Application layer protocols Remote procedure call Data serialization formats Articles with example Python (programming language) code