HOME

TheInfoList



OR:

Snappy (previously known as Zippy) is a fast
data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressio ...
and decompression library written in
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
by Google based on ideas from
LZ77 LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations includin ...
and open-sourced in 2011. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. Compression speed is 250
MB/s In telecommunications, data-transfer rate is the average number of bits ( bitrate), characters or symbols ( baudrate), or data blocks per unit time passing through a communication link in a data-transmission system. Common data rate units are mu ...
and decompression speed is 500 MB/s using a single core of a circa 2011 "Westmere" 2.26 GHz Core i7 processor running in 64-bit mode. The compression ratio is 20–100% lower than
gzip gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and i ...
. Snappy is widely used in Google projects like
Bigtable Bigtable is a fully managed wide-column and key-value NoSQL database service for large analytical and operational workloads as part of the Google Cloud portfolio. History Bigtable development began in 2004.. It is now used by a number of Googl ...
,
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
and in compressing data for Google's internal RPC systems. It can be used in open-source projects like MariaDB ColumnStore,
Cassandra Cassandra or Kassandra (; Ancient Greek: Κασσάνδρα, , also , and sometimes referred to as Alexandra) in Greek mythology was a Trojan priestess dedicated to the god Apollo and fated by him to utter true prophecies but never to be belie ...
,
Couchbase Couchbase Server, originally known as Membase, is an open-source, distributed ( shared-nothing architecture) multi-model NoSQL document-oriented database software package optimized for interactive applications. These applications may serve many ...
,
Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
,
LevelDB LevelDB is an open-source on-disk key-value store written by Google fellows Jeffrey Dean and Sanjay Ghemawat. Inspired by Bigtable, LevelDB is hosted on GitHub under the New BSD License and has been ported to a variety of Unix-based systems, ma ...
,
MongoDB MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Ser ...
, RocksDB,
Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as ...
,
Spark Spark commonly refers to: * Spark (fire), a small glowing particle or ember * Electric spark, a form of electrical discharge Spark may also refer to: Places * Spark Point, a rocky point in the South Shetland Islands People * Spark (surname) * ...
, and InfluxDB. Decompression is tested to detect any errors in the compressed stream. Snappy does not use
inline assembler In computer programming, an inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a higher-level language such as C ...
(except some optimizations) and is portable.


Stream format

Snappy encoding is not bit-oriented, but byte-oriented (only whole bytes are emitted or consumed from a stream). The format uses no entropy encoder, like
Huffman tree In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algor ...
or arithmetic encoder. The first bytes of the stream are the length of uncompressed data, stored as a
little-endian In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the most si ...
varint, which allows for use of a
variable-length code In coding theory a variable-length code is a code which maps source symbols to a ''variable'' number of bits. Variable-length codes can allow sources to be compressed and decompressed with ''zero'' error (lossless data compression) and still b ...
. The lower seven bits of each byte are used for data and the high bit is a flag to indicate the end of the length field. The remaining bytes in the stream are encoded using one of four element types. The element type is encoded in the lower two bits of the first byte (''tag byte'') of the element: * 00 – ''Literal'' – uncompressed data; upper 6 bits are used to store length (len-1) of data. Lengths larger than 60 are stored in a 1-4 byte integer indicated by a 6 bit length of 60 (1 byte) to 63 (4 bytes). * 01 – Copy with length stored as 3 bits and offset stored as 11 bits; one byte after tag byte is used for part of offset; * 10 – Copy with length stored as 6 bits of tag byte and offset stored as two-byte integer after the tag byte; * 11 – Copy with length stored as 6 bits of tag byte and offset stored as four-byte little-endian integer after the tag byte; The copy refers to the dictionary (just-decompressed data). The offset is the shift from the current position back to the already decompressed stream. The length is the number of bytes to copy from the dictionary. The size of the dictionary was limited by the 1.0 Snappy compressor to 32,768 bytes, and updated to 65,536 in version 1.1. The complete official description of the snappy format can be found in the google GitHub repository.


Example of a compressed stream

The text may be compressed to this, shown as hex data with explanations: 0000000: ''ca02'' ''f042'' 5769 6b69 7065 6469 6120 6973 ...BWikipedia is The first 2 bytes, ''ca02'' are the length, as a little-endian varint (see Protocol Buffers for the varint specification). Thus the most-significant byte is '02' . 0x02ca(varint) = 0x014a = 330 bytes. The next two bytes, 0xf042, indicate that a literal of 66+1 bytes follows 0000010: 2061 2066 7265 652c 2077 6562 2d62 6173 a free, web-bas 0000020: 6564 2c20 636f 6c6c 6162 6f72 6174 6976 ed, collaborativ 0000030: 652c 206d 756c 7469 6c69 6e67 7561 6c20 e, multilingual 0000040: 656e 6379 636c 6f''09'' 3f''f0 14''70 726f 6a65 encyclo.?..proje 0x09 is tag-byte of type 01 with length - 4 = 0102 = 210 and offset = 0x03f = 63 or "pedia ";
0xf014 is a literal with length of 20+1 bytes 0000050: 6374 2e00 0000 0000 0000 0000 0000 0000 ct. In this example, all common substrings with four or more characters were eliminated by the compression process. More common compressors can compress this better. Unlike compression methods such as gzip and bzip2, there is no
entropy encoding In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...
used to pack alphabet into the bit stream.


Interfaces

Snappy distributions include C++ and C bindings. Third party-provided bindings and ports include C#, Common Lisp,
Crystal (programming language) Crystal is a general-purpose, object-oriented programming language, designed and developed by Ary Borenszweig, Juan Wajnerman, Brian Cardiff and more than 300 contributors. With syntax inspired by the language Ruby, it is a compiled language w ...
, Erlang, Go,
Haskell Haskell () is a general-purpose, statically-typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research and industrial applications, Haskell has pioneered a number of programming lan ...
,
Lua Lua or LUA may refer to: Science and technology * Lua (programming language) * Latvia University of Agriculture * Last universal ancestor, in evolution Ethnicity and language * Lua people, of Laos * Lawa people, of Thailand sometimes referred t ...
,
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ...
, Nim, Node.js,
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offic ...
,
PHP PHP is a general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by The PHP Group. ...
,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
, R,
Ruby A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called ...
,
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO( ...
, Smalltalk, and
OpenCL OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-progra ...
.


See also

*
Zstandard Zstandard, commonly known by the name of its reference implementation zstd, is a lossless data compression algorithm developed by Yann Collet at Facebook. ''Zstd'' is the reference implementation in C. Version 1 of this implementation was r ...


References


External links


Snappy mailing list
{{Archive formats Archive formats Cross-platform free software Free data compression software Free software programmed in C++ C++ libraries