Snappy (previously known as Zippy) is a fast
data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...
and
decompression library written in
C++
C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
by Google based on ideas from
LZ77
LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978.
They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations includin ...
and open-sourced in 2011. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. Compression speed is 250
MB/s
In telecommunications, data-transfer rate is the average number of bits (bitrate), characters or symbols (baudrate), or data blocks per unit time passing through a communication link in a data-transmission system. Common data rate units are multi ...
and decompression speed is 500 MB/s using a single core of a circa 2011
"Westmere" 2.26 GHz Core i7 processor running in
64-bit mode. The
compression ratio
The compression ratio is the ratio between the volume of the cylinder and combustion chamber in an internal combustion engine at their maximum and minimum values.
A fundamental specification for such engines, it is measured two ways: the stati ...
is 20–100% lower than
gzip
gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and in ...
.
Snappy is widely used in Google projects like
Bigtable
Bigtable is a fully managed wide-column and key-value NoSQL database service for large analytical and operational workloads as part of the Google Cloud portfolio.
History
Bigtable development began in 2004.. It is now used by a number of Googl ...
,
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
and in compressing data for Google's internal
RPC systems. It can be used in open-source projects like
MariaDB ColumnStore,
Cassandra
Cassandra or Kassandra (; Ancient Greek: Κασσάνδρα, , also , and sometimes referred to as Alexandra) in Greek mythology was a Trojan priestess dedicated to the god Apollo and fated by him to utter true prophecies but never to be believe ...
,
Couchbase
Couchbase Server, originally known as Membase, is an open-source, distributed ( shared-nothing architecture) multi-model NoSQL document-oriented database software package optimized for interactive applications. These applications may serve many ...
,
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
,
LevelDB
LevelDB is an open-source on-disk key-value store written by Google fellows Jeffrey Dean and Sanjay Ghemawat. Inspired by Bigtable, LevelDB is hosted on GitHub under the New BSD License and has been ported to a variety of Unix-based systems, ma ...
,
MongoDB
MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Serve ...
,
RocksDB
RocksDB is a high performance embedded database for Key-value database, key-value data. It is a fork of Google's LevelDB optimized to exploit many multi-core processor, CPU cores, and make efficient use of fast storage, such as solid-state drives ...
,
Lucene
Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a ...
,
Spark
Spark commonly refers to:
* Spark (fire), a small glowing particle or ember
* Electric spark, a form of electrical discharge
Spark may also refer to:
Places
* Spark Point, a rocky point in the South Shetland Islands
People
* Spark (surname)
* ...
, and
InfluxDB
InfluxDB is an open-source time series database (TSDB) developed by the company InfluxData. It is written in the Go programming language for storage and retrieval of time series data in fields such as operations monitoring, application metr ...
. Decompression is tested to detect any errors in the compressed stream. Snappy does not use
inline assembler In computer programming, an inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a higher-level language such as C ...
(except some optimizations) and is portable.
Stream format
Snappy encoding is not bit-oriented, but byte-oriented (only whole bytes are emitted or consumed from a stream). The format uses no
entropy encoder, like
Huffman tree
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algor ...
or
arithmetic encoder.
The first bytes of the stream are the length of uncompressed data, stored as a
little-endian
In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the most si ...
varint,
which allows for use of a
variable-length code
In coding theory a variable-length code is a code which maps source symbols to a ''variable'' number of bits.
Variable-length codes can allow sources to be compressed and decompressed with ''zero'' error (lossless data compression) and still be ...
. The lower seven bits of each byte are used for data and the high bit is a flag to indicate the end of the length field.
The remaining bytes in the stream are encoded using one of four element types. The element type is encoded in the lower two bits of the first byte (''tag byte'') of the element:
* 00 – ''Literal'' – uncompressed data; upper 6 bits are used to store length (len-1) of data. Lengths larger than 60 are stored in a 1-4 byte integer indicated by a 6 bit length of 60 (1 byte) to 63 (4 bytes).
* 01 – Copy with length stored as 3 bits and offset stored as 11 bits; one byte after tag byte is used for part of offset;
* 10 – Copy with length stored as 6 bits of tag byte and offset stored as two-byte integer after the tag byte;
* 11 – Copy with length stored as 6 bits of tag byte and offset stored as four-byte little-endian integer after the tag byte;
The copy refers to the dictionary (just-decompressed data). The offset is the shift from the current position back to the already decompressed stream. The length is the number of bytes to copy from the dictionary. The size of the dictionary was limited by the 1.0 Snappy compressor to 32,768 bytes, and updated to 65,536 in version 1.1.
The complete official description of the snappy format can be found in the google GitHub repository.
Example of a compressed stream
The text
may be compressed to this, shown as hex data with explanations:
0000000: ''ca02'' ''f042'' 5769 6b69 7065 6469 6120 6973 ...BWikipedia is
The first 2 bytes, ''ca02'' are the length, as a little-endian varint (see
Protocol Buffers
Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data. The method involves an i ...
for the varint specification).
Thus the most-significant byte is '02' . 0x02ca(varint) = 0x014a = 330 bytes.
The next two bytes, 0xf042, indicate that a literal of 66+1 bytes follows
0000010: 2061 2066 7265 652c 2077 6562 2d62 6173 a free, web-bas
0000020: 6564 2c20 636f 6c6c 6162 6f72 6174 6976 ed, collaborativ
0000030: 652c 206d 756c 7469 6c69 6e67 7561 6c20 e, multilingual
0000040: 656e 6379 636c 6f''09'' 3f''f0 14''70 726f 6a65 encyclo.?..proje
0x09 is tag-byte of type 01 with length - 4 = 010
2 = 2
10 and offset = 0x03f = 63 or "pedia ";
0xf014 is a literal with length of 20+1 bytes
0000050: 6374 2e00 0000 0000 0000 0000 0000 0000 ct.
In this example, all common substrings with four or more characters were eliminated by the compression process. More common compressors can compress this better. Unlike compression methods such as gzip and bzip2, there is no
entropy encoding
In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...
used to pack alphabet into the bit stream.
Interfaces
Snappy distributions include C++ and C bindings. Third party-provided bindings and ports include
C#,
Common Lisp
Common Lisp (CL) is a dialect of the Lisp programming language, published in ANSI standard document ''ANSI INCITS 226-1994 (S20018)'' (formerly ''X3.226-1994 (R1999)''). The Common Lisp HyperSpec, a hyperlinked HTML version, has been derived fro ...
,
Crystal (programming language)
Crystal is a general-purpose, object-oriented programming language, designed and developed by Ary Borenszweig, Juan Wajnerman, Brian Cardiff and more than 300 contributors. With syntax inspired by the language Ruby, it is a compiled language ...
,
Erlang,
Go,
Haskell
Haskell () is a general-purpose, statically-typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research and industrial applications, Haskell has pioneered a number of programming lan ...
,
Lua
Lua or LUA may refer to:
Science and technology
* Lua (programming language)
* Latvia University of Agriculture
* Last universal ancestor, in evolution
Ethnicity and language
* Lua people, of Laos
* Lawa people, of Thailand sometimes referred t ...
,
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
,
Nim,
Node.js
Node.js is an open-source server environment. Node.js is cross-platform and runs on Windows, Linux, Unix, and macOS. Node.js is a back-end JavaScript runtime environment. Node.js runs on the V8 JavaScript Engine and executes JavaScript code o ...
,
Perl
Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offici ...
,
PHP
PHP is a general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by The PHP Group ...
,
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
,
R,
Ruby
A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sa ...
,
Rust
Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH ...
,
Smalltalk
Smalltalk is an object-oriented, dynamically typed reflective programming language. It was designed and created in part for educational use, specifically for constructionist learning, at the Learning Research Group (LRG) of Xerox PARC by Alan Ka ...
, and
OpenCL
OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-progra ...
.
See also
*
Zstandard
Zstandard, commonly known by the name of its reference implementation zstd, is a lossless data compression algorithm developed by Yann Collet at Facebook.
''Zstd'' is the reference implementation in C. Version 1 of this implementation was r ...
References
External links
Snappy mailing list
{{Archive formats
Archive formats
Cross-platform free software
Free data compression software
Free software programmed in C++
C++ libraries