7z is a compressed
archive file format
In computing, an archive file stores the content of one or more files, possibly compressed, with associated metadata such as file name, directory structure, error detection and correction information, commentary, compressed data archives, stor ...
that supports several different
data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
,
encryption
In Cryptography law, cryptography, encryption (more specifically, Code, encoding) is the process of transforming information in a way that, ideally, only authorized parties can decode. This process converts the original representation of the inf ...
and pre-processing algorithms. The 7z format initially appeared as implemented by the
7-Zip
7-Zip is a free and open-source file archiver, a utility used to place groups of files within compressed containers known as "archives". It is developed by Igor Pavlov and was first released in 1999. 7-Zip has its own Archive file, archive forma ...
archiver. The 7-Zip program is publicly available under the terms of the
GNU Lesser General Public License
The GNU Lesser General Public License (LGPL) is a free-software license published by the Free Software Foundation (FSF). The license allows developers and companies to use and integrate a software component released under the LGPL into their own ...
. The LZMA SDK 4.62 was placed in the
public domain
The public domain (PD) consists of all the creative work to which no Exclusive exclusive intellectual property rights apply. Those rights may have expired, been forfeited, expressly Waiver, waived, or may be inapplicable. Because no one holds ...
in December 2008. The latest stable version of 7-Zip and
LZMA SDK is version 24.09.
The 7z file format specification is distributed with 7-Zip's source code since 2015. The specification can be found in plain text format in the "doc" sub-directory of the source code distribution.
Features and enhancements
The 7z format provides the following main features:
*
Open
Open or OPEN may refer to:
Music
* Open (band), Australian pop/rock band
* The Open (band), English indie rock band
* ''Open'' (Blues Image album), 1969
* ''Open'' (Gerd Dudek, Buschi Niebergall, and Edward Vesala album), 1979
* ''Open'' (Go ...
, modular architecture that allows any compression, conversion, or encryption method to be stacked.
* High
compression ratio
The compression ratio is the ratio between the maximum and minimum volume during the compression stage of the power cycle in a piston or Wankel engine.
A fundamental specification for such engines, it can be measured in two different ways. Th ...
s (depending on the compression method used).
*
AES-256 bit
encryption
In Cryptography law, cryptography, encryption (more specifically, Code, encoding) is the process of transforming information in a way that, ideally, only authorized parties can decode. This process converts the original representation of the inf ...
.
* Zip 2.0 (Legacy) Encryption
* Large file support (up to approximately 16
exbibyte
The byte is a units of information, unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character (computing), character of text in a computer and for this ...
s, or 2
64 bytes).
*
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
file names.
* Support for
solid compression
In computing, solid compression is a method for data compression of multiple files, wherein all the uncompressed files are concatenated and treated as a single data block. Such an archive is called a solid archive. It is used natively in the 7z a ...
, where multiple files of similar type are compressed within a single stream, in order to exploit the combined redundancy inherent in similar files.
* Compression and encryption of archive
headers.
* Support for multi-part archives : e.g. xxx.7z.001, xxx.7z.002, ... (see the context menu items ''Split File...'' to create them and ''Combine Files...'' to re-assemble an archive from a set of multi-part component files).
* Support for custom codec plugin DLLs.
The format's
open architecture
Open architecture is a type of computer architecture or software architecture intended to make adding, upgrading, and swapping components with other computers easy. For example, the IBM PC, Amiga 2000 and Apple IIe have an open architecture supp ...
allows additional future compression methods to be added to the standard.
Compression methods
The following compression methods are currently defined:
*
LZMA – A variation of the
LZ77
LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978.
They are also known as Lempel-Ziv 1 (LZ1) and Lempel-Ziv 2 (LZ2) respectively. These two algorithms form the basis ...
algorithm, using a sliding dictionary up to 4 GB in length for duplicate string elimination. The LZ stage is followed by
entropy coding
In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...
using a
Markov chain
In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally ...
-based
range coder and
binary tree
In computer science, a binary tree is a tree data structure in which each node has at most two children, referred to as the ''left child'' and the ''right child''. That is, it is a ''k''-ary tree with . A recursive definition using set theor ...
s.
*
LZMA2 – modified version of LZMA providing better multithreading support and less expansion of incompressible data.
*
Bzip2
bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities such as tar for tasks such as handli ...
– The standard
Burrows–Wheeler transform
The Burrows–Wheeler transform (BWT) rearranges a character string into runs of similar characters, in a manner that can be reversed to recover the original string. Since compression techniques such as move-to-front transform and run-length enc ...
algorithm. Bzip2 uses two reversible transformations; BWT, then
Move to front with
Huffman coding
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by ...
for symbol reduction (the actual compression element).
*
PPMd – Dmitry Shkarin's 2002 PPMdH (PPMII (Prediction by Partial matching with Information Inheritance) and cPPMII (complicated PPMII)) with small changes: PPMII is an improved version of the 1984
PPM compression algorithm (prediction by partial matching).
*
DEFLATE – Standard algorithm based on 32 kB
LZ77
LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978.
They are also known as Lempel-Ziv 1 (LZ1) and Lempel-Ziv 2 (LZ2) respectively. These two algorithms form the basis ...
and
Huffman coding
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by ...
. Deflate is found in several file formats including
ZIP,
gzip
gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and ...
,
PNG and
PDF
Portable document format (PDF), standardized as ISO 32000, is a file format developed by Adobe Inc., Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, computer hardware, ...
. 7-Zip contains a from-scratch DEFLATE encoder that frequently beats the ''de facto'' standard
zlib
zlib ( or "zeta-lib", ) is a software library used for data compression as well as a data format. zlib was written by Jean-loup Gailly and Mark Adler and is an abstraction of the DEFLATE compression algorithm used in their gzip file compre ...
version in compression size, but at the expense of CPU usage.
A suite of recompression tools called AdvanceCOMP contains a copy of the DEFLATE encoder from the 7-Zip implementation; these utilities can often be used to further compress the size of existing
gzip
gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and ...
,
ZIP,
PNG, or
MNG files.
Pre-processing filters
The LZMA SDK comes with the
BCJ and
BCJ2 preprocessors included, so that later stages are able to achieve greater compression: For
x86
x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...
,
ARM,
PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
(PPC), IA-64
Itanium
Itanium (; ) is a discontinued family of 64-bit computing, 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). The Itanium architecture originated at Hewlett-Packard (HP), and was later jointly dev ...
, and
ARM Thumb
ARM (stylised in lowercase as arm, formerly an acronym for Advanced RISC Machines and originally Acorn RISC Machine) is a family of RISC instruction set architectures (ISAs) for computer processors. Arm Holdings develops the ISAs and lice ...
processors, jump targets are "normalized"
before compression by changing relative position into absolute values. For x86, this means that near jumps, calls and conditional jumps (but not short jumps and conditional jumps) are converted from the machine language "jump 1655 bytes backwards" style notation to normalized "jump to address 5554" style notation; all jumps to 5554, perhaps a common subroutine, are thus encoded identically, making them more compressible.
*
BCJ – Converter for 32-bit x86 executables. Normalises target addresses of near jumps and calls from relative distances to absolute destinations.
*
BCJ2 – Pre-processor for x86-64 executables. BCJ2 is an improvement on BCJ, adding additional x86 jump/call instruction processing. Near jump, near call, conditional near jump targets are split out and compressed separately in another stream.
*
Delta encoding
Delta encoding is a way of storing or transmitting data in the form of '' differences'' (deltas) between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding is sometimes called delta comp ...
– delta filter, basic preprocessor for multimedia data.
Similar executable pre-processing technology is included in other software; the
RAR compressor features displacement compression for 32-bit x86 executables and IA-64 executables, and the
UPX runtime executable file compressor includes support for working with 16-bit values within
DOS binary files.
Encryption
The 7z format supports
encryption
In Cryptography law, cryptography, encryption (more specifically, Code, encoding) is the process of transforming information in a way that, ideally, only authorized parties can decode. This process converts the original representation of the inf ...
with the
AES algorithm with a 256-bit key. The key is generated from a user-supplied
passphrase
A passphrase is a sequence of words or other text used to control access to a computer system, program or data. It is similar to a password in usage, but a passphrase is generally longer for added security. Passphrases are often used to control ...
using an algorithm based on the
SHA-256
SHA-2 (Secure Hash Algorithm 2) is a set of cryptographic hash functions designed by the United States National Security Agency (NSA) and first published in 2001. They are built using the Merkle–Damgård construction, from a one-way compressi ...
hash function. The SHA-256 is executed 2
19 (524288) times, which causes a significant delay on slow PCs before compression or extraction starts. This technique is called
key stretching
In cryptography, key stretching techniques are used to make a possibly weak key, typically a password or passphrase, more secure against a brute-force attack by increasing the resources (time and possibly space) it takes to test each possible ke ...
and is used to make a
brute-force search
In computer science, brute-force search or exhaustive search, also known as generate and test, is a very general problem-solving technique and algorithmic paradigm that consists of Iteration#Computing, systematically checking all possible candida ...
for the passphrase more difficult. Current GPU-based, and custom hardware attacks limit the effectiveness of this particular method of key stretching,
Colin Percival
Colin A. Percival (born 1980) is a Canadian computer scientist and computer security researcher. He completed his undergraduate education at Simon Fraser University and a doctorate at the University of Oxford. While at university he joined the F ...
scrypt
.
As presented i
"Stronger Key Derivation via Sequential Memory-Hard Functions"
.
presented at BSDCan'09, May 2009. so it is still important to choose a strong password.
The 7z format provides the option to encrypt the filenames of a 7z archive.
Limitations
The 7z format does not store
filesystem permissions (such as
UNIX
Unix (, ; trademarked as UNIX) is a family of multitasking, multi-user computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, a ...
owner/group permissions or
NTFS
NT File System (NTFS) (commonly called ''New Technology File System'') is a proprietary journaling file system developed by Microsoft in the 1990s.
It was developed to overcome scalability, security and other limitations with File Allocation Tabl ...
ACLs), and hence can be inappropriate for backup/archival purposes. A workaround on UNIX-like systems for this is to convert data to a
tar bitstream before compressing with 7z. But GNU tar (common in many UNIX environments) can also compress with the LZMA2 algorithm ("
xz") natively, without the use of 7z, using the "-J" switch. The resulting file extension is ".tar.xz" or ".txz" and not ".tar.7z". This method of compression has been adopted with many distributions for packaging, such as Arch, Debian (deb), Fedora (rpm) and Slackware. (The older "lzma" format is less efficient.)
On the other hand, it is important to note, that tar does not save the filesystem encoding, which means that tar compressed filenames can become unreadable if decompressed on a different computer.
The 7z format does not allow extraction of some "broken files"—that is (for example) if one has the first segment of a series of 7z files, 7z cannot give the start of the files within the archive—it must wait until all segments are downloaded. The 7z format also lacks recovery records, making it vulnerable to
data degradation
Data degradation is the gradual Data corruption, corruption of Data (computing), computer data due to an accumulation of non-critical failures in a data storage device. It is also referred to as data decay, data rot or bit rot. This results in ...
unless used in conjunction with external solutions, like
parchives, or within
filesystems with robust
error-correction. By way of comparison,
zip files also lack a recovery feature while the
rar format has one.
See also
*
7-Zip
7-Zip is a free and open-source file archiver, a utility used to place groups of files within compressed containers known as "archives". It is developed by Igor Pavlov and was first released in 1999. 7-Zip has its own Archive file, archive forma ...
*
Comparison of archive formats
*
List of archive formats
This is a list of file formats used by file archiver, archivers and data compression, compressors used to create Archive file, archive files.
Archive formats by purpose
Archive formats are used for backups, mobility, and archiving. Many archive ...
*
Open file format
An open file format is a file format for storing digital data, defined by an openly published specification usually maintained by a standards organization, and which can be used and implemented by anyone. An open file format is licensed with a ...
References
Further reading
*
External links
*
*
{{Archive formats
Computer-related introductions in 1999
Archive formats
Russian inventions