computing Computing is any goal-oriented activity requiring, benefiting from, or creating computer, computing machinery. It includes the study and experimentation of algorithmic processes, and the development of both computer hardware, hardware and softw ...

, data deduplication is a technique for eliminating duplicate copies of repeating data. Successful implementation of the technique can improve storage utilization, which may in turn lower

capital expenditure Capital expenditure or capital expense (abbreviated capex, CAPEX, or CapEx) is the money an organization or corporate entity spends to buy, maintain, or improve its fixed assets, such as buildings, vehicles, equipment, or land. It is considered ...

by reducing the overall amount of storage media required to meet storage capacity needs. It can also be applied to network data transfers to reduce the number of bytes that must be sent. The deduplication process requires comparison of data 'chunks' (also known as 'byte patterns') which are unique, contiguous blocks of data. These chunks are identified and stored during a process of analysis, and compared to other chunks within existing data. Whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced. A related technique is single-instance (data) storage, which replaces multiple copies of content at the whole-file level with a single shared copy. While possible to combine this with other forms of data compression and deduplication, it is distinct from newer approaches to data deduplication (which can operate at the segment or sub-block level). Deduplication is different from data compression algorithms, such as

LZ77 and LZ78 LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as Lempel-Ziv 1 (LZ1) and Lempel-Ziv 2 (LZ2) respectively. These two algorithms form the bas ...

. Whereas compression algorithms identify redundant data inside individual files and encodes this redundant data more efficiently, the intent of deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, and replace them with a shared copy.

Functioning principle

For example, a typical email system might contain 100 instances of the same 1 MB (

megabyte The megabyte is a multiple of the unit byte for digital information. Its recommended unit symbol is MB. The unit prefix ''mega'' is a multiplier of (106) in the International System of Units (SI). Therefore, one megabyte is one million bytes ...

) file attachment. Each time the

email Electronic mail (usually shortened to email; alternatively hyphenated e-mail) is a method of transmitting and receiving Digital media, digital messages using electronics, electronic devices over a computer network. It was conceived in the ...

platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1. Deduplication is often paired with data compression for additional storage saving: Deduplication is first used to eliminate large chunks of repetitive data, and compression is then used to efficiently encode each of the stored chunks. In computer code, deduplication is done by, for example, storing information in variables so that they don't have to be written out individually but can be changed all at once at a central referenced location. Examples are CSS classes and named references in

MediaWiki MediaWiki is free and open-source wiki software originally developed by Magnus Manske for use on Wikipedia on January 25, 2002, and further improved by Lee Daniel Crocker,mailarchive:wikipedia-l/2001-August/000382.html, Magnus Manske's announc ...

Benefits

Storage-based data deduplication reduces the amount of storage needed for a given set of files. It is most effective in applications where many copies of very similar or even identical data are stored on a single disk. In the case of data backups, which routinely are performed to protect against data loss, most data in a given backup remain unchanged from the previous backup. Common backup systems try to exploit this by omitting (or

hard link In computing, a hard link is a directory entry (in a Directory (computing), directory-based file system) that associates a name with a Computer file, file. Thus, each file must have at least one hard link. Creating additional hard links for a fil ...

ing) files that haven't changed or storing differences between files. Neither approach captures all redundancies, however. Hard-linking does not help with large files that have only changed in small ways, such as an email database; differences only find redundancies in adjacent versions of a single file (consider a section that was deleted and later added in again, or a logo image included in many documents). In-line network data deduplication is used to reduce the number of bytes that must be transferred between endpoints, which can reduce the amount of bandwidth required. See WAN optimization for more information. Virtual servers and virtual desktops benefit from deduplication because it allows nominally separate system files for each virtual machine to be coalesced into a single storage space. At the same time, if a given virtual machine customizes a file, deduplication will not change the files on the other virtual machines—something that alternatives like hard links or shared disks do not offer. Backing up or making duplicate copies of virtual environments is similarly improved.

Classification

Post-process versus in-line deduplication

Deduplication may occur "in-line", as data is flowing, or "post-process" after it has been written. With post-process deduplication, new data is first stored on the storage device and then a process at a later time will analyze the data looking for duplication. The benefit is that there is no need to wait for the hash calculations and lookup to be completed before storing the data, thereby ensuring that store performance is not degraded. Implementations offering policy-based operation can give users the ability to defer optimization on "active" files, or to process files based on type and location. One potential drawback is that duplicate data may be unnecessarily stored for a short time, which can be problematic if the system is nearing full capacity. Alternatively, deduplication hash calculations can be done in-line: synchronized as data enters the target device. If the storage system identifies a block which it has already stored, only a reference to the existing block is stored, rather than the whole new block. The advantage of in-line deduplication over post-process deduplication is that it requires less storage and network traffic, since duplicate data is never stored or transferred. On the negative side, hash calculations may be computationally expensive, thereby reducing the storage throughput. However, certain vendors with in-line deduplication have demonstrated equipment which performs in-line deduplication at high rates. Post-process and in-line deduplication methods are often heavily debated.

Data formats

The SNIA Dictionary identifies two methods: * content-agnostic data deduplication - a data deduplication method that does not require awareness of specific application data formats. * content-aware data deduplication - a data deduplication method that leverages knowledge of specific application data formats.

Source versus target deduplication

Another way to classify data deduplication methods is according to where they occur. Deduplication occurring close to where data is created, is referred to as "source deduplication". When it occurs near where the data is stored, it is called "target deduplication". Source deduplication ensures that data on the data source is deduplicated. This generally takes place directly within a file system. The file system will periodically scan new files creating hashes and compare them to hashes of existing files. When files with same hashes are found then the file copy is removed and the new file points to the old file. Unlike

s however, duplicated files are considered to be separate entities and if one of the duplicated files is later modified, then using a system called

copy-on-write Copy-on-write (COW), also called implicit sharing or shadowing, is a resource-management technique used in programming to manage shared data efficiently. Instead of copying data right away when multiple programs use it, the same data is shared ...

a copy of that changed file or block is created. The deduplication process is transparent to the users and backup applications. Backing up a deduplicated file system will often cause duplication to occur resulting in the backups being bigger than the source data. Source deduplication can be declared explicitly for copying operations, as no calculation is needed to know that the copied data is in need of deduplication. This leads to a new form of "linking" on file systems called the reflink (Linux) or clonefile (MacOS), where one or more

inode An inode (index node) is a data structure in a Unix-style file system that describes a file-system object such as a file or a directory. Each inode stores the attributes and disk block locations of the object's data. File-system object attribu ...

s (file information entries) are made to share some or all of their data. It is named analogously to

s, which work at the inode level, and

symbolic link In computing, a symbolic link (also symlink or soft link) is a file whose purpose is to point to a file or directory (called the "target") by specifying a path thereto. Symbolic links are supported by POSIX and by most Unix-like operating syste ...

s that work at the filename level. The individual entries have a copy-on-write behavior that is non-aliasing, i.e. changing one copy afterwards will not affect other copies. Microsoft's

ReFS Resilient File System (ReFS), codenamed "Protogon", is a Microsoft proprietary file system introduced with Windows Server 2012 with the intent of becoming the "next generation" file system after NTFS. ReFS was designed to overcome problem ...

also supports this operation. Target deduplication is the process of removing duplicates when the data was not generated at that location. Example of this would be a server connected to a SAN/NAS, The SAN/NAS would be a target for the server (target deduplication). The server is not aware of any deduplication, the server is also the point of data generation. A second example would be backup. Generally this will be a backup store such as a data repository or a virtual tape library.

Deduplication methods

One of the most common forms of data deduplication implementations works by comparing chunks of data to detect duplicates. For that to happen, each chunk of data is assigned an identification, calculated by the software, typically using cryptographic hash functions. In many implementations, the assumption is made that if the identification is identical, the data is identical, even though this cannot be true in all cases due to the

pigeonhole principle In mathematics, the pigeonhole principle states that if items are put into containers, with , then at least one container must contain more than one item. For example, of three gloves, at least two must be right-handed or at least two must be l ...

; other implementations do not assume that two blocks of data with the same identifier are identical, but actually verify that data with the same identification is identical. If the software either assumes that a given identification already exists in the deduplication namespace or actually verifies the identity of the two blocks of data, depending on the implementation, then it will replace that duplicate chunk with a link. Once the data has been deduplicated, upon read back of the file, wherever a link is found, the system simply replaces that link with the referenced data chunk. The deduplication process is intended to be transparent to end users and applications. Commercial deduplication implementations differ by their chunking methods and architectures. * Chunking: In some systems, chunks are defined by physical layer constraints (e.g. 4 KB block size in WAFL). In some systems only complete files are compared, which is called single-instance storage or SIS. The most intelligent (but CPU intensive) method to chunking is generally considered to be sliding-block, also called Content-Defined Chunking. In sliding block, a window is passed along the file stream to seek out more naturally occurring internal file boundaries. * Client backup deduplication: This is the process where the deduplication hash calculations are initially created on the source (client) machines. Files that have identical hashes to files already in the target device are not sent, the target device just creates appropriate internal links to reference the duplicated data. The benefit of this is that it avoids data being unnecessarily sent across the network thereby reducing traffic load. * Primary storage and secondary storage: By definition, primary storage systems are designed for optimal performance, rather than lowest possible cost. The design criteria for these systems is to increase performance, at the expense of other considerations. Moreover, primary storage systems are much less tolerant of any operation that can negatively impact performance. Also by definition, secondary storage systems contain primarily duplicate, or secondary copies of data. These copies of data are typically not used for actual production operations and as a result are more tolerant of some performance degradation, in exchange for increased efficiency. To date, data deduplication has predominantly been used with secondary storage systems. The reasons for this are two-fold: First, data deduplication requires overhead to discover and remove the duplicate data. In primary storage systems, this overhead may impact performance. The second reason why deduplication is applied to secondary data, is that secondary data tends to have more duplicate data. Backup application in particular commonly generate significant portions of duplicate data over time. Data deduplication has been deployed successfully with primary storage in some cases where the system design does not require significant overhead, or impact performance.

Single instance storage

Single-instance storage (SIS) is a system's ability to take multiple copies of content objects and replace them by a single shared copy. It is a means to eliminate data duplication and to increase efficiency. SIS is frequently implemented in file systems, email server software,

data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...

backup In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "wikt:back ...

, and other storage-related computer software. Single-instance storage is a simple variant of data deduplication. While data deduplication may work at a segment or sub-block level, single instance storage works at the object level, eliminating redundant copies of objects such as entire files or email messages. Single-instance storage can be used alongside (or layered upon) other data duplication or data compression methods to improve performance in exchange for an increase in complexity and for (in some cases) a minor increase in storage space requirements.

Drawbacks and concerns

One method for deduplicating data relies on the use of

cryptographic hash function A cryptographic hash function (CHF) is a hash algorithm (a map (mathematics), map of an arbitrary binary string to a binary string with a fixed size of n bits) that has special properties desirable for a cryptography, cryptographic application: ...

s to identify duplicate segments of data. If two different pieces of information generate the same hash value, this is known as a

collision In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great for ...

. The probability of a collision depends mainly on the hash length (see

birthday attack A birthday attack is a bruteforce collision attack that exploits the mathematics behind the birthday problem in probability theory. This attack can be used to abuse communication between two or more parties. The attack depends on the higher likeli ...

). Thus, the concern arises that

data corruption Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer, transmission, and storage systems use a number of meas ...

can occur if a

hash collision In computer science, a hash collision or hash clash is when two distinct pieces of data in a hash table share the same hash value. The hash value in this case is derived from a hash function which takes a data input and returns a fixed length of ...

occurs, and additional means of verification are not used to verify whether there is a difference in data, or not. Both in-line and post-process architectures may offer bit-for-bit validation of original data for guaranteed data integrity. The hash functions used include standards such as

SHA-1 In cryptography, SHA-1 (Secure Hash Algorithm 1) is a hash function which takes an input and produces a 160-bit (20-byte) hash value known as a message digest – typically rendered as 40 hexadecimal digits. It was designed by the United States ...

SHA-256 SHA-2 (Secure Hash Algorithm 2) is a set of cryptographic hash functions designed by the United States National Security Agency (NSA) and first published in 2001. They are built using the Merkle–Damgård construction, from a one-way compressi ...

, and others. The computational resource intensity of the process can be a drawback of data deduplication. To improve performance, some systems utilize both weak and strong hashes. Weak hashes are much faster to calculate but there is a greater risk of a hash collision. Systems that utilize weak hashes will subsequently calculate a strong hash and will use it as the determining factor to whether it is actually the same data or not. Note that the system overhead associated with calculating and looking up hash values is primarily a function of the deduplication workflow. The reconstitution of files does not require this processing and any incremental performance penalty associated with re-assembly of data chunks is unlikely to impact application performance. Another concern is the interaction of compression and encryption. The goal of encryption is to eliminate any discernible patterns in the data. Thus encrypted data cannot be deduplicated, even though the underlying data may be redundant. Although not a shortcoming of data deduplication, there have been data breaches when insufficient security and access validation procedures are used with large repositories of deduplicated data. In some systems, as typical with cloud storage, an attacker can retrieve data owned by others by knowing or guessing the hash value of the desired data.

Implementations

Deduplication is implemented in some filesystems such as in

ZFS ZFS (previously Zettabyte File System) is a file system with Volume manager, volume management capabilities. It began as part of the Sun Microsystems Solaris (operating system), Solaris operating system in 2001. Large parts of Solaris, includin ...

or Write Anywhere File Layout and in different disk arrays models. It is a service available on both

NTFS NT File System (NTFS) (commonly called ''New Technology File System'') is a proprietary journaling file system developed by Microsoft in the 1990s. It was developed to overcome scalability, security and other limitations with File Allocation Tabl ...

and

on Windows servers.

References

External links

* Biggar, Heidi(2007.12.11)
WebCast: The Data Deduplication Effect

Using Latent Semantic Indexing for Data Deduplication

What Is the Difference Between Data Deduplication, File Deduplication, and Data Compression?
- Database from eWeek
SNIA DDSR SIG
*
Understanding Data Deduplication Ratios

DeDuplication
Demo {{File systems Data compression Data management