Data synchronization is the process of establishing
consistency
In classical deductive logic, a consistent theory is one that does not lead to a logical contradiction. The lack of contradiction can be defined in either semantic or syntactic terms. The semantic definition states that a theory is consistent ...
between source and target data stores, and the continuous harmonization of the data over time. It is fundamental to a wide variety of applications, including file synchronization and mobile device synchronization.
Data synchronization can also be useful in encryption for synchronizing
public key servers.
File-based solutions
There are tools available for
file synchronization
File synchronization (or syncing) in computing is the process of ensuring that computer files in two or more locations are updated via certain rules.
In ''one-way file synchronization'', also called mirroring, updated files are copied from a sour ...
,
version control
In software engineering, version control (also known as revision control, source control, or source code management) is a class of systems responsible for managing changes to computer programs, documents, large web sites, or other collections o ...
(
CVS,
Subversion
Subversion () refers to a process by which the values and principles of a system in place are contradicted or reversed in an attempt to transform the established social order and its structures of power, authority, hierarchy, and social norms. Sub ...
, etc.),
distributed filesystem Distribution may refer to:
Mathematics
*Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations
* Probability distribution, the probability of a particular value or value range of a vari ...
s (
Coda
Coda or CODA may refer to:
Arts, entertainment, and media Films
* Movie coda, a post-credits scene
* ''Coda'' (1987 film), an Australian horror film about a serial killer, made for television
*''Coda'', a 2017 American experimental film from Na ...
, etc.), and
mirroring (
rsync
rsync is a utility for efficiently transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like operat ...
, etc.), in that all these attempt to keep sets of files synchronized. However, only version control and file synchronization tools can deal with modifications to more than one copy of the files.
*
File synchronization
File synchronization (or syncing) in computing is the process of ensuring that computer files in two or more locations are updated via certain rules.
In ''one-way file synchronization'', also called mirroring, updated files are copied from a sour ...
is commonly used for home backups on external
hard drive
A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magnet ...
s or updating for transport on
USB flash drives
A USB flash drive (also called a thumb drive) is a data storage device that includes flash memory with an integrated USB interface. It is typically removable, rewritable and much smaller than an optical disc. Most weigh less than . Since first ...
. The automatic process prevents copying already identical files, thus can save considerable time relative to a manual copy, also being faster and less error prone.
*
Version control
In software engineering, version control (also known as revision control, source control, or source code management) is a class of systems responsible for managing changes to computer programs, documents, large web sites, or other collections o ...
tools are intended to deal with situations where more than one user attempts to simultaneously modify the same file, while file synchronizers are optimized for situations where only one copy of the file will be edited at a time. For this reason, although version control tools can be used for file synchronization, dedicated programs require less
overhead.
*
Distributed filesystem Distribution may refer to:
Mathematics
*Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations
* Probability distribution, the probability of a particular value or value range of a vari ...
s may also be seen as ensuring multiple versions of a file are synchronized. This normally requires that the devices storing the files are always connected, but some distributed file systems like
Coda
Coda or CODA may refer to:
Arts, entertainment, and media Films
* Movie coda, a post-credits scene
* ''Coda'' (1987 film), an Australian horror film about a serial killer, made for television
*''Coda'', a 2017 American experimental film from Na ...
allow disconnected operation followed by reconciliation. The merging facilities of a distributed file system are typically more limited than those of a version control system because most file systems do not keep a version graph.
*
Mirror (computing): A mirror is an exact copy of a data set. On the Internet, a mirror site is an exact copy of another Internet site. Mirror sites are most commonly used to provide multiple sources of the same information, and are of particular value as a way of providing reliable access to large downloads.
Theoretical models
Several theoretical models of data synchronization exist in the research literature, and the problem is also related to the problem of
Slepian–Wolf coding
__NOTOC__
In information theory and communication, the Slepian–Wolf coding, also known as the Slepian–Wolf bound, is a result in distributed source coding discovered by David Slepian and Jack Wolf in 1973. It is a method of theoretically codi ...
in
information theory
Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...
. The models are classified based on how they consider the data to be synchronized.
Unordered data
The problem of synchronizing unordered data (also known as the set reconciliation problem) is modeled as an attempt to compute the
symmetric difference
In mathematics, the symmetric difference of two sets, also known as the disjunctive union, is the set of elements which are in either of the sets, but not in their intersection. For example, the symmetric difference of the sets \ and \ is \.
Th ...
between two remote sets
and
of b-bit numbers.
[
] Some solutions to this problem are typified by:
;Wholesale transfer: In this case all data is transferred to one host for a local comparison.
;Timestamp synchronization: In this case all changes to the data are marked with timestamps. Synchronization proceeds by transferring all data with a timestamp later than the previous synchronization.
;Mathematical synchronization: In this case data are treated as mathematical objects and synchronization corresponds to a mathematical process.
Ordered data
In this case, two remote strings
and
need to be reconciled. Typically, it is assumed that these strings differ by up to a fixed number of edits (i.e. character insertions, deletions, or modifications). Then data synchronization is the process of reducing
edit distance
In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to tr ...
between
and
, up to the ideal distance of zero. This is applied in all filesystem based synchronizations (where the data is ordered). Many
practical applications of this are discussed or referenced above.
It is sometimes possible to transform the problem to one of unordered data through a process known as
shingling
{{about, the industrial steel manufacturing process, the text mining technique, w-shingling
Shingling was a stage in the production of bar iron or steel, in the finery and puddling processes. As with many ironmaking terms, this is derived from ...
(splitting the strings into ''shingles'').
Error handling
In fault-tolerant systems, distributed databases must be able to cope with the loss or corruption of (part of) their data. The first step is usually
replication, which involves making multiple copies of the data and keeping them all up to date as changes are made. However, it is then necessary to decide which copy to rely on when loss or corruption of an instance occurs.
The simplest approach is to have a single master instance that is the sole source of truth. Changes to it are replicated to other instances, and one of those instances becomes the new master when the old master fails.
Paxos
Paxos ( gr, Παξός) is a Greek island in the Ionian Sea, lying just south of Corfu. As a group with the nearby island of Antipaxos and adjoining islets, it is also called by the plural form Paxi or Paxoi ( gr, Παξοί, pronounced in Engl ...
and
Raft
A raft is any flat structure for support or transportation over water. It is usually of basic design, characterized by the absence of a hull. Rafts are usually kept afloat by using any combination of buoyant materials such as wood, sealed barrel ...
are more complex protocols that exist to solve problems with transient effects during failover, such as two instances thinking they are the master at the same time.
Secret sharing
Secret sharing (also called secret splitting) refers to methods for distributing a secret among a group, in such a way that no individual holds any intelligible information about the secret, but when a sufficient number of individuals combine th ...
is useful if failures of whole nodes are very common. This moves synchronization from an explicit recovery process to being part of each read, where a read of some data requires retrieving encoded data from several different nodes. If corrupt or out-of-date data may be present on some nodes, this approach may also benefit from the use of an
error correction code
In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for controlling errors in data over unreliable or noisy communication channels. The central idea is ...
.
DHTs and
Blockchain
A blockchain is a type of distributed ledger technology (DLT) that consists of growing lists of records, called ''blocks'', that are securely linked together using cryptography. Each block contains a cryptographic hash of the previous block, a ...
s try to solve the problem of synchronization between many nodes (hundreds to billions).
See also
*
SyncML
SyncML (Synchronization Markup Language) is the former name for a platform-independent information synchronization standard. The project is currently referred to as ''Open Mobile Alliance Data Synchronization and Device Management''. The purpose o ...
, a standard mainly for calendar, contact and email synchronization
*
Synchronization (computer science)
In computer science, synchronization refers to one of two distinct but related concepts: synchronization of processes, and synchronization of data. ''Process synchronization'' refers to the idea that multiple processes are to join up or handshak ...
References
{{DEFAULTSORT:Data Synchronization
Fault-tolerant computer systems