Data consistency
   HOME

TheInfoList



OR:

Data consistency refers to whether the same
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
kept at different places do or do not match.


Point-in-time consistency

Point-in-time consistency is an important property of
backup In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", ...
files and a critical objective of software that creates backups. It is also relevant to the design of disk memory systems, specifically relating to what happens when they are unexpectedly shut down. As a relevant backup example, consider a website with a database such as the online encyclopedia
Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...
, which needs to be operational around the clock, but also must be backed up with regularity to protect against disaster. Portions of Wikipedia are constantly being updated every minute of every day, meanwhile, Wikipedia's database is stored on servers in the form of one or several very large files which require minutes or hours to back up. These large files—as with any database—contain numerous data structures which reference each other by location. For example, some structures are indexes which permit the database subsystem to quickly find search results. If the data structures cease to reference each other properly, then the database can be said to be corrupted.


Counter example

The importance of point-in-time consistency can be illustrated with what would happen if a backup were made without it. Assume Wikipedia's database is a huge file, which has an important index located 20% of the way through, and saves article data at the 75% mark. Consider a scenario where an editor comes and creates a new article at the same time a backup is being performed, which is being made as a simple "
file copy In digital file management, copying is a file operation that creates a new file which has the same content as an existing file. Computer operating systems include file copying methods to users, with operating systems with graphical user interface ...
" which copies from the beginning to the end of the large file(s) and doesn't consider data consistency - and at the time of the article edit, it is 50% complete. The new article is added to the article space (at the 75% mark) and a corresponding index entry is added (at the 20% mark). Because the backup is already halfway done and the index already copied, the backup will be written with the article data present, but with the index reference missing. As a result of the inconsistency, this file is considered corrupted. In real life, a real database such as Wikipedia's may be edited thousands of times per hour, and references are virtually always spread throughout the file and can number into the millions, billions, or more. A sequential "copy" backup would literally contain so many small corruptions that the backup would be completely unusable without a lengthy repair process which could provide no guarantee as to the completeness of what has been recovered. A backup process which properly accounts for data consistency ensures that the backup is a snapshot of how the entire database looked at a single moment. In the given Wikipedia example, it would ensure that the backup was written ''without'' the added article at the 75% mark, so that the article data would be consistent with the index data previously written.


Disk caching systems

Point-in-time consistency is also relevant to computer disk subsystems. Specifically,
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ef ...
s and
file system In computing, file system or filesystem (often abbreviated to fs) is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one larg ...
s are designed with the expectation that the computer system they are running on could lose power, crash, fail, or otherwise cease operating at any time. When properly designed, they ensure that data will not be unrecoverably corrupted if the power is lost. Operating systems and file systems do this by ensuring that data is written to a hard disk in a certain order, and rely on that in order to detect and recover from unexpected
shutdown Shutdown or shut down may refer to: * Government shutdowns in the United States * Shutdown (computing) * Shutdown (economics) * Shutdown (nuclear reactor) Arts and entertainment Music * "Shut Down" (The Beach Boys song), 1963 * ''Shut Down Volu ...
s. On the other hand, rigorously writing data to disk in the order that maximizes data integrity also impacts performance. A process of
write caching Writing is a medium of human communication which involves the representation of a language through a system of physically inscribed, mechanically transferred, or digitally represented symbols. Writing systems do not themselves constitu ...
is used to consolidate and re-sequence write operations such that they can be done faster by minimizing the time spent moving disk heads. Data consistency concerns arise when write caching changes the sequence in which writes are carried out, because it there exists the possibility of an unexpected shutdown that violates the operating system's expectation that all writes will be committed sequentially. For example, in order to save a typical document or picture file, an operating system might write the following records to a disk in the following order: # Journal entry saying file XYZ is about to be saved into sector 123. # The actual contents of the file XYZ are written into sector 123. # Sector 123 is now flagged as occupied in the record of free/used space. # Journal entry noting the file completely saved, and its name is XYZ and is located in sector 123. The operating system relies on the assumption that if it sees item #1 is present (saying the file is about to be saved), but that item #4 is missing (confirming success), that the save operation was unsuccessful and so it should undo any incomplete steps already taken to save it (e.g. marking sector 123 free since it never was properly filled, and removing any record of XYZ from the file directory). It relies on these items being committed to disk in sequential order. Suppose a caching algorithm determines it would be fastest to write these items to disk in the order 4-3-1-2, and starts doing so, but the power gets shut down after 4 get written, before 3, 1 and 2, and so those writes never occur. When the computer is turned back on, the file system would then show it contains a file named XYZ which is located in sector 123, but this sector really does not contain the file. (Instead, the sector will contain garbage, or zeroes, or a random portion of some old file - and that is what will show if the file is opened). Further, the file system's free space map will not contain any entry showing that sector 123 is occupied, so later, it will likely assign that sector to the next file to be saved, believing it is available. The file system will then have two files both unexpectedly claiming the same sector (known as a
cross-linked file In computing, data recovery is a process of retrieving deleted, inaccessible, lost, corrupted, damaged, or formatted data from secondary storage, removable media or files, when the data stored in them cannot be accessed in a usual way. The ...
). As a result, a write to one of the files will overwrite part of the other file, invisibly damaging it. A disk caching subsystem that ensures point-in-time consistency guarantees that in the event of an unexpected shutdown, the four elements would be written one of only five possible ways: completely (1-2-3-4), partially (1, 1-2, 1-2-3), or not at all. High-end hardware
disk controller {{unreferenced, date=May 2010 The disk controller is the controller circuit which enables the CPU to communicate with a hard disk, floppy disk or other kind of disk drive. It also provides an interface between the disk drive and the bus conne ...
s of the type found in servers include a small battery back-up unit on their cache memory so that they may offer the performance gains of write caching while mitigating the risk of unintended shutdowns. The battery back-up unit keeps the memory powered even during a shutdown so that when the computer is powered back up, it can quickly complete any writes it has previously committed. With such a controller, the operating system may request four writes (1-2-3-4) in that order, but the controller may decide the quickest way to write them is 4-3-1-2. The controller essentially ''lies'' to the operating system and reports that the writes have been completed in order (a lie that improves performance at the expense of data corruption if power is lost), and the battery backup hedges against the risk of data corruption by giving the controller a way to silently fix any and all damage that could occur as a result. If the power gets shut off after element 4 has been written, the battery backed memory contains the record of commitment for the other three items and ensures that they are written ("flushed") to the disk at the next available opportunity.


Transaction consistency

Consistency (database systems) in the realm of Distributed database systems refers to the property of many
ACID In computer science, ACID ( atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a se ...
databases to ensure that the results of a Database transaction are visible to all nodes simultaneously. That is, once the transaction has been committed all parties attempting to access the database can see the results of that transaction simultaneously. A good example of the importance of transaction consistency is a database that handles the transfer of money. Suppose a money transfer requires two operations: writing a debit in one place, and a credit in another. If the system crashes or shuts down when one operation has completed but the other has not, and there is nothing in place to correct this, the system can be said to lack transaction consistency. With a money transfer, it is desirable that either the entire transaction completes, or none of it completes. Both of these scenarios keep the balance in check. Transaction consistency ensures just that - that a system is programmed to be able to detect incomplete transactions when powered on, and undo (or "roll back") the portion of any incomplete transactions that are found.


Application consistency

Application consistency, similar to transaction consistency, is applied on a grander scale. Instead of having the scope of a single transaction, data must be consistent within the confines of many different transaction streams from one or more applications. An application may be made up of many different types of data, various types of files and data feeds from other applications. Application consistency is the state in which all related files and databases are synchronized representing the true status of the application. Computer data