Extended ECC
   HOME

TheInfoList



OR:

__NOTOC__ Chipkill is IBM's trademark for a form of advanced
error checking and correcting In information theory and coding theory with applications in computer science and telecommunication, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communi ...
(ECC)
computer memory In computing, memory is a device or system that is used to store information for immediate use in a computer or related computer hardware and digital electronic devices. The term ''memory'' is often synonymous with the term ''primary storage ...
technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip. One simple scheme to perform this function scatters the bits of a
Hamming code In computer science and telecommunication, Hamming codes are a family of linear error-correcting codes. Hamming codes can detect one-bit and two-bit errors, or correct one-bit errors without detection of uncorrected errors. By contrast, the sim ...
ECC word across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of one chip. Typical implementations use more advanced codes, such as a
BCH code In coding theory, the Bose–Chaudhuri–Hocquenghem codes (BCH codes) form a class of cyclic error-correcting codes that are constructed using polynomials over a finite field (also called ''Galois field''). BCH codes were invented in 1959 ...
, that can correct multiple bits with less overhead. Chipkill is frequently combined with dynamic bit-steering, so that if a chip fails (or has exceeded a threshold of bit errors), another, spare, memory chip is used to replace the failed chip. The concept is similar to that of
RAID Raid, RAID or Raids may refer to: Attack * Raid (military), a sudden attack behind the enemy's lines without the intention of holding ground * Corporate raid, a type of hostile takeover in business * Panty raid, a prankish raid by male college ...
, which protects against disk failure, except that now the concept is applied to individual memory chips. The technology was developed by the IBM Corporation in the early and middle 1990s. An important
RAS Ras or RAS may refer to: Arts and media * RAS Records Real Authentic Sound, a reggae record label * Rundfunk Anstalt Südtirol, a south Tyrolese public broadcasting service * Rás 1, an Icelandic radio station * Rás 2, an Icelandic radio stati ...
feature, Chipkill technology is deployed primarily on
SSD A solid-state drive (SSD) is a solid-state storage device that uses integrated circuit assemblies to store data persistently, typically using flash memory, and functioning as secondary storage in the hierarchy of computer storage. It is ...
s,
mainframes A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterprise ...
and midrange servers. An equivalent system from
Sun Microsystems Sun Microsystems, Inc. (Sun for short) was an American technology company that sold computers, computer components, software, and information technology services and created the Java programming language, the Solaris operating system, ZFS, the ...
is called ''Extended ECC'', while equivalent systems from HP are called ''Advanced ECC'' and ''Chipspare''. A similar system from Intel, called ''
Lockstep memory Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in parallel. The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared ...
'', provides
double-device data correction Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in parallel. The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared ...
(DDDC) functionality. Similar systems from
Micron The micrometre ( international spelling as used by the International Bureau of Weights and Measures; SI symbol: μm) or micrometer (American spelling), also commonly known as a micron, is a unit of length in the International System of Unit ...
, called ''redundant array of independent NAND'' (RAIN), and from
SandForce SandForce was an American fabless semiconductor company based in Milpitas, California, that designed flash memory controllers for solid-state drives (SSDs). On January 4, 2012, SandForce was acquired by LSI Corporation and became the Flash Compone ...
, called ''RAISE level 2'', protect data stored on SSDs from any single NAND flash chip going bad. A 2009 paper using data from Google's datacentres provided evidence demonstrating that in observed Google systems, DRAM errors were recurrent at the same location, and that 8% of DIMMs were affected each year. Specifically, "In more than 85% of the cases a correctable error is followed by at least one more correctable error in the same month". DIMMs with chipkill error correction showed a lower fraction of DIMMs reporting uncorrectable errors compared to DIMMs with error correcting codes that can only correct single-bit errors. A 2010 paper from
University of Rochester The University of Rochester (U of R, UR, or U of Rochester) is a private research university in Rochester, New York. The university grants undergraduate and graduate degrees, including doctoral and professional degrees. The University of Roc ...
also showed that Chipkill memory gave substantially lower memory errors, using both real world memory traces and simulations.


See also

*
ECC memory Error correction code memory (ECC memory) is a type of computer data storage that uses an error correction code (ECC) to detect and correct n-bit data corruption which occurs in memory. ECC memory is used in most computers where data corruption c ...
*
Lockstep (computing) Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in parallel. The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared ...
*
Memory ProteXion For computer memory, Memory ProteXion, found in IBM xSeries servers, is a form of " redundant bit steering". This technology uses redundant bits in a data packet to recover from a DIMM failure. Memory ProteXion is different from normal ECC error ...
*
Redundant array of independent memory A redundant array of independent memory (RAIM) is a design feature found in certain computers' main random access memory. RAIM utilizes additional memory modules and striping algorithms to protect against the failure of any particular module and k ...
*
Single-error correction and double-error detection In computer science and telecommunication, Hamming codes are a family of linear error-correcting codes. Hamming codes can detect one-bit and two-bit errors, or correct one-bit errors without detection of uncorrected errors. By contrast, the sim ...
(SECDED)


References

{{Reflist, 30em


External links


Intel E7500 Chipset MCH Intelx4 Single Device Data Correction (x4 SDDC) Implementation and Validation
Intel Application note AP-726, August 2002.
DRAM study turns assumptions about errors upside down
Ars Technica ''Ars Technica'' is a website covering news and opinions in technology, science, politics, and society, created by Ken Fisher and Jon Stokes in 1998. It publishes news, reviews, and guides on issues such as computer hardware and software, sci ...
, October 7, 2009
Enabling Memory Reliability, Availability, and Serviceability Features on Dell PowerEdge Servers
2005
Chipkill correct memory architecture
August 2000, by David Locklear
The Mathematics of Chipkill ECC
October 2015, by Bob Day Computer memory Error detection and correction IBM computer hardware