Predictive Failure Analysis (PFA) refers to methods intended to predict imminent failure of systems or components (software or hardware), and potentially enable mechanisms to avoid or counteract failure issues, or recommend maintenance of systems prior to failure.
For example, computer mechanisms that analyze trends in corrected errors to predict future failures of hardware/memory components and proactively enabling mechanisms to avoid them. Predictive Failure Analysis was originally used as term for a proprietary
IBM technology for monitoring the likelihood of
hard disk drive
A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magn ...
s to fail, although the term is now used generically for a variety of technologies for judging the imminent failure of CPU's, memory and I/O devices. See also
first failure data capture.
Disks
IBM introduced the term ''PFA'' and its technology in 1992 with reference to its 0662-S1x drive (1052 MB
Fast-Wide SCSI-2 disk which operated at 5400
rpm).
The technology relies on measuring several key (mainly mechanical) parameters of the drive unit, for example the flying height of
head
A head is the part of an organism which usually includes the ears, brain, forehead, cheeks, chin, eyes, nose, and mouth, each of which aid in various sensory functions such as sight, hearing, smell, and taste. Some very simple animals may no ...
s. The drive
firmware
In computing, firmware is a specific class of computer software that provides the low-level control for a device's specific hardware. Firmware, such as the BIOS of a personal computer, may contain basic functions of a device, and may provide ...
compares the measured parameters against predefined thresholds and evaluates the health status of the drive. If the drive appears likely to fail soon, the system sends notification to the disk controller.
The major drawbacks of the technology included:
* the binary result - the only status visible to the host was presence or absence of a notification
* the unidirectional communications - the drive firmware sending notification
The technology merged with IntelliSafe to form the
Self-Monitoring, Analysis, and Reporting Technology (SMART).
Processor and Memory
High counts of corrected RAM intermittent errors by
ECC can be predictive of future
DIMM
A DIMM () (Dual In-line Memory Module), commonly called a RAM stick, comprises a series of dynamic random-access memory integrated circuits. These memory modules are mounted on a printed circuit board and designed for use in personal compute ...
failures and so automatic offlining for memory and CPU caches can be used to avoid future errors, for example under the
Linux
Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which i ...
operating system the mcelog
daemon
Daimon or Daemon (Ancient Greek: , "god", "godlike", "power", "fate") originally referred to a lesser deity or guiding spirit such as the daimons of ancient Greek religion and mythology and of later Hellenistic religion and philosophy.
The wo ...
will automatically remove from usage memory pages showing excessive corrections, and will remove from usage processor cores showing excessive cache correctable memory errors.
Optical media
On
optical media (
CD,
DVD
The DVD (common abbreviation for Digital Video Disc or Digital Versatile Disc) is a digital optical disc data storage format. It was invented and developed in 1995 and first released on November 1, 1996, in Japan. The medium can store any kin ...
and
Blu-ray
The Blu-ray Disc (BD), often known simply as Blu-ray, is a digital optical disc data storage format. It was invented and developed in 2005 and released on June 20, 2006 worldwide. It is designed to supersede the DVD format, and capable of s ...
), failures caused by
degradation of media can be predicted and media of low manufacturing quality can be detected prior to data loss occurring by measuring the rate of
correctable data errors using software such as
QpxTool or
Nero DiscSpeed. However, not all vendors and models of optical drives allow error scanning.
[List of supported devices by dosc quality scanning software ''QPxTool'']
/ref>
References
See also
MCELog- Linux daemon for processing of x86 machine checks for predictive failure analysis
Hard disk computer storage
IBM storage devices
{{Compu-storage-stub