Reliability, Availability, And Maintainability
   HOME

TheInfoList



OR:

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a
computer hardware Computer hardware includes the physical parts of a computer, such as the computer case, case, central processing unit (CPU), Random-access memory, random access memory (RAM), Computer monitor, monitor, Computer mouse, mouse, Computer keyboard, ...
engineering term involving
reliability engineering Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specifie ...
,
high availability High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. Modernization has resulted in an increased reliance on these systems. Fo ...
, and serviceability design. The phrase was originally used by International Business Machines ( IBM) as a term to describe the robustness of their
mainframe computer A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterpris ...
s. Computers designed with higher levels of RAS have many features that protect data integrity and help them stay
available In reliability engineering, the term availability has the following meanings: * The degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at a ...
for long periods of time without
failure Failure is the state or condition of not meeting a desirable or intended objective (goal), objective, and may be viewed as the opposite of Success (concept), success. The criteria for failure depends on context, and may be relative to a parti ...
This data integrity and
uptime Uptime is a measure of system reliability, expressed as the percentage of time a machine, typically a computer, has been working and available. Uptime is the opposite of downtime. It is often used as a measure of computer operating system reliabi ...
is a particular selling point for mainframes and
fault-tolerant system Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
s.


Definitions

While RAS originated as a hardware-oriented term,
systems thinking Systems thinking is a way of making sense of the complexity of the world by looking at it in terms of wholes and relationships rather than by splitting it down into its parts. It has been used as a way of exploring and developing effective actio ...
has extended the concept of reliability-availability-serviceability to systems in general, including
software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...
. * ''Reliability'' can be defined as the probability that a system will produce correct outputs up to some given time ''t''. Reliability is enhanced by features that help to avoid, detect and repair hardware faults. A reliable system does not silently continue and deliver results that include uncorrected corrupted data. Instead, it detects and, if possible, corrects the corruption, for example: by retrying an operation for transient ( soft) or intermittent errors, or else, for uncorrectable errors, isolating the fault and reporting it to higher-level recovery mechanisms (which may
failover Failover is switching to a redundant or standby computer server, system, hardware component or network upon the failure or abnormal termination of the previously active application, server, system, hardware component, or network in a computer net ...
to redundant replacement hardware, etc.), or else by halting the affected program or the entire system and reporting the corruption. Reliability can be characterized in terms of
mean time between failures Mean time between failures (MTBF) is the predicted elapsed time between inherent failures of a mechanical or electronic system during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system ...
(MTBF), with reliability = exp(-t/MTBF). * ''Availability'' means the probability that a system is operational at a given time, i.e. the amount of time a device is actually operating as the percentage of total time it should be operating. High-availability systems may report availability in terms of minutes or hours of downtime per year. Availability features allow the system to stay operational even when faults do occur. A highly available system would disable the malfunctioning portion and continue operating at a reduced capacity. In contrast, a less capable system might crash and become totally nonoperational. Availability is typically given as a percentage of the time a system is expected to be available, e.g., 99.999 percent (" five nines"). * ''Serviceability'' or ''maintainability'' is the simplicity and speed with which a system can be repaired or maintained; if the time to repair a failed system increases, then availability will decrease. Serviceability includes various methods of easily diagnosing the system when problems arise. Early detection of faults can decrease or avoid system downtime. For example, some enterprise systems can automatically call a service center (without human intervention) when the system experiences a system fault. The traditional focus has been on making the correct repairs with as little disruption to normal operations as possible. Note the distinction between reliability and availability: reliability measures the ability of a system to function correctly, including avoiding data corruption, whereas availability measures how often the system is available for use, even though it may not be functioning correctly. For example, a server may run forever and so have ideal availability, but may be unreliable, with frequent data corruption.


Failure types

Physical faults can be temporary or permanent. * Permanent faults lead to a continuing error and are typically due to some physical failure such as metal electromigration or dielectric breakdown. * Temporary faults include ''transient'' and ''intermittent'' faults. ** Transient (a.k.a. ''soft'') faults lead to independent one-time errors and are not due to permanent hardware faults: examples include alpha particles flipping a memory bit, electromagnetic noise, or power-supply fluctuations. ** Intermittent faults occur due to a weak system component, e.g. circuit parameters degrading, leading to errors that are likely to recur.


Failure responses

Transient and intermittent faults can typically be handled by detection and correction by e.g., ECC codes or instruction replay (see below). Permanent faults will lead to uncorrectable errors which can be handled by replacement by duplicate hardware, e.g., processor sparing, or by the passing of the uncorrectable error to high level recovery mechanisms. A successfully corrected intermittent fault can also be reported to the
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also in ...
(OS) to provide information for
predictive failure analysis Predictive Failure Analysis (PFA) refers to methods intended to predict imminent failure of systems or components (software or hardware), and potentially enable mechanisms to avoid or counteract failure issues, or recommend maintenance of systems pr ...
.


Hardware features

Example hardware features for improving RAS include the following, listed by subsystem: *
Processor Processor may refer to: Computing Hardware * Processor (computing) **Central processing unit (CPU), the hardware within a computer that executes a program *** Microprocessor, a central processing unit contained on a single integrated circuit (I ...
: ** Processor instruction error detection (e.g. residue checking of results) with instruction retry e.g. alternative processor recovery in IBM mainframes, or "Instruction replay technology" in
Itanium Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance computin ...
systems. ** Processors running in lock-step to perform
master-checker A master-checker is a hardware-supported fault tolerance method for multiprocessor Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to su ...
or voting schemes. **
Machine Check Architecture In computing, Machine Check Architecture (MCA) is an Intel and AMD mechanism in which the CPU reports hardware errors to the operating system. Intel's P6 and Pentium 4 family processors, AMD's K7 and K8 family processors, as well as the Itanium ...
and ACPI Platform Error Interface to report errors to the OS. *
Memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered, ...
: ** Parity or ECC (including single device correction) protection of memory components (
cache Cache, caching, or caché may refer to: Places United States * Cache, Idaho, an unincorporated community * Cache, Illinois, an unincorporated community * Cache, Oklahoma, a city in Comanche County * Cache, Utah, Cache County, Utah * Cache County ...
and
main memory Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a computer ...
); bad cache line disabling;
memory scrubbing Memory scrubbing consists of reading from each computer memory location, correcting bit errors (if any) with an error-correcting code ( ECC), and writing the corrected data back to the same location. Due to the high integration density of modern ...
; memory sparing, memory mirroring; bad page offlining; redundant bit steering;
redundant array of independent memory A redundant array of independent memory (RAIM) is a design feature found in certain computers' main random access memory. RAIM utilizes additional memory modules and striping algorithms to protect against the failure of any particular module and ...
(RAIM). * I/O: **
Cyclic redundancy check A cyclic redundancy check (CRC) is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to digital data. Blocks of data entering these systems get a short ''check value'' attached, based on t ...
checksum A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data ...
s for data transmission/retry and data storage, e.g.
PCI Express PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe or PCI-e, is a high-speed serial computer expansion bus standard, designed to replace the older PCI, PCI-X and AGP bus standards. It is the common ...
(PCIe) Advanced Error Reporting (AER), redundant I/O paths. * Storage: **
RAID Raid, RAID or Raids may refer to: Attack * Raid (military), a sudden attack behind the enemy's lines without the intention of holding ground * Corporate raid, a type of hostile takeover in business * Panty raid, a prankish raid by male college ...
configurations for
hard disk drive A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magnet ...
and
solid state drive A solid-state drive (SSD) is a solid-state storage device that uses integrated circuit assemblies to store data persistently, typically using flash memory, and functioning as secondary storage in the hierarchy of computer storage. It is ...
storage. **
Journaling file system A journaling file system is a file system that keeps track of changes not yet committed to the file system's main part by recording the goal of such changes in a data structure known as a "journal", which is usually a circular log. In the even ...
s for file repair after crashes. **
Checksum A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data ...
s on both data and metadata, and background scrubbing. **
Self-Monitoring, Analysis, and Reporting Technology Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T., often written as SMART) is a monitoring system included in computer hard disk drives (HDDs) and solid-state drives (SSDs). Its primary function is to detect and report various indicat ...
for hard disk drive and solid state drive. *Power/cooling: ** Duplicating components to avoid single points of failure, e.g., power-supplies. ** Over-designing the system for the specified operating ranges of
clock frequency In computing, the clock rate or clock speed typically refers to the frequency at which the clock generator of a processor can generate pulses, which are used to synchronize the operations of its components, and is used as an indicator of the pr ...
, temperature, voltage, vibration. ** Temperature sensors to throttle operating frequency when temperature goes out of specification. **
Surge protector A 'surge protector'' (or spike suppressor, surge suppressor, surge diverter, surge protection device (SPD) or transient voltage surge suppressor (TVSS) is an appliance or device intended to protect electrical devices from voltage spikes in alt ...
,
uninterruptible power supply An uninterruptible power supply or uninterruptible power source (UPS) is an electrical apparatus that provides emergency power to a load when the input power source or mains power fails. A UPS differs from an auxiliary or emergency power system ...
,
auxiliary power Auxiliary power is electric power that is provided by an alternate source and that serves as backup for the primary power source at the station main bus or prescribed sub-bus. An offline unit provides electrical isolation between the primary pow ...
. *System: **
Hot swapping Hot swapping is the replacement or addition of components to a computer system without stopping, shutting down, or rebooting the system; hot plugging describes the addition of components only. Components which have such functionality are said ...
of components: CPUs,
RAM Ram, ram, or RAM may refer to: Animals * A male sheep * Ram cichlid, a freshwater tropical fish People * Ram (given name) * Ram (surname) * Ram (director) (Ramsubramaniam), an Indian Tamil film director * RAM (musician) (born 1974), Dutch * ...
s,
hard disk drive A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magnet ...
s and
solid state drive A solid-state drive (SSD) is a solid-state storage device that uses integrated circuit assemblies to store data persistently, typically using flash memory, and functioning as secondary storage in the hierarchy of computer storage. It is ...
s. **
Predictive failure analysis Predictive Failure Analysis (PFA) refers to methods intended to predict imminent failure of systems or components (software or hardware), and potentially enable mechanisms to avoid or counteract failure issues, or recommend maintenance of systems pr ...
to predict which intermittent correctable errors will lead eventually to hard non-correctable errors. ** Partitioning/domaining of computer components to allow one large system to act as several smaller systems. **
Virtual machine In computing, a virtual machine (VM) is the virtualization/emulation of a computer system. Virtual machines are based on computer architectures and provide functionality of a physical computer. Their implementations may involve specialized hardw ...
s to decrease the severity of
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also in ...
software faults. ** Redundant I/O domains or I/O partitions for providing virtual I/O to guest virtual machines. **
Computer cluster A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The comp ...
ing capability with
failover Failover is switching to a redundant or standby computer server, system, hardware component or network upon the failure or abnormal termination of the previously active application, server, system, hardware component, or network in a computer net ...
capability, for complete redundancy of hardware and software. **
Dynamic software updating In computer science, dynamic software updating (DSU) is a field of research pertaining to upgrading programs while they are running. DSU is not currently widely used in industry. However, researchers have developed a wide variety of systems and te ...
to avoid the need to reboot the system for a
kernel Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learnin ...
software update, for example
Ksplice Ksplice is an open-source extension of the Linux kernel that allows security patches to be applied to a running kernel without the need for reboots, avoiding downtimes and improving availability (a technique broadly referred to as dynamic softw ...
under Linux. ** Independent management processor for serviceability: remote monitoring, alerting and control.
Fault-tolerant design Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
s extended the idea by making ''RAS'' to be the defining feature of their computers for applications like
stock market A stock market, equity market, or share market is the aggregation of buyers and sellers of stocks (also called shares), which represent ownership claims on businesses; these may include ''securities'' listed on a public stock exchange, as ...
exchanges or
air traffic control Air traffic control (ATC) is a service provided by ground-based air traffic controllers who direct aircraft on the ground and through a given section of controlled airspace, and can provide advisory services to aircraft in non-controlled airs ...
, where system crashes would be catastrophic. Fault-tolerant computers (e.g., see
Tandem Computers Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for Automated teller machine, ATM networks, banks, stock exchanges, telephone switching centers, and other similar commercial transaction processing applicati ...
and Stratus Technologies), which tend to have duplicate components running in lock-step for reliability, have become less popular, due to their high cost. High availability systems, using
distributed computing A distributed system is a system whose components are located on different computer network, networked computers, which communicate and coordinate their actions by message passing, passing messages to one another from any system. Distributed com ...
techniques like
computer cluster A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The comp ...
s, are often used as cheaper alternatives.


See also

*
Machine check architecture In computing, Machine Check Architecture (MCA) is an Intel and AMD mechanism in which the CPU reports hardware errors to the operating system. Intel's P6 and Pentium 4 family processors, AMD's K7 and K8 family processors, as well as the Itanium ...
*
Redundancy (engineering) In engineering, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system perform ...
* Integrated logistics support *
RAMS In engineering, RAMS (reliability, availability, maintainability and safety)Itanium Reliability, Availability and Serviceability (RAS) Features
Overview of RAS features in general and specific features of the Itanium processor.
POWER7 System RAS Key Aspects of Power Systems Reliability, Availability, and Serviceability. Daniel Henderson, Jim Mitchell, and George Ahrens. February 10, 2012
Overview of RAS features in Power processors.
Intel Corp. Reliability, Availability, and Serviceability for the Always-on Enterprise (appendix B)
an

Overview of RAS features in
Xeon Xeon ( ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded system markets. It was introduced in June 1998. Xeon processors are based on the same arc ...
processors.
zEnterprise 196 System Overview. IBM Corp. (Chapter 10)
Overview of RAS features of IBM z196 processor and
zEnterprise 196 IBM Z is a family name used by IBM for all of its z/Architecture mainframe computers. In July 2017, with another generation of products, the official family was changed to IBM Z from IBM z Systems; the IBM Z family now includes the newest mo ...
server.
Maximizing Application Reliability and Availability with the SPARC M5-32 Server
RAS features of Oracle’s SPARC M5-32 server {{DEFAULTSORT:Reliability, Availability And Serviceability Fault-tolerant computer systems Systems engineering