HOME

TheInfoList




High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually
uptime Uptime is a measure of system reliability, expressed as the percentage of time Time is the indefinite continued sequence, progress of existence and event (philosophy), events that occur in an apparently irreversible process, irreversible succe ...
, for a higher than normal period. Modernization has resulted in an increased reliance on these systems. For example, hospitals and data centers require high availability of their systems to perform routine daily activities.
Availability In reliability engineering Reliability engineering is a sub-discipline of systems engineering Systems engineering is an interdisciplinary field of engineering and engineering management that focuses on how to design, integrate, and manage com ...
refers to the ability of the user community to obtain a service or good, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is – from the user's point of view – ''unavailable''. Generally, the term ''
downtime The term downtime is used to refer to periods when a system is unavailable. The unavailability is the proportion of a time-span that a system A system is a group of Interaction, interacting or interrelated elements that act according to a s ...
'' is used to refer to periods when a system is unavailable.


Principles

There are three principles of
systems design Systems design is the process of defining the architecture File:Plan d'exécution du second étage de l'hôtel de Brionne (dessin) De Cotte 2503c – Gallica 2011 (adjusted).jpg, upright=1.45, alt=Plan d'exécution du second étage de l'hôtel ...
in
reliability engineering Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specifie ...
which can help achieve high availability. # Elimination of single points of failure. This means adding or building redundancy into the system so that failure of a component does not mean failure of the entire system. # Reliable crossover. In redundant systems, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover. # Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure – but the maintenance activity must.


Scheduled and unscheduled downtime

A distinction can be made between scheduled and unscheduled downtime. Typically, scheduled downtime is a result of
maintenance Maintenance may refer to: Biological science * Maintenance of an organism * Maintenance respiration Non-technical maintenance * Alimony, also called ''maintenance'' in British English * Champerty and maintenance, two related legal doctrines ...
that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Scheduled downtime events might include patches to
system software System software is software designed to provide a platform for other software. Examples of system software include operating systems (OS) like macOS, Linux, Android (operating system), Android and Microsoft Windows, computational science software, ...
that require a
reboot ''ReBoot'' is a Canadian computer-animated " technique Computer animation is the process used for digitally generating animated images. The more general term computer-generated imagery (CGI) encompasses both static scenes and dynamic images, w ...
or system configuration changes that only take effect upon a reboot. In general, scheduled downtime is usually the result of some logical, management-initiated event. Unscheduled downtime events typically arise from some physical event, such as a hardware or software failure or environmental anomaly. Examples of unscheduled downtime events include power outages, failed
CPU A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuit File:PExdcr01CJC.jpg, 200px, A circuit built on a printed circuit board (PCB). An electronic circuit is composed of ...

CPU
or
RAM Random-access memory (RAM; ) is a form of computer memory In computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic proces ...

RAM
components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, security breaches, or various
application Application may refer to: Mathematics and computing * Application software, computer software designed to help the user to perform specific tasks ** Application layer, an abstraction layer that specifies protocols and interface methods used in a co ...
,
middleware Middleware is a type of computer software Software is a collection of instructions that tell a computer A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations automatically. Modern com ...

middleware
, and
operating system An operating system (OS) is system software System software is software designed to provide a platform for other software. Examples of system software include operating systems (OS) like macOS, Linux, Android (operating system), Android and Mi ...

operating system
failures. If users can be warned away from scheduled downtimes, then the distinction is useful. But if the requirement is for true high availability, then downtime is downtime whether or not it is scheduled. Many computing sites exclude scheduled downtime from availability calculations, assuming that it has little or no impact upon the computing user community. By doing this, they can claim to have phenomenally high availability, which might give the illusion of continuous availability. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and most have carefully implemented specialty designs that eliminate any
single point of failure A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability High availability (HA) is a characteristic of a system ...
and allow online hardware, network, operating system, middleware, and application upgrades, patches, and replacements. For certain systems, scheduled downtime does not matter, for example system downtime at an office building after everybody has gone home for the night.


Percentage calculation

Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously.
Service level agreement A service-level agreement (SLA) is a commitment between a service provider and a client. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user. Th ...
s often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable.
Uptime Uptime is a measure of system reliability, expressed as the percentage of time Time is the indefinite continued sequence, progress of existence and event (philosophy), events that occur in an apparently irreversible process, irreversible succe ...
and
availability In reliability engineering Reliability engineering is a sub-discipline of systems engineering Systems engineering is an interdisciplinary field of engineering and engineering management that focuses on how to design, integrate, and manage com ...
can be used synonymously as long as the items being discussed are kept consistent. That is, a system can be up, but its services are not available, as in the case of a
network outage The term downtime is used to refer to periods when a system is unavailable. The unavailability is the proportion of a time-span that a system A system is a group of Interaction, interacting or interrelated elements that act according to a set o ...
. This can also be viewed as a system that is available to be worked on, but its services are not up from a functional perspective (as opposed to software service/process perspective). The perspective is important here - whether the item being discussed is the server hardware, server OS, functional service, software service/process...etc. Keep the perspective consistent throughout a discussion, then uptime and availability can be used synonymously.


"Nines"

Percentages of a particular order of magnitude are sometimes referred to by the number of nines or "class of nines" in the digits. For example, electricity that is delivered without interruptions ( blackouts, brownouts or
surge Surge means a sudden transient rush or flood, and may refer to: Science * Storm surge, the onshore gush of water associated with a low-pressure weather system * Surge (glacier), a short-lived event where a glacier can move up to velocities 100 ti ...

surge
s) 99.999% of the time would have 5 nines reliability, or class five. In particular, the term is used in connection with
mainframes A mainframe computer, informally called a mainframe or big iron, is a computer A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations automatically. Modern computers can perform generic ...
or enterprise computing, often as part of a
service-level agreement A service-level agreement (SLA) is a commitment between a service providerA service provider (SP) provides organizations with consulting, legal, real estate, communications, storage, processing. Although a service provider can be an organizationa ...
. Similarly, percentages ending in a 5 have conventional names, traditionally the number of nines, then "five", so 99.95% is "three nines five", abbreviated 3N5. This is casually referred to as "three and a half nines", but this is incorrect: a 5 is only a factor of 2, while a 9 is a factor of 10, so a 5 is 0.3 nines (per below formula: \log_ 2 \approx 0.3): 99.95% availability is 3.3 nines, not 3.5 nines. More simply, going from 99.9% availability to 99.95% availability is a factor of 2 (0.1% to 0.05% unavailability), but going from 99.95% to 99.99% availability is a factor of 5 (0.05% to 0.01% unavailability), over twice as much. A formulation of the ''class of 9s'' c based on a system's
unavailabilityUnavailability is the probability that an item will not operate correctly at a given time and under specified conditions. It opposes availability. Numerical values associated with the calculation of availability are often awkward, consisting of a se ...
x would be : c := \lfloor - \log_ x \rfloor (cf.
Floor and ceiling functions In mathematics Mathematics (from Greek: ) includes the study of such topics as numbers ( and ), formulas and related structures (), shapes and spaces in which they are contained (), and quantities and their changes ( and ). There is no g ...
). A similar measurement is sometimes used to describe the purity of substances. In general, the number of nines is not often used by a network engineer when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a
probability Probability is the branch of mathematics Mathematics (from Greek: ) includes the study of such topics as numbers (arithmetic and number theory), formulas and related structures (algebra), shapes and spaces in which they are contained ...

probability
(like 0.00001), or a
downtime The term downtime is used to refer to periods when a system is unavailable. The unavailability is the proportion of a time-span that a system A system is a group of Interaction, interacting or interrelated elements that act according to a s ...
per year is quoted. Availability specified as a number of nines is often seen in
marketing Marketing is the process of intentionally stimulating demand for and purchases of goods and services; potentially including selection of a target audience; selection of certain attributes or themes to emphasize in advertising; operation of adv ...

marketing
documents. The use of the "nines" has been called into question, since it does not appropriately reflect that the impact of unavailability varies with its time of occurrence. For large amounts of 9s, the "unavailability" index (measure of downtime rather than uptime) is easier to handle. For example, this is why an "unavailability" rather than availability metric is used in hard disk or data link
bit error rate In digital transmission, the number of bit errors is the number of received bits of a data stream over a communication channel that have been altered due to noise (telecommunications), noise, interference (communication), interference, distortion o ...
s. Sometimes the humorous term "nine fives" (55.5555555%) is used to contrast with "five nines" (99.999%), though this is not an actual goal, but rather a sarcastic reference to totally failing to meet any reasonable target.


Measurement and interpretation

Availability measurement is subject to some degree of interpretation. A system that has been up for 365 days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usage period; the user community will see the system as unavailable, whereas the system administrator will claim 100%
uptime Uptime is a measure of system reliability, expressed as the percentage of time Time is the indefinite continued sequence, progress of existence and event (philosophy), events that occur in an apparently irreversible process, irreversible succe ...
. However, given the true definition of availability, the system will be approximately 99.9% available, or three nines (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing performance problems are often deemed partially or entirely unavailable by users, even when the systems are continuing to function. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users — a true availability measure is holistic. Availability must be measured to be determined, ideally with comprehensive monitoring tools ("instrumentation") that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by the users themselves, than systems which experience periodic lulls in demand. An alternative metric is
mean time between failures Mean time between failures (MTBF) is the predicted elapsed time between inherent failure Failure is the state or condition of not meeting a desirable or intended objective Objective may refer to: * Objective (optics), an element in a camera ...
(MTBF).


Closely related concepts

Recovery time (or estimated time of repair (ETR), also known as
recovery time objective Disaster Recovery involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural disaster, natural or man-made hazards, human-induced disaster. Disaster re ...
(RTO) is closely related to availability, that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Another metric is
mean time to recovery Mean time to recovery (MTTR) tp://download.intel.com/design/servers/ISM/docs/317987.pdf INTEL call for Mean-Time-to-''Repair'' on page 4 left. is the average In colloquial Colloquialism or colloquial language is the style (sociolinguistics), l ...
(MTTR). Recovery time could be infinite with certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary
disaster recovery Disaster recovery involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural Nature, in the broadest sense, is the natural, physical, material ...
data center. Another related concept is data availability, that is the degree to which
database In computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes and development of both computer hardware , hardware and sof ...

database
s and other information storage systems faithfully record and report system transactions. Information management often focuses separately on data availability, or
Recovery Point Objective Disaster Recovery involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Disaster recovery focuses on the IT or tech ...
, in order to determine acceptable (or actual)
data lossData loss is an error condition in information systems in which information is destroyed by failures (like failed spindle motors or head crashes on hard drives) or neglect (like mishandling, careless handling or storage under unsuitable conditions) ...
with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss. A
service level agreement A service-level agreement (SLA) is a commitment between a service provider A service provider (SP) provides organizations with consulting, legal, real estate, communications, storage, processing. Although a service provider can be an organization ...
("SLA") formalizes an organization's availability objectives and requirements.


Military control systems

High availability is one of the primary requirements of the
control system A control system manages, commands, directs, or regulates the behavior of other devices or systems using control loop A control loop is the fundamental building block of industrial control systems. It consists of all the physical components a ...
s in
unmanned vehicle An uncrewed vehicle or unmanned vehicle is a vehicle A vehicle (from la, vehiculum) is a machine that transport Transport (commonly used in the U.K.), or transportation (used in the U.S.), is the Motion, movement of humans, animals and ...
s and autonomous maritime vessels. If the controlling system becomes unavailable, the Ground Combat Vehicle (GCV) or ASW Continuous Trail Unmanned Vessel (ACTUV) would be lost.


System design

Adding more components to an overall system design can undermine efforts to achieve high availability because
complex system A complex system is a system A system is a group of Interaction, interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its environment, is described by it ...
s inherently have more potential failure points and are more difficult to implement correctly. While some analysts would put forth the theory that the most highly available systems adhere to a simple architecture (a single, high quality, multi-purpose physical system with comprehensive internal hardware redundancy), this architecture suffers from the requirement that the entire system must be brought down for patching and operating system upgrades. More advanced system designs allow for systems to be patched and upgraded without compromising service availability (see load balancing and
failover In computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes and development of both computer hardware , hardware and software ...
). High availability requires less human intervention to restore operation in complex systems; the reason for this being that the most common cause for outages is human error. Redundancy is used to create systems with high levels of availability (e.g. aircraft flight computers). In this case it is required to have high levels of failure detectability and avoidance of common cause failures. Two kinds of redundancy are passive redundancy and active redundancy. Passive redundancy is used to achieve high availability by including enough excess capacity in the design to accommodate a performance decline. The simplest example is a boat with two separate engines driving two separate propellers. The boat continues toward its destination despite failure of a single engine or propeller. A more complex example is multiple redundant power generation facilities within a large system involving
electric power transmission Electric power transmission is the bulk movement of electrical energy Electrical energy is energy derived as a result of movement of electrically charged particles. When used loosely, ''electrical energy'' refers to energy that has been convert ...

electric power transmission
. Malfunction of single components is not considered to be a failure unless the resulting performance decline exceeds the specification limits for the entire system. Active redundancy is used in complex systems to achieve high availability with no performance decline. Multiple items of the same kind are incorporated into a design that includes a method to detect failure and automatically reconfigure the system to bypass failed items using a voting scheme. This is used with complex computing systems that are linked. Internet
routing Routing is the process of selecting a path for traffic in a Network theory, network or between or across multiple networks. Broadly, routing is performed in many types of networks, including circuit switching, circuit-switched networks, such as ...

routing
is derived from early work by Birman and Joseph in this area. Active redundancy may introduce more complex failure modes into a system, such as continuous system reconfiguration due to faulty voting logic. Zero downtime system design means that modeling and simulation indicates
mean time between failures Mean time between failures (MTBF) is the predicted elapsed time between inherent failure Failure is the state or condition of not meeting a desirable or intended objective Objective may refer to: * Objective (optics), an element in a camera ...
significantly exceeds the period of time between
planned maintenance The technical meaning of maintenance involves functional checks, servicing, repairing or replacing of necessary devices, equipment, machinery A machine is a man-made device that uses power to apply forces and control movement to perform ...
,
upgrade Upgrading is the process of replacing a product with a newer version of the same product. In computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentat ...
events, or system lifetime. Zero downtime involves massive redundancy, which is needed for some types of aircraft and for most kinds of
communications satellite A communications satellite is an artificial satellite that relays and amplifies radio telecommunication signals via a Transponder (satellite communications), transponder; it creates a communication channel between a source transmitter and a Radi ...
s.
Global Positioning System The Global Positioning System (GPS), originally Navstar GPS, is a satellite-based radionavigation system owned by the United States government The federal government of the United States (U.S. federal government or U.S. governme ...
is an example of a zero downtime system. Fault
instrumentation Instrumentation is a collective term for measuring instruments that are used for indicating, measuring and recording physical quantities. The term has its origins in the art and science of Scientific instrument, scientific instrument-making. Instr ...

instrumentation
can be used in systems with limited redundancy to achieve high availability. Maintenance actions occur during brief periods of down-time only after a fault indicator activates. Failure is only significant if this occurs during a
mission critical A mission critical factor of a system A system is a group of Interaction, interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its environment, is described ...
period.
Modeling and simulation Modeling and simulation (M&S) is the use of model In general, a model is an informative representation of an object, person or system. The term originally denoted the plans of a building in late 16th-century English, and derived via French and I ...
is used to evaluate the theoretical reliability for large systems. The outcome of this kind of model is used to evaluate different design options. A model of the entire system is created, and the model is stressed by removing components. Redundancy simulation involves the N-x criteria. N represents the total number of components in the system. x is the number of components used to stress the system. N-1 means the model is stressed by evaluating performance with all possible combinations where one component is faulted. N-2 means the model is stressed by evaluating performance with all possible combinations where two component are faulted simultaneously.


Reasons for unavailability

A survey among academic availability experts in 2010 ranked reasons for unavailability of enterprise IT systems. All reasons refer to not following best practice in each of the following areas (in order of importance): # Monitoring of the relevant components #
Requirements In product development In business Business is the activity of making one's living or making money by producing or buying and selling Product (business), products (such as goods and services). Simply put, it is "any activity or enterprise e ...
and procurement # Operations # Avoidance of network failures # Avoidance of internal application failures # Avoidance of external services that fail # Physical environment # Network redundancy # Technical solution of backup # Process solution of backup # Physical location # Infrastructure redundancy # Storage architecture redundancy A book on the factors themselves was published in 2003.


Costs of unavailability

In a 1998 report from
IBM Global Services IBM Services is the professional services arm of IBM International Business Machines Corporation (IBM) is an American multinational technology company headquartered in Armonk, New York, with operations in over 170 countries. The company be ...
, unavailable systems were estimated to have cost American businesses $4.54 billion in 1996, due to lost productivity and revenues.IBM Global Services, ''Improving systems availability'', IBM Global Services, 1998

/ref>


See also

*
Disaster recovery Disaster recovery involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural Nature, in the broadest sense, is the natural, physical, material ...
*
Fault-tolerance Fault tolerance is the property that enables a system A system is a group of Interaction, interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its environme ...
*
High-availability cluster High-availability clusters (also known as HA clusters , fail-over clusters or Metroclusters Active/Active) are groups of computer A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations auto ...
*
Overall equipment effectiveness Overall equipment effectiveness (OEE) is a measure of how well a manufacturing Manufacturing is the Production (economics), production of goods through the use of Work (human activity), labor, machines, tools, and chemical or biological processin ...
*
Reliability, availability and serviceability (computing)Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability (computer), servi ...
*
Reliability engineering Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specifie ...
*
Resilience (network) In computer network A computer network is a group of computers that use a set of common communication protocols over digital signal, digital interconnections for the purpose of sharing resources located on or provided by the Node (networking ...
*
Ubiquitous computing Ubiquitous computing (or "ubicomp") is a concept in software engineering and computer science Computer science deals with the theoretical foundations of information, algorithms and the architectures of its computation as well as practical t ...


Notes


References


External links


Lecture Notes on Enterprise Computing
University of Tübingen

by Prof. Phil Koopman
Uptime Calculator (SLA)
{{DEFAULTSORT:High Availability System administration Quality control Applied probability Reliability engineering Measurement