IT disaster recovery (also, simply disaster recovery (DR)) is the process of maintaining or reestablishing vital
infrastructure
Infrastructure is the set of facilities and systems that serve a country, city, or other area, and encompasses the services and facilities necessary for its economy, households and firms to function. Infrastructure is composed of public and pri ...
and
systems
A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its environment, is described by its boundaries, structure and purpose and is exp ...
following a
natural
Nature is an inherent character or constitution, particularly of the ecosphere or the universe as a whole. In this general sense nature refers to the laws, elements and phenomena of the physical world, including life. Although humans are part ...
or
human-induced disaster
A disaster is an event that causes serious harm to people, buildings, economies, or the environment, and the affected community cannot handle it alone. '' Natural disasters'' like avalanches, floods, earthquakes, and wildfires are caused by na ...
, such as a storm or battle. DR employs policies, tools, and procedures with a focus on
IT systems
Information technology (IT) is a set of related fields within information and communications technology (ICT), that encompass computer systems, software, programming languages, data processing, data and information processing, and storage. Inf ...
supporting critical business functions. This involves keeping all essential aspects of a business functioning despite significant disruptive events; it can therefore be considered a subset of business continuity (BC). DR assumes that the primary site is not immediately recoverable and restores data and services to a secondary site.
IT service continuity
IT service continuity (ITSC) is a subset of BCP, which relies on the metrics (frequently used as
key risk indicators) of recovery point/time objectives. It encompasses IT disaster recovery planning and the wider IT resilience planning. It also incorporates IT infrastructure and
services related to
communications
Communication is commonly defined as the transmission of information. Its precise definition is disputed and there are disagreements about whether Intention, unintentional or failed transmissions are included and whether communication not onl ...
, such as
telephony
Telephony ( ) is the field of technology involving the development, application, and deployment of telecommunications services for the purpose of electronic transmission of voice, fax, or data, between distant parties. The history of telephony is ...
and
data communications
Data communication, including data transmission and data reception, is the transfer of data, signal transmission, transmitted and received over a Point-to-point (telecommunications), point-to-point or point-to-multipoint communication chann ...
.
Principles of backup sites
Planning includes arranging for backup sites, whether they are "hot" (operating prior to a disaster), "warm" (ready to begin operating), or "cold" (requires substantial work to begin operating), and standby sites with hardware as needed for continuity.
In 2008, the
British Standards Institution
The British Standards Institution (BSI) is the Standards organization, national standards body of the United Kingdom. BSI produces technical standards on a wide range of products and services and also supplies standards certification services ...
launched a specific standard supporting Business Continuity Standard
BS 25999, titled BS25777, specifically to align computer continuity with business continuity. This was withdrawn following the publication in March 2011 of ISO/IEC 27301, "Security techniques — Guidelines for information and communication technology readiness for business continuity."
ITIL
ITIL (previously and also known as Information Technology Infrastructure Library) is a framework with a set of practices (previously processes) for IT activities such as IT service management (ITSM) and IT asset management (ITAM) that focus ...
has defined some of these terms.
Recovery Time Objective
The Recovery Time Objective (RTO)
is the targeted duration of time and a service level within which a
business process
A business process, business method, or business function is a collection of related, structured activities or tasks performed by people or equipment in which a specific sequence produces a service or product (that serves a particular business g ...
must be restored after a disruption in order to avoid a break in business continuity.
According to business continuity planning methodology, the RTO is established during the
business impact analysis (BIA) by the owner(s) of the process, including identifying time frames for alternate or manual workarounds.

RTO is a complement of RPO. The limits of acceptable or "tolerable"
ITSC performance are measured by RTO and RPO in terms of time lost from normal business process functioning and data lost or not backed up during that period.
Recovery Time Actual
Recovery Time Actual (RTA) is the critical metric for business continuity and disaster recovery.
[
The business continuity group conducts timed rehearsals (or actuals), during which RTA gets determined and refined as needed.][
]
Recovery Point Objective
A Recovery Point Objective (RPO) is the maximum acceptable interval during which transactional data
In data management, dynamic data or transactional data is information that is periodically updated, meaning it changes asynchronously over time as new information becomes available. The concept is important in data management, since the time sca ...
is lost from an IT service.[
For example, if RPO is measured in minutes, then in practice, off-site mirrored backups must be continuously maintained as a daily off-site backup will not suffice.
]
Relationship to RTO
A recovery that is not instantaneous restores transactional data over some interval without incurring significant risks or losses.[
RPO measures the maximum time in which recent data might have been permanently lost and not a direct measure of loss quantity. For instance, if the BC plan is to restore up to the last available backup, then the RPO is the interval between such backups.
RPO is not determined by the existing backup regime. Instead BIA determines RPO for each service. When off-site data is required, the period during which data might be lost may start when backups are prepared, not when the backups are secured off-site.][
]
Mean times
The recovery metrics can be converted to/used alongside failure
Failure is the social concept of not meeting a desirable or intended objective, and is usually viewed as the opposite of success. The criteria for failure depends on context, and may be relative to a particular observer or belief system. On ...
metrics. Common measurements include mean time between failures (MTBF), mean time to first failure
Mean time (to) first failure (MTFF, sometimes MTTFF) is a concept in reliability engineering, which describes time to failure for non-repairable components like an integrated circuit soldered on a circuit board.
For repairable components like a re ...
(MTFF), mean time to repair
Mean time to repair (MTTR) is a basic measure of the maintainability of repairable items. It represents the average time required to repair a failed component or device. Expressed mathematically, it is the total corrective maintenance time for ...
(MTTR), and mean down time
In organizational management, mean down time (MDT) is the average time that a system is non-operational. This includes all downtime associated with repair, corrective and preventive maintenance, self-imposed downtime, and any logistics or adminis ...
(MDT).
Data synchronization points
A data synchronization point is a backup is completed. It halts update processing while a disk-to-disk copy is completed. The backup copy reflects the earlier version of the copy operation; not when the data is copied to tape or transmitted elsewhere.
System design
RTO and the RPO must be balanced, taking business risk into account, along with other system design criteria.
RPO is tied to the times backups are secured offsite. Sending synchronous copies to an offsite mirror allows for most unforeseen events. The use of physical transportation for tapes (or other transportable media) is common. Recovery can be activated at a predetermined site. Shared offsite space and hardware complete the package.
For high volumes of high-value transaction data, hardware can be split across multiple sites.
History
Planning for disaster recovery and information technology (IT) developed in the mid to late 1970s as computer center managers began to recognize the dependence of their organizations on their computer systems.
At that time, most systems were batch-oriented mainframe
A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterpris ...
s. An offsite mainframe could be loaded from backup tapes pending recovery of the primary site; downtime
In computing and telecommunications, downtime (also (system) outage or (system) drought colloquially) is a period when a system is unavailable. The unavailability is the proportion of a time-span that a system is unavailable or offline.
This is ...
was relatively less critical.
The disaster recovery industry developed to provide backup computer centers. Sungard Availability Services was one of the earliest such centers, located in Sri Lanka (1978).
During the 1980s and 90s, computing grew exponentially, including internal corporate timesharing, online data entry and real-time processing. Availability
In reliability engineering, the term availability has the following meanings:
* The degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at ...
of IT systems became more important.
Regulatory agencies became involved; availability objectives of 2, 3, 4 or 5 nines (99.999%) were often mandated, and high-availability solutions for hot-site
A backup site (also work area recovery site or just recovery site) is a location where an organization can relocate following a disaster, such as fire, flood, terrorist threat, or other disruptive event. This is an integral part of the disaster re ...
facilities were sought.
IT service continuity became essential as part of Business Continuity Management (BCM) and Information Security Management (ICM) as specified in ISO/IEC 27001 and ISO 22301 respectively.
The rise of cloud computing
Cloud computing is "a paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with self-service provisioning and administration on-demand," according to International Organization for ...
since 2010 created new opportunities for system resiliency. Service providers absorbed the responsibility for maintaining high service levels, including availability and reliability. They offered highly resilient network designs. Recovery as a Service (RaaS) is widely available and promoted by the Cloud Security Alliance
Cloud Security Alliance (CSA) is a not-for-profit organization with the mission to “promote the use of best practices for providing security assurance within Cloud Computing, cloud computing, and to provide education on the uses of cloud computi ...
.
Classification
Disasters can be the result of three broad categories of threats and hazards.
* Natural hazards include acts of nature such as floods, hurricanes, tornadoes, earthquakes, and epidemics.
* Technological hazards include accidents or the failures of systems and structures such as pipeline explosions, transportation accidents, utility disruptions, dam failures, and accidental hazardous material releases.
* Human-caused threats that include intentional acts such as active assailant attacks, chemical or biological attacks, cyber attacks against data or infrastructure, sabotage, and war.
Preparedness measures for all categories and types of disasters fall into the five mission areas of prevention, protection, mitigation, response, and recovery.
Planning
Research supports the idea that implementing a more holistic pre-disaster planning approach is more cost-effective. Every $1 spent on hazard mitigation (such as a disaster recovery plan
Given organizations' increasing dependency on information technology (IT) to run their operations, business continuity planning (and its subset IT service continuity planning) covers the entire organization, while disaster recovery focuses on I ...
) saves society $4 in response and recovery costs.
2015 disaster recovery statistics suggest that downtime lasting for one hour can cost
* small companies $8,000,
* mid-size organizations $74,000, and
* large enterprises $700,000 or more.
As IT systems
Information technology (IT) is a set of related fields within information and communications technology (ICT), that encompass computer systems, software, programming languages, data processing, data and information processing, and storage. Inf ...
have become increasingly critical to the smooth operation of a company, and arguably the economy as a whole, the importance of ensuring the continued operation of those systems, and their rapid recovery, has increased.
Control measures
Control measures are steps or mechanisms that can reduce or eliminate threats. The choice of mechanisms is reflected in a disaster recovery plan (DRP).
Control measures can be classified as controls aimed at preventing an event from occurring, controls aimed at detecting or discovering unwanted events, and controls aimed at correcting or restoring the system after a disaster or an event.
These controls are documented and exercised regularly using so-called "DR tests".
Strategies
The disaster recovery strategy derives from the business continuity plan. Metrics for business processes are then mapped to systems and infrastructure. A cost-benefit analysis highlights which disaster recovery measures are appropriate. Different strategies make sense based on the cost of downtime compared to the cost of implementing a particular strategy.
Common strategies include:
* backups to tape and sent off-site
* backups to disk on-site (copied to off-site disk) or off-site
* replication off-site, such that once the systems are restored or synchronized, possibly via storage area network
A storage area network (SAN) or storage network is a computer network which provides access to consolidated, block device, block-level data storage. SANs are primarily used to access Computer data storage, data storage devices, such as disk ...
technology
* private cloud solutions that replicate metadata (VMs, templates and disks) into the private cloud. Metadata are configured as an XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
representation called Open Virtualization Format, and can be easily restored
* hybrid cloud solutions that replicate both on-site and to off-site data centers. This provides instant fail-over to on-site hardware or to cloud data centers.
* high availability systems which keep both the data and system replicated off-site, enabling continuous access to systems and data, even after a disaster (often associated with cloud storage
Cloud storage is a model of computer data storage in which data, said to be on "the cloud", is stored remotely in logical pools and is accessible to users over a network, typically the Internet. The physical storage spans multiple servers (so ...
).
Precautionary strategies may include:
* local mirrors of systems and/or data and use of disk protection technology such as RAID
RAID (; redundant array of inexpensive disks or redundant array of independent disks) is a data storage virtualization technology that combines multiple physical Computer data storage, data storage components into one or more logical units for th ...
* surge protectors — to minimize the effect of power surges on delicate electronic equipment
* use of an uninterruptible power supply
An uninterruptible power supply (UPS) or uninterruptible power source is a type of continual power system that provides automated backup electric power to a electrical load, load when the input power source or mains electricity, mains power fai ...
(UPS) and/or backup generator to keep systems going in the event of a power failure
* fire prevention/mitigation systems such as alarms and fire extinguishers
* anti-virus software and other security measures.
Disaster recovery as a service
Disaster recovery as a service (DRaaS) is an arrangement with a third party vendor to perform some or all DR functions for scenarios such as power outages, equipment failures, cyber attacks, and natural disasters.
Disaster recovery for cloud systems
Following best practices can enhance disaster recovery strategy for cloud-hosted
Cloud computing is "a paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with self-service provisioning and administration on-demand," according to ISO.
Essential characteristics ...
systems:
# Flexibility: The disaster recovery strategy should be adaptable to support both partial failures (such as recovering specific files) and full environment failures.
# Regular testing: Regular testing of the disaster recovery plan can verify its effectiveness and identify any weaknesses or gaps.
# Clear roles and permissions: It should be clearly defined who is authorized to execute the disaster recovery plan, with separate access and permissions for these individuals. Implementing a clear separation of permissions between those who can execute the recovery and those who have access to backup data helps minimize the risk of unauthorized actions.
# Documentation
Documentation is any communicable material that is used to describe, explain or instruct regarding some attributes of an object, system or procedure, such as its parts, assembly, installation, maintenance, and use. As a form of knowledge managem ...
: The plan should be well-documented and easy-to-follow to ensure that operators can effectively follow it during stressful situations.
See also
References
Further reading
*
*
*
*
*
*
*
*
External links
*
*
*
{{Authority control
Backup
Business continuity
Data management
IT risk management