Disaster recovery is the process of maintaining or reestablishing vital infrastructure and systems following a

natural Nature, in the broadest sense, is the physical world or universe. "Nature" can refer to the phenomena of the physical world, and also to life in general. The study of nature is a large, if not the only, part of science. Although humans are p ...

or human-induced

disaster A disaster is a serious problem occurring over a short or long period of time that causes widespread human, material, economic or environmental loss which exceeds the ability of the affected community or society to cope using its own resources ...

, such as a storm or battle.It employs policies, tools, and procedures. Disaster recovery focuses on the information technology (IT) or

technology systems Technology is the application of knowledge to reach practical goals in a specifiable and reproducible way. The word ''technology'' may also mean the product of such an endeavor. The use of technology is widely prevalent in medicine, science, ...

supporting critical business functions as opposed to

business continuity Business continuity may be defined as "the capability of an organization to continue the delivery of products or services at pre-defined acceptable levels following a disruptive incident", and business continuity planning (or business continuity a ...

. This involves keeping all essential aspects of a business functioning despite significant disruptive events; it can therefore be considered a subset of business continuity. Disaster recovery assumes that the primary site is not immediately recoverable and restores data and services to a secondary site.

IT service continuity

IT Service Continuity (ITSC) is a subset of

business continuity planning Business continuity may be defined as "the capability of an organization to continue the delivery of products or services at pre-defined acceptable levels following a disruptive incident", and business continuity planning (or business continuity a ...

(BCP) that focuses on

Recovery Point Objective Disaster recovery is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle.It employs policies, tools, and procedures. Disaster recovery focuses on t ...

(RPO) and

Recovery Time Objective Disaster recovery is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle.It employs policies, tools, and procedures. Disaster recovery focuses on t ...

(RTO). It encompasses IT disaster recovery planning and wider IT resilience planning. It also incorporates

IT infrastructure Information technology infrastructure is defined broadly as a set of information technology (IT) components that are the foundation of an IT service; typically physical components (computer and networking hardware and facilities), but also variou ...

and services related to communications, such as telephony and data communications.

Principles of backup sites

Planning includes arranging for backup sites, whether they are "hot" (operating prior to a disaster), "warm" (ready to begin operating), or "cold" (requires substantial work to begin operating), and standby sites with hardware as needed for continuity. In 2008, the

British Standards Institution The British Standards Institution (BSI) is the national standards body of the United Kingdom. BSI produces technical standards on a wide range of products and services and also supplies certification and standards-related services to business ...

launched a specific standard supporting Business Continuity Standard

BS 25999 BS was BSI's standard in the field of Business Continuity Management (BCM). It was withdrawn in 2012 (part 2) and 2013 (part 1) following the publication of the international standards ISO 22301 - ″Societal Security — Business continuity mana ...

, titled BS25777, specifically to align computer continuity with business continuity. This was withdrawn following the publication in March 2011 of ISO/IEC 27031, "Security techniques — Guidelines for information and communication technology readiness for business continuity."

ITIL The Information Technology Infrastructure Library (ITIL) is a set of detailed practices for IT activities such as IT service management (ITSM) and IT asset management (ITAM) that focus on aligning IT services with the needs of business. ITIL de ...

has defined some of these terms.

Recovery Time Objective

The Recovery Time Objective (RTO) is the targeted duration of time and a service level within which a

business process A business process, business method or business function is a collection of related, structured activities or tasks by people or equipment in which a specific sequence produces a service or product (serves a particular business goal) for a parti ...

must be restored after a disruption in order to avoid a break in

. According to

methodology, the RTO is established during the

Business Impact Analysis Business continuity may be defined as "the capability of an organization to continue the delivery of products or services at pre-defined acceptable levels following a disruptive incident", and business continuity planning (or business continuity a ...

(BIA) by the owner(s) of the process, including identifying time frames for alternate or manual workarounds. RPO RTO example converted

RTO is a complement of RPO. The limits of acceptable or "tolerable" ITSC performance are measured by RTO and RPO in terms of time lost from normal business process functioning and data lost or not backed up during that period.

Recovery Time Actual

Recovery Time Actual (RTA) is the critical metric for business continuity and disaster recovery. The business continuity group conducts timed rehearsals (or actuals), during which RTA gets determined and refined as needed.

Recovery Point Objective

A Recovery Point Objective (RPO) is the maximum acceptable interval during which transactional data is lost from an IT service. For example, if RPO is measured in minutes, then in practice, off-site mirrored backups must be continuously maintained as a daily off-site backup will not suffice.

Relationship to Recovery Time Objective

A recovery that is not instantaneous restores transactional data over some interval without incurring significant risks or losses. RPO measures the maximum time in which recent data might have been permanently lost and not a direct measure of loss quantity. For instance, if the BC plan is to restore up to the last available backup, then the RPO is the interval between such backups. RPO is not determined by the existing backup regime. Instead

business impact analysis Business continuity may be defined as "the capability of an organization to continue the delivery of products or services at pre-defined acceptable levels following a disruptive incident", and business continuity planning (or business continuity a ...

determines RPO for each service. When off-site data is required, the period during which data might be lost may start when backups are prepared, not when the backups are secured off-site.

Data synchronization points

A data synchronization point is a backup is completed. It halts update processing while a disk-to-disk copy is completed. The backup copy reflects the earlier version of the copy operation; not when the data is copied to tape or transmitted elsewhere.

System design

RTO and the RPO must be balanced, taking business risk into account, along with other system design criteria. RPO is tied to the times backups are secured offsite. Sending synchronous copies to an offsite mirror allows for most unforeseen events. The use of physical transportation for tapes (or other transportable media) is common. Recovery can be activated at a predetermined site. Shared offsite space and hardware complete the package. For high volumes of high-value transaction data, hardware can be split across multiple sites.

History

Planning for disaster recovery and information technology (IT) developed in the mid to late 1970s as computer center managers began to recognize the dependence of their organizations on their computer systems. At that time, most systems were batch-oriented

mainframe A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterprise ...

s. An offsite mainframe could be loaded from backup tapes pending recovery of the primary site;

downtime The term downtime is used to refer to periods when a system is unavailable. The unavailability is the proportion of a time-span that a system is unavailable or offline. This is usually a result of the system failing to function because of an un ...

was relatively less critical. The disaster recovery industry developed to provide backup computer centers. Sungard Availability Services was one of the earliest such centers, located in Sri Lanka (1978). During the 1980s and 90s, computing grew exponentially, including internal corporate timesharing, online data entry and real-time processing.

Availability In reliability engineering, the term availability has the following meanings: * The degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at a ...

of IT systems became more important. Regulatory agencies became involved; availability objectives of 2, 3, 4 or 5 nines (99.999%) were often mandated, and

high-availability High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. Modernization has resulted in an increased reliance on these systems. Fo ...

solutions for

hot-site A backup site or work area recovery site is a location where an organization can relocate following a disaster, such as fire, flood, terrorist threat or other disruptive event. This is an integral part of the disaster recovery plan and wider busin ...

facilities were sought. IT service continuity became essential as part of Business Continuity Management (BCM) and Information Security Management (ICM) as specified in ISO/IEC 27001 and ISO 22301 respectively. The rise of

cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mul ...

since 2010 created new opportunities for system resiliency. Service providers absorbed the responsibility for maintaining high service levels, including availability and reliability. They offered highly resilient network designs.

Recovery as a Service Recovery as a service (RaaS), sometimes referred to as disaster recovery as a service (DRaaS), is a category of cloud computing used for protecting an application or data from a natural or human disaster or service disruption at one location by ena ...

(RaaS) is widely availability and promoted by the

Cloud Security Alliance Cloud Security Alliance (CSA) is a not-for-profit organization with the mission to “promote the use of best practices for providing security assurance within cloud computing, and to provide education on the uses of cloud computing to help secure ...

Classification

Disasters can be the result of three broad categories of threats and hazards. * Natural hazards include acts of nature such as floods, hurricanes, tornadoes, earthquakes, and epidemics. * Technological hazards include accidents or the failures of systems and structures such as pipeline explosions, transportation accidents, utility disruptions, dam failures, and accidental hazardous material releases. * Human-caused threats that include intentional acts such as active assailant attacks, chemical or biological attacks, cyber attacks against data or infrastructure, sabotage, and war. Preparedness measures for all categories and types of disasters fall into the five mission areas of prevention, protection, mitigation, response, and recovery.

Planning

Research supports the idea that implementing a more holistic pre-disaster planning approach is more cost-effective. Every $1 spent on hazard mitigation (such as a

disaster recovery plan Given organizations' increasing dependency on information technology to run their operations, Business continuity planning covers the entire organization, and Disaster recovery focuses on ''IT''. Auditing of documents covering an organization's ' ...

) saves society $4 in response and recovery costs. 2015 disaster recovery statistics suggest that downtime lasting for one hour can cost * small companies $8,000, * mid-size organizations $74,000, and * large enterprises $700,000 or more. As

IT systems Information technology (IT) is the use of computers to create, process, store, retrieve, and exchange all kinds of Data (computing), data . and information. IT forms part of information and communications technology (ICT). An information te ...

have become increasingly critical to the smooth operation of a company, and arguably the economy as a whole, the importance of ensuring the continued operation of those systems, and their rapid recovery, has increased.

Control measures

Control measures are steps or mechanisms that can reduce or eliminate threats. The choice of mechanisms is reflected in a disaster recovery plan (DRP). Control measures can be classified as controls aimed at preventing an event from occurring, controls aimed at detecting or discovering unwanted events, and controls aimed at correcting or restoring the system after a disaster or an event. These controls are documented and exercised regularly using so-called "DR tests".

Strategies

The disaster recovery strategy derives from the business continuity plan. Metrics for business processes are then mapped to systems and infrastructure. A cost-benefit analysis highlighs which disaster recovery measures are appropriate. Different strategies make sense based on the cost of downtime compared to the cost of implementing a particular strategy. Common strategies include: * backups to tape and sent off-site * backups to disk on-site (copied to off-site disk) or off-site * replication off-site, such that once the systems are restored or synchronized, possibly via

storage area network A storage area network (SAN) or storage network is a computer network which provides access to consolidated, block-level data storage. SANs are primarily used to access data storage devices, such as disk arrays and tape libraries from serve ...

technology * private cloud solutions that replicate metadata (VMs, templates and disks) into the private cloud. Metadata are configured as an

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...

representation called Open Virtualization Format, and can be easily restored * hybrid cloud solutions that replicate both on-site and to off-site data centers. This provides instant fail-over to on-site hardware or to cloud data centers. * high availability systems which keep both the data and system replicated off-site, enabling continuous access to systems and data, even after a disaster (often associated with

cloud storage Cloud storage is a model of computer data storage in which the digital data is stored in logical pools, said to be on "the cloud". The physical storage spans multiple servers (sometimes in multiple locations), and the physical environment is t ...

). Precautionary strategies may include: * local mirrors of systems and/or data and use of disk protection technology such as

RAID Raid, RAID or Raids may refer to: Attack * Raid (military), a sudden attack behind the enemy's lines without the intention of holding ground * Corporate raid, a type of hostile takeover in business * Panty raid, a prankish raid by male college ...

* surge protectors — to minimize the effect of power surges on delicate electronic equipment * use of an

uninterruptible power supply An uninterruptible power supply or uninterruptible power source (UPS) is an electrical apparatus that provides emergency power to a load when the input power source or mains power fails. A UPS differs from an auxiliary or emergency power system ...

(UPS) and/or backup generator to keep systems going in the event of a power failure * fire prevention/mitigation systems such as alarms and fire extinguishers * anti-virus software and other security measures.

Disaster recovery as a service

Disaster recovery as a service Recovery as a service (RaaS), sometimes referred to as disaster recovery as a service (DRaaS), is a category of cloud computing used for protecting an application or data from a natural or human disaster or service disruption at one location by ena ...

(DRaaS) is an arrangement with a third party vendor to perform some or all DR functions.

References

External links

* * * {{Authority control Backup Business continuity Data management IT risk management