IT disaster recovery

Last updated

IT disaster recovery (also, simply disaster recovery (DR)) is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle. DR employs policies, tools, and procedures with a focus on IT systems supporting critical business functions. [1] This involves keeping all essential aspects of a business functioning despite significant disruptive events; it can therefore be considered a subset of business continuity (BC). [2] [3] DR assumes that the primary site is not immediately recoverable and restores data and services to a secondary site.

Contents

IT service continuity

IT service continuity (ITSC) is a subset of BCP, [4] which relies on the metrics (frequently used as key risk indicators) of recovery point/time objectives. It encompasses IT disaster recovery planning and the wider IT resilience planning. It also incorporates IT infrastructure and services related to communications, such as telephony and data communications. [5] [6]

Principles of backup sites

Planning includes arranging for backup sites, whether they are "hot" (operating prior to a disaster), "warm" (ready to begin operating), or "cold" (requires substantial work to begin operating), and standby sites with hardware as needed for continuity.

In 2008, the British Standards Institution launched a specific standard supporting Business Continuity Standard BS 25999, titled BS25777, specifically to align computer continuity with business continuity. This was withdrawn following the publication in March 2011 of ISO/IEC 27301, "Security techniques — Guidelines for information and communication technology readiness for business continuity." [7]

ITIL has defined some of these terms. [8]

Recovery Time Objective

The Recovery Time Objective (RTO) [9] [10] is the targeted duration of time and a service level within which a business process must be restored after a disruption in order to avoid a break in business continuity. [11]

According to business continuity planning methodology, the RTO is established during the business impact analysis (BIA) by the owner(s) of the process, including identifying time frames for alternate or manual workarounds.

Example showing longer 'actual' times that do NOT meet either RPO or RTOs ('objectives'). Diagram provides schematic representation of the terms RPO and RTO. RPO RTO example converted.png
Example showing longer 'actual' times that do NOT meet either RPO or RTOs ('objectives'). Diagram provides schematic representation of the terms RPO and RTO.

RTO is a complement of RPO. The limits of acceptable or "tolerable" ITSC performance are measured by RTO and RPO in terms of time lost from normal business process functioning and data lost or not backed up during that period. [11] [12]

Recovery Time Actual

Recovery Time Actual (RTA) is the critical metric for business continuity and disaster recovery. [9]

The business continuity group conducts timed rehearsals (or actuals), during which RTA gets determined and refined as needed. [9]

Recovery Point Objective

A Recovery Point Objective (RPO) is the maximum acceptable interval during which transactional data is lost from an IT service. [11]

For example, if RPO is measured in minutes, then in practice, off-site mirrored backups must be continuously maintained as a daily off-site backup will not suffice. [13]

Relationship to RTO

A recovery that is not instantaneous restores transactional data over some interval without incurring significant risks or losses. [11]

RPO measures the maximum time in which recent data might have been permanently lost and not a direct measure of loss quantity. For instance, if the BC plan is to restore up to the last available backup, then the RPO is the interval between such backups.

RPO is not determined by the existing backup regime. Instead BIA determines RPO for each service. When off-site data is required, the period during which data might be lost may start when backups are prepared, not when the backups are secured off-site. [12]

Mean times

The recovery metrics can be converted to/used alongside failure metrics. Common measurements include mean time between failures (MTBF), mean time to first failure (MTFF), mean time to repair (MTTR), and mean down time (MDT).

Data synchronization points

A data synchronization point [14] is a backup is completed. It halts update processing while a disk-to-disk copy is completed. The backup [15] copy reflects the earlier version of the copy operation; not when the data is copied to tape or transmitted elsewhere.

System design

RTO and the RPO must be balanced, taking business risk into account, along with other system design criteria. [16]

RPO is tied to the times backups are secured offsite. Sending synchronous copies to an offsite mirror allows for most unforeseen events. The use of physical transportation for tapes (or other transportable media) is common. Recovery can be activated at a predetermined site. Shared offsite space and hardware complete the package. [17]

For high volumes of high-value transaction data, hardware can be split across multiple sites.

History

Planning for disaster recovery and information technology (IT) developed in the mid to late 1970s as computer center managers began to recognize the dependence of their organizations on their computer systems.

At that time, most systems were batch-oriented mainframes. An offsite mainframe could be loaded from backup tapes pending recovery of the primary site; downtime was relatively less critical.

The disaster recovery industry [18] [19] developed to provide backup computer centers. Sungard Availability Services was one of the earliest such centers, located in Sri Lanka (1978). [20] [21]

During the 1980s and 90s, computing grew exponentially, including internal corporate timesharing, online data entry and real-time processing. Availability of IT systems became more important.

Regulatory agencies became involved; availability objectives of 2, 3, 4 or 5 nines (99.999%) were often mandated, and high-availability solutions for hot-site facilities were sought.[ citation needed ]

IT service continuity became essential as part of Business Continuity Management (BCM) and Information Security Management (ICM) as specified in ISO/IEC 27001 and ISO 22301 respectively.

The rise of cloud computing since 2010 created new opportunities for system resiliency. Service providers absorbed the responsibility for maintaining high service levels, including availability and reliability. They offered highly resilient network designs. Recovery as a Service (RaaS) is widely available and promoted by the Cloud Security Alliance. [22]

Classification

Disasters can be the result of three broad categories of threats and hazards.

Preparedness measures for all categories and types of disasters fall into the five mission areas of prevention, protection, mitigation, response, and recovery. [23]

Planning

Research supports the idea that implementing a more holistic pre-disaster planning approach is more cost-effective. Every $1 spent on hazard mitigation (such as a disaster recovery plan) saves society $4 in response and recovery costs. [24]

2015 disaster recovery statistics suggest that downtime lasting for one hour can cost [25]

As IT systems have become increasingly critical to the smooth operation of a company, and arguably the economy as a whole, the importance of ensuring the continued operation of those systems, and their rapid recovery, has increased. [26]

Control measures

Control measures are steps or mechanisms that can reduce or eliminate threats. The choice of mechanisms is reflected in a disaster recovery plan (DRP).

Control measures can be classified as controls aimed at preventing an event from occurring, controls aimed at detecting or discovering unwanted events, and controls aimed at correcting or restoring the system after a disaster or an event.

These controls are documented and exercised regularly using so-called "DR tests".

Strategies

The disaster recovery strategy derives from the business continuity plan. [27] Metrics for business processes are then mapped to systems and infrastructure. [28] A cost-benefit analysis highlights which disaster recovery measures are appropriate. Different strategies make sense based on the cost of downtime compared to the cost of implementing a particular strategy.

Common strategies include:

Precautionary strategies may include:

Disaster recovery as a service

A modular data center connected to the power grid at a utility substation Edge Night 02.jpg
A modular data center connected to the power grid at a utility substation

Disaster recovery as a service (DRaaS) is an arrangement with a third party vendor to perform some or all DR functions for scenarios such as power outages, equipment failures, cyber attacks, and natural disasters. [30]

See also

Related Research Articles

<span class="mw-page-title-main">Business continuity planning</span> Prevention and recovery from threats that might affect a company

Business continuity may be defined as "the capability of an organization to continue the delivery of products or services at pre-defined acceptable levels following a disruptive incident", and business continuity planning is the process of creating systems of prevention and recovery to deal with potential threats to a company. In addition to prevention, the goal is to enable ongoing operations before and during execution of disaster recovery. Business continuity is the intended outcome of proper execution of both business continuity planning and disaster recovery.

In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", whereas the noun and adjective form is "backup". Backups can be used to recover data after its loss from data deletion or corruption, or to recover data from an earlier time. Backups provide a simple form of IT disaster recovery; however not all backup systems are able to reconstitute a computer system or other complex configuration such as a computer cluster, active directory server, or database server.

SunGard was an American multinational company based in Wayne, Pennsylvania, which provided software and services to education, financial services, and public sector organizations. It was formed in 1983, as a spin-off of the computer services division of Sun Oil Company. The name of the company originally was an acronym which stood for Sun Guaranteed Access to Recovered Data, a reference to the disaster recovery business it helped pioneer. SunGard was ranked at 480th in the U.S. Fortune 500 list in the year 2012.

A remote, online, or managed backup service, sometimes marketed as cloud backup or backup-as-a-service, is a service that provides users with a system for the backup, storage, and recovery of computer files. Online backup providers are companies that provide this type of service to end users. Such backup services are considered a form of cloud computing.

Given organizations' increasing dependency on information technology (IT) to run their operations, business continuity planning covers the entire organization, while disaster recovery focuses on IT.

Continuous data protection (CDP), also called continuous backup or real-time backup, refers to backup of computer data by automatically saving a copy of every change made to that data, essentially capturing every version of the data that the user saves. In its true form it allows the user or administrator to restore data to any point in time. The technique was patented by British entrepreneur Pete Malcolm in 1989 as "a backup system in which a copy [editor's emphasis] of every change made to a storage medium is recorded as the change occurs [editor's emphasis]."

High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

A backup site is a location where an organization can relocate following a disaster, such as fire, flood, terrorist threat, or other disruptive event. This is an integral part of the disaster recovery plan and wider business continuity planning of an organization.

Email archiving is the act of preserving and making searchable all email to/from an individual. Email archiving solutions capture email content either directly from the email application itself or during transport. The messages are typically then stored on magnetic disk storage and indexed to simplify future searches. In addition to simply accumulating email messages, these applications index and provide quick, searchable access to archived messages independent of the users of the system using a couple of different technical methods of implementation. The reasons a company may opt to implement an email archiving solution include protection of mission critical data, to meet retention and supervision requirements of applicable regulations, and for e-discovery purposes. It is predicted that the email archiving market will grow from nearly $2.1 billion in 2009 to over $5.1 billion in 2013.

The subject of computer backups is rife with jargon and highly specialized terminology. This page is a glossary of backup terms that aims to clarify the meaning of such jargon and terminology.

This is a comparison of online backup services.

EVault is a part of Carbonite, and a brand name for some of Carbonite's product offerings. EVault and its partner network develop and support on-premises, cloud-based, and hybrid backup and recovery services for mid-market customers in need of data backup, data recovery, disaster recovery, regulatory compliance, and cloud storage or online backup services. The company primarily serves customers in heavily regulated industries—financial services, legal, and health care, as well as in government, education, telecommunications, and charity/nonprofit sectors. Headquartered in Boston, Massachusetts, United States, the company has sales, service, and data center operations in North America and EMEA.

In information technology, real-time recovery (RTR) is the ability to recover a piece of IT infrastructure such as a server from an infrastructure failure or human-induced error in a time frame that has minimal impact on business operations. Real-time recovery focuses on the most appropriate technology for restores, thus reducing the Recovery Time Objective (RTO) to minutes, Recovery Point Objectives (RPO) to within 15 minutes ago, and minimizing Test Recovery Objectives (TRO), which is the ability to test and validate that backups have occurred correctly without impacting production systems.

iland

iland Internet Solutions was a provider of hosted cloud infrastructure as a service for production business applications, disaster recovery and business continuity, testing and development, and software as a service enablement for independent software vendors. 11:11 Systems agreed to buy iland in December 2021. When the merger completed in January 2022, all of iland's services and assets were transferred to 11:11.

Continuous availability is an approach to computer system and application design that protects users against downtime, whatever the cause and ensures that users remain connected to their documents, data files and business applications. Continuous availability describes the information technology methods to ensure business continuity.

Disk-based backup refers to technology that allows one to back up large amounts of data to a disk storage unit. It is often supplemented by tape drives for data archival or replication to another facility for disaster recovery. Backup-to-disk is a popular in enterprise use for both technical and business reasons. Storage devices have gotten faster access time and higher storage capacity. There are different forms of disks used for back up, standard mechanical disks and solid state disks.

<span class="mw-page-title-main">ISO/TC 292</span>

ISO/TC 292 Security and resilience is a technical committee of the International Organization for Standardization formed in 2015 to develop standards in the area of security and resilience.

ISO 22301:2019, Security and resilience – Business continuity management systems – Requirements, is a management system standard published by International Organization for Standardization that specifies requirements to plan, establish, implement, operate, monitor, review, maintain and continually improve a documented management system to protect against, reduce the likelihood of occurrence, prepare for, respond to, and recover from disruptive incidents when they arise. It is intended to be applicable to all organizations, or parts thereof, regardless of type, size and nature of the organization.

<span class="mw-page-title-main">CloudEndure</span> American cloud computing company

CloudEndure is a cloud computing company that develops business continuity software for disaster recovery, continuous backup, and live migration. CloudEndure is headquartered in the United States with R&D in Israel.

ISO 22300:2021, Security and resilience – Vocabulary, is an international standard developed by ISO/TC 292 Security and resilience. This document defines terms used in security and resilience standards and includes 360 terms and definitions. This edition was published in the beginning of 2021 and replaces the second edition from 2018.

References

  1. "'Systems and Operations Continuity: Disaster Recovery". Georgetown University - University Information Services. Archived from the original on 26 Feb 2012. Retrieved 20 July 2024.
  2. "Disaster Recovery and Business Continuity". IBM. Archived from the original on January 11, 2013. Retrieved 20 July 2024.
  3. "What is Business Continuity Management?". Disaster Recovery Institute International. Retrieved 20 July 2024.
  4. "Defending The Data Strata". ForbesMiddleEast.com. December 24, 2013.[ permanent dead link ]
  5. M. Niemimaa; Steven Buchanan (March 2017). "Information systems continuity process". ACM.com (ACM Digital Library).
  6. "2017 IT Service Continuity Directory" (PDF). Disaster Recovery Journal. Archived from the original (PDF) on 2018-11-30. Retrieved 2018-11-30.
  7. "ISO 22301 to be published Mid May - BS 25999-2 to be withdrawn". Business Continuity Forum. 2012-05-03. Retrieved 2021-11-20.
  8. "Browse the Resource Hub for all the latest content | Axelos". www.axelos.com.
  9. 1 2 3 "Like The NFL Draft, Is The Clock The Enemy Of Your Recovery Time". Forbes . April 30, 2015.
  10. "Three Reasons You Can't Meet Your Disaster Recovery Time". Forbes . October 10, 2013.
  11. 1 2 3 4 "Understanding RPO and RTO". DRUVA. 2008. Retrieved February 13, 2013.
  12. 1 2 "How to fit RPO and RTO into your backup and recovery plans". SearchStorage. Retrieved 2019-05-20.
  13. Richard May. "Finding RPO and RTO". Archived from the original on 2016-03-03.
  14. "Data transfer and synchronization between mobile systems". May 14, 2013.
  15. "Amendment #5 to S-1". SEC.gov. real-time ... provide redundancy and back-up to ...
  16. Peter H. Gregory (2011-03-03). "Setting the Maximum Tolerable Downtime -- setting recovery objectives". IT Disaster Recovery Planning For Dummies. Wiley. pp. 19–22. ISBN   978-1118050637.
  17. William Caelli; Denis Longley (1989). Information Security for Managers. Springer. p. 177. ISBN   1349101370.
  18. "Catastrophe? It Can't Possibly Happen Here". The New York Times . January 29, 1995. .. patient records
  19. "Commercial Property/Disaster Recovery". The New York Times . October 9, 1994. ...the disaster-recovery industry has grown to
  20. Charlie Taylor (June 30, 2015). "US tech firm Sungard announces 50 jobs for Dublin". The Irish Times. Sungard .. founded 1978
  21. Cassandra Mascarenhas (November 12, 2010). "SunGard to be a vital presence in the banking industry". Wijeya Newspapers Ltd. SunGard ... Sri Lanka's future.
  22. SecaaS Category 9 // BCDR Implementation Guidance CSA, retrieved 14 July 2014.
  23. "Threat and Hazard Identification and Risk Assessment (THIRA) and Stakeholder Preparedness Review (SPR): Guide Comprehensive Preparedness Guide (CPG) 201, 3rd Edition" (PDF). US Department of Homeland Security. May 2018.
  24. "Post-Disaster Recovery Planning Forum: How-To Guide, Prepared by Partnership for Disaster Resilience". University of Oregon's Community Service Center, (C) 2007, www.OregonShowcase.org. Retrieved October 29, 2018.[ permanent dead link ]
  25. "The Importance of Disaster Recovery" . Retrieved October 29, 2018.
  26. "IT Disaster Recovery Plan". FEMA. 25 October 2012. Retrieved 11 May 2013.
  27. "Use of the Professional Practices framework to develop, implement, maintain a business continuity program can reduce the likelihood of significant gaps". DRI International. 2021-08-16. Retrieved 2021-09-02.
  28. Gregory, Peter. CISA Certified Information Systems Auditor All-in-One Exam Guide, 2009. ISBN   978-0-07-148755-9. Page 480.
  29. Brandon, John (23 June 2011). "How to Use the Cloud as a Disaster Recovery Strategy". Inc. Retrieved 11 May 2013.
  30. "What Is Disaster Recovery as a Service (DRaaS)? | Definition from TechTarget". Disaster Recovery.

Further reading