Downtime

Last updated

The term downtime (also (system) outage) is used to refer to periods when a system is unavailable. The unavailability is the proportion of a time-span that a system is unavailable or offline. This is usually a result of the system failing to function because of an unplanned event, or because of routine maintenance (a planned event).

Contents

The terms are commonly applied to networks and servers. The common reasons for unplanned outages are system failures (such as a crash) or communications failures (commonly known as network outage). For outages due to issues with general computer systems, the term computer outage (also IT outage) can be used.

The term is also commonly applied in industrial environments in relation to failures in industrial production equipment. Some facilities measure the downtime incurred during a work shift, or during a 12- or 24-hour period. Another common practice is to identify each downtime event as having an operational, electrical or mechanical origin.

The opposite of downtime is uptime.

Types

Industry standards for the term "Outage Duration" or "Maintenance Duration" can have different point of initiation and completion thus the following clarification should be used to avoid conflicts in contract execution:

  1. "Turnkey" this is the most engrossing of all outage types. Outage or Maintenance starts with operator of the plant or equipment pressing the shutdown or stop button to initiate a halt in operation. Unless otherwise noted, Outage or Maintenance is considered completed when the plant or equipment is back in normal operation ready to begin manufacturing or ready be synchronized with system or grid or ready to perform duties as pump or compressor.
  2. "Breaker to Breaker" This Outage or Maintenance starts with operator of the plant or equipment removing the power circuit (Main power breaker at "off" or "disengaged" or "On-Cooldown"), not the control circuit from operation. This still would allow for the equipment to be cooled down or brought to ambient such that outage/maintenance work can be prepared or initiated. Depending on equipment types, "Breaker to Breaker" outage can be advantageous if contracting out controls related maintenance as this type of maintenance work can be performed while main equipment is still on cool-down or on stand-by. Unless otherwise noted, this type of outage is considered complete when power circuit is re-energized via engaging of the power breaker.
  3. "Completion of Lock-out/Tag-out" This Outage or Maintenance (sometimes mistaken for "Off-Cooldown" but not the same) starts with operator of the plant or equipment removing the power circuit, disengaging the control circuit and performing other neutralization of potential power and hazard sources (typically called Lock-Out, Tag-Out "LOTO") This point of maintenance period is typically the last phase of the outage initiation stage before actual work starts on the facility, plant or equipment. Safety briefing should always follow the LOTO activity, before any work is conducted. Unless otherwise noted, this type of outage is considered complete when the equipment has reached mechanical completion and ready to be placed on slow-roll for many heavy rotating equipment, Bump-test or rotation check for motors, etc., but must follow return or work permit per LOTO procedures.

Any on-line testing, performance testing and tuning required should not count towards the outage duration as these activities are typically conducted after the completion of outage or maintenance event and are out of control of most maintenance contractors.

Characteristics

Unplanned downtime may be the result of an equipment malfunction, etc.

Telecommunication outage classifications

Downtime can be caused by failure in hardware (physical equipment), (logic controlling equipment), interconnecting equipment (such as cables, facilities, routers,...), transmission (wireless, microwave, satellite), and/or capacity (system limits).

The failures can occur because of damage, failure, design, procedural (improper use by humans), engineering (how to use and deployment), overload (traffic or system resources stressed beyond designed limits), environment (support systems like power and HVAC), (outages designed into the system for a purpose such as software upgrades and equipment growth), other (none of the above but known), or unknown.

The failures can be the responsibility of customer/service provider, vendor/supplier, utility, government, contractor, end customer, public individual, act of nature, other (none of the above but known), or unknown.

Impact

Outages caused by system failures can have a serious impact on the users of computer/network systems, in particular those industries that rely on a nearly 24-hour service:

Also affected can be the users of an ISP and other customers of a telecommunication network.

Corporations can lose business due to network outage or they may default on a contract, resulting in financial losses. According to Veeam 2019 cloud data management report organizations encounter unplanned downtime, on average, 5-10 times per year with the average cost of one hour of downtime being $102,450. [1]

Those people or organizations that are affected by downtime can be more sensitive to particular aspects:

The most demanding users are those that require high availability.

Famous outages

On Mother's Day, Sunday, May 8, 1988, a fire broke out in the main switching room of the Hinsdale Central Office of the Illinois Bell telephone company. One of the largest switching systems in the state, the facility processed more than 3.5 million calls each day while serving 38,000 customers, including numerous businesses, hospitals, and Chicago's O'Hare and Midway Airports. [2]

Virtually the entire AT&T network of 4ESS toll tandems switches went in and out of service over and over again on January 15, 1990, disrupting long-distance service for the entire United States. The problem dissipated by itself when traffic slowed down. A software bug was found. [3]

AT&T lost its Frame Relay network for 26 hours on April 13, 1998. [4] This affected many thousands of customers, and bank transactions were one casualty. AT&T failed to meet the service level agreement on their contracts with customers and had to refund [5] 6,600 customer accounts, costing millions of dollars.

Xbox Live had intermittent downtime during the 2007–2008 holiday season which lasted thirteen days. [6] Increased demand from Xbox 360 purchasers (the largest number of new user sign-ups in the history of Xbox Live) was given as the reason for the downtime; in order to make amends for the service issues, Microsoft offered their users the opportunity to receive a free game. [7]

Sony's PlayStation Network April 2011 outage, began on April 20, 2011, and was gradually restored on May 14, 2011, starting in the United States. This outage is the longest amount of time the PSN has been offline since its inception in 2006. Sony has stated the problem was caused by an external intrusion which resulted in the confiscation of personal information. Sony reported on April 26, 2011, that a large amount of user data had been obtained by the same hack that resulted in the downtime. [8]

Telstra's Ryde switch failed in late 2011 after water egressed into the electrical switch board from continuing wet weather. The Ryde switch is one of the largest by area switches in Australia, and affected more than 720,000 services.[ citation needed ]

The Miami datacenter of ServerAxis went offline unannounced on February 29, 2016, and was never restored. This impacted multiple providers and hundreds of websites. The outage impacted coverage of the 2016 NCAA Division I women's basketball tournament as WBBState, one of the affected sites, was by far the most comprehensive provider of women's basketball statistics available. [9]

The game platform Roblox had an outage around October 2021, during their Chipotle Event. Many users thought it was because of the event, because it received massive reception, as users could get a free Chipotle burrito during it. The outage was Roblox's longest downtime, lasting 3 days. [10] [11] [12]

On July 8, 2022, Rogers suffered a major nationwide outage in Canada. This simultaneously affected cell phone and internet access, causing 911 calls, interbank transactions to fail and also disrupting government services.

On July 19th, 2024, CrowdStrike issued a faulty device driver update for their Falcon software, resulting in Windows PCs, servers, and virtual machines to crash and boot loop. The incident unintentionally affected approximately 8.5 million Windows machines worldwide, including critical infrastructure such as 911 services in various states. It is considered to be the largest outage in the history of information technology. [13] [14]

Service levels

In service level agreements, it is common to mention a percentage value (per month or per year) that is calculated by dividing the sum of all downtimes timespans by the total time of a reference time span (e.g. a month). 0% downtime means that the server was available all the time.

For Internet servers downtimes above 1% per year or worse can be regarded as unacceptable as this means a downtime of more than 3 days per year. For e-commerce and other industrial use any value above 0.1% is usually considered unacceptable. [15]

Response and reduction of impact

It is the duty of the network designer to make sure that a network outage does not happen. When it does happen, a well-designed system will further reduce the effects of an outage by having localized outages which can be detected and fixed as soon as possible.

A process needs to be in place to detect a malfunction - network monitoring - and to restore the network to a working condition - this generally involves a help desk team that can troubleshoot a problem, one composed of trained engineers; a separate help desk team is usually necessary in order to field user input, which can be particularly demanding during a downtime.

A network management system can be used to detect faulty or degrading components prior to customer complaints, with proactive fault rectification.

Risk management techniques can be used to determine the impact of network outages on an organisation and what actions may be required to minimise risk. Risk may be minimised by using reliable components, by performing maintenance, such as upgrades, by using redundant systems or by having a contingency plan or business continuity plan. Technical means can reduce errors with error correcting codes, retransmission, checksums, or diversity scheme.

One of the biggest causes of downtime is misconfiguration, where a planned change goes wrong. Typically organisations rely on manual effort to manage the process of configuration backups, but this requires highly skilled engineers with the time to manage the process across a multi-vendor network. Automation tools are available to manage backups, but there are very few solutions that handle configuration recovery which is needed to minimize the overall impact of the outage. [16]

Planning

A planned outage is the result of a planned activity by the system owner and/or by a service provider. These outages, often scheduled during the maintenance window, can be used to perform tasks including the following:

Outages can also be planned as a result of a predictable natural event, such as Sun outage.

Maintenance downtimes have to be carefully scheduled in industries that rely on computer systems. In many cases, system-wide downtimes can be averted using what is called a "rolling upgrade" - the process of incrementally taking down parts of the system for upgrade, without affecting the overall functionality.

Avoidance

For most websites, website monitoring is available. Website monitoring (synthetic or passive) is a service that "monitors" downtime and users on the site.

Other usage

Downtime can also refer to time when human capital or other assets go down. For instance, if employees are in meetings or unable to perform their work due to another constraint, they are down. This can be equally expensive, and can be the result of another asset (i.e. computer/systems) being down. This is also commonly known as "dead time".

Downtime is also generalized in a personal sense, being used to refer to a period of sleep or recreation. [17] [18] [19]

This term is used also in factories or industrial use. See total productive maintenance (TPM).

Measuring downtime

There are many external services which can be used to monitor the uptime and downtime as well as availability of a service or a host.

See also

Related Research Articles

In reliability engineering, the term availability has the following meanings:

In telecommunication, provisioning involves the process of preparing and equipping a network to allow it to provide new services to its users. In National Security/Emergency Preparedness telecommunications services, "provisioning" equates to "initiation" and includes altering the state of an existing priority service or capability.

Mean time between failures (MTBF) is the predicted elapsed time between inherent failures of a mechanical or electronic system during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems while mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system.

<span class="mw-page-title-main">Web hosting service</span> Service for hosting websites

A web hosting service is a type of Internet hosting service that hosts websites for clients, i.e. it offers the facilities required for them to create and maintain a site and makes it accessible on the World Wide Web. Companies providing web hosting services are sometimes called web hosts.

<span class="mw-page-title-main">Electrical substation</span> Part of an electrical transmission, and distribution system

A substation is a part of an electrical generation, transmission, and distribution system. Substations transform voltage from high to low, or the reverse, or perform any of several other important functions. Between the generating station and consumer, electric power may flow through several substations at different voltage levels. A substation may include transformers to change voltage levels between high transmission voltages and lower distribution voltages, or at the interconnection of two different transmission voltages. They are a common component of the infrastructure. There are 55,000 substations in the United States.

Total cost of ownership (TCO) is a financial estimate intended to help buyers and owners determine the direct and indirect costs of a product or service. It is a management accounting concept that can be used in full cost accounting or even ecological economics where it includes social costs.

A mission critical factor of a system is any factor that is essential to business, organizational, or governmental operations. Failure or disruption of mission critical factors will result in serious impact on business, organization, or government operations, and even can cause social turmoil and catastrophes.

Network monitoring is the use of a system that constantly monitors a computer network for slow or failing components and that notifies the network administrator in case of outages or other trouble. Network monitoring is part of network management.

The Service Availability Forum is a consortium that develops, publishes, educates on and promotes open specifications for carrier-grade and mission-critical systems. Formed in 2001, it promotes development and deployment of commercial off-the-shelf (COTS) technology.

<span class="mw-page-title-main">Predictive maintenance</span> Method to predict when equipment should be maintained

Predictive maintenance techniques are designed to help determine the condition of in-service equipment in order to estimate when maintenance should be performed. This approach promises cost savings over routine or time-based preventive maintenance, because tasks are performed only when warranted. Thus, it is regarded as condition-based maintenance carried out as suggested by estimations of the degradation state of an item.

High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

Website monitoring is the process of testing and verifying that end-users can interact with a website or web application as expected. Website monitoring are often used by businesses to ensure website uptime, performance, and functionality is as expected.

<span class="mw-page-title-main">Northeast blackout of 2003</span> Major power outage in August 2003 in North America

The Northeast blackout of 2003 was a widespread power outage throughout parts of the Northeastern and Midwestern United States, and most parts of the Canadian province of Ontario on Thursday, August 14, 2003, beginning just after 4:10 p.m. EDT.

<span class="mw-page-title-main">Amazon Elastic Compute Cloud</span> Cloud computing platform

Amazon Elastic Compute Cloud (EC2) is a part of Amazon.com's cloud-computing platform, Amazon Web Services (AWS), that allows users to rent virtual computers on which to run their own computer applications. EC2 encourages scalable deployment of applications by providing a web service through which a user can boot an Amazon Machine Image (AMI) to configure a virtual machine, which Amazon calls an "instance", containing any software desired. A user can create, launch, and terminate server-instances as needed, paying by the second for active servers – hence the term "elastic". EC2 provides users with control over the geographical location of instances that allows for latency optimization and high levels of redundancy. In November 2010, Amazon switched its own retail website platform to EC2 and AWS.

Remote service software is used by equipment manufacturers to remotely monitor, access and repair products in use at customer sites. It is a secure, auditable gateway for service teams to troubleshoot problems, perform proactive maintenance, assist with user operations and monitor performance. This technology is typically implemented in mission-critical environments like hospitals or IT data centers – where equipment downtime is intolerable.

Continuous availability is an approach to computer system and application design that protects users against downtime, whatever the cause and ensures that users remain connected to their documents, data files and business applications. Continuous availability describes the information technology methods to ensure business continuity.

A prediction of reliability is an important element in the process of selecting equipment for use by telecommunications service providers and other buyers of electronic equipment, and it is essential during the design stage of engineering systems life cycle. Reliability is a measure of the frequency of equipment failures as a function of time. Reliability has a major impact on maintenance and repair costs and on the continuity of service.

High availability software is software used to ensure that systems are running and available most of the time. High availability is a high percentage of time that the system is functioning. It can be formally defined as *100%. Although the minimum required availability varies by task, systems typically attempt to achieve 99.999% (5-nines) availability. This characteristic is weaker than fault tolerance, which typically seeks to provide 100% availability, albeit with significant price and performance penalties.

Data center management is the collection of tasks performed by those responsible for managing ongoing operation of a data center. This includes Business service management and planning for the future.

On July 8, 2022, Canadian telecom provider Rogers Communications experienced a major service outage affecting more than 12 million users of Rogers' cable internet and cellular networks, including those of subsidiary brands Rogers Wireless, Fido, Cityfone, and Chatr. This followed another major national outage a year prior in April 2021.

References

  1. "2021 Data Protection Trends Executive Brief". Veeam Software.
  2. Neumann, Peter G.; Weinstock, Chuck; Townson, Patrick (May 11, 1988). "Risks of Single Point Failures: The Hinsdale Fire". The RISKS Digest. 6 (82). Archived from the original on October 6, 2022 via The Catless Web Server. Excerpted from TELECOM Digest. 8 (76).
  3. Neumann, Peter G. (February 26, 1990). "The Crash of the AT&T Network in 1990". Telephone World. The Risks Digest. Archived from the original on Dec 19, 2022.
  4. "Preventing IP Network Service Outages" (PDF). Agilent Technologies. March 15, 2002. Archived from the original (PDF) on Sep 28, 2018.
  5. Neumann, Peter G.; Bellovin, Steve; Byrnes, Jim; Newell, Ruthlyn (May 7, 1998). "AT&T Announces Cause of Frame Relay Network Outage". The RISKS Digest. 19 (72) via The Catless Web Server.
  6. Block, Ryan (2008-01-03). "Xbox Live outage, day 13: still up and down, still preventing fun from being had". Engadget. Archived from the original on Jan 27, 2012. Retrieved 2011-04-27.
  7. Cohen, Peter (January 4, 2008). "Microsoft offers free game for Xbox Live holiday problems". PC World . Macworld. Archived from the original on 2011-12-01.
  8. "Restoration of PlayStation®Network and Qriocity Services begins". Sony Group Portal - Sony Global Headquarters. May 15, 2011. Retrieved 2021-10-22.
  9. Levy, Ian (2016-03-16). "A Website Went Offline And Took Most Of Women's College Basketball Analytics With It". FiveThirtyEight . Archived from the original on Sep 30, 2023.
  10. Plant, Logan (29 October 2021). "Roblox's Servers Are Back Online [Update]". IGN. Archived from the original on Oct 17, 2023.
  11. Finnis, Alex. "Is Roblox down? Why the gaming platform isn't working today with thousands of users reporting login problems". MSN . Archived from the original on Nov 15, 2021.
  12. "Roblox was down all weekend, and not because of Chipotle". 30 October 2021.
  13. Milmo, Dan; Kollewe, Julia; Quinn, Ben; Taylor, Josh; Ibrahim, Mimi (2024-07-20). "Slow recovery from IT outage begins as experts warn of future risks". The Guardian. ISSN   0261-3077 . Retrieved 2024-07-21.
  14. Weston, David (2024-07-20). "Helping our customers through the CrowdStrike outage". The Official Microsoft Blog. Retrieved 2024-07-21.
  15. Cohen, Gad. "Downtime, Outages and Failures - Understanding Their True Costs". www.evolven.com. Retrieved 2021-10-22.
  16. "Why Machine Downtime Tracking Matters?". Evocon. 10 September 2018. Retrieved 2021-10-22.
  17. "Rest & Relaxation: Why "Downtime" Is Important For Kids". 19 September 2016.
  18. "The Importance of Scheduling Downtime". 25 August 2008.
  19. "What Lack of Sleep Does to Your Mind". Many people think of sleep simply as a luxury -- a little downtime.