Incident management

Last updated

An incident is an event that could lead to loss of, or disruption to, an organization's operations, services or functions. Incident management (IcM) is a term describing the activities of an organization to identify, analyze, and correct hazards to prevent a future re-occurrence. These incidents within a structured organization are normally dealt with by either an incident response team (IRT), an incident management team (IMT), or Incident Command System (ICS). Without effective incident management, an incident can disrupt business operations, information security, IT systems, employees, customers, or other vital business functions. [1]

Contents

Description

An incident is an event that could lead to the loss of, or disruption to, an organization's operations, services or functions. [2] Incident management (IcM) is a term describing the activities of an organization to identify, analyze, and correct hazards to prevent a future re-occurrence. If not managed, an incident can escalate into an emergency, crisis or disaster. Incident management is therefore the process of limiting the potential disruption caused by such an event, followed by a return to business as usual. Without effective incident management, an incident can disrupt business operations, information security, IT systems, employees, customers, or other vital business functions. [1]

Physical incident management

National Fire Protection Association states that incident management can be described as, '[a]n IMS [incident management system] is "the combination of facilities, equipment, personnel, procedures and communications operating within a common organizational structure, designed to aid in the management of resources during incidents". [3] [4]

Physical incident management is the real-time response that may last for hours, days, or longer. The United Kingdom Cabinet Office has produced the National Recovery Guidance (NRG), which is aimed at local responders as part of the implementation of the Civil Contingencies Act 2004 (CCA). It describes the response as the following: "Response encompasses the actions taken to deal with the immediate effects of an emergency. In many scenarios, it is likely to be relatively short and to last for a matter of hours or days – rapid implementation of arrangements for collaboration, coordination and communication is, therefore, vital. Response encompasses the effort to deal not only with the direct effects of the emergency itself (eg fighting fires, rescuing individuals) but also the indirect effects (eg disruption, media interest)". [5] [6]

International Organization for Standardization (ISO), which is the world's largest developer of international standards also makes a point in the description of its risk management, principles and guidelines document ISO 31000:2009 that, "Using ISO 31000 can help organizations increase the likelihood of achieving objectives, improve the identification of opportunities and threats and effectively allocate and use resources for risk treatment". [7] This again shows the importance of not just good planning but the effective allocation of resources to treat the risk.

Computer security incident management

Today, an important role is played by a Computer Security Incident Response Team (CSIRT), due to the rise of internet crime, and is a common example of an incident faced by companies in developed nations all across the world. For example, if an organization discovers that an intruder has gained unauthorized access to a computer system, the CSIRT would analyze the situation, determine the breadth of the compromise, and take corrective action.

Currently, over half of the world's hacking attempts on Trans National Corporations (TNCs) take place in North America (57%). 23% of attempts take place in Europe. [8] Having a well-rounded Computer Security Incident Response team is integral to providing a secure environment for any organization, and is becoming a critical part of the overall design of many modern networking teams.

Roles

Incidents within a structured organization are normally dealt with by either an incident response team (IRT), or an incident management team (IMT). These are often designated beforehand or during the event and are placed in control of the organization whilst the incident is dealt with, to restore normal functions. The incident commander manages the response to a security incident and leads the members of the incident response team(s) through the process, as defined by the Incident Command System (ICS). [9]

Usually, as part of the wider management process in private organizations, incident management is followed by post-incident analysis where it is determined why the incident happened despite precautions and controls. This analysis is normally overseen by the leaders of the organization, with the view of preventing a repetition of the incident through precautionary measures and often changes in policy. This information is then used as feedback to further develop the security policy and/or its practical implementation. In the United States, the National Incident Management System, developed by the Department of Homeland Security, integrates effective practices in emergency management into a comprehensive national framework. This often results in a higher level of contingency planning, exercise and training, as well as an evaluation of the management of the incident. [10]

Root cause analysis

Human factors

During the root cause analysis, human factors should be assessed. James Reason conducted a study into the understanding of adverse effects of human factors. [11] The study found that major incident investigations, such as Piper Alpha and Kings Cross Underground Fire, made it clear that the causes of the accidents were distributed widely within and outside the organization. There are two types of events: active failure—an action that has immediate effects and has the likelihood to cause an accident—and latent or delayed action—events can take years to have an effect and are usually combined with triggering events that then cause the accident.

Latent failures are created as the result of decisions taken at the higher echelons of an organisation. Their damaging consequences may lie dormant for a long time, only becoming evident when they combine with local triggering factors (e.g., the spring tide, the loading difficulties at Zeebrugge harbour, etc.) to breach the system's defences. Decisions taken in the higher echelons of an organization can trigger the events towards an accident becoming more likely, the planning, scheduling, forecasting, designing, policymaking, etc., can have a slow burning effect. The actual unsafe act that triggers an accident can be traced back through the organization and the subsequent failures can be exposed, showing the accumulation of latent failures within the system as a whole that led to the accident becoming more likely and ultimately happening. Better improvement action can be applied, and reduce the likelihood of the event happening again. [12]

Field-specific implementation

IT service management

Incident management is an important part of IT service management (ITSM) process area. [13] The first goal of the incident management process is to restore a normal service operation as quickly as possible and to minimize the impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained. 'Normal service operation' is defined here as service operation within service-level agreement (SLA). It is one process area within the broader ITIL and ISO 20000 environment.

ISO 20000 defines the objective of Incident management (part 1, 8.2) as: To restore agreed service to the business as soon as possible or to respond to service requests. [14]

ITIL 2011 defines an incident as:

an unplanned interruption to an IT service or reduction in the quality of an IT service or a failure of a Configuration Item that has not yet impacted an IT service (for example failure of one disk from a mirror set). [15]

The ITIL incident management process ensures that normal service operation is restored as quickly as possible and the business impact is minimized.ITIL Service Operation. AXELOS. 30 May 2007. ISBN   978-0113310463.

The main challenges and cause for problems in the Incident management are:

  1. Constantly increasing Alert and Event Noise
  2. Complex and Lengthy IT Problem Resolution Process
  3. Inability to effectively predict and prevent IT service degradations or outages [16]

See also

Related Research Articles

<span class="mw-page-title-main">Risk management</span> Identification, evaluation and control of risks

Risk management is the identification, evaluation, and prioritization of risks, followed by the minimization, monitoring, and control of the impact or probability of those risks occurring.

Security management is the identification of an organization's assets i.e. including people, buildings, machines, systems and information assets, followed by the development, documentation, and implementation of policies and procedures for protecting assets.

<span class="mw-page-title-main">Business continuity planning</span> Prevention and recovery from threats that might affect a company

Business continuity may be defined as "the capability of an organization to continue the delivery of products or services at pre-defined acceptable levels following a disruptive incident", and business continuity planning is the process of creating systems of prevention and recovery to deal with potential threats to a company. In addition to prevention, the goal is to enable ongoing operations before and during execution of disaster recovery. Business continuity is the intended outcome of proper execution of both business continuity planning and disaster recovery.

In science and engineering, root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. It is widely used in IT operations, manufacturing, telecommunications, industrial process control, accident analysis (e.g., in aviation, rail transport, or nuclear plants), medical diagnosis, the healthcare industry (e.g., for epidemiology), etc. Root cause analysis is a form of inductive inference (first create a theory, or root, based on empirical evidence, or causes) and deductive inference (test the theory, i.e., the underlying causal mechanisms, with empirical data).

An organizational crisis is described as a rare, high-impact event that jeopardizes the organization's survival. It is marked by uncertainty regarding the cause, effects, and solutions, along with the need for rapid decision-making.

IT disaster recovery (also, simply disaster recovery (DR)) is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle. DR employs policies, tools, and procedures with a focus on IT systems supporting critical business functions. This involves keeping all essential aspects of a business functioning despite significant disruptive events; it can therefore be considered a subset of business continuity (BC). DR assumes that the primary site is not immediately recoverable and restores data and services to a secondary site.

Information technology service management (ITSM) are the activities performed by an organization to design, build, deliver, operate and control IT services offered to customers.

Given organizations' increasing dependency on information technology (IT) to run their operations, business continuity planning covers the entire organization, while disaster recovery focuses on IT.

<span class="mw-page-title-main">U.S. critical infrastructure protection</span>

In the U.S., critical infrastructure protection (CIP) is a concept that relates to the preparedness and response to serious incidents that involve the critical infrastructure of a region or the nation. The American Presidential directive PDD-63 of May 1998 set up a national program of "Critical Infrastructure Protection". In 2014 the NIST Cybersecurity Framework was published after further presidential directives.

ITIL security management describes the structured fitting of security into an organization. ITIL security management is based on the ISO 27001 standard. "ISO/IEC 27001:2005 covers all types of organizations. ISO/IEC 27001:2005 specifies the requirements for establishing, implementing, operating, monitoring, reviewing, maintaining and improving a documented Information Security Management System within the context of the organization's overall business risks. It specifies requirements for the implementation of security controls customized to the needs of individual organizations or parts thereof. ISO/IEC 27001:2005 is designed to ensure the selection of adequate and proportionate security controls that protect information assets and give confidence to interested parties."

In the fields of computer security and information technology, computer security incident management involves the monitoring and detection of security events on a computer or computer network, and the execution of proper responses to those events. Computer security incident management is a specialized form of incident management, the primary purpose of which is the development of a well understood and predictable response to damaging events and computer intrusions.

ISO/IEC 27005 "Information technology — Security techniques — Information security risk management" is an international standard published by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) providing good practice guidance on managing risks to information. It is a core part of the ISO/IEC 27000-series of standards, commonly known as ISO27k.

ISO/TC 223 Societal security was a technical committee of the International Organization for Standardization formed in 2001 to develop standards in the area of societal security: i.e. protection of society from and response to incidents, emergencies, and disasters caused by intentional and unintentional human acts, natural hazards, and technical failures.

Information technology risk, IT risk, IT-related risk, or cyber risk is any risk relating to information technology. While information has long been appreciated as a valuable and important asset, the rise of the knowledge economy and the Digital Revolution has led to organizations becoming increasingly dependent on information, information processing and especially IT. Various events or incidents that compromise IT in some way can therefore cause adverse impacts on the organization's business processes or mission, ranging from inconsequential to catastrophic in scale.

Security level management (SLM) comprises a quality assurance system for information system security.

<span class="mw-page-title-main">Supply chain risk management</span> Preventing failures in logistics

Supply chain risk management (SCRM) is "the implementation of strategies to manage both everyday and exceptional risks along the supply chain based on continuous risk assessment with the objective of reducing vulnerability and ensuring continuity".

In computer security, a threat is a potential negative action or event enabled by a vulnerability that results in an unwanted impact to a computer system or application.

Human factors are the physical or cognitive properties of individuals, or social behavior which is specific to humans, and which influence functioning of technological systems as well as human-environment equilibria. The safety of underwater diving operations can be improved by reducing the frequency of human error and the consequences when it does occur. Human error can be defined as an individual's deviation from acceptable or desirable practice which culminates in undesirable or unexpected results. Human factors include both the non-technical skills that enhance safety and the non-technical factors that contribute to undesirable incidents that put the diver at risk.

[Safety is] An active, adaptive process which involves making sense of the task in the context of the environment to successfully achieve explicit and implied goals, with the expectation that no harm or damage will occur. – G. Lock, 2022

Dive safety is primarily a function of four factors: the environment, equipment, individual diver performance and dive team performance. The water is a harsh and alien environment which can impose severe physical and psychological stress on a diver. The remaining factors must be controlled and coordinated so the diver can overcome the stresses imposed by the underwater environment and work safely. Diving equipment is crucial because it provides life support to the diver, but the majority of dive accidents are caused by individual diver panic and an associated degradation of the individual diver's performance. – M.A. Blumenberg, 1996

<span class="mw-page-title-main">Disaster preparedness (cultural property)</span> Preserving and protecting cultural artifact collections

Disaster preparedness in museums, galleries, libraries, archives and private collections, involves any actions taken to plan for, prevent, respond or recover from natural disasters and other events that can cause damage or loss to cultural property. 'Disasters' in this context may include large-scale natural events such as earthquakes, flooding or bushfire, as well as human-caused events such as theft and vandalism. Increasingly, anthropogenic climate change is a factor in cultural heritage disaster planning, due to rising sea levels, changes in rainfall patterns, warming average temperatures, and more frequent extreme weather events.

ISO 22300:2021, Security and resilience – Vocabulary, is an international standard developed by ISO/TC 292 Security and resilience. This document defines terms used in security and resilience standards and includes 360 terms and definitions. This edition was published in the beginning of 2021 and replaces the second edition from 2018.

References

  1. 1 2 "What qualifies as an 'incident'?". Business Link. Archived from the original on 2011-06-15. Retrieved 2018-01-04.
  2. "Dictionary of business continuity management terms" (PDF). Business Continuity Institute. Archived from the original (PDF) on 2015-04-30. Retrieved 2015-09-03.
  3. "List of NFPA Codes and Standards". National Fire Protection Association. 2013. Retrieved 10 April 2013.
  4. "Incident Management". Ready.gov. 2012. Archived from the original on 12 April 2013. Retrieved 10 April 2013.
  5. "National Recovery Guidance". GOV.UK. 2007. Retrieved 10 April 2013.
  6. "Civil Contingencies Act 2004". legislation.gov.uk. 2012. Retrieved 10 April 2013.
  7. "ISO 31000 Risk management". International Organization for Standardization. 2009. Retrieved 13 April 2013.
  8. "Hacking Incidents 2009 – Interesting Data". Roger's Security Blog. TechNet Blogs. 12 Mar 2010. Archived from the original on Sep 24, 2012. Retrieved 2012-11-17.
  9. FEMA. "Incident Command System" (PDF). Retrieved 2024-01-30.
  10. "About the Contingency Planning and Incident Management Division". Homeland Security. Archived from the original on April 2, 2012. Retrieved 2012-11-17.
  11. Reason J (June 1995). "Understanding adverse events: human factors". Quality in Health Care. 4 (2): 80–9. doi:10.1136/qshc.4.2.80. PMC   1055294 . PMID   10151618.
  12. O’Callaghan, Katherine Mary, Incident Management: Human Factors and Minimising Mean Time to Restore Archived 2011-09-17 at the Wayback Machine , Ph.D. Thesis, Australian Catholic University, 2010.
  13. "Incident management is now a necessity for the enterprise".
  14. "The BPM-D Application". Gov.UK Digital Marketplace.
  15. ITIL Service Operation. United Kingdom: The Stationery Office. 2011. ISBN   9780113313075.
  16. "Why automatic context enrichment for alert and incident management is critical for operations?". 3 December 2019.

Further reading

Bibliography