A resilient control system is one that maintains state awareness and an accepted level of operational normalcy in response to disturbances, including threats of an unexpected and malicious nature". [1]
Computerized or digital control systems are used to reliably automate many industrial operations such as power plants or automobiles. The complexity of these systems and how the designers integrate them, the roles and responsibilities of the humans that interact with the systems, and the cyber security of these highly networked systems have led to a new paradigm in research philosophy for next-generation control systems. Resilient Control Systems consider all of these elements and those disciplines that contribute to a more effective design, such as cognitive psychology, computer science, and control engineering to develop interdisciplinary solutions. These solutions consider things such as how to tailor the control system operating displays to best enable the user to make an accurate and reproducible response, how to design in cybersecurity protections such that the system defends itself from attack by changing its behaviors, and how to better integrate widely distributed computer control systems to prevent cascading failures that result in disruptions to critical industrial operations.
In the context of cyber-physical systems, resilient control systems are an aspect that focuses on the unique interdependencies of a control system, as compared to information technology computer systems and networks, due to its importance in operating our critical industrial operations.
Originally intended to provide a more efficient mechanism for controlling industrial operations, the development of digital control systems allowed for flexibility in integrating distributed sensors and operating logic while maintaining a centralized interface for human monitoring and interaction. [2] This ease of readily adding sensors and logic through software, which was once done with relays and isolated analog instruments, has led to wide acceptance and integration of these systems in all industries. However, these digital control systems have often been integrated in phases to cover different aspects of an industrial operation, connected over a network, and leading to a complex interconnected and interdependent system. [3] While the control theory applied is often nothing more than a digital version of their analog counterparts, the dependence of digital control systems upon the communications networks, has precipitated the need for cybersecurity due to potential effects on confidentiality, integrity and availability of the information. [4] To achieve resilience in the next generation of control systems, therefore, addressing the complex control system interdependencies, including the human systems interaction and cybersecurity, will be a recognized challenge. [1]
From a philosophical standpoint, advancing the area of resilient control systems requires a definition, metrics and consideration of the challenges and associated disciplinary fusion to address. From these will fall the value proposition for investment and adoption. Each of these topics will be discussed in what follows, but for perspective consider Fig.1. [5]
Research in resilience engineering over the last decade has focused in two areas, organizational and information technology. Organizational resilience considers the ability of an organization to adapt and survive in the face of threats, including the prevention or mitigation of unsafe, hazardous or compromising conditions that threaten its very existence. [6] Information technology resilience has been considered from a number of standpoints . [7] Networking resilience has been considered as quality of service. [8] Computing has considered such issues as dependability and performance in the face of unanticipated changes . [9] However, based upon the application of control dynamics to industrial processes, functionality and determinism are primary considerations that are not captured by the traditional objectives of information technology. . [10]
Considering the paradigm of control systems, one definition has been suggested that "Resilient control systems are those that tolerate fluctuations via their structure, design parameters, control structure and control parameters". [11] However, this definition is taken from the perspective of control theory application to a control system. The consideration of the malicious actor and cyber security are not directly considered, which might suggest the definition, "an effective reconstitution of control under attack from intelligent adversaries," which was proposed. [12] However, this definition focuses only on resilience in response to a malicious actor. To consider the cyber-physical aspects of control system, a definition for resilience considers both benign and malicious human interaction, in addition to the complex interdependencies of the control system application . [13]
The use of the term “recovery” has been used in the context of resilience, paralleling the response of a rubber ball to stay intact when a force is exerted on it and recover its original dimensions after the force is removed. [14] Considering the rubber ball in terms of a system, resilience could then be defined as its ability to maintain a desired level of performance or normalcy without irrecoverable consequences. While resilience in this context is based upon the yield strength of the ball, control systems require an interaction with the environment, namely the sensors, valves, pumps that make up the industrial operation. To be reactive to this environment, control systems require an awareness of its state to make corrective changes to the industrial process to maintain normalcy. [15] With this in mind, in consideration of the discussed cyber-physical aspects of human systems integration and cyber security, as well as other definitions for resilience at a broader critical infrastructure level, [16] [17] the following can be deduced as a definition of a resilient control system:
A resilient control system is one that maintains state awareness and an accepted level of operational normalcy in response to disturbances, including threats of an unexpected and malicious nature [1]
Considering the flow of a digital control system as a basis, a resilient control system framework can be designed. Referring to the left side of Fig. 2, a resilient control system holistically considers the measures of performance or normalcy for the state space. At the center, an understanding of performance and priority provide the basis for an appropriate response by a combination of human and automation, embedded within a multi-agent, semi-autonomous framework. Finally, to the right, information must be tailored to the consumer to address the need and position a desirable response. Several examples or scenarios of how resilience differs and provides benefit to control system design are available in the literature. [18] [13]
Some primary tenets of resilience, as contrasted to traditional reliability, have presented themselves in considering an integrated approach to resilient control systems. [19] [20] [21] These cyber-physical tenants complement the fundamental concept of dependable or reliable computing by characterizing resilience in regard to control system concerns, including design considerations that provide a level of understanding and assurance in the safe and secure operation of an industrial facility. These tenants are discussed individually below to summarize some of the challenges to address in order to achieve resilience.
The benign human has an ability to quickly understand novel solutions, and provide the ability to adapt to unexpected conditions. This behavior can provide additional resilience to a control system, [22] but reproducibly predicting human behavior is a continuing challenge. The ability to capture historic human preferences can be applied to bayesian inference and bayesian belief networks, but ideally a solution would consider direct understanding of human state using sensors such as an EEG. [23] [24] Considering control system design and interaction, the goal would be to tailor the amount of automation necessary to achieve some level of optimal resilience for this mixed initiative response. [25] Presented to the human would be that actionable information that provides the basis for a targeted, reproducible response. [26]
In contrast to the challenges of prediction and integration of the benign human with control systems, the abilities of the malicious actor (or hacker) to undermine desired control system behavior also create a significant challenge to control system resilience. [27] Application of dynamic probabilistic risk analysis used in human reliability can provide some basis for the benign actor. [28] However, the decidedly malicious intentions of an adversarial individual, organization or nation make the modeling of the human variable in both objectives and motives. [29] However, in defining a control system response to such intentions, the malicious actor looks forward to some level of recognized behavior to gain an advantage and provide a pathway to undermining the system. Whether performed separately in preparation for a cyber attack, or on the system itself, these behaviors can provide opportunity for a successful attack without detection. Therefore, in considering resilient control system architecture, atypical designs that imbed active and passively implemented randomization of attributes, would be suggested to reduce this advantage. [30] [31]
While much of the current critical infrastructure is controlled by a web of interconnected control systems, either architecture termed as distributed control systems (DCS) or supervisory control and data acquisition (SCADA), the application of control is moving toward a more decentralized state. In moving to a smart grid, the complex interconnected nature of individual homes, commercial facilities and diverse power generation and storage creates an opportunity and a challenge to ensuring that the resulting system is more resilient to threats. [32] [33] The ability to operate these systems to achieve a global optimum for multiple considerations, such as overall efficiency, stability and security, will require mechanisms to holistically design complex networked control systems. [34] [35] Multi-agent methods suggest a mechanism to tie a global objective to distributed assets, allowing for management and coordination of assets for optimal benefit and semi-autonomous, but constrained controllers that can react rapidly to maintain resilience for rapidly changing conditions. [36] [37]
Establishing a metric that can capture the resilience attributes can be complex, at least if considered based upon differences between the interactions or interdependencies. Evaluating the control, cyber and cognitive disturbances, especially if considered from a disciplinary standpoint, leads to measures that already had been established. However, if the metric were instead based upon a normalizing dynamic attribute, such a performance characteristic that can be impacted by degradation, an alternative is suggested. Specifically, applications of base metrics to resilience characteristics are given as follows for type of disturbance:
Such performance characteristics exist with both time and data integrity. Time, both in terms of delay of mission and communications latency, and data, in terms of corruption or modification, are normalizing factors. In general, the idea is to base the metric on “what is expected” and not necessarily the actual initiator to the degradation. Considering time as a metrics basis, resilient and un-resilient systems can be observed in Fig. 3. [38]
Dependent upon the abscissa metrics chosen, Fig. 3 reflects a generalization of the resiliency of a system. Several common terms are represented on this graphic, including robustness, agility, adaptive capacity, adaptive insufficiency, resiliency and brittleness. To overview the definitions of these terms, the following explanations of each is provided below:
On the abscissa of Fig. 3, it can be recognized that cyber and cognitive influences can affect both the data and the time, which underscores the relative importance of recognizing these forms of degradation in resilient control designs. For cybersecurity, a single cyberattack can degrade a control system in multiple ways. Additionally, control impacts can be characterized as indicated. While these terms are fundamental and seem of little value for those correlating impact in terms like cost, the development of use cases provide a means by which this relevance can be codified. For example, given the impact to system dynamics or data, the performance of the control loop can be directly ascertained and show approach to instability and operational impact.
The very nature of control systems implies a starting point for the development of resilience metrics. That is, the control of a physical process is based upon quantifiable performance and measures, including first principles and stochastic. The ability to provide this measurement, which is the basis for correlating operational performance and adaptation, then also becomes the starting point for correlation of the data and time variations that can come from the cognitive, cyber-physical sources. Effective understanding is based upon developing a manifold of adaptive capacity that correlates the design (and operational) buffer. For a power system, this manifold is based upon the real and reactive power assets, the controllable having the latitude to maneuver, and the impact of disturbances over time. For a modern distribution system (MDS), these assets can be aggregated from the individual contributions as shown in Fig. 4. [39] For this figure, these assets include: a) a battery, b) an alternate tie line source, c) an asymmetric P/Q-conjectured source, d) a distribution static synchronous compensator (DSTATCOM), and e) low latency, four quadrant source with no energy limit.
1) When considering the current digital control system designs, the cyber security of these systems is dependent upon what is considered border protections, i.e., firewalls, passwords, etc. If a malicious actor compromised the digital control system for an industrial operation by a man-in-the-middle attack, data can be corrupted with the control system. The industrial facility operator would have no way of knowing the data has been compromised, until someone such as a security engineer recognized the attack was occurring. As operators are trained to provide a prompt, appropriate response to stabilize the industrial facility, there is a likelihood that the corrupt data would lead to the operator reacting to the situation and lead to a plant upset. In a resilient control system, as per Fig. 2, cyber and physical data is fused to recognize anomalous situations and warn the operator. [40]
2) As our society becomes more automated for a variety of drivers, including energy efficiency, the need to implement ever more effective control algorithms naturally follow. However, advanced control algorithms are dependent upon data from multiple sensors to predict the behaviors of the industrial operation and make corrective responses. This type of system can become very brittle, insofar as any unrecognized degradation in the sensor itself can lead to incorrect responses by the control algorithm and potentially a worsened condition relative to the desired operation for the industrial facility. Therefore, implementation of advanced control algorithms in a resilient control system also requires the implementation of diagnostic and prognostic architectures to recognize sensor degradation, as well as failures with industrial process equipment associated with the control algorithms. [41] [42] [43]
In our world of advancing automation, our dependence upon these advancing technologies will require educated skill sets from multiple disciplines. The challenges may appear simply rooted in better design of control systems for greater safety and efficiency. However, the evolution of the technologies in the current design of automation has created a complex environment in which a cyber-attack, human error (whether in design or operation), or a damaging storm can wreak havoc on the basic infrastructure. The next generation of systems will need to consider the broader picture to ensure a path forward where failures do not lead to ever greater catastrophic events. One critical resource are students who are expected to develop the skills necessary to advance these designs, and require both a perspective on the challenges and the contributions of others to fulfill the need. Addressing this need, a semester course in resilient control systems was established over a decade ago at Idaho and other universities as a catalogue or special topics focus for undergraduate and graduate students. The lessons in this course were codified in a text that provides the basis for the interdisciplinary studies. [44] In addition, other courses have been developed to provide the perspectives and relevant examples to overview the critical infrastructure issues and provide opportunity to create resilient solutions at such universities as George Mason University and Northeastern.
Through the development of technologies designed to set the stage for next generation automation, it has become evident that effective teams are comprised several disciplines. [45] However, developing a level of effectiveness can be time-consuming, and when done in a professional environment can expend a lot of energy and time that provides little obvious benefit to the desired outcome. It is clear that the earlier these STEM disciplines can be successfully integrated, the more effective they are at recognizing each other's contributions and working together to achieve a common set of goals in the professional world. Team competition at venues such as Resilience Week will be a natural outcome of developing such an environment, allowing interdisciplinary participation and providing an exciting challenge to motivate students to pursue a STEM education.
Standards and policy that define resilience nomenclature and metrics are needed to establish a value proposition for investment, which includes government, academia and industry. The IEEE Industrial Electronics Society has taken the lead in forming a technical committee toward this end. The purpose of this committee will be to establish metrics and standards associated with codifying promising technologies that promote resilience in automation. This effort is distinct from more supply chain community focus on resilience and security, such as the efforts of ISO and NIST
{{citation}}
: CS1 maint: location missing publisher (link)Computer security is the protection of computer software, systems and networks from threats that may result in unauthorized information disclosure, theft of hardware, software, or data, as well as from the disruption or misdirection of the services they provide.
SCADA is a control system architecture comprising computers, networked data communications and graphical user interfaces for high-level supervision of machines and processes. It also covers sensors and other devices, such as programmable logic controllers, which interface with process plant or machinery.
Business continuity may be defined as "the capability of an organization to continue the delivery of products or services at pre-defined acceptable levels following a disruptive incident", and business continuity planning is the process of creating systems of prevention and recovery to deal with potential threats to a company. In addition to prevention, the goal is to enable ongoing operations before and during execution of disaster recovery. Business continuity is the intended outcome of proper execution of both business continuity planning and disaster recovery.
IT disaster recovery (also, simply disaster recovery (DR)) is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle. DR employs policies, tools, and procedures with a focus on IT systems supporting critical business functions. This involves keeping all essential aspects of a business functioning despite significant disruptive events; it can therefore be considered a subset of business continuity (BC). DR assumes that the primary site is not immediately recoverable and restores data and services to a secondary site.
The United States Computer Emergency Readiness Team (US-CERT) is an organization within the Department of Homeland Security’s (DHS) Cybersecurity and Infrastructure Security Agency (CISA). Specifically, US-CERT is a branch of the Office of Cybersecurity and Communications' (CS&C) National Cybersecurity and Communications Integration Center (NCCIC).
High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
Information assurance (IA) is the practice of assuring information and managing risks related to the use, processing, storage, and transmission of information. Information assurance includes protection of the integrity, availability, authenticity, non-repudiation and confidentiality of user data. IA encompasses both digital protections and physical techniques. These methods apply to data in transit, both physical and electronic forms, as well as data at rest. IA is best thought of as a superset of information security, and as the business outcome of information risk management.
Cyber-Physical Systems (CPS) are integrations of computation with physical processes. In cyber-physical systems, physical and software components are deeply intertwined, able to operate on different spatial and temporal scales, exhibit multiple and distinct behavioral modalities, and interact with each other in ways that change with context. CPS involves transdisciplinary approaches, merging theory of cybernetics, mechatronics, design and process science. The process control is often referred to as embedded systems. In embedded systems, the emphasis tends to be more on the computational elements, and less on an intense link between the computational and physical elements. CPS is also similar to the Internet of Things (IoT), sharing the same basic architecture; nevertheless, CPS presents a higher combination and coordination between physical and computational elements.
An information security operations center is a facility where enterprise information systems are monitored, assessed, and defended.
A cyberattack occurs when there is an unauthorized action against computer infrastructure that compromises the confidentiality, integrity, or availability of its content.
In the fields of engineering and construction, resilience is the ability to absorb or avoid damage without suffering complete failure and is an objective of design, maintenance and restoration for buildings and infrastructure, as well as communities. A more comprehensive definition is that it is the ability to respond, absorb, and adapt to, as well as recover in a disruptive event. A resilient structure/system/community is expected to be able to resist to an extreme event with minimal damages and functionality disruptions during the event; after the event, it should be able to rapidly recovery its functionality similar to or even better than the pre-event level.
Wolfgang Kröger has been full professor of Safety Technology at the ETH Zurich since 1990 and director of the Laboratory of Safety Analysis simultaneously. Before being elected Founding Rector of International Risk Governance Council (IRGC) in 2003, he headed research in nuclear energy and safety at the Paul Scherrer Institut (PSI). After his retirement early 2011 he became the Executive Director of the newly established ETH Risk Center. He has both Swiss and German citizenship and lives in Kilchberg, Zürich. His seminal work lies in the general area of reliability, risk and vulnerability analysis of large-scale technical systems, initially single complicated systems like nuclear power plants of different types and finally complex engineered networks like power supply systems, the latter coupled to other critical infrastructure and controlled by cyber-physical systems. He is known for his continuing efforts to advance related frameworks, methodology, and tools, to communicate results including uncertainties as well as for his successful endeavor in stimulating trans-boundary cooperation to improve governance of emerging systemic risks. His contributions to shape and operationalize the concept of sustainability and - more recently - the concept of resilience are highly valued. Furthermore, he is in engaged in the evaluation of smart clean, secure, and affordable energy systems and future technologies, including new ways of exploiting nuclear energy. The development and certification of cooperative automated vehicles, regarded as a cornerstone of future mobility concepts, are matter of growing interest.
Climate resilience is a concept to describe how well people or ecosystems are prepared to bounce back from certain climate hazard events. The formal definition of the term is the "capacity of social, economic and ecosystems to cope with a hazardous event or trend or disturbance". For example, climate resilience can be the ability to recover from climate-related shocks such as floods and droughts. Different actions can increase climate resilience of communities and ecosystems to help them cope. They can help to keep systems working in the face of external forces. For example, building a seawall to protect a coastal community from flooding might help maintain existing ways of life there.
Community resilience is the sustained ability of a community to use available resources to respond to, withstand, and recover from adverse situations. This allows for the adaptation and growth of a community after disaster strikes. Communities that are resilient are able to minimize any disaster, making the return to normal life as effortless as possible. By implementing a community resilience plan, a community can come together and overcome any disaster, while rebuilding physically and economically.
Operational technology (OT) is hardware and software that detects or causes a change, through the direct monitoring and/or control of industrial equipment, assets, processes and events. The term has become established to demonstrate the technological and functional differences between traditional information technology (IT) systems and industrial control systems environment, the so-called "IT in the non-carpeted areas".
Cyber resilience refers to an entity's ability to continuously deliver the intended outcome, despite cyber attacks. Resilience to cyber attacks is essential to IT systems, critical infrastructure, business processes, organizations, societies, and nation-states. A related term is cyberworthiness, which is an assessment of the resilience of a system from cyber attacks. It can be applied to a range of software and hardware elements.
The Internet of Military Things (IoMT) is a class of Internet of things for combat operations and warfare. It is a complex network of interconnected entities, or "things", in the military domain that continually communicate with each other to coordinate, learn, and interact with the physical environment to accomplish a broad range of activities in a more efficient and informed manner. The concept of IoMT is largely driven by the idea that future military battles will be dominated by machine intelligence and cyber warfare and will likely take place in urban environments. By creating a miniature ecosystem of smart technology capable of distilling sensory information and autonomously governing multiple tasks at once, the IoMT is conceptually designed to offload much of the physical and mental burden that warfighters encounter in a combat setting.
Internet security awareness or Cyber security awareness refers to how much end-users know about the cyber security threats their networks face, the risks they introduce and mitigating security best practices to guide their behavior. End users are considered the weakest link and the primary vulnerability within a network. Since end-users are a major vulnerability, technical means to improve security are not enough. Organizations could also seek to reduce the risk of the human element. This could be accomplished by providing security best practice guidance for end users' awareness of cyber security. Employees could be taught about common threats and how to avoid or mitigate them.
Operational collaboration is a cyber resilience framework that leverages public-private partnerships to reduce the risk of cyber threats and the impact of cyberattacks on United States cyberspace. This operational collaboration framework for cyber is similar to the Federal Emergency Management Agency (FEMA)'s National Preparedness System which is used to coordinate responses to natural disasters, terrorism, chemical and biological events in the physical world.
Resilience week is an annual symposium established to enable cross-disciplinary and role based discussions to advance strategies and research that engenders resilience in critical infrastructure systems and communities. Damaging storms, cyber attack and the interconnection of critical infrastructure systems can lead to cascading events that not only affect local but also across regions. However, many of these interdependencies are not easily recognized and obscure and complicate the mitigation of risk. The purpose of the symposia series is hence to facilitate best practice in managing critical infrastructure risks, by bringing together businesses, government and researchers.