When a digital system is developed or purchased, a primary consideration is how successfully the system accomplishes its desired function. Measuring it, however, does encompass the functional lifespan of the system, including its long-term efficiency. Efficiency is defined as the ability to do things well, successfully, and without waste. A short-term view of efficiency might not account for threats with a low probability of occurring in the next days or weeks. Achieving long-term efficiency with the system accomplishing its function even in the presence of disruptions requires consideration of several factors, including short-term efficiency and resilience. Resilience is defined as the ability of a system to absorb, respond, recover from, and adapt to disruptive events.9,14,20,22
Enhancing cyber resilience often requires investing in qualified labor, redundant equipment, and software. Such investment increases the cost per byte or user and therefore is detrimental to short-term efficiency. However, enhanced resilience reduces the impact of disruptions and speeds up recovery from them. While investing in resilience improves long-term efficiency, its optimization requires perfect knowledge of the system's short-term efficiency, the exact nature and impact of all future failures, and knowledge of the system's response to those failures (that is, resilience). As it is impossible to quantify the exact nature and impact of all future failures, it is also impossible to deterministically optimize a system's long-term efficiency. While risks can't be understood deterministically, the impact and nature of those risks can be estimated probabilistically. Ideally, the uncertainty introduced by using a probabilistic approach demands prioritizing the most effective solution based on the long-term goals. We need a framework that compares the tradeoffs between short-term efficiency and resilience to optimize the long-term efficiency goals and therefore the effectiveness of a proposed solution.
Efficiency has been examined extensively, both in operations research and general management4 and in specific disciplines such as computer science,1 communications,15 and supply chain management.5,11 Likewise, resilience is the theme of a substantial body of knowledge18,20 with studies focusing on particular areas such as cyber resilience,3,17 aviation, climate change and epidemiology;24,25 or specific types of disruptions, such as the ripple effect in supply chains.11
Recent work has examined the relationship between efficiency and resilience in a system from two different perspectives. One body of knowledge has characterized efficiency and resilience as conflicting objectives, proposing a trade-off between them in a system's design or optimization.11,24,25 The second perspective asserts that systems optimized exclusively for efficiency often sacrifice redundancy, reliability, robustness, and other attributes related to resilience.1,12
In this Viewpoint, we explore the relationship between short-term efficiency and resilience of digital or cyber-physical systems through the lens of functionality, resources, and cost over time. The objective is to present a framework for evaluating the long-term efficiency of a system as a combination of short-term efficiency and resilience. After one or more disruptions, systems that were optimized for both short-term efficiency and resilience as a combined objective are more efficient in the long-term than systems that optimize exclusively for efficiency in the short-term. The challenges of estimating resilience against future disruptions and possible approaches to overcome these challenges are discussed, followed by the WannaCry ransomware attack on the National Health Services of England (NHS), to illustrate the impact of disruptions and investment in resilience.
As an illustration of the relationship between efficiency and resilience consider a scenario in which two systems initially have identical configurations and levels of functionality (or output)—as an example, in the case of a healthcare system, functionality and output over a given period refers to the volume of patient appointments, response time to emergency calls, and preventive care exams, among others. At an instant denoted as Tpre, one of the systems is modified to improve resilience while the other is kept unchanged. Because of the modification, these systems respond differently to a disruption. Conceptually, the dynamics of the two systems is as follows.
Initial state: Identical systems. Consider two systems S1 and S2 that initially have identical configuration and level of functionality. This initial state is represented in the leftmost part of the timeline illustrated in Figure 1, as the period before Tpre. More specifically:
Time Tpre: Modification to improve resilience. At time Tpre, managers of system S2 decide to improve its resilience against future attacks or disasters. For example, in a communications network, this might be extra routers or links added to improve network resilience, and in a distributed computing platform it might represent redundant servers, storage, and so forth. On the other hand, system S1 does not receive any investment for resilience at time Tpre. Note that functionality or output F2 of system S2 is unaffected by the extra investment since its purpose is to improve resilience rather than functionality. This is illustrated in Figure 1(A) in the period after Tpre. However, the equipment, services, so forth added at the time Tpre increase the total cost of system S2 increases from C0 to C2. This is illustrated in Figure 1(B).
Immediately after Tpre, it becomes apparent that system S2 is less efficient than S1, at least in the short term. This is because S2 incurs a cost C2>C1 to produce the same functionality as S1. We call this a short-term loss of efficiency.
Time Tdis: Disruption. At time Tdis a natural disaster, operations error, or cyber-attack causes disruption on both systems affecting their functionality as illustrated in Figure 1(A). System S1 suffer a sharp reduction in functionality, followed by a relatively slow recovery toward the pre-disruption level F0.
On the other hand, the additional investment in resilience in system S2 pays off. As illustrated in Figure 1(A), the functionality of system S2 is more stable (that is, has less loss) and recovers (that is, goes back to pre-disruption levels) faster than system S1 in the time following the disruption. Figure 1(B) illustrates the increase in total costs caused by the disruption. This increase includes direct costs of foregone output and costs to remedy the disruption. The increase in total costs may also include indirect or future costs related to lawsuits, loss of users or customers that no longer trust the system, and so forth. Since system S1 suffers a more severe and longer impact, it is reasonable to expect that the increase in total cost is relatively larger than that of system S2, which suffers fewer losses.
Figure 1(C) is one way to illustrate the difference in efficiency (short and long term) between the systems over time. Figure 1(C) represents the ratio of the total cost to functionality or output. In other words, the curves represent the unit cost of each system. Under "normal" conditions (before a disruption, illustrated as t<Tdis), system S1 has the best short-term efficiency because of cheaper unit cost. But when a disruption occurs at Tdis, functionality declines (see Figure 1(A) for Tdis<t<Trec) while cost increases simultaneously (Figure 1(B) for the same period), and these two effects compound negatively to the cost per unit illustrated in Figure 1(C). On the other hand, system S2 has both a lower decrease in output and less increase in total cost.
The key takeaway of the scenario illustrated in Figure 1 is that resilience may be detrimental to short-term efficiency but can be necessary for long-term efficiency. System S2 initially has worse short-term efficiency because of the costly resources added to enhance resilience. However, S2 has higher long-term efficiency than S1 after failures, attacks, or disasters because S2 turns out to have a lower cost per unit in the longer term.
Efficiency. To formalize the idea illustrated in Figure 1, let the cost per unit CU=C/F be a ratio of total cost C to functionality F of a system, as illustrated in Figure 1(D). Note that CU is directly related to efficiency. The lower CU is in a period of time, the more efficient a system is during that period. Therefore, the difference in efficiency between two systems S1 and S2 can be expressed as the area between CU1 and CU2 over time:
This difference in efficiency is illustrated in Figure 2.
Resilience. Let R be a measure of the resilience of a system. One way to define R is as the area under the curve of the system's functionality F over the period Tdis≤t<Trec following a disruption at Tdis (for example, cyberattack). Two resilience properties are directly associated with the change in F: (i) the decrease in output F from the baseline F0; (ii) the time the system takes to recover (for example, return to pre-disruption output levels—F0).
Therefore, the area under the curve (that is, the integral of F over time) can be understood as a measure of resilience that combines the impact of the disruption and recovery to the previous output level. The larger the integral, the more resilient a system is. When a disruption occurs, system S2 is more resilient than S1 (a larger integral) when either the degradation in F2 is less than the degradation in F1, or the recovery of F2 to the pre-disruption level is faster than the recovery of F1. In the example illustrated in Figure 1(B), both the degradation in F2 is less than the degradation in F1, and the recovery of F2 to the pre-disruption level is faster than the recovery of F1. As with the case of efficiency, we can define a difference in resilience between two systems .
Note that resilience R and long-term efficiency (as represented by the inverse of CU) are positively correlated in the period Tdis≤t<Trec. A system with higher resilience R2>R1 will also be more efficient in the long term because it has a lower unit cost CU2<CU1 in the period Tdis≤t<Trec. For example, if S2 experiences a degradation in F that 10% less than S1, then both R and 1/CU for S2 will be relatively higher than R and 1/CU for S1. The takeaway argument is that a resilient system is also efficient in the long term—there is no conflict. For example, a healthcare system that is prepared for a relatively small degradation in patient appointments following a disruption will have both high resilience and high efficiency in volume of appointments per dollar.
The formulation here implies that the optimal investment in resilience resources is such that maximizes the difference in long-term efficiency after a disruption ΔCU. However, finding such optimal investment is challenging. Even though there is a clear goal, possible barriers to performing the optimization include uncertainty regarding the frequency/probability and impact of disruptions and uncertainty regarding the effect of each dollar invested in resilience with respect to reduction of impact and/or speed of recovery from disruptions.
The uncertainty regarding the frequency/probability and impact of disruptions has several implications that make the optimization for long-term efficiency hard. One is to judge threats to be outside the horizon of interest. For example, a manager might prioritize short-term efficiency as share prices and budgetary/time constraints often mean that short-term efficiency is the metric by which their, and the project's, performance is gauged. Another implication is a refusal to invest in resilience against unknown threats. For example, a manager could refuse to prepare the system against attacks that decrypt information based on quantum computing, arguing that nobody knows what specific attacks and their impact will look like. If these implications are widespread across organizations and sectors, then there is yet another incentive to focus on short-term efficiency: "no one else is investing in resilience—if we do, we will not be competitive."
The uncertainty regarding the effect of each dollar invested in resilience for reduction of impact and/or speed of recovery from disruptions also has implications that make the optimization challenging. Managers may assess that they do not know if anything will happen, and if it does, benefit-cost analysis of resilience cannot be performed because the resilience gain per dollar invested cannot be determined a priori. For these reasons, managers may struggle to choose between investment alternatives. Should one invest in thicker data center walls, anti-aircraft missiles, or anti-virus software? Moreover, the uncertainty regarding the return on investment may prevent managers to act if they perceive that they are paying for other people or organizations to benefit.
Resilience may be detrimental to short-term efficiency but can be necessary for long-term efficiency.
Another challenge to optimizing long-term efficiency is that estimates of impacts need to be done for at least two scenarios—without any investment in resilience (that is, S1) and at least one scenario where resources are added to improve resilience (that is, S2). Moreover, if there N>2 scenarios, then ΔR and ΔCU must be estimated for multiple pairs (S1,Si),i=2,…,N to choose the scenario with highest improvement in ΔR and ΔCU.
The uncertainty about disruptions and resilience benefit from investment affects the optimization of long-term efficiency. Therefore, approaches that help reduce this uncertainty will result in investments in resilience that are close to optimal. One approach is to use historical data about past disruptions and past resilience investments to forecast future disruptions and investments. This data should ideally include past disruptions with information about impact and frequency on given systems and data about past investments in resilience that enable the assessment of resilience benefit per dollar (reduction and recovery from disruptions). One shortcoming of this approach is that it does not help optimize the investment for new threats (for example, malware that was never seen before).
Simulation is another approach to reduce the uncertainty about disruptions and resilience benefit from investment, informing the optimization of long-term efficiency. Simulations can be as simple as tabletop red teaming in cyber-security where subject matter experts can be used to estimate probabilities and impact of unforeseen events, but the assumptions can be subjective and result ineffective.10 The other extreme would be high fidelity digital twins of cyber-physical systems with varied capabilities of functionality emulation. For example, NREL's CEEP is considered a sophisticated platform for the simulation of cyber events.7 In either way, simulation of events with high impact and low probability (for example, quantum computer attacks), as well as simulation of distinct resilience measures (for example, placement of redundant server versus cyber defense software or teams) may facilitate the optimization of long-term efficiency with less uncertainty.
As future work, the approach proposed in this Viewpoint could benefit from insights from Pareto optimality methods. Previous work has discussed the Pareto front when optimizing criteria such as reliability and affordability (for example, Bhattacharya et al.2) and a similar approach could be used to explore equilibrium levels of resilience and efficiency.
In May 2017, malware of large-scale impact was released on the Internet. Using the EternalBlue vulnerability on Microsoft systems (made public earlier the same year) the worm—WannaCry, would infect a computer and encrypt the files, which were then held for ransom. The victim was offered the encryption key in exchange for a payment of the equivalent of 600 USD in Bitcoin (per infected device). After three days, if the 'ransom' was not paid the files were permanently deleted. WannaCry spread to 230,000 computers in over 150 countries costing the global economy approximately 4 billion USD,6 which represents total cost C in our formulation. While the NHS was not the intended target, it was among the hardest hit. The malware spread to 80 out of the 236 public hospitals in England costing the Department of Health an estimated 120 million USD.18 This financial loss and the loss of output in terms of healthcare that could not be provided justify investments in resilience resources (as discussed previously). The global and U.K. losses are a tangible portion of the aggregate increase in cost between and in our model. For example, an investment in resilience R that reduced impact and sped recovery for the NHS would prevent part or all of the loss of USD 120 million, or result in minimal impact for the cost per patient CU, preserving long-term efficiency. If the investment does not exceed the prevented loss, it is justifiable.
While we have discussed that unpredictable disruptions make it difficult to optimize investments, this attack was not unprecedented. Two hospitals in the NHS had fallen victim to ransom-ware attacks much earlier—in October 201621 and again in January 2017.8 As a result of the earlier attacks, the NHS was in the process of conducting on-site cyber security assessments and all of the 88 hospitals that had been audited at the time had failed the assessment. Moreover, NHS Digital, the Health and Social Care entity in charge of IT, had issued guidance to hospitals to install the updates that would have prevented the success of WannaCry.
However, the resources required to mitigate the effect of the malware did not add value in terms of short-term efficiency. They do streamline monitoring and shed clarity in a moment of emergency but provide limited if any contribution to the day-to-day operation of a hospital, increasing costs in the short term. As a result, prior investments for resilience were possibly below the amount that would optimize long-term efficiency.
In this Viewpoint, we showed that resilience pays off. It is likely adding resources for resilience initially increases its costs without expanding functionality, causing an initial decline in short-term efficiency. However, cyber disruptions are increasingly likely (if not certain), which decrease the system's functionality and simultaneously increase its costs due to lost customers or users, lawsuits, and other damage. A system that is prepared for resilience has lower declines in functionality and fewer cost overruns, and this advantage can more than compensate for the initial costs of adding resources for resilience. This not only improves resilience but optimizes long-term efficiency as well.
This is especially true for cyber systems and cyber-physical systems. For example, the electric grid is increasingly dependent on information and supervisory systems that control power generation, transmission, and distribution. Disruptions that impair the functionality of control systems may cause electric outages that put lives at risk and incur economic losses in businesses or missions that are beyond the boundaries of the grid. Resilience resources that mitigate those losses, therefore, result in a benefit to energy users that exceeds the upfront cost of the grid. Therefore, adding certain resources should improve both long-term efficiency and resilience.
However, there probably is a limit on such long-term efficiency and resilience gains. In the grid example, adding a redundant server to a single-server control system will probably result in enhanced long-term efficiency and resilience. Adding a third server, a fourth, and so forth will increase cost and probably will not enhance resilience as much as the second server. At some point, adding resources may not improve resilience at all while still adding to the overall cost and therefore becoming detrimental to long-term efficiency. Finding the level of investment in resilience that is optimal for a given system is challenging, if not impossible. While the cost of equipment, software, or personnel dedicated to cybersecurity and resilience is generally known, its benefit depends on the intensity and frequency of future threats.
3. Bodeau, D. et al. Cyber Resiliency Metrics, Measures of Effectiveness, and Scoring. (2018); https://bit.ly/3XR4fip
5. Chopra, S. et al. Achieving supply chain efficiency and resilience by using multi-level commons. Decision Sciences 52, 4 (2021), 817–832. https://bit.ly/3XQSGYk
6. Ghafur, S. et al. A retrospective impact analysis of the WannaCry cyberattack on the NHS. Npj Digital Medicine, 2 (1 (2019); https://bit.ly/3IJjMfW
7. Hasandka, A. et al. NREL's Cyber-Energy Emulation Platform for Research and System Visualization (May 2020). https://bit.ly/3lWPJIJ
11. Ivanov, D. et al. The Ripple effect in supply chains: Trade-off efficiency-flexibility-resilience in disruption management. International Journal of Production Research 52, 7 (July 2014); https://bit.ly/3ErAbms
12. Jin, A.S. et al. Building resilience will require compromise on efficiency. Nature Energy 6 (Nov. 2021), 997–999; https://doi.org/10.1038/s41560-021-00913-7
13. Kott, A. et al. Cyber resilience: By design or by Intervention? Computer 54, 8 (Aug. 2021), 112–117; https://bit.ly/3Slnob5
15. Ligo, A.K. et al. How to measure cyber-resilience of a system with autonomous agents: Approaches and challenges. IEEE Engineering Management Review 49, 2 (Feb. 2021), 89–97; https://bit.ly/3IINUId
16. Ligo, A.K. et al. Throughput and economics of DSRC-based Internet of vehicles. IEEE Access, 6, (2017); https://bit.ly/417EuNw
17. Ligo, A. et al. How to measure cyber-resilience of a system with autonomous agents: Approaches and callenges. IEEE Engineering Management Review 49, 2 (Feb. 2021); https://bit.ly/3xHU6d3
19. National Health Executive. WannaCry cyber-attack cost the NHS £94m after 19,000 appointments were cancelled. (2018); National Health Executive. https://bit.ly/3YNQAtC
20. National Research Council. Disaster resilience: A national imperative. In Disaster Resilience: A National Imperative. The National Academies Press Washington, D.C., (2012); https://bit.ly/3Is7PtM
21. Stevens, L. Aggressive ransomware blamed for NHS cyber-attack. Digital Health (2016); https://bit.ly/3ZcGiTx
22. The Resilience Shift. What is critical infrastructure? Why is resilience important? (Mar. 2022); https://bit.ly/3ErFGS2
24. Vardi, M. Efficiency vs. resilience: What COVID-19 teaches computing. Commun. ACM 63, 5 (May 2020); https://bit.ly/41sTVjN
25. Vardi, M. Engineers and economists prize efficiency, but nature favors resilience—Lessons from Texas, COVID-19, and the 737 Max. The Conversation. (2021); https://bit.ly/3XYqlzG
This research project was supported by the U.S. Army Engineers Research and Development Center Funding Laboratory Enhancements Across Four Categories (FLEX-4) Program. The views and opinions expressed in this Viewpoint are those of the individual authors and not those of the U.S. Army or other sponsor organizations.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.
No entries found