Four articles, published across the March through May issues of Communications, highlight how people are the unique source of the adaptive capacity essential to incident response in modern Internet-facing software systems. While it's reasonable for software engineering and operations communities to focus on the intricacies of technology, there is not much attention given to the intricacies of how people do their work. Ultimately, it is human performance that makes modern business-critical systems robust and resilient.
As business-critical software systems become more successful, they necessarily increase in complexity. Ironically, this complexity makes these systems inherently messy so that surprising incidents are part and parcel of the capability to provide services at larger scales and speeds.13 Studies in resilience engineering2,12 reveal that people produce resilient performance in messy systems by doing the cognitive work of anomaly response; coordinating joint activity during events that threaten service outages; and revising their models of how the system actually works and malfunctions using lessons learned from incidents. People's resilient performance compensates for the messiness of systems, despite constant change.
No entries found