Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Despite ever-accelerating technology advances over the past decade, achieving application stability is still fraught with complexity and prohibitively high costs. Resilient applications provide much more than reliability: they help to build a continuous innovation process by predicting the failure paths and design for failure to move fast and stay ahead of availability goals to build always-available systems and provide continuous service to customers. To get there, IT organizations must assess the current cultural situation, and change behavior and mindsets to enable their business, process, tools and technology to deliver enhanced quality of service to users inside and outside the enterprise.

Root Cause Analysis

Root Cause Analysis (RCA), a common practice throughout the software industry, does not provide any value in preventing future incidents in complex software systems. Instead, it reinforces hierarchical structures that confer blame, and this inhibits learning, creativity, and psychological safety. In short, RCA is an inhumane practice.

A common misconception that encourages the embrace of RCA is that without understanding the ‘root cause,’ you can’t fix what is wrong. Let’s differentiate between the ‘root cause’ and Least Effort to Remediate (LER). As long as an incident is ongoing, LER is absolutely the right thing to pursue. When the building is on fire, put the fire out as quickly as possible. Unfortunately it is all-too-common to assume that whatever undoes the damage must also point to what caused it. One branch of research in particular, Resilience Engineering, has made its way into the software industry.

Complex software systems will not lose their complexity. RCA will only hinder our ability to navigate them with fairness and grace. Resilience Engineering may hold the key to take back our dignity as knowledge workers, overcome naive notions like ‘root cause,’ and learn to navigate our complex systems better.

Customer Benefits

Post implementation of resilience features in their IT organization, the following benefits were measured:
• 20% reduction in number of non-functional incidents year to date (YTD).
• 32% reduction in incident duration (MTTR), YTD.

...