Resilience of an application, in simple language, is the capability of the application to spring back to an acceptable operational condition after one or more of its components face an event affecting its operating condition.
It should not be confused with System Resilience, that is focused on Cybersecurity and Business Continuity.
Software Resiliency Index indicates programming best practices that make software bullet-proof, more robust and secure. This index is derived through technology-specific code analysis that searches for the presence of code patterns and bad programming practices that may comprise the reliability of the software at short term. Higher is the Software Resiliency Index, lower is the likelihood of defects occurring in production.
For measuring the Thresholds, % are calculated from aggregated reliability violation results, respect than total, obtaining a Software Resiliency Index (SRI). Thresholds used for Software Resiliency Index:
High (green): value > 84%
Medium (orange): value >= 62%
Low (red): value < 62%
Recent advances in measuring the structural quality of software involve detecting violations of Architecture and Design from statically analyzing source code. They can be stated as rules for engineering software products. Violations of these rules will be called Weaknesses to be consistent with terms used in the Common Weakness Enumeration which lists the weaknesses used in these measures. The Automated Source Code Quality Measures are calculated from counts of what industry experts have determined to be most severe weaknesses. Consequently, they provide strong indicators of the quality of an application software and the probability of operational or cost problems related to each measure’s domain.
Security Reviewer SRA is a Static analysis solution that can be used to perform application resiliency testing on large, complex, and multi-technology applications, regardless of the language, to provide development process improvement opportunities. SRA analyzes source and binary code and architecture to identify vulnerabilities and verify architecture or coding standards adherence. This creates a bottom-up view of software risks and real-time information for remediation or software quality improvement.
Software Quality Characteristics from ISO/IEC 25010 with CISQ focal areas highlighted.
This analysis certifies the level of quality measured in this application when measured against the CISQ Quality Characteristic Measures developed by the Consortium for IT Software Quality and adopted as standards by the Object Management Group (OMG). These measures are developed from counting the number of times critical rules of good architectural and coding practice for each characteristic have been violated. Since structural quality analysis tools differ in the violations of good architectural and coding practices they can detect, the analysis will only include results for practices that were evaluated and are the basis for this certification. For each architectural or coding practice within each quality characteristic, the results present both the number of times each practice was violated and the number of opportunities for the practice to have been violated within the application. When aggregated over the all violations, these numbers provide the basis for a 6-sigma ranking for each quality characteristic and the aggregated characteristics. That is, the σ level representing the number of violations per million opportunities. This analysis provides an evidence-based assessment of the risk this application poses to the business operations it supports or its cost of ownership.
ISO/IEC 25010 certification is presented in the Sigma format that many companies are familiar with from Six Sigma quality improvement programs. The use of Sigma levels provides a common representation supported by a rigorous, statistically-based method for benchmarking quality results.
The total number of occurrences detected for the weaknesses included in a CISQ measure is then transformed into weaknesses per million opportunities to determine the Sigma level for that measure.
The following is an example of Reliability Checklist:
In general, applications below the 3σ level should be considered unacceptable and of high risk. In practice, it will be difficult for applications to achieve 5 or 6 Sigma scores, and this level of quality may be beyond the requirements for many applications, or beyond the cost-benefits of striving for this level of quality. However, there are violations such as SQL injection for which the tolerance level should be ‘0 occurrences’ since the Security risks posed by this weakness can be disastrous. The appropriate quality range for most business applications will be a certification between 3 and 4 Sigma.
Security Reviewer SRA follows the MITRE 2019 System Engineering for Resilience recommendations:
Added Resilience is part and parcel of DevOps. You can still benefit from deployment automation processes even if you do not introduce any functional change whatsoever. Consider one of your software applications, for example. The source code is at the top of the stack, also when you unpack the stack there is likely a Web Server and other dependent services required to make the application work. If one of those internal services has vulnerability, your DevOps process can flag it and automatically upgrade to a non-vulnerable version.
Understanding the attributes and dimensions of resilience provides guidance to measure the adoption and effectiveness of resilience engineering implementation. As illustrated in the following Figure, it is recommended to shift focus from maximizing mean time to failure (MTTF) to minimizing mean time to recover (MTTR) to build highly resilient systems.
Resilience can be quantified as time at which a system returns to functional mode after a disturbance from normal operations. For example:
Resilience = f (Robustness, Rapidity)
• Robustness is the measure of impact to system function in fault mode.
• Rapidity is the time taken to recover to normal mode of operation (time to discover, time to isolate, time to fix and time to recovery). The system function, or quality of a software system, is defined by performance and functional availability. Functional availability covers accuracy of all functional and data components that could be measured at every business outcome level. Criticality, volume of users accessing the functionality, end-to-end completeness and performance are critical factors in measuring the availability.
Resilience improvement can also be measured from increase in MTTF and reduction in below KPI metrics over time:
Mean time to discover (MTTD)
Mean Time to Recovery (MTTR)
Recovery Point Objective (RPO)
Recovery Time Objective (RTO)
Number of Failures/Bugs
A number of massive failures occur because crucial design flaws are discovered too late. Only after programmers began building the code do they discover the inadequacy of their designs. Sometimes a fatal inconsistency or omission is at fault, but more often the overall design is vague and poorly thought out. As the code grows with the addition of piecemeal fixes, a detailed design structure indeed emerges–but it is a design full of special cases and loopholes, without coherent principles. As in a building, when the software’s foundation is unsound, the resulting structure is unstable.
Security Reviewer Software Resiliency Analysis indicates programming best practices that make software bullet-proof, more robust and secure.
From Fragile to Anti-Fragile
When looking at how resilient an application might be, it tends to fall along a spectrum, from ‘fragile’ to ‘antifragile’ — and everything in between:
Fragile applications are not prepared for change. They may break if an API returns an unexpected response, or if they receive an increase in load. They are brittle. It may be very difficult to modify this software without changing or breaking many features. It may be very difficult or slow to deploy new changes quickly in these applications. Fragile applications are tightly coupled to their dependencies. On the other end of the spectrum, Anti-Fragile applications thrive under chaos. They become stronger with uncertainty, randomness, errors, and volatility. They know how to make autonomous repairs and decisions when production is affected. These applications welcome change, and leverage continuous deployments to allow teams to react on the spot. These systems have a human element as well. Antifragile software is built in a culture that welcomes change, experimentation, and uses failures as lessons for how to improve.
For those reasons a dedicated Software Resilience Index was created. Refer to previous Threshold section for the possible values.
Root Cause Analysis
Root Cause Analysis (RCA), a common practice throughout the software industry, does not provide any value in preventing future incidents in complex software systems. Instead, it reinforces hierarchical structures that confer blame, and this inhibits learning, creativity, and psychological safety. In short, RCA is an inhumane practice.
A common misconception that encourages the embrace of RCA is that without understanding the ‘root cause,’ you can’t fix what is wrong. Let’s differentiate between the ‘root cause’ and Least Effort to Remediate (LER). As long as an incident is ongoing, LER is absolutely the right thing to pursue. When the building is on fire, put the fire out as quickly as possible. Unfortunately it is all-too-common to assume that whatever undoes the damage must also point to what caused it. One branch of research in particular, Resilience Engineering, has made its way into the software industry.
Complex software systems will not lose their complexity. RCA will only hinder our ability to navigate them with fairness and grace. Resilience Engineering may hold the key to take back our dignity as knowledge workers, overcome naive notions like ‘root cause,’ and learn to navigate our complex systems better.
Despite ever-accelerating technology advances over the past decade, achieving application stability is still fraught with complexity and prohibitively high costs. Resilient applications provide much more than reliability: they help to build a continuous innovation process by predicting the failure paths and design for failure to move fast and stay ahead of availability goals to build always-available systems and provide continuous service to customers. To get there, IT organizations must assess the current cultural situation, and change behavior and mindsets to enable their business, process, tools and technology to deliver enhanced quality of service to users inside and outside the enterprise.
Post implementation of resilience features in their IT organization, the following benefits were measured: • 20% reduction in number of non-functional incidents year to date (YTD). • 32% reduction in incident duration (MTTR), YTD.
COPYRIGHT (C) 2014-2022 SECURITY REVIEWER SRL. ALL RIGHTS RESERVED.