Fork me on GitHub
Risk First Logo

Risk-First Analysis Framework



Start Here


Home
Contributing
Quick Summary
A Simple Scenario
The Risk Landscape

Discuss


Please star this project in GitHub to be invited to join the Risk First Organisation.

Publications



Click Here For Details


Operational Risk

“The risk of loss resulting from inadequate or failed internal processes, people and systems or from external events.” - Operational Risk, Wikipedia

In this section we’re going to start considering the realities of running software systems in the real world.

There is a lot to this subject, so this section is really offers just a taster: we’re going to set the scene by looking at what constitutes an Operational Risk, and then look at the related discipline of Operations Management. Following this background, we’ll apply the Risk-First model and have a high-level look at the various mitigations for Operational Risk.

Operational Risks

When building software, it’s tempting to take a very narrow view of the dependencies of a system, but Operational Risks are often caused by dependencies we don’t consider - i.e. the Operational Context within which the system is operating. Here are some examples:

This is a long laundry-list of everything that can go wrong due to operating in “The Real World”. Although we’ve spent a lot of time looking at the varieties of Dependency Risk on a software project, with Operational Risk we have to consider that these dependencies will fail in any number of unusual ways, and we can’t be ready for all of them. Nevertheless, preparing for this comes under the umbrella of Operations Management.

Operations Management

If we are designing a software system to “live” in the real world, we have to be mindful of the Operational Context we’re working in, and craft our software and processes accordingly. This view of the “wider” system is the discipline of Operations Management.

“Operations management is an area of management concerned with designing and controlling the process of production and redesigning business operations in the production of goods or services. It involves the responsibility of ensuring that business operations are efficient in terms of using as few resources as needed and effective in terms of meeting customer requirements. “ - Operations Management, Wikipedia

Model of Operations Management, inspired by the work of Slack _et al._

The diagram above is a Risk-First interpretation of Slack et al’s model of Operations Management. This model breaks down some of the key abstractions of the discipline:

The Operational Context supplies the Transform Process with three key dependencies:

Risk-First Operations Management:  Taking Action, inspired by the work of Slack _et al._

We have looked at processes like the Transform Process in the section on Process Risk. The healthy functioning of this process is the domain of Operations Management. As the above diagram shows (again, modified from Slack et al.) this involves the following types of actions:

Let’s look at each of these actions in turn.

Control

Control, Monitoring And Detection

Since humans and machines have different areas of expertise, and because Operational Risks are often novel, it’s often not optimal to try and automate everything. A good operation will consist of a mix of human and machine actors, each playing to their strengths (see the table below).

The aim is to build a human-machine operational system that is Homeostatic. This is the property of living things to try and maintain an equilibrium (for example, body temperature or blood glucose levels), but also applies to systems at any scale. The key to homeostasis is to build systems with feedback loops, even though this leads to more complex systems overall. The diagram above shows some of the actions involved in these kind of feedback loops.

Humans Are… Machines Are…
Good at novel situations Good at repetitive situations
Good at adaptation Good at consistency
Expensive at scale Cheap at scale
Reacting and Anticipating Recording

As we saw in Map and Territory Risk, it’s very easy to fool yourself, especially around Key Performance Indicators (KPIs) and metrics. Large organisations have Audit functions precisely to guard against their own internal failing processes and Agency Risk. Audits could be around software tools, processes, practices, quality and so on. Practices such as Continuous Improvement and Total Quality Management also figure here.

Scanning The Operational Context

There are plenty of Hidden Risks within the environment the operation exists within, and these change all the time in response to economic, legal or political change. In order to manage a risk, you have to uncover it, so part of Operations Management is to look for trouble:

Planning

Forecasting and Planning Actions

In order to control an operation, we need targets and plans to control against. For a system to run well, it needs to carefully manage unreliable dependencies, and ensure their safety and availability. In the example of the humans, say, it’s the difference between Hunter-Gathering (picking up food where we find it) and Agriculture (controlling the environment and the resources to grown crops).

As the diagram above shows, we can bring Planning to bear on dependency management, and this usually falls to the more human end of the operation.

Design

Design and Change Activities

Since our operation exists in a world of risks like Red Queen Risk and Feature Drift Risk, we would expect that the output of our Planning actions would result in changes to our operation.

While planning is a day-to-day operational feedback loop, design is a longer feedback loop changing not just the parameters of the operation, but the operation itself.

You might think that for an IT operation, tasks like Design belong within a separate “Development” function within an organisation. Traditionally, this might have been the case. However separating Development from Operation implies Boundary Risk between these two functions. For example, the developers might employ different tools, equipment and processes to the operations team, resulting in a mismatch when software is delivered.

In recent years, the DevOps movement has brought this Boundary Risk into sharper focus. This specifically means:

Improvement

No system can be perfect, and after it meets the real world, we will want to improve it over time. But Operational Risk includes an element of Trust & Belief Risk: we have a reputation and the good will of our customers to consider when we make improvements. Because this is very hard to rebuild, we should consider this before releasing software that might not live up to expectations.

So there is a tension between “you only get one chance to make a first impression” and “gilding the lily” (perfectionism). In the past I’ve seen this stated as:

“Pressure to ship vs pressure to improve”

Balance of Risks from Delivering Software

A Risk-First re-framing of this (as shown in the diagram above) might be the balance between:

The “should we ship?” decision is therefore a complex one. In Meeting Reality, we discussed that it’s better to do this “sooner, more frequently, in smaller chunks and with feedback”. We can meet Operational Risk on our own terms by doing so:

Meet Reality… Techniques
Sooner Quality Control Processes, Limited Early-Access Programs, Beta Programs, Soft Launches, Business Continuity Testing
More Frequently Continuous Delivery, Sprints
In Smaller Chunks Modular Releases, Microservices, Feature Toggles, Trial Populations
With Feedback User Communities, Support Groups, Monitoring, Logging, Analytics

End Of The Road

In a way, actions like Design and Improvement bring us right back to where we started from: identifying Dependency Risks, Feature Risks and Complexity Risks that hinder our operation, and mitigating them through actions like software development.

Our safari of risk is finally complete, it’s time to look back and what we’ve seen in Staging and Classifying.