Risk-First Analysis Framework
As in Agency Risk, we are going to use the term agent, which refers to anything with agency in a system to decide its own fate. That is, an agent has an Internal Model, and can take actions based on it. Here, we work on the assumption that the agents are working towards a common Goal, even though in reality it’s not always the case, as we saw in the section on Agency Risk.
Coordination Risk is the risk that a group of people or processes -maybe with a common Goal In Mind- can fail to coordinate to meet this goal and end up making things worse. Coordination Risk is embodied in the phrase “Too Many Cooks Spoil The Broth”: more people, opinions or agents often make results worse.
In this section, we’ll first build up a model of Coordination Risk, describing exactly coordination means and why we do it. Then, we’ll look at some classic problems of coordination. Then, we’re going to consider agency at several different levels (because of Scale Invariance) . We’ll look at:
… and we’ll consider how Coordination Risk is a problem at each scale.
But for now, let’s crack on and examine where Coordination Risk comes from.
Earlier, in Dependency Risk, we looked at various resources (time, money, people, events etc) and showed how we could depend on them, taking on risk. Here, however, we’re looking at the situation where there is competition for those dependencies, that is, Scarcity Risk: other parties want to use them in a different way.
One argument for coordination could come from Diminishing Returns, which says that the earlier units of a resource (say, chocolate bars) give you more benefit than later ones.
We can see this in the chart above. Let’s say A and B compete over a resource, of which there are 5 units available. For every extra A takes, B loses one. The X axis shows A’s consumption of the resource, so the biggest benefit to A is in the consumption of the first unit.
As you can see, by sharing, it’s possible that the total benefit is greater than it can be for either individual. But sharing requires coordination. Further, the more competitors involved, the worse a winner-take-all outcome is for total benefit.
Just two things are needed for competition to occur:
The only way that the agents can move away from competition towards coordination is via Communication, and this is where their coordination problems begin.
Coordination Risk commonly occurs where people have different ideas about how to achieve a goal, and they have different ideas because they have different Internal Models. As we saw in the section on Communication Risk, we can only hope to synchronise Internal Models if there are high-bandwidth Channels available for communication.
You might think, therefore, that this is just another type of Communication Risk problem, and that’s often a part of it, but even with synchronized Internal Models, coordination risk can occur. Imagine the example of people all trying to madly leave a burning building. They all have the same information (the building is on fire). If they coordinate, and leave in an orderly fashion, they might all get out. If they don’t, and there’s a scramble for the door, more people might die.
Let’s unpack this idea, and review some classic problems of coordination, none of which can be addressed without good communication. Here are some examples:
Merging Data: if you are familiar with the source code control system, Git, you will know that this is a distributed version control system. That means that two or more people can propose changes to the same files without knowing about each other. This means that at some later time, Git then has to merge (or reconcile) these changes together. Git is very good at doing this automatically, but sometimes, different people can independently change the same lines of code and these will have to be merged manually. In this case, a human arbitrator “resolves” the difference, either by combining the two changes or picking a winner.
Factions: sometimes, it’s hard to coordinate large groups at the same time, and “factions” can occur. That the world isn’t a single big country is probably partly a testament to this: countries are frequently separated by geographic features that prevent the easy flow of communication (and force). We can also see this in distributed systems, with the “split brain” problem. This is where subset of the total system becomes disconnected (usually due to a network failure), and you end up with two, smaller networks with different knowledge. We’ll address in more depth later.
Resource Allocation: ensuring that the right people are doing the right work, or the right resources are given to the right people is a coordination issue. On a grand scale, we have Logistics, and Economic Systems. On a small scale, the office’s room booking system solves the coordination issue of who gets a meeting room using a first-come-first-served booking algorithm.
Deadlock refers to a situation where, in an environment where multiple parallel processes are running, the processing stops and no-one can make progress because the resources each process needs are being reserved by another process. This is a specific issue in Resource Allocation, but it’s one we’re familiar with in the computer science industry. Compare with Gridlock, where traffic can’t move because other traffic is occupying the space it wants to move to already.
Race Conditions are where we can’t be sure of the result of a calculation, because it is dependent on the ordering of events within a system. For example, two separate threads writing the same memory at the same time (one ignoring and over-writing the work of the other) is a race.
Within a team, Coordination Risk is at its core about resolving Internal Model conflicts in order that everyone can agree on a Goal In Mind and cooperate on getting it done. Therefore, Coordination Risk is worse on projects with more members, and worse in organizations with more staff.
As an individual, do you suffer from Coordination Risk at all? Maybe: sometimes, you can feel “conflicted” about the best way to solve a problem. And weirdly, usually not thinking about it helps. Sleeping too. (Rich Hickey calls this “Hammock Driven Development”). This is probably because, unbeknownst to you, your subconscious is furiously communicating internally, trying to resolve these conflicts itself, and will let you know when it has come to a resolution.
Vroom and Yetton introduced a model of group decision making which delineated five different styles of decision making within a team. These are summarised in the table below (AI, AII, CI, CII, GII). To this, I have added a sixth (UI), which is the uncoordinated option, where everyone competes. The diagram above illustrates these, with the following conventions:
| Type | Description | Decision Makers | Opinions | Channels | Risk | |———|——————————|—————–|———–|————–|——————————————-| | UI | Uncoordinated | 1 | 1 | 0 | Competition | | AI | Autocratic | 1 | 1 | s | Maximum Coordination Risk | | AII | Autocratic (with upward information flow) | 1 | 1 | s | | | CI | Consultative (Individual) | 1 | 1 + s | 2s | | | CII | Consultative (Group) | 1 | 1 + s | s² | | | GII | Group Consultation and Voting | 1 + s | 1 + s | s² | Maximum Communication Risk, Schedule Risk | s = subordinate
At the top, you have the least consultative styles, and at the bottom, the most. At the top, decisions are made with just the leader’s Internal Model but moving down, the Internal Models of the subordinates are increasingly brought into play.
The decisions at the top are faster, but don’t do much for mitigating Coordination Risk. The ones below take longer, (incurring Schedule Risk) but mitigate more Coordination Risk. Group decision-making inevitably involves everyone learning, and improving their Internal Models.
The trick is to be able to tell which approach is suitable at which time. Everyone is expected to make decisions within their realm of expertise: you can’t have developers continually calling meetings to discuss whether they should be using an Abstract Factory or a Factory Method, this would waste time. The critical question is therefore, “what’s the biggest risk?”
So organisation can reduce Coordination Risk but to make this work we need more communication, and this has attendant complexity and time costs.
Staff in a team have a dual nature: they are Agents and Resources at the same time. The team depends on staff for their resource of labour, but they’re also part of the decision making process of the team, because they have agency over their own actions.
Part of Coordination Risk is about trying to mitigate differences in Internal Models. So it’s worth considering how varied people’s models can be:
The job of harmonising this on a project would seem to fall to the team leader, but actually people are self-organising to some extent. This process is called Team Development:
“The forming–storming–norming–performing model of group development was first proposed by Bruce Tuckman in 1965, who said that these phases are all necessary and inevitable in order for the team to grow, face up to challenges, tackle problems, find solutions, plan work, and deliver results.” - Tuckman’s Stages Of Group Development, Wikipedia
Specifically, this describes a process whereby a new group will form and then be required to work together. In the process, they will have many disputes. Ideally, the group will resolve these disputes internally and emerge as a team, with a common Goal In Mind.
Since Coordination is aboutResource Allocation the skills of staff can potentially be looked at as resources to allocate. This means handling Coordination Risk issues like:
“As a rough rule, three programmers organised into a team can do only twice the work of a single programmer of the same ability - because of time spent on coordination problems.” - Gerald Wienberg, The Psychology of Computer Programming
Vroom and Yetton’s organisational model isn’t relevant to just teams of people. We can see it in the natural world too. Although the majority of cellular life on earth (by weight) is single celled organisms, the existence of humans (to pick a single example) demonstrates that sometimes it’s better to try to mitigate Coordination Risk and work as a team, accepting the Complexity Risk and Communication Risk this entails. For example, in the human body, we have:
There is huge attendant Coordination Risk to running a complex multi-cellular system like the human body, but given the success of humanity as a species, you must conclude that these steps on the evolutionary Risk Landscape have benefited us in our ecological niche.
The key observation from looking at biology is this: most of the cells in the human body don’t get a vote. Muscles in the motor system have an AI or AII relationship with the brain - they do what they are told, but there are often nerves to report pain back. The only place where CII or GII could occur is in our brains, when we try to make a decision and weigh up the pros and cons.
This means that there is a deal: most of the cells in our body accede control of their destiny to “the system”. Living within the system of the human body is a better option than going it alone as a single-celled organism. Occasionally, due to mutation, we can end up with Cancer, which is where one cell genetically “forgets” its purpose in the whole system and goes back to selfish individual self-replication (UI). We have White Blood Cells in the body to shut down this kind of behaviour and try to kill the rogue cells. In the same way, societies have police forces to stop undesirable behaviour amongst their citizens.
Working in a large organisation often feels like being a cell in a larger organism. Cells live and die and the organism goes on. Workers come and go from a large company but the organisation goes on. By working in an organisation, we give up self-control and competition and accept AI and AII power structures above us, but we trust that there is symbiotic value creation on both sides of the employment deal.
Less consultative decision making styles are more appropriate then when we don’t have the luxury of high-bandwidth channels for discussion. When the number of parties rises above a room-full of people it’s not possible to hear everyone’s voice. As you can see from the table above, for CII and GII decision-making styles, the amount of communication increases non-linearly with the number of participants, so we need something simpler.
As we saw in the Complexity Risk section, hierarchies are an excellent way of economising on number of different communication channels, and we use these frequently when there are lots of parties to coordinate.
In large organisations, teams are created and leaders chosen for those teams precisely to mitigate this Communication Risk. We’re all familiar with this: control of the team is ceded to the leader, who takes on the role of ‘handing down’ direction from above, but also ‘reporting up’ issues that cannot be resolved within the team. In Vroom and Yetton’s model, this is moving from a GII or CII to an AI or AII style of leadership.
Clearly, this is just a model, it’s not set in stone and decision making styles usually change from day-to-day and decision to decision. The same is not true in our software - rules are rules.
It should be pretty clear that we are applying our Scale Invariance rule to Coordination Risk: all of the problems we’ve described as affecting teams and organisations also affect software, although the scale and terrain are different. Software processes have limited agency - in most cases they follow fixed rules set down by the programmers, rather than self-organising like people can (so far).
As before, in order to face Coordination Risk in software, we need multiple agents all working together. Coordination Risks (such as race conditions or deadlock) only really occur where more than one agent working at the same time. This means we are considering at least multi-threaded software, and anything above that (multiple CPUs, servers, data-centres and so on).
Imagine talking to a distributed database, where your request (read or write) can be handled by one of many agents.
In the diagram above, we have just two agents 1
and 2
, in order to keep things simple. User A
writes something to the database, then User B
reads it back afterwards.
According to the CAP Theorem, there are three properties we could desire in such a system:
The CAP Theorem states that this is a Trilemma. That is, you can only have two out of the three properties.
There are plenty of resources on the Internet that discuss this in depth, but let’s just illustrate with some diagrams to show how this plays out. In the diagram above, we can see a 2-agent distributed database. Either agent can receive a read or write. So this might be a GII decision making system, because all the agents are going to need to coordinate to figure out what the right value is to return for a read, and what the last value written was.
In the above diagram, you can already see that there is a race condition: if A and B both make their requests at the same time, what will B get back? The original value of X, or the new value?
Here, we are going to consider what happens when communication breaks down between Agents 1 and 2. That is, they are isolated from communicating with each other. As shown in the above diagram, in an AP
system, we have a database that is able to survive partitioning, and always returns a response, but may not be consistent. The value B
will get back will depend on whether they talk with Agent 1 or Agent 2.
.
To be consistent, Agent 2 needs to check with Agent 1 to make sure it has the latest value for X. Where Agent 2 is left waiting for Agent 1 to re-appear, we are blocked. So CP systems will block when partitioned.
Finally, if we have a CA system, we are essentially saying that only one agent is doing the work. (You can’t partition a single agent, after all). But this leads to Resource Allocation and Contention around use of the scarce resource of Agent 2
’s attention. (Both Coordination Risk issues we met earlier.)
This sets a lower bound on Coordination Risk: we can’t get rid of it completely in a software system, -or- a system on any other scale. Fundamentally, coordination problems are inescapable at some level. The best we can do is mitigate it by agreeing on protocols and doing lots of communication.
Let’s look at some real-life examples of how this manifests in software.
First, ZooKeeper is an Open-Source datastore, which is used in building distributed systems (like the one above) and ensuring things like configuration information are consistent across all agents.
This seems trivial, but it quickly gets out-of-hand: what happens if only some of the agents receive the new information? What happens if a datacentre gets disconnected while the update is happening? There are lots of edge-cases.
ZooKeeper handles this by communicating inter-agent with its own protocol. It elects a master agent (via GII-style voting), turning it into an AI-style team. If the master is lost for some reason, a new leader is elected. Writes are then coordinated via the master agent who makes sure that a majority of agents have received and stored the configuration change before telling the user that the transaction is complete. Therefore, ZooKeeper is a CP
system.
Second, git is a (mainly) write-only ledger of source changes. However, as we already discussed above, where different agents make incompatible changes, someone has to decide how to resolve the conflicts so that we have a single source of truth.
The Coordination Risk just doesn’t go away.
Since multiple users can make all the changes they like locally, and merge them later, Git is an AP
system where everyone’s opinion counts (GII): individual users may have wildly different ideas about what the source looks like until the merge is complete.
Finally, Bitcoin (BTC) is a write-only distributed ledger, where agents compete to mine BTC (a UI style organisation), but also at the same time record transactions on the ledger. BTC is also AP
, in a similar way to Git. But new changes can only be appended if you have the latest version of the ledger. If you append to an out-of-date ledger, your work will be lost.
Because it’s based on outright competition, if someone beats you to completing a mining task, then your work is wasted. So, there is huge Coordination Risk.
For this reason, BTC agents coordinate into mining consortia, so they can avoid working on the same tasks at the same time, turning it into a CI-type organisation.
This in itself is a problem, because the whole point of BTC is that it’s competitive, and no one entity has control. So, mining pools tend to stop growing before they reach 50% of the BTC network’s processing power. Taking control would be politically disastrous and confidence in the currency (such as there is) would likely be lost.
CAP theory gives us a fundamental limit on how much Coordination Risk we can mitigate. We’ve looked at different organisational structures used to manage Coordination Risk within teams of people, organisations or living organisms, so it’s the case in software.
At the start of this section, we questioned whether Coordination Risk was just another type of Communication Risk. However, it should be clear after looking at the examples of competition, cellular life and Vroom and Yetton’s Model that this is exactly backwards:
In the next section, Map And Territory Risk, we’re going to look at some new ways in which systems can fail, despite their attempts to coordinate.