Oops! Why accidental technology failure is a greater threat than cyber security.

You'd think digital technology is thriving.

After all, we're living in an era of groundbreaking artificial intelligence, global-scale cloud computing and the unstoppable pervasive nature of digital products and services.

But the recent CrowdStrike outages reveal that below the surface, the technology that runs our global digital ecosystem is also fragile.

And that fragility comes in many forms. Here’s four of the most significant:

1. Legacy technology is abundant.

Inside most large organisations, the majority of digital technology has uncomfortably aged. It’s old, outdated, and no longer fully understood. It has weaknesses and fragilities.

The task of modernising it all has CEOs, CTOs and CFOs scratching their heads: it looks like a never-ending task, is hard to consider a priority, and even harder to justify the spend.

2. We all rely on the same technologies.

Products such as Microsoft 365 (Microsoft Word, Teams etc) and GSuite (Gmail etc) are ubiquitous. Behind the scenes of most large organisations are products like SAP, Oracle ERP or Workday running the core finances and human resources. But the commonality runs even deeper.

These consumer-facing products are themselves built from parts - millions of small software components, pieced together to make more complex products. Some small fragments of software will be present on virtually every computer in the world. It’s very efficient for the world to share common solutions to common problems like this, but it amplifies the impact of a single technology failing.

3. Everything is interconnected.

It’s now common that we use our software over the internet (software-as-a-service) and the work we do involves not just the computer in front of us, but also a vast ecosystem of computers across the world (cloud computing).

As we sleep, computers communicate, exchanging information, undertaking instructions and getting things done. This interconnectedness is powerful and brings extraordinary value, but can also turn isolated problems into problems which cascade through the networks.

4. Automation means software changes rapidly and at an incredible scale.

This has become necessary in the arms race against malicious actors - to be cyber secure. When someone discovers a vulnerability in a software system, it becomes a race against time to ‘patch’ the software, repair the vulnerability, and ensure the weakpoint can’t be exploited by malicious attackers.

If the software is used by millions of people, it needs to be updated for millions of people. This is the noble intent behind the CrowdStrike update, which backfired.

Put all of this together, and you have the ability to change the same software, on millions or even billions of computers across the world, almost instantaneously. And this software change will take place within, or adjacent to fragile legacy technology. 

That means that you don’t need a malicious actor to break a fragile system like this - a simple accident will do it.

It’s an engineering marvel that incidents like the CrowdStrike outages don’t happen with disruptive regularity. 

The reason it doesn’t is that a myriad of techniques exist to account for the inherent fragility - but these techniques need continuous care and attention from people, not from computers. 

It will never be possible to automate everything, to account for every change in context. What’s more, every automation itself becomes another moving part in the fragile ecosystem. You can’t sidestep fragility by building more technology.

Teams are at the heart of the rethink we need

A rethink is needed around our relationship to digital technology. It’s never been good enough for software to simply and momentarily ‘work’ - yet the way we fund and build teams around technology so often ignores the long term. 

Technology needs to keep working. It needs people to anticipate how it could fail, and keep doing so.

Technology also needs accountable people, because every working piece of software or hardware is a service upon which, ultimately, people rely.

That’s why we still think that every tool, every component, every product, needs a team. This isn’t about countless inefficient teams with time on their hands. Instead, it’s about distributing the accountability for all the technology you make, buy and use, across the teams of your organisation - balancing, finding trade-offs, to ultimately have each team accountable for a sustainable amount of technology. It’s ensuring there isn’t a critical technology that you use, start to depend upon, but then when it fails, you realise it wasn’t any team’s job to think about it.

A focus on teams and accountability changes the conversation. When technology failure can be a life or death matter, we need to understand what caused it, and which teams can or should be accountable.

We need to understand if the legal and regulatory environment is suited to holding organisations to account and incentivising the sustained investments and modernisation needed.

We need this work across complex supply chains. This team-centred way of thinking helps organisations to treat technology not as something static - but a living part of the global ecosystem, playing a role that people must oversee.

We have all the other parts of the puzzle. Within software and hardware engineering, product management, user centred design and a myriad of other professional practices, there are known ways to make technology more resilient and reliable. What’s missing is the rigour provided by matching enduring teams to enduring technology, to ensure these skills can be applied continuously.

We know that outages are a “when” not an “if” problem, and organisations can and must take responsibility for protecting themselves from inevitable tech failures.

However, many events, like the CrowdStrike outage, are plainly avoidable. It is within our power to improve the ecosystem for everyone by preventing these things from happening, as much as we can.

Written by