CrowdStrike: a global wake-up call

Information is still coming to light about the circumstances in which vast swathes of the world’s most critical IT systems went down on Friday.

At the time of writing, it looks likely that the cause was a mistake rather than malicious: cyber security supplier CrowdStrike has said that a defect in one of its software updates affected Windows operating systems around the world.

We won’t know the full story until after the dust has settled.

What we do know is that global flights have been grounded. Clinicians have lost access to their systems, and patients have had their cancer treatments delayed. Shops can’t take payment. Train companies haven’t been able to get their drivers to the right places. Ticket machines have broken. Whole TV channels have gone down.

The full disruptive knock-on impact of any such outage will last for days if not weeks. Full recovery will take a while longer. The economic impact will be huge. 

Meanwhile, spare a thought for all the IT teams facing a very tough and stressful situation: they are often under-celebrated, but their work is vital to keeping the world working.

Here’s a few immediate reflections on this event - which while startling in scale is unfortunately neither unique nor unexpected.

1. Mistakes happen. Attacks happen.

As cyber threats evolve ever more quickly, the world’s organisations need to update their protection against them more and more quickly.

That in itself increases the risk of mistakes like this happening.

It is an incredibly difficult challenge, but there is a constant need to balance the risk of cyber attack versus the risk of making mistakes in protecting against them.

2. Even when there are mistakes, the risk of catastrophic failure on this scale is a factor of the world's interconnected IT infrastructure.

The systems and suppliers that now provide the operating model for the world are heavily interconnected and rely on each other. When something breaks, it can have waves of impact beyond the initial blast zone.

To take another recent example: when the pathology laboratory Synnovis was hit with a ransomware attack in June, the impact spread across the NHS to A&E appointments, to the supplies of donated blood needed across the NHS. Ultimately 7,000 outpatient appointments and 1,500 elective procedures were postponed.

An outage which hits operating systems across multiple industries and sectors globally has an astronomical potential impact. There are obvious parallels here with systemic risk in the global banking system - in 2008-2009, systemic exposure to risky but misunderstood assets triggered a global crisis which is now estimated to have cost the US economy alone $4.6 trillion in missed growth.

3. The risk is greatest where there are highly consolidated IT supply chains.

In many parts of the global economy, a few dominant suppliers have captured their markets. Highly oligopolistic supply chains means that when something goes wrong, it can wipe out an entire set of critical services with no viable fallback option.

We’ve seen that today in GP surgeries across the UK. One supplier, EMIS, provides software to 60% of surgeries. That means 60% of surgeries risk being floored if a third party issue causes a software outage.

4. Technology risk is a life and death issue.

The risk is non-trivial. Today, GPs will not have had medical notes when making decisions about their patients, and at least one hospital has declared a critical incident.

After the Synnovis attack, NHS Blood and Transplant had to make urgent appeals for blood donors to replenish lifesaving stocks. Beyond inconvenience and economic impact, the world’s healthcare, transportation and other critical infrastructure all rely on their IT systems working.

Whether through malicious actors or innocent mistakes, people’s safety is at risk when IT risks are not managed effectively.

5. This risk isn’t taken seriously enough by business leaders or global leaders.

Wherever possible, we need to ensure there are diversified technology supply chains, avoiding single points of failure and spreading risk.

This should be a top priority both within organisations, and across national and international systems. 

  • At an organisational level, CEOs should be asking how well their organisation is set up to respond when things go wrong. Are they set to recover quickly? How have they designed to minimise the impact of a major outage?
  • At a national level, Ministers and senior civil servants should be looking at this risk at system-level across the whole of the public realm and the private sector. For example, what is the aggregate potential impact of consolidated supply chains on the NHS, on the transport network, or on banking?
  • And, in the same way as central banks and treasury ministers reviewed risk after the Global Financial Crisis, countries need to work together to reduce the systemic risk in the global technology market. This won’t be popular with most of the world’s largest technology firms, but in the internet era, our digital infrastructure is fundamental to both national security and public safety.

This week, the first reports from the UK’s Covid-19 public inquiry revealed a lack of preparedness for a health pandemic. Unfortunately, just like a pandemic, major global IT outages are a “when” not an “if” problem. Something WILL go wrong - and it will be worse than today.

Will we be ready for it?

Written by