24 July 2024

What the Microsoft Outage Reveals

Samuel Arbesman

It happened again: yet another cascading failure of technology. In recent years we’ve had internet blackouts, aviation-system debacles, and now a widespread outage due to an issue affecting Microsoft systems, which has grounded flights and disrupted a range of other businesses, including health-care providers, banks, and broadcasters.

Why are we so bad at preventing these? Fundamentally, because our technological systems are too complicated for anyone to fully understand. These are not computer programs built by a single individual; they are the work of many hands over the span of many years. They are the interaction of countless components that might have been designed in a specific way for reasons that no one remembers. Many of our systems involve massive numbers of computers, any one of which might malfunction and bring down all the rest. And many have millions of lines of computer code that no one entirely grasps.

We don’t appreciate any of this until things go wrong. We discover the fragility of our technological infrastructure only when it’s too late.

So how can we make our systems fail less often?

We need to get to know them better. The best way to do this, ironically, is to break them. Much as biologists irradiate bacteria to cause mutations that show us how the bacteria function, we can introduce errors into technologies to understand how they’re liable to fail.


No comments: