ZEYNEP TUFEKCI
Computers become increasingly capable and powerful by the year, and new hardware is often the most visible cue for technological progress. However, even with the shiniest hardware, the software that plays a critical role inside many systems is too often antiquated and, in some cases, decades old.
This failing appears to be a key factor in why Southwest Airlines couldn’t return to business as usual the way other airlines did after last week’s major winter storm. More than 15,000 of its flights were canceled starting on Dec. 22, including more than 2,300 canceled this past Thursday — almost a week after the storm had passed.
It’s been an open secret within Southwest for some time, and a shameful one, that the company desperately needed to modernize its scheduling systems. Software shortcomings contributed to previous, smaller-scale meltdowns, and Southwest unions had repeatedly warned about the software. Without more government regulation and oversight and greater accountability, we may see more fiascos like this one, which most likely stranded hundreds of thousands of Southwest passengers — perhaps more than a million — over Christmas week. And not just for a single company, as the problem is widespread across many industries.
This problem — relying on older or deficient software that needs updating — is known as incurring technical debt, meaning there is a gap between what the software needs to be and what it is. While aging code is a common cause of technical debt in older companies — such as with airlines, which started automating early — it can also be found in newer systems, because software can be written in a rapid and shoddy way, rather than in a more resilient manner that makes it more dependable and easier to fix or expand. As you might expect, quicker is cheaper.
It’s a bit like constructing a building. If you had the option of not adhering to strict earthquake or fire codes — i.e., if there was little or no regulation or oversight — it would almost inevitably be cheaper and quicker to skip such niceties. The building might look and feel the same to its inhabitants — as long as there was no earthquake or fire. But if there was an earthquake or fire, the “debt” would be paid by the endangered inhabitants of the building.
Which brings us back to Southwest. Throughout the past year, the flight attendants’ union picketed in front of various airports as part of their contract negotiations. One protest sign the demonstrators carried? A placard declaring, “Another victim of SWA’s outdated technology,” with a graphic showing a stuck software progress bar. In September, they put the same sign lamenting the company’s outdated technology on the side of a truck and drove it in circles around Love Field (Southwest’s core airport) in Dallas, as well as the nearby Southwest headquarters. In March in an open letter to the company, the union even placed updating the creaking scheduling technology above its demands for increased pay.
Likewise, in October of 2021, when Southwest experienced another cancellation crisis, the president of the pilots’ union pointed out that the antiquated crew-scheduling technology was leading to cascading disruptions. Even as Southwest’s chief executive at the time, Gary Kelly, objected to the pilots’ claims, saying Southwest had “wonderful technology,” he conceded that the company’s tools could use improvement.
That improvement seems not to have occurred.
Lyn Montgomery, the president of Southwest’s flight attendants’ union, told me that currently, when hiccups or weather events happen, the employees have to go through a burdensome, arduous process to get things sorted, because Southwest hadn’t sufficiently modernized its crew-scheduling systems.
For example, if crew members from Buffalo don’t arrive in Baltimore because their flight was canceled, the employees have had to manually call in to let the company know where they are and get hotels arranged for them.
Montgomery told me that employees had sent in screenshots that showed their being left on hold on the phone for three, six, seven, eight and 12 hours and even one of 17 hours just to let the company know their whereabouts and get hotel rooms arranged. During such waits, they could time out — a phrase relating to a Federal Aviation Administration safety requirement that mandates a certain amount of rest between flights. The result is that once the employees managed corporate contact, they weren’t allowed to fly, even if they were at an airport with a flight that needed them. Online forums are full of employee accounts of such misery.
Meanwhile — extending our example from above — Southwest would have to find a new crew in Baltimore to replace the one that never arrived from Buffalo. But the candidates currently in Baltimore might also be on hold for hours, trying to let the company know of their whereabouts.
You can see how this can easily cascade to a systemwide halt, as happened this week.
You might be wondering how Southwest can lose track of where the crews are and why anyone has to call in at all, since the company presumably should know exactly which flights got canceled and who flew where, based on passenger lists. Southwest had an old system, and Montgomery said it broke down during even mild hiccups, forcing the employees to call in.
And why can’t the crews simply notify the company of their whereabouts via an app or a website and get their hotel assignments that way? John Brant, the vice president for product strategy at Arcos, a company that sells work force management software to airlines and other companies, told me that that’s how it works for many other airlines. But that’s yet another layer of software that has to be written and integrated into whatever software the airline uses for scheduling personnel.
Southwest concedes that technology played a role in the fiasco but without acknowledging past decisions contributing to why this happened now. “Our systems were overwhelmed by the scale of the disruption,” Chris Perry, a Southwest spokesman, told me. “We had available crews and aircraft, but our technology struggled to align our resources due to the magnitude and scale of the disruptions. As a result, our crew schedulers tackled the issue manually, which is a tedious, long process that takes time and trained resources to accomplish.”
Such breakdowns resulting from technical debt are often triggered by external events, like weather, and can be worsened by other dynamics, such as the fact that Southwest has more point-to-point flights than most airlines, which use a hub-and-spoke model, in which passengers are ferried to major hubs like Atlanta and Dallas from their origin and then put on different planes to their final destinations. But the point-to-point flight model — which has its advantages — doesn’t fully explain how Southwest still couldn’t start flying its regular schedule until a week after the storm had passed.
So why didn’t Southwest simply update its software and systems?
Well, if you are a corporate executive whose compensation is tied to stock prices and earnings statements released every three months, there are strong incentives to address any immediate problem by essentially adding a bit of duct tape and wire to what you already have, rather than spending a large amount of money — updating software is costly and difficult — to address the root problem. Then you can cross your fingers and hope that whatever catastrophe may be in the making, it erupts under someone else’s future tenure. Such bets often pay off, since increasingly, the plight of a company’s customers and employees is divorced from the immediate fortunes of its current top executives.
In 2020, for instance, Kelly’s compensation was a record $9.2 million, despite the fact that the company lost more than $3 billion that year because of the Covid pandemic and the compensation for the median employee fell by $35,000, to about $66,000. (The company said his compensation was set before the pandemic.) In the years before the pandemic, while the company’s aging scheduling technology groaned, the company spent $8.5 billion of its excess cash on purchasing its own stock — a common practice among airlines, which helps increase the value of the stock, the main form of compensation for many executives. Then when the pandemic hit, like other airlines, Southwest received billions from the government in grants and low-interest loans. Kelly, an accountant who became the C.E.O. of Southwest in 2004, retired this year, with an estimated net worth in the tens of millions of dollars, so the crisis did indeed occur under someone else’s tenure.
When I talk about technical debt, many people point to the Y2K scare, which seems to offer a perfect example. Before the year 2000, when computer memory was really expensive, software programs used two digits, instead of four, to indicate the year: e.g., 81 instead of 1981. Obviously, that wasn’t going to work in the new millennium, when confusions between 1905 and 2005 could have caused programs to glitch or crash on an epic scale.
But that didn’t happen, and some people may believe that the implication is that technical debt is not a big deal. But the reason we made it through Y2K intact is that we didn’t ignore the problem. The U.S. government and businesses spent a staggering $100 billion to fix the underlying problem in a massive, multiyear effort. These are the kind of efforts that may not be widely visible but make all the difference.
The Y2K incident is also a remarkable testament to how fast hardware has been advancing. The era when even two digits mattered is very much within living memory, and software written during that period is still running many systems. Meanwhile, my not-that-fancy smartphone’s memory holds what not that long ago would have been an unimaginable 128GB, or 137,438,953,472 bytes. That remarkable progress in hardware shouldn’t obscure the critically important fact that software doesn’t just come along by itself.
Ultimately, the problem is that we haven’t built a regulatory environment in which companies have incentives to address technical debt, rather than passing the burden on to customers, employees or the next management.
What would proper incentives look like? They would differ by industry. For airlines, they might mean holding them responsible for the problems their miserly approach causes to the flying public. To start with, they could be forced to compensate passengers for delays or cancellations that go beyond reasonable expectations because of weather or events outside their control. (Europe has such a rule, though the implementation has hit a lot of snags.)
Companies can also be substantively fined for major failures like this one. But if the fines are too small, companies will just see them as a cost of doing business and carry on.
For example, after the 2017 Equifax breach, which exposed sensitive information from 143 million Americans because the company failed to institute a routine security update to its software, it agreed to pay a penalty of at least $575 million to the Federal Trade Commission. That may sound like a lot, but it was just a few dollars per affected customer and a mere 15 percent of the company’s revenue in 2018, the year after the hack. I’m sure Equifax would have much preferred not to have been fined, but it was still a cost it could endure — especially those lucky enough to inhabit the executive suites. The Equifax C.E.O., Richard Smith, did resign. But despite the failure and the fine, he collected $18 million in pension money on his way out the door.
This is why we can’t just keep leaving the operation of more and more of our infrastructure and our lives to antiquated software and self-interested executives. Technical debt is real debt. It will eventually be paid by someone. And unless we take steps to hold companies and executives accountable for preventable — and foreseeable — failures, it will be us, the public, who keep paying.
The Times is committed to publishing a diversity of letters to the editor. We’d like to hear what you think about this or any of our articles. Here are some tips. And here’s our email: letters@nytimes.com.
No comments:
Post a Comment