Monday, August 8, 2016

Wicked problems, systems failure and the swiss cheese model of safety

Reddit had an interesting discussion thread today over Delta Air Lines' network-wide computer system failure, touching on such diverse topics as reliability analysis, computer science, database architecture, electrical engineering, failure mode analysis, business continuity planning, disaster recovery, and many more. Whatever the specific circumstances of Delta's failure, passenger aviation is one of the most complex systems on earth, and overall the thread reminded me that cybernetics is still the most underrated field of knowledge that exists.

Here are some of the best quotes:

Outages, failures, and degradations happen all the time in computer networks. It's the design of the systems in place, and the willingness to spend the $$$ to implement them, that makes the difference in recovering in milliseconds or hours. To be a little more fair to airlines, it's a pretty low-margin business.

[G]o wiki SABRE and SAGE. The airlines were one of the first major business to place computer systems at the core of their business infrastructure, and a lot of the core of those systems to this day date back to the very first generations of commercial computers.

Just because something is offsite even far away doesn't mean that it's guaranteed to fail over correctly.  
Lots of reasons for this:
    • Improperly configured / setup routing to the fail over site that hasn't been tested or had a real test. 
    • Poorly configured load balancing overwhelming a single device causing it to fail.
Even if / when the routing is correct, it's hard to tell how your infrastructure will fail over in a real scenario until it actually happens. None of these are excusable or excuses, it still shouldn't happen. But you would be surprised at how easy and how often ANY fail over doesn't work, whether it's on or offsite at a remote location.

In general terms if you are planning for a cutover measured in hours, an offsite "fail over" is not even something you should be considering. 
I have yet to see a "DR [disaster recovery] site" actually come on-line 100% within the companies DR plan window. The exceptions are companies that cutover between two sites on a scheduled (usually monthly) basis. 
If you need true HA [high-availability], you should be designing and architecting your systems to be truly active/active. For something that is atomic and important as a airline reservation system, this likely means a metro HA setup plus an off-site (e.g. cross-country) DR site. The metro setup allows instant zero-downtime failover to it's twin, and the off-site location is your "the east coast got nuked" option where recovery is measured in hours or days. 
Failover is not easy. Untested failover is worthless. Almost every company that says they are going to test DR, never does, because when they try it things like today happen and the dirty secret is that most companies get away without doing it just fine for an entire executive career.

Normal Accidents at its finest.

Do you guys have any idea how costly a 'hot site' for something like this would be? The turn around for a warm site is longer than they even took to get the power running again. You'd have to have a full blown, to scale, hot site to fail over to and the people to be able to get there. This was not a DR scenario, but one where better BCP [business continuity planning] could have happened. 
They would have definitely spent more time getting to a warm/cold site and getting it up and running than they did getting the power on. A hot site is EXTREMELY expensive and you are talking about one of the most penny pinching industries that the planet has ever fathomed. 
I work on the financial/IT Security side of an arm of the airline industry. You simply just can't expect something that runs on margins this small to have a full blown hot site set up and ready to go, it's just not feasible.

I think it's also worth noting that it's not exactly a trivial exercise (or even completely possible really), to thoroughly test a failure scenario for a system that is supposed to have 100% uptime. You can't just kick the power off, because if the backup fails... Now you have downtime. Which is bad. So you try and test incrementally. Except that won't necessarily catch everything that might go wrong in a real world failure scenario. 
It's a shitty position.

Of course a major airline is going to have a plan in place for a power failure. 
If you were the spokesperson which would you tell to a BBC reporter?
  • A power outage at the Atlanta Hub resulted in widespread system failures
  • The KDO hyperserver hosting 3 of our 2150 distributed OO9 database mirrors was knocked off line by electricians in our L549.22 data center. Normal roll over couldn't occur because the incident happened at exactly the time the MD8H controller was performing an FAA-required online incident simulation. When the KDO came back online the resulting network load between L549.22 an L549.69 resulted in critical queries timing out which propagated across all BBBK system relying on the Q-FAK servers due to the HCC system being left in K6 mode.
[...] I've worked in many industries (I'm a consultant) and I can assure you that an airline's computer systems are leaps and bounds more complex than even something like Facebook or Google's websites. It has very little to do with change control and much more to do with the amount of interconnected systems.

I've seen blank check level redundancy fail so many times it's laughable. It really isn't that simple I don't know what type of fairy tail world you live in. 
Bringing up another datacenter is easy. Having your applications continue functioning as if nothing happened is a whole other game.

The amount of red tape in the airlines industry is third or fourth in my opinion, next to only the government, military, and healthcare industries on its inability to get shit done due to bureaucratic nature of the people and companies involved.

People who aren't engineers don't really appreciate you can have backup systems for your backup systems but still get hosed when there's a threefold failure. 
It doesn't make news when the main system fails, because you've got a backup system to catch you. It's that super-rare edge case no human predicted where seven failsafes fail that all Hell breaks loose and suddenly every customer's an expert in the field.

A UPS [uninterruptible power supply] is not designed to power something for hours on end. It's just supposed to protect the equipment from dirty power, keep the equipment powered for short periods of time while everything kicks over to generator, or give you time to power the system down gracefully. This is a major fuckup by Delta. Saying a UPS failed has nothing to do with it.

oh boy are we in for a treat if the Sun suddenly farts in our general direction; creating an EMP.

A lot of industries have systems like this. Airlines and financial institutions are two good examples. They computerized early on and still have legacy stuff that is decades old. It tends to be very reliable but it isn't simply a matter of upgrading. In many cases they need to rewrite everything. There is a ton of interconnected stuff and business logic embedded in the software, often not documented and written by people who are long since dead. 
Visa makes a ton of money and can afford to be very redundant. I'm assuming your Comcast comment was a joke. 

No comments :

Post a Comment