Delta Airlines last night suffered a major power outage at its data center in Atlanta that led to a systemwide shutdown of its computer network, stranding airliners and canceling flights all over the world. You already know that. What you may not know, however, is the likely role in the crisis of IT outsourcing and offshoring.
Whatever the cause of the Delta Airlines power outage, data center recovery pretty much follows the same routine I used 30 years ago when I had a PDP-8 minicomputer living in my basement and heating my house. First you crawl around and find the power shut-off and turn off the power. I know there is no power but the point is that when power returns we don’t want a surge damaging equipment. Then you crawl around some more and turn off power to every individual device. Wait in the dark for power to be restored, either by the utility or a generator. Once power has been restored turn the main power switch back on then crawl back to every device, turning them back on in a specific order that follows your emergency plan. You do have an emergency plan, right? In the case of the PDP-8, toggle in the code to launch the boot PROM loader (yes, I have done this is complete darkness). Reboot all equipment and check for errors. Once everything is working well together then reconnect data to the outside world.
Notice the role in this process of crawling around in the dark? How do you do that when your network technicians are mainly in India?
Yes, every data center of a certain size has bodies on-site, but few have enough such bodies to do all the crawling required for a building with thousands of individual devices.
Modern data centers can switch to UPS power very quickly, usually less than 1/30th of a second. They will run on battery power for a few minutes while the generators come on line. Smarter data centers drop power to the HVAC system until the generators are on line and handling the full load. Smarter IT departments also monitor the quality of the electric power coming into the data center. They can see the effect of bad weather on the power grid. When there are storms approaching the area they proactively switch to generator power, which even if it isn’t needed is a good test. Better to fire up the generators, have them go into phase with the utility then take over the load gracefully rather than all at once. It is doubtful that happened last night at Delta.
Delta Airlines was an IBM outsourcing customer, they may still be today, I don’t know. They haven’t returned my call.
Loss of power in a data center usually triggers a disaster recovery plan. When that happens you have two basic choices: switch to your backup systems somewhere else or fix the outage and recover your primary systems. The problem with going to backup systems is those backups usually do not have capacity for 100 percent of the workload so only the most critical functions are moved. Then once everything is fixed you have to move your workload back to your production systems. That is often high risk, a major pain, and takes a lot of effort. So in a traditional disaster recovery setup, the preference will always be to recover the primary services.
Anything less than a 100 percent service backup isn’t disaster recovery, it is disaster coping.
Now if the IT support team is thousands of miles away, offshore, the process for restarting hundreds — perhaps thousands — of systems can be slow and painful. If you lose the data link between your support team and the data center due to that same power outage your support team can do nothing until the data link is fixed.
In the old days a smart IT department would put 50 people in the data center with pre-printed documentation on how to recover the systems. They’d go into a massive divide and conquer effort to restart everything. One person can work on several systems at the same time. While debugging the systems the IT team can spot and diagnose network and SAN (Storage Area Network) problems and work shoulder to shoulder with the network team until everything is fixed. Today due to long distance data connections, offshore IT folks can only work on a few systems at a time. All discussions with other teams is done via conference calls. Language issues make the challenges much harder.
A further problem with this scenario is that network, application, and storage support can be on completely separate contracts with vendors who may not play well together. Some vendors simply refuse to cooperate because cooperation isn’t a contract term.
Now I don’t know if any of this applies to Delta Airlines because they are too busy to answer a question from little old me, but I’m sure the answers will appear in coming days. Hopefully other IT departments will learn from Delta’s experience.