Delta Airlines last night suffered a major power outage at its data center in Atlanta that led to a systemwide shutdown of its computer network, stranding airliners and canceling flights all over the world. You already know that. What you may not know, however, is the likely role in the crisis of IT outsourcing and offshoring.
Whatever the cause of the Delta Airlines power outage, data center recovery pretty much follows the same routine I used 30 years ago when I had a PDP-8 minicomputer living in my basement and heating my house. First you crawl around and find the power shut-off and turn off the power. I know there is no power but the point is that when power returns we don’t want a surge damaging equipment. Then you crawl around some more and turn off power to every individual device. Wait in the dark for power to be restored, either by the utility or a generator. Once power has been restored turn the main power switch back on then crawl back to every device, turning them back on in a specific order that follows your emergency plan. You do have an emergency plan, right? In the case of the PDP-8, toggle in the code to launch the boot PROM loader (yes, I have done this is complete darkness). Reboot all equipment and check for errors. Once everything is working well together then reconnect data to the outside world.
Notice the role in this process of crawling around in the dark? How do you do that when your network technicians are mainly in India?
Yes, every data center of a certain size has bodies on-site, but few have enough such bodies to do all the crawling required for a building with thousands of individual devices.
Modern data centers can switch to UPS power very quickly, usually less than 1/30th of a second. They will run on battery power for a few minutes while the generators come on line. Smarter data centers drop power to the HVAC system until the generators are on line and handling the full load. Smarter IT departments also monitor the quality of the electric power coming into the data center. They can see the effect of bad weather on the power grid. When there are storms approaching the area they proactively switch to generator power, which even if it isn’t needed is a good test. Better to fire up the generators, have them go into phase with the utility then take over the load gracefully rather than all at once. It is doubtful that happened last night at Delta.
Delta Airlines was an IBM outsourcing customer, they may still be today, I don’t know. They haven’t returned my call.
Loss of power in a data center usually triggers a disaster recovery plan. When that happens you have two basic choices: switch to your backup systems somewhere else or fix the outage and recover your primary systems. The problem with going to backup systems is those backups usually do not have capacity for 100 percent of the workload so only the most critical functions are moved. Then once everything is fixed you have to move your workload back to your production systems. That is often high risk, a major pain, and takes a lot of effort. So in a traditional disaster recovery setup, the preference will always be to recover the primary services.
Anything less than a 100 percent service backup isn’t disaster recovery, it is disaster coping.
Now if the IT support team is thousands of miles away, offshore, the process for restarting hundreds — perhaps thousands — of systems can be slow and painful. If you lose the data link between your support team and the data center due to that same power outage your support team can do nothing until the data link is fixed.
In the old days a smart IT department would put 50 people in the data center with pre-printed documentation on how to recover the systems. They’d go into a massive divide and conquer effort to restart everything. One person can work on several systems at the same time. While debugging the systems the IT team can spot and diagnose network and SAN (Storage Area Network) problems and work shoulder to shoulder with the network team until everything is fixed. Today due to long distance data connections, offshore IT folks can only work on a few systems at a time. All discussions with other teams is done via conference calls. Language issues make the challenges much harder.
A further problem with this scenario is that network, application, and storage support can be on completely separate contracts with vendors who may not play well together. Some vendors simply refuse to cooperate because cooperation isn’t a contract term.
Now I don’t know if any of this applies to Delta Airlines because they are too busy to answer a question from little old me, but I’m sure the answers will appear in coming days. Hopefully other IT departments will learn from Delta’s experience.
I’m sure someone is already working on a spreadsheet that shows the amortised cost of the disaster is less than what they save by outsourcing.
Yes, and the spreadsheet won’t consider the cost of losing customers to competitors, since that can’t be easily measured.
CEO and Directors of IT hide the most embarrassing disasters caused by their decision to outsource from India or Pakistan. I’ve worked for the largest corps in the US and too often they hire an H1B from India/Pakistan who claims he/she is a ETL expert. ETL expert allows a person to download corporate data and they will never be questioned. Do you understand how hacking is taking place so easily?
Note to the next Delta CEO: While it may seem like a cool idea to film yourself apologizing with the operations center in the background showing “all hands on deck” this is a really dumb idea and shows a totally lack of understanding when it comes to computer security.
http://news.delta.com/ceo-apologizes-customers-flight-schedule-recovery-continues
I can’t believe this a command center trying to solve a problem with the TV on the big screen. This looks more like a situation room for the PR folks.
What you see on the video is their operations center. It is not their IT command center. The big airlines have operations centers. There they track their flights, their fleet, their crews, the weather, etc. The operate 1000’s of planes that each flying 4 to 8 or more flights a day. The crews don’t necessarily stay on the same plane all day. The time they are awake and work each day is carefully managed. The daily logistics to get the maximum use of the fleet is daunting.
.
If a plane needs a new tire — do you delay its next flight, or pull a plane from another flight? If there are too many weather delays the airline may have to bring in a new crew. The crew they relieve may be out of position for their next days assignments. The airlines use operations research software to evaluate the options and pick the one that has the least total impact to the system on that day.
.
When you consider what is needed to manage the real time operations of an airline, the room shown in the video is exactly what you would expect. This wasn’t a PR shot.
Southwest Airlines servers go off line and wreak havoc. Shortly after this Delta Airlines servers go off line and wreak (lesser?) havoc. Stuxnet beta testing against American infrastructure anyone?
No, not Stuxnet. That was pretty stealth and not meant to cause damage you can see but, rather damage things in a way that look like the error was in the operator and scientist calculations. Stuxnet was the nuclear bomb of viruses as it showed you can put viruses in firmware and have it lie undetected for month or years but, now that the secret is out everyone will try write viruses in a similar way. All chips come from China what’s keeping China from putting backdoors in all our electronics?
This most likely is a failure of a system to come online after supposed power outage. I don’t doubt IBM is running the show here. They have been a disaster to my company’s IT.
Oh, I can think of a couple of other IT outsourcing companies that would do just as poor a job as IBM….
Having watched every one I know lose their wonderful IT job to some outsourced location, I can’t help but feel a bit glub about this. It could devolve into an ‘I told you so’ moment, but they are learning. If this one doesn’t get them to rethink a few things, maybe the next one will.
Outsourcing your IT is tantamount to giving someone else the financial controls to your company.
No, they’re not learning… Delta has had problems like this before, with almost identical results. That was just a storm forced them into delaying and re-routing a lot of flights, and the resulting number of ticket changes crashed their systems. They ended up shut down throughout most of the US for at least a day (I was supposed to be on a flight that got cancelled, and it took two hours just to find out that our flight had been cancelled plus another six hours or so before they told us when our new booking was — and our flight wasn’t even affected by the storm).
“Outsourcing your IT is tantamount to giving someone else the financial controls to your company.”
True, but then so is going public.
The Delta subsidiary Comair (now defunct) had that issue with their ticketing system a decade or so ago, and they were down for a couple of days. So yeah, Delta hasn’t really learned.
AMEN; I agree 100 percent.
I worked in systems engineering and management for “the world’s largest IT company” for more than 30 years and spent the last six working with its biggest non-government customer. Bottom line– in this day and age a properly designed – AND TESTED– backup process would have allowed Delta to continue operating such that no one but its harried IT staff would have known it suffered any kind of a problem. Losing so much business and customer good will to something as plebeian as a power outage is so … Twentieth century …
Well stated and witty! However, some truth’s are timeless and applicable no matter WHAT the century is.
Example: ““By failing to prepare, you are preparing to fail.”
Bejamin Frankin, 1792.
True in 1992, true in 2016, and I suspect just as true in 2392.
true statement, but business must balance total readiness vs. cost.
the last 20% costs 80% of the budget.
are you whom I presume you to be??
Yeah, but a power outage isn’t the sort of oddball occurence that exists in the last 10%. It’s a fairly common event that should be planned for in the first 80%.
My question is, what the hell happened to DR? Why was there no failover to a fully capable DR site? Oh, wait, that costs a LOT of money for something that will probably never happen…
Look on the bright side, until this point, we saved lots of money on our IT spend.
Anything that doesn’t add to the bottom line is minimized in the publicly traded company. This not only applies to disaster recovery, but also to network security, application security, and associated personnel costs.
Unless, and until a clear monetary impact can be measured with these events – we’ll continue to see companies gambling with their customer’s data.
I just went through a dry run with the present clients DR/HA infrastructure as an outside evaluation. Their DR site has disk that is not mirrored and these hold the standby databases of the essential systems. As this is a primary failover, I’m actually stunned that the standby databases aren’t mirrored. That’s just the first finding, the SPOF list is long and may never be remediated.
While this specific situation is uncommon at other clients, similar DR gaffes are just as prevalent elsewhere.
Outsourcing IT isn’t the answer for essential systems when you can’t trust the outsourcer. But it’s not always the outsourcer, sometimes it’s the customer.
As for Delta, it will likely be a matter of lining up the innocent for punishment and rewarding the guilty.
SSDD.
You make a good point. I’ve seen that single point of failure disease in key systems before and more than once it was done knowingly– as you point out, sometimes it is the customer!
We don’t know what happened at Delta yet. Swinging a big business to a DR site, even a properly set up one is a big undertaking and a few weeks later you have to undo everything and swing it back. So I am guessing they were able to fix the initial cause of the problem and chose to restore their production systems. I probably would have made the same decision. Remember we’re probably talking about 1000’s of applications running on 1000’s of systems. This is not a trivial undertaking.
.
If your IT team lives in the same area as your data center then you can call everyone in, buy lots of coffee and donuts, and they can work through the challenge fairly efficiently. However if your IT team is on another continent, then things get ugly. When you lost the data center, you probably lost the network circuits between your systems and your support team. You have to get the network circuits backup, then the jump hosts and remote access (eg Citrix) systems. Working remotely in this way is not easy and one is limited to how much and what they can really do. It is common for the Windows admins to be in a different place than the Unix admins, and the DBA’s could be somewhere else. Calling over to the person next to you to ask them to ping or test your system is not an option. Bridge calls, chat sessions, is the norm. In a crisis it is terribly awkward.
.
I used to work in outsourcing. Every time we were in a crisis situation like this and had people 12 time zones away trying to do work over slow WAN lines I had to ask myself what was IBM management thinking? What was the customer thinking? A business is shut down hard. A problem that should have taken 3-6 hours to fix can now take days. Delta makes $40.,7B a year. That is $111M a day, or $4.6M an hour. If their IT decisions caused their outage to take 20 hours longer to fix, that’s a $100M loss !!! Wow!
.
Is the enormously expensive risk of outsourcing and offshoring worth it?
You make a good point John, about turning on the DR site and then moving back. Over the years I only remember ever doing it (for real, not talking tests) three times and each time moving back was a terrific undertaking. I seem to recall a colleague once having a client having to turn on their DR site and eventually decided it was so complicated to go back, they never did it (an internally managed site, not a rented one).
There are a few DR Strategies in IT. 1) A Warm site or 2) A Cold site
A Warm Site is usually much more expensive and the most reliable. Most companies do not like seeing their money spent on unused computer systems just in case something terrible happens.
The second option is usually what companies do, they attempt to leverage virtual systems to replicate their most important servers. The problem is that the data may not be up to date and there will be delays in bringing up all of the systems.
Now add the second option of DR with the Outsourcing bug, and what you get is someone that wrote the process for DR has their job(s) outsourced and the replacement is lost. This happened to a customer when IBM decided to offshore support to Argentina and a couple of years later Argentina had to train their replacements in India. In that case, the people in Argentina did not like having to train their replacements and did not perform a complete “Knowledge Transfer” to their replacements. Once the “Knowledge Transfer” is completed, the India team had their responsibilities, and it is set as stringent and non-conforming. So, in this case, the DR solution was not “Knowledge Transferred” and the India team was lost on the next quarters DR test. Which IBM failed for the customer. So, I know some people that went to the Delta Account at IBM, I hope that they can show keeping IT staff onshore is well worth it for companies in Delta’s position.
Management doesn’t care about those losses. They don’t care that it takes longer to fix things. They care that they get massive bonuses for reducing IT costs. That money never gets taken back when plans fail, or even when the company goes under.
That’s what you geeks never seem to understand. You are getting laid off because your company is being looted by it’s own management. I understood this early, lived frugally/invested and retired young.
is it so really hard to have mirrored duplicate servers in other locations, and have the logins on a sheet so everybody can swing in a minute? our outfit had that in 1998. with a disaster third site in one of the engineering centers. it’s not that hard. you just have to spend some cash.
Actually some big companies do this swap between datacenters very easily. One I know uses datacenter A on even days and datacenter B on odd days. Swapping is seamless at midnight. Why was it so hard for Delta to switch to backup? They wanted to save money and thought this wouldn’t happen.
I’m sure they have examined the cost of building a system that is 100% robust, but it is surely cheaper to deal with the occasional outages like this, then to build that 99.9999% uptime system.
DR, HA, SPOF, SSDD?
Disaster Recovery
High Availability
Single Point Of Failure
Same Sh1t, Different Day
also, too: BC = Business Continuity
Thanks! 🙂
I’ve seen this sort of thing before, where the IT company designs the proper solution, but the customer says “Oh, that’s too expensive,” and the IT company is forced to remove redundancy to keep the solution “cost competitive”.
.
Then a DR situation arises, and the IT company has to deal with the “WHY DIDN’T THIS SOLUTION WORK?” demands from the customer.
.
If only the IT company had a mirror to hand the customer….
First, not sure if Delta needs thousands of servers to run their operation. 1000, maybe, 10000, doubtful. Flight records don’t take that much room or processing power to manage. Second, with remote access cards, you don’t need to be present to restart servers. There are a handful of systems you might need to restart manually, but most will just have the RAC reboot at very low power and you can restart the full server whenever.
This is mostly a failure of architecture, where they don’t have multiple data centers, but just one with non automated failover. Maybe someone will listen to the guy who rambles about distributed databases now.
imagine a beowulf cluster….
You seem to be assuming that they use separate applications for operations management and CRM. Snark? We wish. Supply chain/work management, anyhow. Different business rules per location ==> separate instances (we hope). One server leads to two and two servers multiply like rabbits.
Outsourcing can be a problem but even when you aren’t outsourcing there are rarely more than a handful of people at a datacenter. The number of servers managed per person has increased far beyond what most people can imagine. Combine that with the IT specialization that has occurred over the same time frame and it is almost inevitable that a truly major failure will be difficult to recover from. And then on top of that for the airlines there are a ton of interdependent legacy systems.
Delta bought its RES system back in house from Travelport 2 years ago:
https://www.wsj.com/articles/SB10001424052702303480304579575891541812918
Are these systems still running on bare metal servers or on VM clusters? I’m surprised they don’t have redundant nearby data centers (in separate towns with the speed of networks over fiber nowadays) where apps are clustered to run simultaneously in both DCs, so when one has an outage nothing stops. I work for a large company that does that now, with the DR DC 1500 miles away.
Great memories of running across town to pick up an armful of 9-track backup and recovery tape reels while the power was out in the computer room. Batteries? That was for the big boys and I worked a small shop.
For Cringely, blaming outsourcing for everything is like how others blame global warming.
He might be right, but he has published before finding out for sure.
They could of course have screwed this up all by themselves. But the point about outsourcing is that it is not done to save money; it is done out first of an emotional compulsion to take revenge upon employees for having attempted to assert themselves over the preceding decades, and secondarily to outsource blame and liability. If (as often) it winds up costing more, these things are deemed to be worth the cost. If (as often) it results in operational failures and disruptions of service, that just feeds into the mythos of modern capitalism as being a wild and edgy sort of thing. Continuity of service was so mid-20th-century.
I’d argue that outsourcing blame and liability is the primary motivator and revenge secondary, based on what I’ve seen. Reducing IT and other, “non essential” functions to a line item handled by someone else means that the company can engage in plausible deniability when the inevitable disaster happens.
I also love how he never misses a chance to throw IBM under the bus without even knowing if the lack of restoration was their fault, of it there were other contributing factors that caused restoration to happen quickly that were out of their hands, or even if they are still a customer of IBM.
I was waiting to see if this was an IBM related story…
.
And on that thought.
I wonder how other customers of IBM would feel knowing that their disaster recovery may be no better?
He says he doesn’t know if it is, but still never misses a chance to throw the company under the bus, never realizing that his fear mongering among the employees may actually be a small contributing cause to the overall issues.
I agree with you Bob at 100%. To add something more airport should be powered minimaly from two independent power networks. If let say nothern network is down, shouthern network usualy works. I do not think that a whole Virginia was without power. If both sources are down came your scenario UPS, motorgerator.
Maybe they need to save cost…..
Start such system only with remote assistance is practicaly impossible. I am waiting if they answer yours questions???
there was a time when NWA had redundant feeds all over hell and begone in Minneapolis. only problem was, they didn’t engineer them with the vendors to have them in totally separate fiber networks. you can guess the rest when I say “backhoe.”
the costs rise exponentially and IT has to be eternally vigilant. but big outfits do this often enough that designers know the tricks.
Actually, a few years ago, a tornado system shut off ALL power to the north end of Alabama for about five days. EVERY transmission line into this end of the state was down.
I keep saying I’m going to build a small solar array, to charge batteries for critical stuff (CPAP, nebulizer, radios).
Looks like today’s disaster for IBM is their cloud screwing up Australia’s census: https://www.theregister.co.uk/2016/08/09/australian_census_slips_in_the_ibm_cloud/
Wow. Weren’t they involved in the census 100 years ago that had punch cards eve before computers were invented?
IBM previous ran the Census for ABS in 2011
And they also eagerly helped the Nazis to efficiently round up and exterminate Jews.
Oh man there’s therapy for being that angry at IBM….
IIRC They made the card based census machines that the Germans used pre WWII as well
Not sure the conversation should be outsourcing related. I am sure we all have personal biases but if you do understand the technicalities of a large data center and today’s mirrored systems, you will know that with sophisticated redundancy routines, one failed data center should not bring a mission critical system down. Also if it does go down, bringing it up does not require hundreds of people crawling around.
Blaming this on outsourcing is like saying we should send someone up in space strapped to satellites as its easier to debug a satellite system of you are.on the satellite in person! And thank god delta is not taking calls from people like us claiming to be experts..I am.sure they have enough experts in house to fix and learn from this.
I have been flying delta as a loyal customer for.years and while I was a little frustrated when I was delayed yesterday, Delta has always been a great airline.
LOOK OVER THERE!!!!!!!!!!!
A similar thing happened on Christmas 2005, when ComAir’s reservation system failed which brought down most airports in the US. A total standstill.
The cause was traced to a 100% Indian IT outsourcing company which worked for ComAir. A single int value was used where a long should have been used. When the number of crew reschedules surpassed 65,323, the entire system came crashing down. Those highly skilled 100% Indian IT workers that America’s economy can’t do without didn’t know the difference between a short and a long and the implications for using either.
Yes, outsourcing is often the cause. Vast economic contributions to America’s economy.
https://www.foxnews.com/story/2004/12/25/comair-cancels-all-1100-flights.html
And, it should be noted, Comair was a subsidiary of Delta at the time. It never recovered from that 2005 disaster, and is now defunct.
It was actually Christmas 2004 my dear boys. Just to prevent misreading for next visitors. 🙂 Love info and discussion. BR Sonia
Reminds me of the wonderful job Satyam did supporting desktops and servers for the World Bank. The financial scandal quickly overshadowed this news, but it WAS reported at the time. The cheap third world folks think they know what they’re doing but they have no clue.
How do you fix a satellite with a power failure?
The following was the biggest IT disaster I personally witnessed (as a customer, although a small one) in my entire life and perhaps the biggest up to date.
I must say that in the end I didn’t loose a single byte though, it was just a matter of having my instances down for three days, waiting to be brought back.
Certainly having redundant resources mirrored in other “zones” (data centers) would have avoided the shortage, but failover resources cost and not everybody can afford them. In the end I can survive three days of downtime but others don’t.
https://aws.amazon.com/message/65648/
A more believable explanation from one of the frequent flyer sites:
According to the flight captain of JFK-SLC this morning, a routine scheduled switch to the backup generator this morning at 2:30am caused a fire that destroyed both the backup and the primary. Firefighters took a while to extinguish the fire. Power is now back up and 400 out of the 500 servers rebooted, still waiting for the last 100 to have the whole system fully functional.
if the center was running on a telco-style 48 volt system, batteries backed by gensets backed by commercial power, the chances of that happening become minute. not impossible, but minute.
I was working on consolidation projects when Delta merged with Northwest. Northwest had some pretty good technology, but they were told by Delta that Delta’s was the best and should take priority in the streamlining. When the Delta kit arrived, Northwest people couldn’t believe how archaic it was. Some people even talked of cancelling the merger because they had been misled. It wouldn’t surprise me if this “power outage” was the result of Delta’s older technology replacing superior Northwest technology.
Canceling the merger because of tech? Never happen.
Didn’t IBM just dump their U.S. people in favor of resources in India???
That sounds a bit like an implication that Indian IT staff aren’t people, since resources aren’t people.
Of course, most corporations refer to their staff as resources, and treat them as such… and then complain when those people don’t show them any loyalty.
most company HR departments now refer to people as “Human capital”.
BINGO. Americans are starting to smell the smoking gun everywhere. And it’s not just this one instance.
Vast economic contributions to America’s economy.
[…] Read more […]
Gee! Ya think! I think there needs to be new phrase for this. “Keep your enemies close, and your cloud/data servers closer”
Just sayin!
Early in my career, I worked in a department at American Airlines (yes, even before we used the name SABRE for the company) called Realtime Coverage.
We had an Operations Department. They mounted and ran our 7 track backup tapes when I first started there in the 1980’s. We converted to cartridge tapes shortly after I arrived. We had close to 30 operations people on-staff full time day and night (more like 10-15 at night) to ensure that SABRE ran and took the recovery measures when the system crashed, we lost power, had database corruption, application memory corruption. Realtime Coverage led the technical part of recovery and reboot. We made the calls on taking the system down, bringing it back up, and were heavily trained in all aspects of the system, from the hardware, to the operating system, to the key applications running the airline, to the database. We busted dumps and looked at system problems. It was an intense, pressure filled job that I’m STILL proud to have on my resume many years later.
These systems still run an operating system TPF (Transaction Processing Facility) that was originally written as ACP (Airline Control Program) in the 1960’s on IBM 370 hardware. ACP came after Reservisor. TPF was the 1980’s revision of it, converting from 370 Assembly to the C language (Or a derivative called TPFC). It is long past time to retire TPF. IBM tried to kill ACP while I was at American Airlines, and the result was TPF.
As evidenced by the Southwest outage as well, airlines have failed to keep pace with computer technology, instead clinging to old operating systems and protocols, while the hardware continues to be updated around them. The whole industry needs a re-write and update, across the board. The innovation is occurring in the Hotel and Car Rental systems, who have needed both real-time reservations and “marketing images” for websites for a long time now. Let’s stop investing in buying back stock and start updating systems for the 21st century.
My advice is for ALL the major airlines to each put in about 10 million dollars (20-30 airlines would put a fund together about 200-300 million) to modernize and work on the Interfaces between them, and the hotel and car rental systems, tours, and other functions that SABRE/Amadeus/Apollo/etc. interface to. This would fund a research consortium to look at the current technology, and DEFINE THE INTERFACES for the next generation system. Maybe HP and IBM and Microsoft and whoever else wants to play could put in some money too. The key for this consortium is to have the INTERFACES defined. Give the specifications to the vendors (HP, IBM, Microsoft, Google, Priceline, Hilton, Hertz, whoever) that want to build the next generation reservations system. Then let them have 1 year and all have to work to inter-operate on the specification (just like they do on the “old” specs today for things like teletype, and last seat availability). This has worked well in the healthcare space in getting payers and providers to work together. Each potential vendor needs to plan to spend 10-50 million dollars on their proposed solution. Then, we have the inter-operability technology fair (I would make it 2 weeks to 1 month) and each vendor can pitch to each airline, car rental, hotel, tour company, Uber, etc. Let each vendor do what he wants as long as the requirements for the specifications are met. Let the best tech vendor win.
It’s far past time to update these systems. Otherwise, more heartache pain and probably government bailouts to come. Possibly even larger travel and freight interruptions. A longer term blow up could put an airline out of business. Remember Eastern? I do….
Sorry, they were 9 track tapes. You put the reel up, ran a couple of turns and pressed the button. And we had some autoloaders, too.
Man that brings back memories. Programming cargo reservation systems in assembler in ACP was the start of my career. C was tested and dumped as too slow…
It’s a long way from there to the Phoenix /Elixir /React stack I’m working on now….
Now here is a someone who knows how to use acronyms. Define and then use. Nice writing.
I worked on CONFIRM back in the day…complete mess.
As someone who supported aspects of AA.com and eventually some aspects of flight operations 2007-2010, they trusted us enough to let us bid on updating their flight operation systems from the legacy stuff, we built something around IBM MQ that met their performance requirements, it ran (all their stuff did) active-active, or at least a hot standby in two data centers that were something like 100 miles from each other.
So it seems they learned a little. We were an outsource company, but we were a high grade operations center that gave you smart 24/7 coverage, we only woke people for serious issues that we could not fix ourselves (we were based in San Francisco, so not cheap!). Unfortunately, the company that bought that company tried hard to make things cheaper, and off shored much of it, everyone I knew there has moved on.
I started reading this and thought..How in the world is Bob going to tie IBM into this disaster… and..he did…
I do know that this wasn’t an IBM issue and all of the IBM systems came up and were operational. There was an issue with another very large IT vendor… but this was one where IBM actually is ok.
But who had responsibility for the power to the data center, the battery backups, the generators? And then, who owned the backup data center failover procedures? It might be better to have operations folks in Atlanta where they can do some good (or at least within 10-15 minutes of the data center). Also, I read in the late 1990’s about banks running on fuel cells where the systems were so reliable, “the grid” was their backup power. They sold power back to the grid, and had near perfect reliability. It seems to me that is what airlines need for power, not the power company! Whatever happened to fuel cells for data centers?
Are you a Donald Trump supporter?
Afraid much?
No, I’m either voting for Hillary or Gary Johnson. You may not have lived through Y2K. Some people did not believe that the power grid would stay up. My boss actually had me researching Fuel Cells as an alternative to grid power in 1998/1999. Once it became clear that “everything was going to be all right”, then the research just got filed away. But my point is: If you have something as valuable as a flight system, reservations system, crew scheduling, weight and balance, flight planning, then you damn well better have reliable power, network, and redundant computer systems to support it. And systems exist which can provide more reliable power than the power company. If you promise five nines and then don’t, in my opinion, then you are committing fraud by promising the customer 24/7 reliability/availability and delivering far, far less.
First I want everyone to give Cringely a pat on the back. He forecasted massive layoffs in 2016, and was pretty close to the real numbers. Don’t add back in the tens of thousands of technology school grads at $29k/yr, hired to replace the 50+ yr old crowd at IBM. Anyway, he saw it coming. It’s not finished yet. Credit where credit is due.
Secondly, yes. Delta is one of IBMs largest customers. When DL came out of bankruptcy around 2005ish, they pal’ed up with IBM using AMEX’s cash cow. Every year since, a press announcement such as this was published: https://www-03.ibm.com/press/us/en/pressrelease/20162.wss
Thirdly, IBM thoroughly corrupted DL’s infrastructure, staff and managerial tiers. DL was so mixed up they began to fire local staff faster than IBM did. The major sites are Brazil, Pakistan, and India. They used China for a while, but the political environment over there wrecked that back in 2010. You’ve got to remember, last Sunday’s little DL burp was nothing. The entire commercial flight regime in the US is running at 109%, including IBM controlled FAA, security/passport systems, the ATT controlled data communications grid, and of course the thousands of operations systems in use at Delta also run by IBM. DL has lots of locals that are very sharp, and keep their company running well. They fight with the India/IBM crowd hourly, all day, all week, all year. India could care less, making $5/hr and face the same mass IBM RAs as the US. Talk about bad attitude.
Wait till one of the crew scheduling systems fails. Especially the massive and complex flight attendant system. Pilots have so many union and FAA rules it’s surprising they even work the maximum of 80 hrs/month allocated. Then the other ops systems like weight/balance, loads, aircraft tracking, ticketing, financial, customer service/gate, and don’t forget the whopper – maintenance. All airlines use similar technology, and was really trail blazed by UA in the 60’s. They’re still the largest carrier in the world, and is of course another huge customer of IBM’s providing constant monthly cash flow to IBM service. Talk about Ginni and crowd dozing for dollars. Just sleep, party and rake it in. And they CANNOT get away from IBM without tanking the whole thing. Now that’s a mess indeed.
As I pointed out above, there has already been a massive crew scheduling system failure: ComAir in 2005. It was on Christmas Eve/Day and their failure brought down the entire US airport grid as it messed up other airlines’ crew schedules. Passengers were stranded at over 118 US airports.
The error was traced to a single short int being used where a long int should have been used. Under heavy travel load, when crew reassignments exceeded 65,323, the system crashed.
It turns out ComAir had contracted their entire IT dept, including programming out to a 100% Indian outsourcing company with 100% Indian programmers – highly skilled geniuses who didn’t know the difference between a short and a long and when to use either.
Yes, outsourcing is often the cause.
https://www.foxnews.com/story/2004/12/25/comair-cancels-all-1100-flights.html
Remember the CEO who claimed he can buy a used Boeing 777 for under $10 million. (Fact check, it was only $7.7 million). When you paid peanuts, you got a “Monkey” IT Systems, outdated jets in your fleet, etc. etc. I may as will cut up my Delta SKYMILES card.
Ok, let’s do some simple stupid forensic here, quoted from news: “Delta initially pointed to a loss of electricity from Georgia Power, which serves its Atlanta hub, when its worldwide computer network crashed at 2:30 a.m. Monday”. And then they said: “Delta Air Lines said Tuesday that an internal problem, not the loss of power from a local utility, was to blame for the disruption that caused hundreds of flight cancellations and delayed tens of thousands of travelers Monday.” … So, initially they didn’t know that it was their electrical system failure. Presumably they must be arguing for sometime. And so far no mention of overseas outsourced data center. Before we go further in analysing “data backup”, RAID, redundancy, routers and so on, if they say that “power outage” is the cause of all this mess, then shouldn’t we get to the basic first, namely electricity ? Aside from UPS which is just a temporary short time solution, a “big” and critical data center such as Delta should have diesel generators installed, no ? Next question is how long were they “in the dark” ? No matter how long, once the electricity was up again, how long does it take for the system to get back to “current status”, or like in “good old days” Windows “last good known configuration” ? Presumably, data were not destroyed, just interrupted, cut off, so there was no need for systemwide data restore, perhaps needed some synchronization, which presumably for a sophisticated system like Delta I presume was designed that way, able to synchronize. Then, last but not least, is the entire system “modular” with “modules priority structure”, so that when the “reservation system” is up for example, the other systems which are in lower priority list, could be synchronized later.
The problems lie not in the power outage, but “bringing everything back up”. When we had this on eight inter-connected mainframes and maybe 24 front end processors, my team (Realtime Coverage) actually had the procedures to bring up everything in a specific order. There were literally 10 people in the room (and adjacent rooms) all on a war room call running through the procedure. We had planners who told us the sequence, and what checks and tests needed to be in place, to get the airline operations (flight systems) up first, then passenger service (on-boarding), seating, etc. Then finally (last), the travel agencies and Internet (well after I left). As we “decentralize” IT, it becomes harder to follow this kind of script because you literally “don’t know” what is running on 500 boxes. It becomes uncontrollable. You just “try to bring it up” and then it crashes on the load, or intercommunications isues. Coordination problems create something that should be a 1-2 hour outage into a full day (or two) of flight delays.
Back in the mid’90s a new internet NAP went online in California. Shortly afterward, it went down.
The NAP had both batteries and a generator; it was supposed to switch to internal power automatically if the grid went down. As the story had it, the system that controlled the switchover was on the grid side of the demarcation, so when the power went down… so did the system that was supposed to switch over.
One issue I didn’t see brought up was staffing for the recovery.
Even if all of your systems had DR sites and were tested, it takes
a fair amount of resources , network, OS, backups to bring them up.
When the DR site was tested it was likely only one of several systems, applications, etc
at a time.
Now you have COMPLETE outage, you have to restore everything on 10k systems
and you fired all your excess Network, backup, OS admins, so you have people supporting
100s or 1000s of systems and they ALL went down…
In the 40 or so years I was in IT. I was only involved in one major DR test. Everybody had just the right tools to use during the test. This was before the company went to India for their primary Computers. It seemed to go amazingly well one or two minor glitzes but all in all a well done job. Luckily I left before it was the far easterners time to try a DR test. Rumor was it was a fiasco.
None of this surprises me. I worked for the insurance group Aon and on September 11 we had a little thing happen called buildings going down. And I was there, 101st floor survivor from South Tower. Our data center in that building crashed big time. Oh, disasters can happen. As it turned out, my manager scheduled a discussion at his insistence with Chicago (management) to discuss the Trade Center as a focal point for an outage, and it was set for September 19. Oooops. Now disaster recovery and business continuity is a must for every business, and having generators and APC devices on the server racks is, well, common sense particularly IF you are a global giant like Delta. Outsourcing is never successful and having 50,000 eager young IT types in Bangalore does you no good IF you hare to turn on a switch in Atlanta, or 10,000 switches as it turned out. And god forbid IBM, once a great firm now a shadow. This is everything gone wrong. Good job Bob.
a 30 year old UPS failed, taking down a core router and that took down all the other core routers… the cascade proceeded from there
Someone said that Delta (and all airlines for that matter) is an IT company that flies people to destinations using planes.
There are so many things incorrect in your article it is staggering. You admit that you do not even know what happened and then pontificate about IBM and Delta.You imply many things about both companies that are entirely false. The good news is that I realize that I could have a career writing articles with no facts and blowing hot air. btw – It is Delta Air Lines. You could at least do a fact check on the name of the company about which you write.
The least you could do is provide counter data as opposed to saying “you’re wrong!” and leaving.
An old IT architects adage….”The bigger the system, the harder it is to keep running and the more painful the failure and recovery.” and the “The faster the network, the faster the system goes to hell and crashes.” These adages were often used by wise old codgers to put the fear of failure on management to strive to build and maintain the best constantly tested failure avoidance systems.
I guess the Gods of Accounting will start the chopping off of heads very soon.
Bob, here is another theory for you to explore. Delta has sold two data centers to Digital Realty in the past few years, one of them in Atlanta. Digital Realty’s website confirms that this property is leased to a triple net customer. If there truly was an electrical outage that led to the massive data system failures, who is really to blame? Does Delta have a service level agreement with Digital Realty to guarantee power? Does this expose the risk of losing control over your critical infrastructure? Does Delta have a case to recover losses from their landlord?
https://www.digitalrealty.com/data-centers/atlanta/
https://www.investopedia.com/terms/n/netnetnet.asp “requires the lessee to pay the net amount for three types of costs, including net real estate taxes on the leased asset, net building insurance and net common area maintenance” That definition still leaves me in the dark about who decides how much to spend on insurance or common area maintenance, or even what is the “common area”.
Cripes, this piece is built like a house of cards.
* Is Delta still outsourced to IBM? You don’t know.
* Was the system involved outdated? You don’t know.
* Was this outdated system part of the outsourcing contract? You don’t know.
* Did they further outsource to India? You don’t know.
* Did they pay their outsourcer for DR support? You don’t know.
* Did any of what you write about even happen? You don’t know.
Yet you somehow manage to write a whole piece about this.
Here’s a couple of things I do know. If an outsourcer were involved, Delta would have *definitely* thrown them under the bus. They threw their damned power company under the bus. And DR contracts are (rightly) freakishly expensive, because of the extra hardware, personnel, training, and drills. Most companies don’t pay this service, even if they outsource. If Delta had paid for a service, again, I have no doubts they would have thrown their DR partner under the bus.
There you go: six sentences that are more true than your entire piece.
Re: “more true than your entire piece” So what did Bob say that was wrong?
I have to wonder if Ginny is trolling Bob’s board at this point….
Why wasn’t this data center running on a UPS, or depending on the size of the data center, multiple UPS units that provide clean power 24/7? The UPS units should have at least 30 minutes of run time which is more than required to bring generators online to restore power until the utility returns. If single piece of switchgear took down this data center, which is one explanation I read, than the folks that designed the power distribution for the data center weren’t very bright, or someone cut corners somewhere to save a buck?
Just sayin’
In my personal experience, you want more than 30 minutes, because not all your generators will start the way you think they will, and you need time to go kick them until they do.
Esenyurt konut projeleri ile İstranbulun büyüyen pazarında söz sahibi olmaya ne dersiniz. Sizde gelecek adına yapacağınız yatırımları şimdiden iyi bir şekilde seçerek yapabileceksiniz.
Çok uygun fiyatlara yapı sahibi olabileceksiniz.
https://www.emlakdream.com/
Kartal konut projesi yaşamına renk katacak konutlarıyla siz ev sahibi alacaklarına her acıdan geniş çapta imkanlarını ayaklarınızın altına sermektedir projede ki evlerin mimari yapısı çevresinde ki tüm konutların üstünde konfora ve lükse sahip yaşamınıza ve geleceğinize en büyük yatırımınızı en düşük fiyatlar ve istediğiniz taksit koşulları ve büyüklüktedir.eşsiz manzaraları geniş açık ve kapalı otoparkları ,yüzme havuzları ,koşu parkurları,spor kompleksleri ,havuz başı oturma şadırvanları ,geniş oyun parkları mevcuttur. konutlarımızda Amerikan mutfakları, Fransız balkonlar ,İtalyan tasarımı boyalar ,renkli florasanlı ters tavanlar kabartmalı fayans ve renkli boyalarıyla içinizi ısıtacak şahane konutlarıyla sizleri bekliyoruz.Eğitim ,ulaşım,sağlık merkezlerine yürüme mesafesinde bulunmaktadır .
IBM manages the mainframes for this account and nothing else. The facilities people did not plug 2 servers into 2 different PDU’s/ Panels so when power was lost the servers went down due to lack of redundancy. Human error and not IBM
[…] Interesting comment on the technology involved in these systems from FormerRTCoverageAA: […]
[…] Interesting comment on the technology involved in these systems from FormerRTCoverageAA: […]