Delta Computer Glitches Force Flight Halts Third Year in a Row

(bloomberg.com)

72 points | by danso 2033 days ago

11 comments

  • danso 2032 days ago
    The end of the article provides this comparison for context:

    > Still, Delta isn’t the only U.S. carrier to have suffered from technical glitches. In December 2017, a fire at Atlanta’s airport, the world’s busiest hub, caused a major electrical disruption, crippling services and stranding thousands of passengers of Delta as well as rivals including Southwest Airlines Co.

    It seems to me that Southwest's systems being brought down by fire and electrical disruption is a different "technical glitch" than the kind of systems problem Delta just had. A better comparison would be July 2016, when a router failure created a "chokepoint that crippled hundreds of the company's software applications":

    https://www.dallasnews.com/business/southwest-airlines/2016/...

    • Twirrim 2032 days ago
      > It seems to me that Southwest's systems being brought down by fire and electrical disruption is a different "technical glitch" than the kind of systems problem Delta just had.

      To some degree, sure. But they really should be capable of a straight failover (forced, if necessary) to geographically separate components.

      • txcwpalpha 2032 days ago
        >To some degree, sure. But they really should be capable of a straight failover (forced, if necessary) to geographically separate components.

        AFAIK the failure at ATL had nothing to do with the ability for the IT network to failover. It was because the power at the airport was off, and therefor the airport was unable to fly planes. And when the world's largest airport goes dark, it causes ripples in delays across the entire country because of misplaced planes/staff/passengers. And you can't just "failover" hundreds of planes and thousands of passengers to a different geographic location.

        We're really talking about two completely different types of failure here, and the other comment is right in that it's weird to compare the ATL fire to last night's failure.

        • AnimalMuppet 2032 days ago
          It's weird to those who know. But to those who are just looking at how bad it messes things up from a passenger's view (ie, non-aviation-knowlegeable people reading the article), it's reasonable to compare things of different causes that have similar effects.
          • txcwpalpha 2032 days ago
            True, but if that's the case then why doesn't the article bring up all of the adverse weather events that have had the same impact?

            Even for non-aviation-or-tech-knowledgeable people, I think most still care who is at fault for such disruptions. The article frames the ATL fire as Delta/Southwest's fault. It puts the ATL fire in the same category as other, airline-specific issues that were caused by Delta's IT systems. But that just isn't the case. The ATL fire affected every airline at ATL (but for some reason the article only calls out SW and DL), and was the fault of the airport, not the airlines. It's more similar to bad weather shutting down an airport than it is to the IT issues that Delta faced last night.

            It just feels like bad reporting to include that comparison.

  • anon4lol 2032 days ago
    I just have to comment about this.

    I had the weirdest job interview for a contracting job at Delta. I passed two technical phone interviews and was scheduled for an in-person interview. The night before it dumped a ton of snow. I tried to cancel, but didn't have the manager's contact info. I left an hour early to get there, and still I was 5 minutes late.

    They let me sit in the lobby for another twenty minutes, then the Manager's "administrative assistant" came and fetched me. The Manager then asked me some of the most condescending questions, like "what operating system do you use? Have you ever used Linux?" He then finished with a snide comment and that our time was up and he had a meeting to go to.

    I called the recruiter from my car. They decided pass, "because the Manager said I didn't apologize enough for being late." That was the last I heard from the recruiting company. They ghosted me. Wouldn't talk to me or respond to my email. It was the most unprofessional situation I've run into in 25 years of contract/consulting. Just plain weird.

    Note, this job was not to code, but write documentation about the code. They wanted a C++ programmer to document the existing system that schedules and reroutes flights during irregular ops, that was written by some guys from Bell Labs running on some fault-tolerant hardware. He had sold the management on converting everything into Java running on commodity/cloud servers.

    Every time I see their system meltdown... I'm reminded of this.

    • pavel_lishin 2032 days ago
      Your story reminds me of the LinkedIn article about employees ghosting employers and recruiters.

      https://www.linkedin.com/pulse/people-ghosting-work-its-driv...

      • mikestew 2032 days ago
        And the article waits until late to point out that it is the companies themselves that have explicitly demonstrated that that's how they want it to go: once you lose interest, quit answering the phone.
        • pavel_lishin 2032 days ago
          I wonder how many people reading the article thought to themselves, "Well, no shit" vs. "Oh, wow!"
    • purplezooey 2032 days ago
      I had an interview for a place in SF and was 5 minutes late and had a similar situation. I hardly ever ding people for coming a few mins. late to interviews. People love to latch on to that. It's kind of pointless.
  • AceyMan 2032 days ago
    Once again, TFA is unclear about exactly what DL systems became unavailable. As usual, it's implied that it was ticketing and reservations stuff ("Deltamatic" for DL) which is a nightmare, yes. But back in the day (of paper tickets) you could resort to manual processes and still get flights out. The backpressure would accumulate quite fast, and you couldn't operate in 'manual mode' for very long, but you could bust your butt and do it for a departure/arrival push or two.

    The real showstopper is if the Operations Center can't deliver the dispatch release; that's the legal document prepared by the airline dispatcher that grants the flight permission to initiate the flight. Without that, it's a no-go. Even if Deltamatic were down, Flight Ops could fax the releases over, as the dispatch planning system is merely tied into Deltamatic to make local printout easier; it doesn't actually run on Deltamatic.

    These days with e-ticketing and more stringent DOT/Customs standards on accurate pax manifests, etc, having Deltamatic (ticketing/res) down is a show-stopper, but it wasn't always that way and the way these articles are written it leaves, in my mind, the open question about what parts were actually fubared.

    (me: ex-aircraft dispatcher for a DL-owned regional carrier.)

  • rectang 2032 days ago
    Which airlines, if any, run outstanding software departments?

    The impression I get from these repeated fiascos is the whole industry still runs on mainframes and java applets held together with duct tape and bubblegum, and that their executives are dinosaurs with no appreciation for what it takes to build robust failure-tolerant systems. But maybe it's just a skew in the reporting.

    • UperSpaceGuru 2032 days ago
      Having worked on this personally, Deltamatic runs on HPUX & pretty much C++.

      My info is maybe 10years old, but I doubt much has changed.

      These infrastructure type things are incredibly complicated. Not in the “Newfangled” framework way, but rather the business implications of real world impact of potentially bad code.

      I consider it one of the highlights of my career going thru the Delta Northwest Merger without shit hitting the fan. We were writing decrementer & while loops instead of for loops because there was a measurable performance & stability impact.

      There’s something to be said about software that has withstood the test of time.

      I’m impressed that the team is able to handle outages as well as they do. Having had some experience working with the folks at Delta & Northwest, I’d say they’re pretty impressive people. I’d be happy to steal some for my startup if they’d be willing to part with the flight benefits & tenure.

      Full Disclosure: I was not a Delta Employee, but worked for a startup that built the checkin systems (which got acquired by NCR). I got to work with Delta as a client.

      • madengr 2032 days ago
        What do they run HPUX on? HP has been out of the PA RISC Unix workstation for years. Do they still make mainframes?
        • secabeen 2032 days ago
          Probably HPUX x64.
    • vorpalhex 2032 days ago
      > with no appreciation for what it takes to build robust failure-tolerant systems

      It's a risk aversion issue. Large software projects are horrendously difficult to get right, and managing airlines is a horrifically complicated kind of project. Fault tolerance aside, just correctly modeling the problem is a mess and by the time you have a correct solution, the problem itself may well have changed. So you end up having to throw a lot of resources at a very complex problem and know that you may entirely fail at it, which is not the kind of gamble execs love.

      • mschuster91 2032 days ago
        > It's a risk aversion issue. Large software projects are horrendously difficult to get right, and managing airlines is a horrifically complicated kind of project.

        Banks for a long time thought the same. Then banking startups with code written on a clean slate in modern architectures emerged and are eating the bank's lunches as their infrastructure (more often than not a mix of mainframe, weird custom middleware and half-modern webstacks) cannot keep up in development pace. For example, in Germany Commerzbank (once a powerhouse) got kicked out of the DAX with former fintech startup now giant Wirecard replacing them.

        Airlines are only "kept safe" from modernization because the cost of entry is so unbelievably huge - to start a bank one needs only a couple of employees plus a couple million euros founding money, but to start an airline one needs vastly bigger sums.

        • rectang 2032 days ago
          Thanks vorpalhex and mschuster91! To me, your two posts capture the takeaway and explain why there seem to be no airlines which run outstanding software departments.
        • pavel_lishin 2032 days ago
          > on a clean slate

          I think that's one of the keys to their success. Large banks can't just say, "well, we're not going to be supporting A, B and C so we can do a total rewrite."

        • PurpleBoxDragon 2032 days ago
          Are startups more likely to have successes writing these applications in modern architects than large corporations, and if so, is it perhaps because they are startup and thus don't have corporate culture and all that it entails?
    • pc86 2032 days ago
      Running on mainframes and java applets is not inherently a negative thing. Not everything needs to use npm or be written in Rust.
      • rectang 2032 days ago
        Sure, so emphasize the "duct tape and bubblegum", then. There are a number of modern software development practices that make software systems more robust as they are evolved over time. These are the kinds of things that companies whose leadership lacks individuals with tech backgrounds do not seem to appreciate enough to budget for.

        * Sharding and clustering.

        * Queuing between services.

        * Geographic distribution across multiple datacenters.

        * Continuous restoration from backups.

        * Unit testing.

        * Source control.

        * ...

        Those practices are not language or system specific, but they may not be easy to follow when using older technologies.

        • ogn3rd 2032 days ago
          Are you from the cloud? ;)
          • rectang 2032 days ago
            Heh. I'm not, and all of those practices are compatible with running your own data centers.
      • steveklabnik 2032 days ago
        I feel like, as a Delta customer (I was actually flying yesterday and narrowly missed the situation described in the article), I should make a joke here...

        You’re right though. That said, these systems are crufty. It’s true that they get the job done often, and updating them doesn’t inherently mean that they’re better, but they fail a lot. One recent example; I had a reservation cancelled 24 hours after booking three times, with no explanation why. “Normally there’s a note here explaining... but this time, nothing. This just happens and we don’t know why.”

        • sandworm101 2032 days ago
          >>"This just happens and we don’t know why.”

          The person telling you doesn't know why, and there is no way for them to directly check, but there are a variety of reasons reservations get canceled, reasons they airline doesn't want to pass on to you. Top of the list is being pumped for an airline employee who needs to be somewhere. Then come premium customers such as government agencies/police/military. Then rich people (premium travelers/club members etc). When such people get a flight at the last minute, they are bumping someone else. That someone isn't told why.

          • steveklabnik 2032 days ago
            This is true. I have quite high status, but that’s not the same as being an employee or government member.

            In this case, I don’t think that’s what happened; I was able to re-book the exact same flight all three times. In those cases, I should at least be notified my flight was cancelled, but was not. The third rep suggested it was a KLM/Delta integration issue. This is also only one of many issues I’ve run into over my years of pretty extensive travel.

            Regardless, you are right; it could be a white lie.

      • swarnie_ 2032 days ago
        I think you just alienated 2/3rds of the user base in one comment =)
        • pc86 2032 days ago
          I've worked with old COBOL, FORTRAN, and RPG systems and have argued to management for replacing every single one of them at one point or another, but the overall sentiment on HN that if it isn't written in JavaScript or OCaml it's probably shit does get tiring after a while.

          In the world outside Silicon Valley, and outside of Venture Capital-funded projects, the type of code and infrastructure everyone here idolizes is the exception by a wide margin.

    • txcwpalpha 2032 days ago
      Doing consulting for a big consulting co, I have many friends who have done work for all of the major airlines. From what they tell me, they all actually have very robust engineering orgs, and the systems that run things like the website, backend payment systems, security systems, finance apps etc are pretty advanced.

      The problem is that those systems are pretty segmented from the systems that schedule/dispatch planes and staff. Those systems are incredibly complicated due to the complex nature of managing the status/location of 1,000+ airplanes, millions of pieces of cargo, thousands of flight attendants and pilots, etc. And those systems also often have to interface with antiquated airport or FAA systems. As a result, making changes to the reservations or flight management system is a lot different than working on a web app.

    • waylandsmithers 2032 days ago
      From strictly a UX standpoint, I recently found Virgin's website to be fantastic to use-- very clear hierarchy, nice fonts and colors, and a unique approach to form error handling-- the text of the submit button itself would change to tell you what fields still needed to be filled in.

      No idea what their backend looks like though.

  • rb808 2032 days ago
    Incidentally Delta has the same market cap as Square at ~$40B. It also owns 881 airplanes. Plus it has to write critical software to keep them all running smoothly. Non-tech world is tough.
  • Sharlin 2032 days ago
    As an aside, the title is a garden path sentence if I’ve ever seen one. Had to backtrack at least twice to parse it correctly.
  • chisleu 2032 days ago
    At least the "glitch" let everyone get home from defcon before "happening" this year.
    • acct1771 2032 days ago
      Has there been strife between the community and Delta?
    • dylan604 2032 days ago
      timebombs are great ways to add plausible deniability
  • exabrial 2032 days ago
    I'm curious if they're still using their mainframe based systems? They've been workhorses over a number of years but show their age when it comes to making quick changes.
  • rbanffy 2032 days ago
    Do they publish detailed postmortems?
    • snaky 2032 days ago
      > The recent Delta Airlines system outage and the prior Southwest outage are pointing people to blame their antiquated technology and infrastructure. In particular, a lot of these so-called technology experts point their fingers at airlines’ IBM mainframes running z/TPF as part of the cause of their troubles. The problem is that none of these “experts” seem to have ever done any work on a mainframe and only have a passing understanding of z/TPF if they have any understanding of it at all.

      https://www.linkedin.com/pulse/mainframes-problem-solution-j...

      • rectang 2032 days ago
        This link is a gem. It is very revealing of the dynamics of arguments over software in the airline industry.

        > A number of financial institutions have discovered and acknowledged this fact and are now taking the first steps to address this situation by making the decision to go back to z/TPF, z/OS and UNIX because of their bulletproof reliability and transaction processing capabilities. This to the chagrin of the bulk of their application development staffs who are bolting out of these organizations for the more sexy technology environments. Yes, IBM Assembler, COBOL and other antique languages are making a resurgence at these brave organizations as they supplant and even replace Java, .NET, Python and other development environments.

        Back to the good old days of IBM Assembler?!

        Are they ditching source control and unit testing too?

        • snaky 2032 days ago
          > Back to the good old days of IBM Assembler?!

          High Level Assembler to be correct. But kidding aside, the slide 40-41 is interesting http://www.ibm.com/software/htp/tpf/tpfug/tgf18/TPFUG_2018_M...

        • pc86 2032 days ago
          I used to work at a healthcare company where about 1/3 of the development staff worked in RPG and COBOL day-to-day. A consultant came in to pitch them on a COBOL source control system, and from what I heard they had to spend most of the time explaining to these 60+ year olds what source control was before pitching the management on a 7-figure implementation for what was basically timestamped folders with copied of code inside.

          When I left that company a few years after that, there was still no source control for the RPG or COBOL code.

          • Xixi 2032 days ago
            I have to chime in and display my ignorance: what makes RPG or COBOL so special that git or mercurial could not simply be used for source control?
          • _Codemonkeyism 2032 days ago
            I know a large organization which has a lot of code in an Oracle cluster. No version control, no tests, no development environment.
      • qaq 2032 days ago
        OK not being z/TPF expert the "the world's largest TPF-based systems are easily capable of processing tens of thousands of transactions per second, three billion per day. TPF is also designed for highly reliable, 24-7 operation." This totally reasonable workload for RDBMS system does not look like you specifically need z/TPF to handle the transaction volume.
        • snaky 2032 days ago
          It depends on transactions obviously (among many other factors). MySQL would handle simple one-insert web-style ones pretty well. At the same time very simple (but slightly more realistic) read-write transactions barely got 2500 tps on PostgreSQL with Xeons and SSDs with as low as 32 clients and almost no contention at all (https://blog.pgaddict.com/posts/performance-since-postgresql...). And all transactions were the same, and there were no long-lived ones, no distributed ones, and no rollbacks, and so on.
          • qaq 2032 days ago
            • snaky 2032 days ago
              How exactly TPC-C v5 synthetic benchmark transactions related to the Delta Airlines backend transactions? Yes, that's somwewhat toyish 'business application' sketch benchmark. The goal of any benchmark is to be the same for all participants to compare them - staying aside the question what exactly the results tell about one or another particular system.
              • qaq 2032 days ago
                No company outside of exceptional circumstances will give you their actual data and workloads so some generic data points are useful. It's also pretty obvious that number of transaction Delta is handling is few orders of magnitude less than say Walmart. To me the point that conventional RDBMS can not handle transactional workload of Delta sounds like total BS.
          • qaq 2032 days ago
            You do realise that it is a tiny box with really crappy SSD BTW? You can get an x86 box with 12TB RAM 200+ cores and RAID over a bunch of Enterprise NVMe SDDs for 350K.
            • snaky 2032 days ago
              That's nice, and you can get a 1,000,000 tps on the shelf POWER8 server if you wish (on simple select-only requests, after tuning and patching the PostgreSQL core by a couple of the best PostgreSQL hackers in the world - yes, using carefully crafted assembler, no less https://akorotkov.github.io/blog/2016/05/09/scalability-towa...).

              That wouldn't make that server competitor neither for IBM Parallel Sysplex, nor even for Oracle Exadata.

              • qaq 2032 days ago
                Oracle Exadata is not a mainframe which once more underscores the point of not having to use a mainframe.
      • the_duke 2032 days ago
        That post is a horrible mess of subjective misinformation by someone almost certainly involved in the mainfraim business.
  • CaliforniaKarl 2033 days ago
    Suggest changing the title to "Delta Resumes U.S. Domestic Flights After Computer Glitches", as the headline has changed, since the issue has been resolved.
    • dang 2032 days ago
      We've changed the title from "Delta Grounds U.S. Domestic Flights to Fix Computer Glitches" to the article's current title.
  • senorsmile 2032 days ago
    I'm currently sitting in an airport in Salt Lake City. Was supposed to be in St. Louis, MO last night for the preconf day of Strange Loop. Instead I get to miss the first day because they didn't hold my connecting flight. And, my first flight didn't leace until almost 2 hours after we were supposed to, because of a computer glitch.
    • pc86 2032 days ago
      I thought it was standard practice not to hold connecting flights, especially for 2 hours? I certainly would not want to be stuck at my gate (and potentially miss my connections) waiting for another passenger for that plane. It seems pretty obvious that the least impactful route is that planes take off when they are ready and do not wait around.
      • gav 2032 days ago
        Almost no flights are held back for connections.

        There's probably the odd flight that is held back if there's a large number of passengers that will miss it due to a late connection--and there's no later flights with seats available. I would imagine the cost of holding the 11pm LAX-SYD flight because 20 people are 10 minutes late on the NYC-LAX connection is a lot less that paying for hotels, especially when the next flight may not have 20 seats available anyway. Though then again, most carriers have gate pressure at LAX and they might not be able to hold the flight anyway.

        It's an interesting optimization problem to solve!

      • senorsmile 2032 days ago
        From what I understand, all flights across the nation were equally impacted. My connecting flight was delayed, but ended up leaving right before my flight landed.

        Interestingly, there was a very, very long line at the ticketing counter of Delta this morning. I asked a few passersby if they were put up at a hotel like me. Every person I asked was in the same situation.

      • dylan604 2032 days ago
        It depends on what the destination is. I sat on a plane for over an hour at the gate waiting for a delayed flight with passengers connecting to my flight. This flight was the final leg for everyone, and the pilot was going to be able to "make it up in the air". So, I'm guessing they have ways to decide if it is worth delaying a flight to wait for another delayed connecting flight.