Challenges with distributed systems

(aws.amazon.com)

188 points | by fagnerbrack 1526 days ago

9 comments

  • yamrzou 1525 days ago
    Interesting read.

    Relatedly, FoundationDB has a distributed testing framework called “Simulation”, which can simulate distributed failures on a single machine (thread!). Quoting from https://pierrezemb.fr/posts/notes-about-foundationdb/ :

    > We wanted FoundationDB to survive failures of machines, networks, disks, clocks, racks, data centers, file systems, etc., so we created a simulation framework closely tied to Flow. By replacing physical interfaces with shims, replacing the main epoll-based run loop with a time-based simulation, and running multiple logical processes as concurrent Flow Actors, Simulation is able to conduct a deterministic simulation of an entire FoundationDB cluster within a single-thread! Even better, we are able to execute this simulation in a deterministic way, enabling us to reproduce problems and add instrumentation ex post facto. This incredible capability enabled us to build FoundationDB exclusively in simulation for the first 18 months and ensure exceptional fault tolerance long before it sent its first real network packet. For a database with as strong a contract as the FoundationDB, testing is crucial, and over the years we have run the equivalent of a trillion CPU-hours of simulated stress testing.

    • epdlxjmonad 1525 days ago
      Testing a distributed system using a single machine may look like an unorthodox approach. From our experience, however, when building a test framework for a distributed system, everyone would be automatically led to think about building it using a single machine for many benefits, especially saving a lot time. So, I would find it very inefficient to develop a distributed system using only system tests utilizing a cluster of machines, without exploiting integration tests.

      Nevertheless using just a single thread to simulate everything seems like a great approach.

    • jffhn 1525 days ago
      >simulation of an entire FoundationDB cluster within a single-thread!

      Reminds me of an InfoQ post about simulating robot swarms in a single thread: https://www.infoq.com/articles/java-robot-swarms/ (links to Java libraries for that in the comments)

      Being able to run everything in a single thread allows to:

      1) Do a lot of tests quickly (with virtual time flowing as fast as possible), or arbitrarily slowly if you prefer slow-motion in some contexts.

      2) Have determinism. Great for debugging: can reproduce bugs at will, even with full logs on (one trick is to enable them just some time before the bug occurs, when the bug is far from start time).

      3) More easily figure out whether a bug is in domain code or a threading issue.

      4) Use single-threaded execution as a bench of domain code, and see how much multi-threading/distribution can make them faster or do make them slower.

      One constraint is that it rules out some programming styles, since the code must never use waiting constructs, like futures (if the single thread starts to wait, it will wait forever since nothing happens outside of it).

      • sitkack 1525 days ago
        This is an excellent example of having time being a first class input to the evolution of a system. I believe this is fairly common in CEP [1] systems. Similar techniques can be used to handle randomness. I think anytime one writes code that is tied to the wall clock, that a mistake has been made.

        [1] https://en.wikipedia.org/wiki/Complex_event_processing

      • zzzcpan 1525 days ago
        > One constraint is that it rules out some programming styles, since the code must never use waiting constructs

        The constraint here is for testing/simulation to be able to supply their own implementation of waiting constructs, not that waiting constructs cannot be used, of course they can.

        • jffhn 1525 days ago
          I meant rules out for the domain code that you want to be able to run in a single thread, which should be by far most of the code (unless you ensure that whatever your code wants to wait for, has then necessarily already happened, but that seems brittle to me).

          Inside of the technical layers you use to run it in multiple threads, there are of course wait/notify mechanisms (or similar).

          Maybe you thought about wait implementations that would not wait but that would "help", and go on with other computations while the condition is not yet met? If not then I would like you to expand on what you mean, ideally with a few lines of code as a sample to make it clear.

          • zzzcpan 1524 days ago
            I mean wait implementation doesn't have to actually wait, it just has to register an event handler and store some context to continue with. In the simplest case if we assume there is nothing else to process the condition will be met right in the next iteration and will call back that event handler and continue from that point all without any waiting.
            • jffhn 1524 days ago
              Ok, I was talking about actually waiting (like in "while(!condition)yield();") with more code to execute at the _same_ (*) virtual time after the wait, not taking care of having some code executed _later_, which is indeed a proper approach in our case, and what "waiting" could mean in some informal specification.

              (*) When doing deterministic virtual time scheduling, computations are scheduled to take place at given times, and are processed exactly at the time they are supposed to be. The time can only change once everything that had to be computed at current time has been computed (if in "as fast as possible" mode, the scheduler then just jumps the clock to the next time something is scheduled to happen) (if you want to read more about that, see "time advance request" and "time advance grant" in the HLA norm).

    • drej 1524 days ago
      There is a great talk about this testing, it goes all the way to deterministic seeds and other basics. The level of determinism is admirable. https://youtu.be/4fFDFbi3toc
  • yamrzou 1525 days ago
    > It’s almost impossible for a human to figure out how to handle UNKNOWN correctly. What does UNKNOWN really mean? Should the code retry? If so, how many times? How long should it wait between retries? It gets even worse when code has side-effects. Inside of a budgeting application running on a single machine, withdrawing money from an account is easy, as shown in the following example.

    > Figuring out how to handle the UNKNOWN error type is one reason why, in distributed engineering, things are not always as they seem.

    How are such UNKNOWN errors handled in practice? The article doesn’t talk much about it.

    • finaliteration 1525 days ago
      In my experience designing systems like these: There are no hard and fast rules and you handle them in a way that’s been decided on a case-by-case basis for each system. Some messages may need to keep retrying with an exponential back off indefinitely. Others may need to retry only once and then send an email to someone because two failures in five minutes is a critical failure. You also have to design things so that if one failure occurs, maybe it stops the entire process, or maybe other messages can go through still and you just log the failure.

      It all comes down to the rules of your business and how critical these systems are. Maybe unknown means “failure” or maybe unknown means “someone should get an alert about this and check it out”.

      I think it’s hard because there is no highly visible “crash” that occurs like in a non-distributed system when an unexpected exception occurs and the entire program shuts down. Failures often happen silently and it’s difficult to tell where or why something failed. So you have to design each system with that in mind and figure out how each piece needs to deal with uncertainty.

      • stingraycharles 1525 days ago
        For what it’s worth, in my experience it’s very effective to add this information to the error context: is it a permanent failure (eg validation) or a retryable error. If it is retryable, also add to the error context when it should be retried.

        This will allow you to handle these errors appropriately without having to handle these things on a case by case basis.

        • clarry 1525 days ago
          > it’s very effective to add this information to the error context: is it a permanent failure (eg validation) or a retryable error

          The discussion was specifically about UNKNOWN errors, i.e. you sent a message but never got a reply back. You don't know whether it was a validation failure or temporary hiccup. For all you know, it's possible the message was received and processed correctly but the response never made it back.

          How to handle these unknowns is always going to be case by case. Some combination of retry and give up works for most cases, but there is no silver bullet and usually you have to think hard about the consequences of 1) retrying 2) giving up thinking the request failed even though it actually (silently) succeeded.

    • kccqzy 1525 days ago
      There's not really an actual UNKNOWN. As the article illustrated, there're distinct steps in the client-side processing, so we do know with certainty which part of the client code emitted the error. We might know anything about the remote server but the point is the error is known at the client side. It could be failure to create socket, failure to send messages, failure to receive back any reply within the stipulated deadline, the remote server returning an error, etc.

      In practice the easiest thing is simply to propagate the error onwards, without affecting other independent requests (failure domains), and let something intelligent handle the error. It could very well be a human sitting at a computer seeing an internal error message, who can then decide to retry or not.

      Also often times it's acceptable to just log the error and carry on. I don't know of anything that can promise a 100% error-free SLA. With careful engineering achieving even 99.999% success rate is possible, but I don't think anyone would actually promise 100%.

      • azylman 1525 days ago
        > failure to receive back any reply within the stipulated deadline

        If you get this error back, the client doesn't know if the server actually processed it or not, so knowing where the client failed isn't actually useful for knowing the state of the request and what needs to happen next.

        To handle something like this, you need a resilient design around client-server communications (e.g. assuming retries on the client side and idempotent behavior on the server side).

        Immediately erroring out on the client is usually going to lead to a poor user experience and might lead to inconsistent behavior.

        • kccqzy 1525 days ago
          > so knowing where the client failed isn't actually useful for knowing the state of the request and what needs to happen next.

          Correct. Which is why the client should just error out and stop processing and return the error to the user, who will have more context and knows whether or not a retry is necessary or desirable.

          My argument is that your "resilient design around client-server communications" isn't necessary in the majority of cases, and is often unwarranted over-engineering. Poor user experience is fine, if they don't happen very often, and go away upon a retry. Even banks do that. It's fine. No one will be offended if your app shows an internal error message once a month (a five-minute outage in a month is still more than 99.9% availability).

          • azylman 1525 days ago
            > the client should just error out and stop processing and return the error to the user, who will have more context and knows whether or not a retry is necessary or desirable... Even banks do that. It's fine

            If a user at a bank tries to transfer $10,000, gets an internal server error and retries because their balance hasn't updated (banks don't process these in realtime), and checks the next day and finds $20,000 gone, that's a big problem. The user can't be responsible for this, you need something more.

            You're not wrong that this isn't necessary in the majority of cases (updating your email address?) - but the majority of cases don't need distributed systems. By the time you're talking about distributed systems (which is the focus of this article and discussion), you absolutely do need this, in the vast majority of cases.

            • foota 1525 days ago
              Eh, in that case you're asking for trouble by requiring the server to be successful. If I had to write an API like that I would definitely go with randomized tokens per logical request for dedupe on retries.

              If it's a user facing transaction it should be a transaction resource that's created and then the server does the retries imo, with the user able to view the status.

              • azylman 1525 days ago
                Exactly, given a timeout you can't assume the server is clearly successful (or unsuccessful!). That was my point, we're definitely agreeing :) Usually you'd generate some kind of idempotency key (randomized token as you said) and retry with that.

                You definitely can't just bubble that up to the user and assume things are fine

            • ubu7737 1525 days ago
              Avoiding duplicate processing is a very big part of compliance in banking transactions. Making mistakes in this area is somewhat common but very punitive for the institution that makes the mistake. As a result, design priorities are heavily skewed toward exactly-once processing of financial transactions.
          • ubu7737 1525 days ago
            This is correct.

            In an authorization with a user waiting at the POS, you fail-fast and let the user retry.

            During the clearing/settlement phases, you don't have a waiting user. You're settling accounts and moving money. Nothing is allowed to fail silently.

  • phtrivier 1525 days ago
    Is there an article (in this "Amazon Builder Library" or elsewhere) that offers practical advices about handling such problems ? Or is there nothing else but "do it on a case by case basis ?"
  • tjchear 1524 days ago
    I wonder what if we decide that 'fate sharing' is OK, i.e the client request flow (not the entire process) fails when a network step fails. How long can a system stay operational after it started before the whole thing grinds to a halt? Seconds? Hours? Days? Would there be a huge ding or small ding on the SLIs?
  • tyingq 1525 days ago
    Designing your software to work interchangeably with unix domain and regular sockets can also make initial testing much easier. Then you have a no hassle way to simulate lots of ip/port pairs via file names, without port clashes and creating virtual nw interfaces.
  • AkshatM 1525 days ago
    I wonder if there are any advantages to using formal methods over thread-local simulation in this case.

    Eighteen months is a long time of development, and there are at least some benefits to catching bugs in the design phase before pursung implementation.

  • fyp 1525 days ago
    Did anyone else find their "eight failure modes of the apocalypse" incredibly arbitrary given the important sounding name?
  • ex_amazon_sde 1525 days ago
    meh
  • marknadal 1525 days ago
    Hmm, how I handle UNKNOWN in our distributed system (10M+ monthly users running on $0 cost infrastructure, https://github.com/amark/gun ) is to say that originator is responsible for retries until satisfactorily ACKed.

    This means all other nodes in the network (routers, relays, security, storage, etc.) can safely error or not recurse or not retry, without any loss in redundancy guarantees.

    Either the operation did succeed silently, and it is OK for originator to try again (idempotency keeps this safe), or things never finished processing, and it is OK to try again despite being out of order (CRDTs keep this safe).