How Antithesis finds bugs

(antithesis.com)

160 points | by wwilson 12 days ago

12 comments

  • infogulch 12 days ago
    Have you tried to 'rediscover' classic (in)famous bugs? E.g. take an old version of OpenSSL vulnerable to heartbleed and run Antethesis on it to 'discover' heartbleed via fuzzing. It would be interesting to see how much fine tuning would be needed to discover it.
    • nlavezzo 12 days ago
      This is a cool idea. I’ll ask the team about it. Would make for a very interesting blog post or talk!
    • klysm 12 days ago
      Excellent idea and it would be a great way to pitch it. What about collaborations with Jepsen?
    • eichin 12 days ago
      Heh, yeah, "being able to predict the present" is an important benchmarking technique
  • mrkmarron 12 days ago
    FYI playing Super Mario with fuzzing (AFL) was done in a fun 2020 S&P paper. Also finds bugs and security issues.

    "IJON: Exploring Deep State Spaces via Fuzzing" https://casa.rub.de/fileadmin/img/Publikationen_PDFs/2020_IJ...

    • chc4 12 days ago
      A lot of fuzzers use Mario or other simple games as an internal testcase. I'm aware of a hypervisor fuzzer from 2016 that did it, and I'm positive there are others (both before and since). Hell, tom7 has a fuzzer for exploring program states that uses Super Mario Bros as the example from 2013 (https://www.cs.cmu.edu/~tom7/mario/mario.pdf, plus a youtube video https://youtu.be/xOCurBYI_gY), and he's definitely not the first either.
      • wwilson 12 days ago
        We are huge fans of tom7 and that paper was one of our inspirations for using NES as a domain for researching autonomous state space search! I think he does a very good job of explaining why the problem is hard.
        • chc4 12 days ago
          Right, in case it wasn't clear to readers this isn't a bad thing. Lots of people use games because they're good analogs for other programs and evocatively show fuzzing exploration progress. Not being the first to point a fuzzer at Mario doesn't matter.
      • mrkmarron 12 days ago
        Thanks for sharing, I felt like there were earlier but the x,y trick jumped out at me and that was the one I remembered off the top of my head.
    • wwilson 12 days ago
      Thanks for flagging this! The work we're announcing today was completed in 2018, and we have since moved on to much more challenging problems both in the Nintendo domain and elsewhere. Totally not looking to pick a fight over priority though. This is such a hilariously understudied and under-explored area, we really value anybody else who's trying to work on these problems.
      • mrkmarron 12 days ago
        I think you underestimate the level to which this area has been studied. And I wish you would talk about these new results then instead of announcing 5+ year old results then.

        It would be great to see progress in this area (not my primary area of work BTW) but I am not seeing anything here, technically, that is going to make that happen -- maybe it is just getting all the parts in place and magic happens. It just makes me scratch my head a bit.

        • wwilson 12 days ago
          It's possible you did not make it to the end of the talk where I explain this, but the thing that excites me is that we can now apply fuzzing and related techniques to things which are neither Nintendo games nor tiny stateless libraries and parsers, because of this: https://antithesis.com/blog/deterministic_hypervisor/

          As for getting to the newer stuff, yeah, totally, just give us some time. There's a bit of a backlog. :-)

          • mrkmarron 12 days ago
            I just rewatched the end of the video to make sure I didn't miss anything. Deterministic execution and replay is very-very well-known and understood. It is possible that your packaging and market fit is right on. Lots of cottage industry in DB testing and bug finding -- but not clear how this generalizes and why something like Coyote [1] (to pick one) wouldn't work as well.

            So, fuzzing has been applied to very stateful and very large industrial systems for some time. And yes it is very cool but I feel like I am seeing more "sizzle than steak" so to speak. Great engineering though, hypervisor work is very challenging.

            [1] https://www.microsoft.com/en-us/research/blog/coyote-making-...

            • wwilson 12 days ago
              It is absolutely possible to write a large stateful system from the ground up so that autonomous testing techniques can be applied to it. FoundationDB and Tiger Beetle are both examples of this, I think Resonate might be another one, and Toby Bell's talk at Strange Loop last year is a great guide on how to do so.

              What's much harder is to take an arbitrary system, written in an arbitrary way, without these techniques in mind, and make it amenable to this sort of testing. From the start of our company, we believed that unless this was possible, the market would be too hard to crack, because most human beings are not very foresightful and not able to justify a bunch of extra work.

              Hypervisor-based snapshot fuzzing like Nyx-Net and deterministic userspaces like Facebook's now-discontinued Hermit project are the other ways I know of accomplishing that goal. We believe that both of them have some pretty serious practical limitations which our approach does not share.

              EDIT: Maybe the way to get to the crux of the disagreement is for me to turn the question around. Why do you believe that the vast majority of stateful and concurrent systems are not tested with fuzzing?

              • mrkmarron 12 days ago
                At the end of the day you have 2 problems (1) how to make execution deterministic within some boundary be it a process, hypervisor, or distributed system and (2) how you handle non-determinism when data crosses this boundary. You can move the boundaries and effort around but the problems always exist. So, if you are claiming that you have a sweet spot on this tradeoff then I could certainly believe that, if you claim that you eliminated this boundary issue then I am highly credulous.

                I'll agree with you on indeterminate behaviors though, I suspect they will eventually be seen like the "billion dollar" mistake of null pointers.

              • mrkmarron 12 days ago
                Just saw the edit. I have 2 answers:

                1) Fuzzing is under-utilized even for simple code. AFL is dead easy to use and, even so, most projects don't have it in CI runs. So, despite how much I like it, in general it seems people do not see value in this type of testing.

                2) The effort to handle external state (say a restful call to get stock ticker info) needs to be mocked -- which is deeply unpopular -- or handled by record/replay which works ok-ish but eventually breaks down with divergences. Outside of well-chosen domains it these eventually pop-up and add an additional pain point that builds on item 1.

            • amw-zero 12 days ago
              Deterministic execution might be well understood by its proponents, but it's a completely niche technique that practically no one uses in practice. You have this jaded tone like this is something that everyone is doing, and everyone knows that this isn't true, so we're curious... why are you writing these things?
    • infogulch 12 days ago
      Have any of these methods found clips and speedrunning shortcuts? Examples: Clip the base of the flagpole to skip some animation time. Clip into and walk below the floor to run past obstacles. Etc.
      • wwilson 12 days ago
        We find a ton of glitches in Mario, and even more in other games. See, e.g.: https://vimeo.com/807601164
        • wwilson 12 days ago
          Hmm... looks like Vimeo is currently having some kind of outage. If anybody at Vimeo is reading this and wants some help with testing backend systems, email is in bio.
  • wwilson 12 days ago
    This is Will (I gave the talk linked in the post). Happy to answer any questions about this work, or how it generalizes to testing things that aren't Nintendo games.
    • cperciva 12 days ago
      Why is Mario so jumpy?
      • wwilson 12 days ago
        This is actually an incredibly deep and difficult question to answer. I would expect no less from cperciva. :-)

        As I mention in the talk, you get very bad tactical performance from taking a uniform random distribution and piping it into the emulator. The fuzzer is exponentially unlikely to hold the jump button for many successive frames without a break. In the fully general case, I think instead of maximum entropy, you want something more like Marcus Hutter's AIXI where you spend energy on inputs inversely proportional to their Kolmogorov complexity. Unfortunately, that's uncomputable, but it turns out that just switching to toggling bits with low probability does a lot better than pure randomness. The approach is analogous to swarm testing (https://users.cs.utah.edu/~regehr/papers/swarm12.pdf).

        All of which is to say, the result that we show here is vastly less jumpy than our first tries. The reason it's still more jumpy than a human player is that our platform has no idea where it is in the game, or even that it's playing a game. So if a jump doesn't harm it in the exploration process, there's some chance the first input getting somewhere new will involve a jump, and that will then get locked in.

        We do have the capability to do optimization on inputs (what conventional PBT calls "shrinking"), and indeed if you apply this to Mario you can get it to jump a lot less and complete levels a lot faster. That capability didn't exist yet when this video was recorded. We should totally do a another post on this topic!

        • infogulch 12 days ago
          To reframe the problem with purely random "Tactics", some frequencies may be an inefficient way to get to desired states of the system. That is, it is useful to consider the correlation between random inputs across time.

          This reminds me a lot of the problem space involving Perlin Noise. I'd hypothesize that by using a fractal noise pattern you can generate inputs with correlated values yet still remain random enough to get to any state. Have you tried using noise generators for random inputs?

        • someplaceguy 12 days ago
          > Unfortunately, that's uncomputable

          Minor nitpick, but while Kolmogorov complexity as typically defined is uncomputable, I would argue that this result is only a theoretical curiosity and mostly irrelevant.

          That is, the "uncomputable" Kolmogorov complexity computation presupposes that you have a Turing machine, i.e. a machine with literally infinite memory, which is not possible to construct in our universe. Or alternatively, it presupposes that "computable function" is one that can be computed by a machine with an infinite amount of storage, which amounts to the same thing as having a Turing machine.

          You could probably define some version of Kolmogorov complexity that is parametrized by the memory size (e.g. of a linear bounded automaton or similar model that better represents a computer with finite resources), which should make it computable. That said, in practice it would probably take an unreasonable amount of time to perform this computation (but that is orthogonal to whether it's computable or not).

          • anyfoo 12 days ago
            While reading your comment, I kept having the vague feeling that your idea of "theoretically computable" (within the given memory bounds) will still leave some exponential time and/or space complexity, and so effectively will still be "practically not computable".

            Your last paragraph then seemed to confirm that, i.e. there was no particular shortcut for this specific case that would make it any different from general Kolmogorov complexity.

            In that sense, isn't your comment itself also "only a theoretical curiosity"? As we went to from "uncomputable", to "theoretically computable under the given constraints", to "practically uncomputable"?

            While suffering from lack of rigor, I think a lot of times--probably even the majority of times outside purely cs-theoretical treatments--when we colloquially speak of "uncomputable", we are always talking about practical computers without an infinite touring tape, and so actually mean "practically uncomputable".

            Because, yes, while everyone immediately understands that actual computers don't have infinite memory, at the same time everyone understands that "exponential time" is still "never" in practical terms.

            • someplaceguy 12 days ago
              I would somewhat agree with your comment, but I think you're missing some important points:

              > everyone understands that "exponential time" is still "never" in practical terms.

              It's important to note that this is not necessarily true:

              1. "exponential time" is somewhat ambiguous. An algorithm might be exponential time yet have a very low exponential base (e.g. 1.0001), so in practical terms it might be practically computable for reasonable sizes.

              2. Even if the exponential base is high, it still doesn't say anything about whether an algorithm can be used practically or not. The algorithm might still be very efficient for reasonable problem sizes even though it has a high exponential base (important question: exponential in terms of what, exactly?).

              3. Exponential time algorithms might only be so in the worst case but might not necessarily be exponential in the average case, or even the vast majority of interesting cases. As an example, it might be easy (or at least doable) to solve the halting problem for normal computer programs. Humans do this all the time with real-world programs when performing formal verification (as these programming languages, logics and tools force you to prove that loops and recursive functions always terminate, even when assuming a model equivalent to a Turing machine), and AFAIK there's no proof that computers can't efficiently do the same for the vast majority of real-world programs (cryptographic algorithms being the usual exception).

              4. Whenever someone mentions that the halting problem and Kolmogorov complexity are uncomputable, the discussion ends there. But notice that when I pointed out that it's in fact computable, the discussion turned into one about how efficient the computation might be (which I argue, is how all such discussions should be).

              5. As a side note, every single time I argued this point in the past, usually in the context of the halting problem, someone always argued that such an algorithm would necessarily have a time complexity of 2^N, where N=nr. of bits of the machine. This is not true. It would only be true for the simplest and most naive solution to the halting problem, which is inevitably what that person has in mind. In fact, there's already a family of algorithms that solve the halting problem with less complexity for almost all programs: "Floyd's tortoise and hare" and similar ones (see [1]). Note that these algorithms don't even inspect the program, they just run it step by step. This leads me to think that there are undiscovered algorithms that are far more efficient by virtue of exploiting knowledge about the program being analyzed.

              [1] https://en.wikipedia.org/wiki/Cycle_detection

              • anyfoo 12 days ago
                Oh but that was specifically what I was trying to get at, i.e. point #4 of this reply of yours: Are there any shortcuts that make this reasonably computable or not?

                In your original reply, you started with (transcribing) "Kolmogorov complexity is not technically uncomputable in the practical case of not having an infinite tape", which gave me hope that we would start talking about how there are some reasonable shortcuts in this particular case (as you mention in this answer now).

                But then you ended with (literally) "That said, in practice it would probably take an unreasonable amount of time to perform this computation (but that is orthogonal to whether it's computable or not)", which squashed my hopes, seemed to just have replaced "theoretically uncomputable" with "practically uncomputable", and made me wonder how it changed anything that OP already wrote in practical terms, namely: "Unfortunately, that's uncomputable"

                But now it seems we're back to (potentially) discussing how in this particular use case there might be tractable ways to (limited, but useful) computability, which is good again!

                • someplaceguy 12 days ago
                  I should note that I'm not a computer scientist and I'm currently a bit sleep-deprived, but to continue discussing what you're interested in:

                  I tend to think that an efficient computation of some finite-state version of Kolmogorov complexity would necessarily require an efficient computation of a finite-state version of the halting problem, but I'm not entirely sure of this.

                  Naively, it seems that this Kolmogorov calculation would require enumerating all programs (in increasing program size) and then running each of them until they either 1) enter an infinite loop, 2) produce the input string and halt, or 3) start producing a different string.

                  However, I'm not sure this would be the most efficient algorithm. For example, it might be easy to inspect each program and discard almost all that would "obviously" not produce the input string before we even try to run them.

                  Or better yet, never even enumerate such programs that can be proven not to produce the input string. In other words, cut the search space significantly.

                  Perhaps there might even be a shortcut to directly construct the smallest program that produces a given string, or at least, a family of small candidate programs that would be a very small subset of all possible programs and yet would be guaranteed to contain the solution.

                  As you might have noticed, unfortunately I don't know if there are such extraordinarily efficient shortcuts, I'm only speculating that they might exist.

                  That said, I still suspect that this might be intractable due to having to account for the worst case, i.e. undecipherable random-looking programs. In the case of the halting problem in the context of formal verification, I'm more optimistic since we usually don't need to care about such random-looking programs, only human-constructed ones (usually), which might be far easier to analyze algorithmically. I don't know if that makes sense...

          • wwilson 12 days ago
            Thank you. Comments like this are why I love HN.
        • infogulch 12 days ago
          What if you optimized inputs based on the strategy? E.g when you're in the air pressing down has no effect, or more measurably, the code paths between runs where you do or don't press down are very similar or completely converge. Similarly, jumping while running on a flat floor at full speed doesn't affect the game much, so prune it.

          Maybe call it "strategy-informed tactics"

      • gwern 12 days ago
        Most RL agents are 'jumpy' or jitter a lot, because it makes no difference to the reward, and where it does make a difference, that tends to only matter close to convergence where slight speedups are all that's left. If you want to reduce that, you have to reward-shape it to penalize excess movement. (Which is relevant in robotics, where 'jitter' can be very bad for the machinery in a way not reflected in the simple tasks' reward functions.)
        • anyfoo 12 days ago
          Case in point: Watch an actual Mario speed running world record video, and the human players there are jumping around a lot while running around flat sections with no obstacles.

          As I understand, the jumping (at least in this case) does nothing to Mario's horizontal speed at all, so they basically just do it for "fun".

          The difference is that the human players know that they're doing something inconsequential for variety, while the fuzzer has no "idea" (i.e. no concept) that the jumping does not matter, or even that it is "jumping". More generally, it does not know nor care that this particular part of the input is irrelevant. It just found an input that works, and sometimes that happens to include a useless jump.

          • gwern 10 days ago
            Or they might be doing it just to keep their fingers 'warm'. I recall during AlphaStar, there were a lot of questions about why the human pro APMs were so high when it seemed like a lot of them were null ops, and apparently the answer is that they do that just to keep their fingers going and reduce reflex reaction time. (As you can imagine, this would be very bad for imitation learning...)
    • mustknow2201 12 days ago
      How close is or isn't this to a genetic algorithm in practice? It seems like as soon as you start scaling out to multiple threads/nodes you'd benefit from crossover, selection techniques and so on?
    • thsksbd 12 days ago
      How does this distinguish between a bug and a "feature"?

      Edit: i mean this is the spirit of Knuth's quip that when he dies all the bugs in tex will become features

      • wwilson 12 days ago
        There was a LOT of discussion of this in the Q&A after the talk. Currently we have 4 main approaches:

        (1) There's some stuff that's pretty much a bug for every program. If it segfaults, exits with a nonzero code, OOMs, triggers a TSAN error, fills the disk with fatal error messages, etc., etc., that's pretty easy to qualify.

        (2) You can use our SDK to define additional custom test properties. Think like a normal assertions library, but you can also do existential quantification ("this code is reachable/this situation can happen") and soon temporal assertions ("this should never happen without this other thing happening first, possibly on a different node").

        (3) We store all the output of your system in every timeline in a giant analytic database and support ad-hoc querying against it. Think "pre-observability", observability but for mirror universes. You can then do all the spelunking and analysis you would do with your production traces, but before your real customers are exposed to any issue.

        (4) We have some very cool ML approaches in the pipeline that I can't talk about quite yet.

        • ajb 12 days ago
          Can you define equivalence classes (mutations that shouldn't change the result) eg timing, order of events, idempotence, etc? So that you can use (3) to define the correct result for all members of the class

          (Sorry if this is explained in the talk - I'll watch it but it's now too late in the day in my timezone)

    • iamcreasy 12 days ago
      Very cool presentation. Thanks for doing this.

      Choosing next input based on present sounds a lot like a Markov chain - is that something you guys use when simulating user interaction with distributed system?

  • infogulch 12 days ago
    In Will's talk he defines two terms related to optimizing fuzzers [2]: Strategy and Tactics.

    Strategy is the datum you choose to optimize for as the fuzzer randomly walks the states of the system. E.g. optimize to maximize Mario's X value, or optimize for reaching all tile positions etc. This generalizes the concept of "coverage guided" to include domain-specific details about your target program (e.g. that the program has the concept of a grid of possible positions).

    Tactics is the choice of input distribution. Sometimes the frequency of the randomness should be tuned for the application. For example, randomly changing the state of the A button every frame is not a good frequency to properly test long jumps, maybe a normal distribution with average hold/not hold time of 1s would be better. Also, encoding the randomness within the program's valid domain can help avoid over-testing parsing/validation code at the expense of more interesting code further in the program. [1][2]

    [0]: Barton P. Miller, Lars Fredriksen, and Bryan So. 1990. An empirical study of the reliability of UNIX utilities. Commun. ACM 33, 12 (Dec. 1990), 32–44. https://doi.org/10.1145/96267.96279

    [1]: This reference appears to be related: Rohan Padhye, Caroline Lemieux, Koushik Sen, Laurent Simon, and Hayawardh Vijayakumar. 2019. FuzzFactory: domain-specific fuzzing with waypoints. Proc. ACM Program. Lang. 3, OOPSLA, Article 174 (October 2019), 29 pages. https://doi.org/10.1145/3360600

    [2]: I introduce the concept of fuzzing in another comment: https://news.ycombinator.com/item?id=40068187#40071972

  • Taikonerd 12 days ago
    Super Mario is such a fun example. Well chosen.
  • yanniszark 12 days ago
    This is fascinating! I thought only Reinforcement Learning was doing things like this but you're saying you can do this via fuzzying? What does this mean exactly? How is it able to learn to advance through all these levels? Is there an underlying learning mechanism at play?
    • infogulch 12 days ago
      It appears that you are not familiar with the concept of fuzzing.

      Fuzzing is a moderately advanced software testing technique popularized in the '90s that operates on a very simple idea: If you feed a program's inputs with arbitrary/random data, this could be used to discover bugs in the program with little human effort.

      In the 90s they fed random data into the stdin of unix utilities and found that many programs crashed. [0] In this context printing an error message that says "I can't interpret the input" is a valid state, but reading past the end of a buffer because the input confused the program is a bug. Variants can be designed to test any API layer.

      More recently Coverage Guided Fuzzers use information about which code paths are executed for each input as a way to reach a variety of program states more quickly. Also, starting with a prefix known to produce an interesting state can also speed up testing.

      I wrote a comment relating this to the article and talk in the OP here: https://news.ycombinator.com/item?id=40068187#40071950

    • vojev 12 days ago
      There's no learning exactly, as the post explains the fuzzer is aware of various RAM addresses (as well as having a tactic for how it "presses" buttons in the game). It's just trying to explore the space of Mario's level + his x and y coordinates.

      (I'm an Antithesis employee.)

      • nextaccountic 11 days ago
        This means that, without a learning procedure to direct Mario towards the end of the level, it can only reach the end by itself because the levels (and Mario's in-memory data structures in general) are pretty small, right?

        Or rather, if there were tons of irrelevant state, it could always end up trapped somewhere and never actually complete a level even after centuries of fuzzing.

        Something similar was tested in the Twitch Plays Pokemon [0] gaming experiment, but there the inputs appeared random but weren't actually random: there were "factions" that either tried to sabotage the run, or that tried to make it progress. Ultimately the majority of the players were cooperating to complete the game and this was a deciding factor to make the run succeed. Maybe fuzzing Pokemon can't complete the game, the way that TPP could (or reinforcement learning could).

        [0] https://en.wikipedia.org/wiki/Twitch_Plays_Pok%C3%A9mon

        • vojev 8 days ago
          The space is large, it just turns out if you direct Mario to explore with a bit of bias (so, in general, there's some favoring of exploring from states where Mario's x coordinate is to the right, e.g.) it completes the levels.

          I think Pokemon could be beaten with our techniques. Final Fantasy on NES poses similar problems to Pokemon, and that is a game at which some progress has been made in the past, here.

  • suprfnk 12 days ago
    @wwilson How do you define the X/Y "distance" of a non-Mario application? I.e. any (distributed or not) system that doesn't have a relatively trivial "higher x/y is better" fitness function?
  • t4ng0pwn3d 12 days ago
    I see a lot of fuzzing tools for CLI apps, but are there any good alternatives for web applications/APIs? I've used Hypothesis for generating random datas in requests but maybe there's something better out there.
  • m3kw9 12 days ago
    If you just read it it sounds like a scam to some. Going thru all states does not find you bugs magically. You need to know what a bug is first or if it’s an actual intended feature. This article fails to explain it
  • bbor 12 days ago
    At what point can we start suing companies on behalf of the commons for taking words from the lexicon? I miss the days when this would be called “Wilson & Co.’s automated testing solution” instead of such a beautiful, philosophically meaningful word. Same thoughts on that Devin.AI scam taking the name “Cognition” and Vercel somehow bribing their way into claiming the “ai” name on NPM.

    Technically awesome post tho! Love the heatmap esp. Maybe bring up changing your name to investors because some rando online doesn’t like it though, please.

  • m3kw9 12 days ago
    Doesn’t explain how it finds bugs it’s just had the AI play Mario bros
    • anyfoo 12 days ago
      I think everyone in the target audience of this blog post is immediately able to make the connection.
      • m3kw9 12 days ago
        The big question is how does it find bugs without knowing if it is intended behaviour or not?
  • Olesya000 12 days ago
    [dead]