Show HN: Hyperfine – a command-line benchmarking tool

(github.com)

96 points | by sharkdp 1628 days ago

5 comments

  • kqr 1627 days ago
    Most -- nearly all -- benchmarking tools like this work from a normality assumption, i.e. assume that results follow the normal distribution, or is close to it. Some do this on blind faith, others argue from the CLT that "with infinite samples, the mean is normally distributed, so surely it must be also with finite number of samples, at least a little?"

    In fact, performance numbers (latencies) often follow a heavy-tailed distribution. For these, you need a literal shitload of samples to get even a slightly normal mean. For these, the sample mean, the sample variance, the sample centiles -- they all severely underestimate the true values.

    What's worse is when these tools start to remove "outliers". With a heavy-tailed distribution, the majority of samples don't contribute very much at all to the expectation. The strongest signal is found in the extreme values. The strongest signal is found in the stuff that is thrown out. The junk that's left is the noise, the stuff that doesn't tell you very much about what you're dealing with.

    I stand firm in my belief that unless you can prove how CLT applies to your input distributions, you should not assume normality.

    And if you don't know what you are doing, stop reporting means. Stop reporting centiles. Report the maximum value. That's a really boring thing to hear, but it is nearly always statistically and analytically meaningful, so it is a good default.

    • gjm11 1627 days ago
      If your distribution is heavy-tailed, then in order to find the mean (or any other measure of "location") you typically need to pay _less_ attention to outliers, not _more_.

      So, e.g.: for a normal distribution, with very narrow tails -- probabilities like exp(-x^2) -- your sample mean is the maximum-likelihood estimator. For a double-exponential (Laplace) distribution, whose tails are like exp(-|x|) and therefore much fatter, the maximum-likelihood estimator is the _median_, which gives much less weight to outliers. Another way to look at it: the mean minimizes the sum of |x-m|^2 and the median minimizes the sum of |x-m|, and the former grows in size and hence importance much faster as x gets large.

      • kqr 1626 days ago
        This reads like a circular argument to me. I'm saying that the MLE is not a very useful measure of location for fat tailed distributions, since extreme values contribute so much to the expectation.
        • gjm11 1626 days ago
          I think we could do with being a bit more explicit about some distinctions here. We have:

          1. the observations we make

          from which we can derive

          2. statistics like the sample mean, the sample median, etc.

          which may or may not give us useful information about

          3. the actual underlying distribution

          which has

          4. properties like its mean (= expectation), median, etc.

          which may or may not give us useful information about

          5. the _future_ behaviour of the system in question.

          We probably care a lot about extreme values in #5, but depending on what the actual distribution is and what we know about it a priori, the right way to predict extreme values in #5 may or may not involve paying a lot of attention to extreme values in #1.

          If, to take a very heavy-tailed example, we know that the actual distribution's PDF is proportional to 1/(1+(x-m)^2) for some m, then the sample mean is no more informative than any single sample; if we have a lot of samples and want to estimate m, then we have to pay much less attention to outliers than we do when calculating the mean or even the median.

          Note that here m is the expectation of both #3 and #5.

          For heavy-tailed distributions,it may well be that the "obvious" location parameter is not a good measure of the expectation of what we actually care about, e.g. because small changes make no difference to speak of but large ones can put us out of business. In that case, what we want under #4 might actually be something like "the 99th centile". But that still doesn't mean that outlying values in #1 are a good indicator of those.

          Here's an actual example. I generated 100 samples from a Cauchy distribution with PDF proportional to 1/(1+x^2). (So its actual mean is zero.) I computed the max-likelihood parameters of the distribution using those samples; they turned out to correspond to a PDF proportional to 1/(1+((x-a)/b)^2 where a=0.2240 and b=1.095. Here a is the mean of the distribution and b is a scale parameter.

          I then threw in an extra 1-in-10000 outlying point at x=3183 and computed the new max-likelihood parameters. They had a=0.2235 and b=1.121. So (1) an extra large positive outlier moves our idea of the mean further to the left (though only a little) but (2) it quite correctly increases our idea of how broad the distribution is.

          What's the next effect of that on, say, our prediction of the 99th centile? It goes from 35.08 with the "real" data to 35.91 with the extra outlier. That's an increase but a rather small one. (The "correct" figure -- i.e., the actual 99th centile of the distribution I really drew the samples from -- is 31.8, so we've overestimated it a bit even without the extra spurious outlier.)

          If instead we took the maximum of the samples (without my spurious extra outlier) as a guide to likely future events, we'd have got 40.5, a much worse approximation than we got by finding max-likelihood parameter estimates.

          For comparison, I did the same with a normal distribution with mean 0 and standard deviation 1. ML parameter estimates from my 100 samples were mean 0.078 and standard deviation 0.969; actual 99th centile is 2.326, estimated 99th centile is 2.332, max of 100 samples is 2.994, a much worse estimate.

          If we throw in a 1-in-10000 outlier (at x=3.719) then our estimated 99th centile moves to 2.508, a much larger change than for the Cauchy distribution relative either to the estimates themselves or to the value of the outlier we added in. It's also further out "in probability": that is, CDF(actual distribution, perturbed 99th centile) is bigger for the normal than for the Cauchy distribution.

          In other words, it looks as if even if what we care about is predicting outlying values, (1) we probably don't do well to use outliers in our data directly -- that gives us extremely noisy estimates for fat-tailed distributions; and (2) the impact on our predictions of any given outlying value will, if we do it right, typically be smaller for fat-tailed distributions than for thin-tailed ones.

          BUT a cautionary note: there is an important thing missing in the experiments I did. I found point estimates of the actual distribution and worked from those. It would be better to find probability distributions over actual distribution parameters and use those. You could e.g. do that with MCMC methods. Throwing in outliers that don't actually come from the true underlying distribution will tend to make your posterior distribution broader, which means that if you compute e.g. a posterior 99th centile value then it will be larger. I am fairly sure that conclusions #1 and #2 above will still hold, perhaps more strongly than given the simpler analysis I did.

          • kqr 1626 days ago
            Thank you! This is a very well-reasoned comment that I will find myself returning to also in the future.
    • JoshTriplett 1627 days ago
      > Report the maximum value.

      In benchmarks, assuming you run the same workload each time, you often want the minimum value. Anything else just tells you how much system overhead you encountered.

      (Complete agreement that applying statistics without knowing anything about the distribution can mislead.)

      • dzamlo 1627 days ago
        [This article](https://tratt.net/laurie/blog/entries/minimum_times_tend_to_...) explaib why using the minimum time may not be a so great idea.
        • virgilp 1626 days ago
          Just looking at the chart, your article makes a case that minimum is much better than maximum, and in fact if you report one number, the minimum is the best number to report (in that particular example).

          If we go into details - sure, 1 number is not ideal, but neither is the confidence interval, because people will assume normal distribution for them. If you report 2 numbers, perhaps instead of confidence interval (which falsely implies normal distribution) it's better to report the mode and the median.

      • kqr 1626 days ago
        I don't fully agree. Imagine there are two programs: Acme and Blob. Acme theoretically runs faster, but it also degrades much worse under interference from other system loads. Blob doesn't run as fast at the theoretical maximum, but it handles system load very well and doesn't degrade too bad.

        There's an argument to Blob being the better choice, if this will run in production on a system that might encounter unrelated loads. Predictable performance is frequently more useful than theoretical maximum performance.

        That said, I still agree a little bit. The minimum value is also a useful metric, and if you have the opportunity to report two numbers, the minimum-maximum pair is a great choice.

        • bradknowles 1626 days ago
          I disagree. I think you want to report as much as you reasonably can about the distribution, and then you have to let the reader make up their mind which they think is most important.

          You can have your own idea about which is most important and build your comparisons on that choice, but you should give the reader enough information that the work you put in is still valuable to them in case they should make a different choice.

          If we both have equal information about the distribution, then we can make our own choices about the data being presented.

      • virgilp 1627 days ago
        To be pedantic, you want the mode of the distribution. You are right though, the minimum is a much closer approximation of the mode in most situations.
    • skybrian 1626 days ago
      I agree that the maximum is useful, but in a situation where you're optimizing code on a machine where there is external interference from other processes, the minimum can be more relevant.

      This is when the code you write is deterministic and the interference is not. The minimum is closer to what you would get without interference.

      Just don't effect the time to represent typical results. (And why would you expect that if you're running benchmarks on your development machine?)

    • sharkdp 1627 days ago
      Thank you for the feedback. I agree with most of your points.

      > Most -- nearly all -- benchmarking tools like this work from a normality assumption

      I don't think that hyperfine makes any assumption about normality. Sure, we do report sample mean and sample standard deviation by default, but we also report sample minimum and the maximum. You can also easily export all the benchmark results and inspect in more detail with the supplied Python scripts.

      > In fact, performance numbers (latencies) often follow a heavy-tailed distribution

      So when is this really the case? In my understanding, if I am measuring the runtime of a deterministic program with the same input, the runtime should only be influenced by external factors that are out of my control (other programs being scheduled, caching effects, hardware-specific influences, ..). These are exactly the things that I want to "average out" by running the benchmark multiple times.

      > What's worse is when these tools start to remove "outliers".

      Hyperfine never removes outliers. What we do is to try and detect outliers. We do this by computing robust statistical estimates that specifically DO NOT assume a normal distribution (see https://github.com/sharkdp/hyperfine/blob/master/src/hyperfi... for details).

      We perform this outlier detection to warn users about potentially interfering processes or caching effects.

      Take a look at these results, for example: https://i.imgur.com/XRvE6Ys.png

      I benchmarked a file-searching program. The underlying distribution, while probably not normal, seems to be "well behaved" and I think that the sample mean and the sample standard deviation could be quantities with a reasonably predictive power.

      What you do NOT see in the histogram is a single outlier at 1.15 seconds, far outside the plot to the right. This was the first benchmark run where the disk caches were still cold. In such a case, hyperfine warns the user:

      Warning: The first benchmarking run for this command was significantly slower than the rest (1.152 s). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.

      In conclusion, I am not quite sure how your critisism applies to hyperfine, but I'd be happy to get further feedback.

      • kqr 1626 days ago
        Thank you for responding! I was curious to hear your thoughts on this.

        > Sure, we do report sample mean and sample standard deviation by default, but we also report sample minimum and the maximum. You can also easily export all the benchmark results and inspect in more detail with the supplied Python scripts.

        Hyperfine presents itself as a black-box tool where you plug in program and it is supposed to outputs usable numbers. For that type of tool, "but you can script up a custom report" is not how you defend wildly speculative reporting!

        I know trying to black-box performance measurements is very, very annoying, because things we are used to take for certain simply don't hold. But that is an inherent problem of the field, and not something one can wish away.

        > In my understanding, if I am measuring the runtime of a deterministic program with the same input, the runtime should only be influenced by external factors that are out of my control (other programs being scheduled, caching effects, hardware-specific influences, ..). These are exactly the things that I want to "average out" by running the benchmark multiple times.

        This is broadly correct. The common misconception regards how many runs are needed to successfully average out these external facts: probably more than you want your user to wait through. Sums of numbers drawn from heavy-tailed distributions converge very slowly to the normal distribution, to the point where a lot of people people won't wait for it to become even close to normal.

        > Hyperfine never removes outliers. What we do is to try and detect outliers. We do this by computing robust statistical estimates that specifically DO NOT assume a normal distribution (see ... for details).

        Sorry, that was my misreading. I'm glad you don't remove outliers! I like that you're using robust estimations, but I'm still not convinced they work as well as we would want to. I'm sure someone else could formalise this, but just based on very simple experimentation[1], I get some worrying results: When the cutoff value is chosen so that D > 3.5 is labeled as an outlier, the samples labeled as "outliers" reliably contribute around 0.75 to the expectation of the sample.

        Despite using robust estimations, the outliers completely dominate any expectations about the sample.

        > Take a look at these results, for example: https://i.imgur.com/XRvE6Ys.png

        > I benchmarked a file-searching program. The underlying distribution, while probably not normal, seems to be "well behaved" and I think that the sample mean and the sample standard deviation could be quantities with a reasonably predictive power.

        To me, that also looks like a heavy-tailed distribution where there simply aren't enough samples to reveal the extremal values that exist in the real population. If you still have the raw data, we could try a K-S test against the MLE fitting of some common heavy-tailed distributions to see if it's possible to rule them out, but I suspect we won't be able to do that.

        [1]: https://two-wrongs.com/pastes/outliers-529ad9.r.html

    • nerdponx 1627 days ago
      I'll take some sparkline histograms in the console, too.
  • sharkdp 1628 days ago
    I have submitted "hyperfine" 1.5 years ago when it just came out. Since then, the program has gained functionality (statistical outlier detection, result export, parametrized benchmarks) and maturity.

    Old discussion: https://news.ycombinator.com/item?id=16193225

    Looking forward to your feedback!

    • dwohnitmok 1627 days ago
      Since you cite bench as an inspiration, have you ever thought about including the nice graphical HTML page with graphs that bench outputs? In a similar vein what are your thoughts on directly depending on and using criterion (the Rust port)?
      • sharkdp 1627 days ago
        Thank you for the feedback.

        No, we never thought about HTML output. However, there are multiple other export options and we also ship Python scripts that can be used to plot the benchmark results. The script is not very large, so far, but we are happy to add new scripts if the need for one should arise. What kind of diagrams would you like to see?

        Also, I have never thought about using criterion.rs. My feeling was that it is suited for benchmarks with thousands of iterations, while we typically only have tens of iterations in hyperfine (as we typically benchmark programs with execution times > 10 ms). Do you have anything specific criterion feature in mind that we could benefit from?

        • dwohnitmok 1627 days ago
          Ah those were earnest questions, not feature requests in disguise :), especially the second question. I haven't so far had any need for either of those things while using hyperfine (although admittedly I've only used hyperfine for about 3 small projects).

          I ask purely as a comparison to some of the design choices bench made for personal edification.

          That being said, one of the hypothetical advantages of using criterion is that you get to piggy-back on its visualizations and statistical analyses, which are quite useful for showing friends and coworkers. I'm not sure of the specifics of how you're doing outlier detection, but Criterion's method is quite nice and I find the choice of linear regression for linearly increasing iterations to be an interesting sanity check.

          Criterion can work for lower iterations as well (although tens is definitely getting into the lower bound).

  • mplanchard 1628 days ago
    I started using hyperfine a few months ago now on a colleague’s recommendation and I really like it.

    In the past, I’ve cobbled together quick bash pipelines to run time in a loop, awk out timings, and compute averages, but it was always a pain. Hyperfine has a great interface and really useful reports. It actually reminds me quite a bit of Criterion, the benchmarking suite for Rust.

    I also use fd and bat extensively, so thanks for making such useful tools!

    • sharkdp 1627 days ago
      Thank you very much for your feedback!
  • breck 1626 days ago
    This is great! I was looking for something like this a year ago for benchmarking imputation scripts as part of a paper. This would have been awesome to use. Will keep it in my in the future.
  • Myrmornis 1626 days ago
    hyperfine is really nice!

    FWIW I wrote a rough first version of a tool that runs a hyperfine benchmark over all commits in a repo and plots the results in order to see which commits cause performance changes: https://github.com/dandavison/chronologer

    • sharkdp 1626 days ago
      Very cool! I'd love to reference this in the hyperfine README
      • Myrmornis 1626 days ago
        Great, please do. I wanted to add zoom capabilities to the boxplot visualization, but it didn’t seem to be available in vega(-lite) yet.