We can do better than percentile latencies

(medium.com)

53 points | by kiyanwang 6 days ago

13 comments

  • imtringued 5 days ago

    Just plot the latency distribution in a histogram [1] and be done with this boring topic. The problem you're complaining about is that aggregating everything into a single number doesn't give you the information you want to know. It's impossible by design. If you want a different type of information you will also need a different type of diagram or statistic.

    [1] https://rethinkdb.com/assets/images/docs/performance-report/...

    • jcrites 5 days ago

      I think you may be overlooking the challenges that the author is trying to tackle with their proposal.

      First, the author is looking for a characteristic that can be monitored automatically - for example, alarm if P99 latency is over 2 s. Visualizations while useful don’t help with that.

      Second, the author is looking for a solution that can run in soft real time, so that it can be used for system monitoring.

      Third, they’re looking for a solution that does not have to aggregate the full raw data set from across the fleet. It is implied that they are working with reasonably large fleets such that full aggregation is impractical; or maybe just too costly or too slow.

      If you were able to aggregate the full raw data set in real time and compute the Nth percentile, then that statistic would meet the author’s needs. Their point is that actually computing the Nth percentile is expensive and not commonly done in real-time monitoring (hence the statistic is usually an average of host-level Nth percentile).

      The challenge they’ve proposed is to define a statistic that is more useful for alarming while still avoiding the need to aggregate the entire raw data set.

      I thought this was a thoughtful article with a clever suggestion. “Percent of requests over threshold” meets these criteria. One criticism of this approach however is that the threshold needs to be known ahead of time, prior to aggregation.

      • heinrichhartman 4 days ago

        > Second, the author is looking for a solution that can run in soft real time, so that it can be used for system monitoring.

        Histograms can absolutely be used for alerting. We have done this at Circonus for ages: https://www.circonus.com/features/analytics/

        I wrote up the case of latency monitoring 6 weeks ago here: https://www.circonus.com/2018/08/30/latency-slos-done-right/

        • rixed 5 days ago

          But an histogram can be considered a generalised version of what the article propose so the OP is indeed right that histograms are already a well known solution to that boring problem.

          • eutropia 5 days ago

            The threshold could be computed batchwise as you aggregate the stats out of band, so it would at least be dynamic, if not instantaneous.

            • londons_explore 5 days ago

              There is no reason to aggregate all raw data across the fleet. Use bucketed histograms. Have a few hundred buckets for times, exponentially distributed.

              Store counts of numbers of events in each bucket.

              A few hundred integers per server isn't hard to store and aggregate.

              Prometheus does this out of the box.

              Now you can recreate any of the charts you want!

              • mping 4 days ago

                A probabilistic data structure comes to mind, such as Q-Digest. I wonder why it isn't used more for these kinds of problems

              • bostik 5 days ago

                Bingo.

                As long as your histogram has the final long tail bucket (>99%) included, you'll be fine.

                After all, the 100'th percentile latencies are what your users will experience as the worst case. That's what they will perceive and remember. That matters for usability. While there is no sane way to ever eliminate the most obscene outliers, you can target the worst-case behaviour and find ways to limit how badly it impacts your users.

                Anecdote from work: our exchange team (who routinely consider 4ms for service response too slow) monitor p99 for general performance and p100 for the nastiest outliers. They want to know exactly how bad the performance is for the observed worst-case scenarios.

                • pvarangot 5 days ago

                  Technically, you could use something like goedelization and the information would be there... I know saying this is kind of pedantic but I'm just trying to show how percentiles vs. histogram is kind of like discussing aesthetics.

                • latch 5 days ago

                  Worth watching is How NOT to Measure Latency: https://www.youtube.com/watch?v=lJ8ydIuPFeU

                  The speaker Gil Tene is also the author of the HdrHistogram which addresses this articles point: https://hdrhistogram.github.io/HdrHistogram/

                  • frankmcsherry 5 days ago

                    Absolutely worth watching. Cannot un-watch.

                    • romed 5 days ago

                      I don't get hdrhistogram at all. It's really just histogram with a shitload of logarithmic buckets.

                      • sbanach 5 days ago

                        The compression format is really smart, there's a neat trick to make the logarithm calculation fast, and there's a concurrent thread handoff mechanism so you can swap out a histogram without disturbing the thread you're measuring (though last two probably only in the java version). Those three make it super useful for very low impact performance measurements.

                    • WorkLifeBalance 5 days ago

                      Stop trying to re-invent statistics. Use a box and whisker plot of latency. You quickly get to see the mean, the quartiles, and all the outliers and you get it in a format which is familiar and easy to understand. You can even plot box and whisker plots next to each other for quick meaningful comparisons between different things.

                      • scott_s 5 days ago

                        I've recently grown to like violin plots for latency (https://en.wikipedia.org/wiki/Violin_plot). I've also added 99%ile tick marks, which with the already present median mark, gives a relatively full picture of latency that is easily digestible.

                        • vlovich123 5 days ago

                          Mathematics is not something handed down by the gods. It's possible to encounter not just completely new problems, but also limitations to existing methods for solving a known problem.

                          In this particular case the challenge is aggregating statistics from a very large fleet & having automated alarms. Visualization tools don't help with any of that. More specifically, the reporting tools out there apparently have a very common & persistent flaw of reporting an average of percentiles across agents which is a statistically meaningless metric. It makes no difference how you visualize it - the data is bunk.

                          This article flips it so that agents simply report how many requests they got & how many exceeded the required threshold. This lets them report the percentage of users having a worse experience than the desired SLA. You can also build reliable tools on top of this metric. It's not a universal solution but it's a neat trick to maintain the performance properties of not needing to pull full logs from all agents & still have a meaningful representation of the latency of your users.

                        • mkesper 5 days ago

                          The article states that almost everyone is doing percentile latencies wrong by averaging on agent level (thus creating nonsensical data) and proposes using the per­cent­age of requests that are over the thresh­old instead, a metric that can be averaged properly. He additionally suggests to always use actionable dashboards catered to its users (dev/ops/manager).

                          • vvern 5 days ago

                            I agree with the premise but it seems that there are more solutions out there. As other commenters noted, you can collect histograms or hdrhistograms. Those have the problem of needing to be precofiguring and of not being able to be merged unless they are configured the same way.

                            Instead you can use the t-digest (https://github.com/tdunning/t-digest), a very cool online quantile estimation data structure from Ted Dunning (which he has recently improved with the Merging approach). There are a number of implementations out there. It is not unreasonable to serialize them and merge them. Unfortunately there’s no easy way to set this up in Prometheus but making that easy could be a fun project

                            • 5 days ago
                              [deleted]
                              • 5 days ago
                                [deleted]
                                • henridf 5 days ago

                                  I'm not sure which tools the author has tried, but the Prometheus monitoring system supports both histograms and quantiles.

                                  There's a good discussion of the respective merits of each at https://prometheus.io/docs/practices/histograms/#quantiles

                                  • deathanatos 5 days ago

                                    Histograms require you to configure buckets into which your samples are allocated; to allocate the buckets appropriately, you need to know what your expected values are — that is, to measure latency, you need to know your latency. While this can work (I think most of us have a clear idea, or can obtain an idea of what our typical latencies is, and configure buckets around that) it is inelegant. I feel like I would rather have X=percentile, Y=latency, but such a bucketing gives you X=latency, Y=request count. Still useful, but only as informative as you are good at choosing buckets. (There is the histogram_quantile function, but I am unclear that its assumption of linear distribution within buckets really makes much sense, since most things would be long-tail distributions, and thus I would think that once you get past the main "hump" of typical latencies, most samples would cluster towards the lower end of any particular bucket.)

                                    I am not clear on how Summaries actually work; they appear to report count and sum of the thing they're monitoring; that is, if one were to use them for latencies (and the docs do indeed suggest this), it would report a value like "3" and "2000ms", indicating that 3 requests took a total of 2000ms together; how is one supposed to derive a latency histogram/profile from that?

                                    Prometheus's fatal flaw here, IMO, is that it requires sampling of metrics. That is, things like CPU, which are essentially a continuous function that you're sampling over time. But its collection method/format doesn't seem to really work that well for when you have an event-based metric, such as request latency, which only happens at discrete points. (If no requests are being served, what is the latency? It makes no sense to ask, unlike CPU usage or RAM usage.)

                                    To me, ideally, you want to collect up all the samples in a central location and then compute percentiles. Anything else seems to run afoul of the very "doing percentiles on the agents, then 'averaging' percentiles at the monitoring system" critique pointed out in the video posted in this sibling comment: https://news.ycombinator.com/item?id=18194507

                                    • tyldum 5 days ago

                                      Your points are largely valid, but prometheus is a monitoring solution, not a scientific or financial tool. Certain tradeoffs are taken since the monitoring aspect comes first and being scientifically correct comes second. Hence poll vs push, for instance.

                                  • digikata 5 days ago

                                    For diagnosing, I like building up a cumulative distribution function (CDF) plot. If you're collecting data either for percentiles or thresholds you likely have the data already. If you're setting thresholds, it's a useful plot to judge how likely a given threshold might trigger an alarm.

                                    • amarant 5 days ago

                                      I really like the idea of displaying what percentage is over a certain threshold. at my work, we kinda sorta simulate this by having separate alarms for many percentiles (with increasing thresholds). The approach suggested by the article seems to be quite obviously better tbh.

                                      • jordanthoms 5 days ago

                                        The approach here sounds a bit like the ApDex metric New Relic has been doing for years? Is there something different I'm missing?

                                        • afpx 5 days ago

                                          Reservoir sampling with outliers?

                                          • krona 5 days ago

                                            Correct.

                                          • asplake 5 days ago

                                            Mean excess delay?