Just plot the latency distribution in a histogram  and be done with this boring topic. The problem you're complaining about is that aggregating everything into a single number doesn't give you the information you want to know. It's impossible by design. If you want a different type of information you will also need a different type of diagram or statistic.
I think you may be overlooking the challenges that the author is trying to tackle with their proposal.
First, the author is looking for a characteristic that can be monitored automatically - for example, alarm if P99 latency is over 2 s. Visualizations while useful don’t help with that.
Second, the author is looking for a solution that can run in soft real time, so that it can be used for system monitoring.
Third, they’re looking for a solution that does not have to aggregate the full raw data set from across the fleet. It is implied that they are working with reasonably large fleets such that full aggregation is impractical; or maybe just too costly or too slow.
If you were able to aggregate the full raw data set in real time and compute the Nth percentile, then that statistic would meet the author’s needs. Their point is that actually computing the Nth percentile is expensive and not commonly done in real-time monitoring (hence the statistic is usually an average of host-level Nth percentile).
The challenge they’ve proposed is to define a statistic that is more useful for alarming while still avoiding the need to aggregate the entire raw data set.
I thought this was a thoughtful article with a clever suggestion. “Percent of requests over threshold” meets these criteria. One criticism of this approach however is that the threshold needs to be known ahead of time, prior to aggregation.
As long as your histogram has the final long tail bucket (>99%) included, you'll be fine.
After all, the 100'th percentile latencies are what your users will experience as the worst case. That's what they will perceive and remember. That matters for usability. While there is no sane way to ever eliminate the most obscene outliers, you can target the worst-case behaviour and find ways to limit how badly it impacts your users.
Anecdote from work: our exchange team (who routinely consider 4ms for service response too slow) monitor p99 for general performance and p100 for the nastiest outliers. They want to know exactly how bad the performance is for the observed worst-case scenarios.
Technically, you could use something like goedelization and the information would be there... I know saying this is kind of pedantic but I'm just trying to show how percentiles vs. histogram is kind of like discussing aesthetics.
The compression format is really smart, there's a neat trick to make the logarithm calculation fast, and there's a concurrent thread handoff mechanism so you can swap out a histogram without disturbing the thread you're measuring (though last two probably only in the java version). Those three make it super useful for very low impact performance measurements.
Stop trying to re-invent statistics. Use a box and whisker plot of latency. You quickly get to see the mean, the quartiles, and all the outliers and you get it in a format which is familiar and easy to understand. You can even plot box and whisker plots next to each other for quick meaningful comparisons between different things.
I've recently grown to like violin plots for latency (https://en.wikipedia.org/wiki/Violin_plot). I've also added 99%ile tick marks, which with the already present median mark, gives a relatively full picture of latency that is easily digestible.
Mathematics is not something handed down by the gods. It's possible to encounter not just completely new problems, but also limitations to existing methods for solving a known problem.
In this particular case the challenge is aggregating statistics from a very large fleet & having automated alarms. Visualization tools don't help with any of that. More specifically, the reporting tools out there apparently have a very common & persistent flaw of reporting an average of percentiles across agents which is a statistically meaningless metric. It makes no difference how you visualize it - the data is bunk.
This article flips it so that agents simply report how many requests they got & how many exceeded the required threshold. This lets them report the percentage of users having a worse experience than the desired SLA. You can also build reliable tools on top of this metric. It's not a universal solution but it's a neat trick to maintain the performance properties of not needing to pull full logs from all agents & still have a meaningful representation of the latency of your users.
The article states that almost everyone is doing percentile latencies wrong by averaging on agent level (thus creating nonsensical data) and proposes using the percentage of requests that are over the threshold instead, a metric that can be averaged properly.
He additionally suggests to always use actionable dashboards catered to its users (dev/ops/manager).
I agree with the premise but it seems that there are more solutions out there. As other commenters noted, you can collect histograms or hdrhistograms. Those have the problem of needing to be precofiguring and of not being able to be merged unless they are configured the same way.
Instead you can use the t-digest (https://github.com/tdunning/t-digest), a very cool online quantile estimation data structure from Ted Dunning (which he has recently improved with the Merging approach). There are a number of implementations out there. It is not unreasonable to serialize them and merge them. Unfortunately there’s no easy way to set this up in Prometheus but making that easy could be a fun project
Histograms require you to configure buckets into which your samples are allocated; to allocate the buckets appropriately, you need to know what your expected values are — that is, to measure latency, you need to know your latency. While this can work (I think most of us have a clear idea, or can obtain an idea of what our typical latencies is, and configure buckets around that) it is inelegant. I feel like I would rather have X=percentile, Y=latency, but such a bucketing gives you X=latency, Y=request count. Still useful, but only as informative as you are good at choosing buckets. (There is the histogram_quantile function, but I am unclear that its assumption of linear distribution within buckets really makes much sense, since most things would be long-tail distributions, and thus I would think that once you get past the main "hump" of typical latencies, most samples would cluster towards the lower end of any particular bucket.)
I am not clear on how Summaries actually work; they appear to report count and sum of the thing they're monitoring; that is, if one were to use them for latencies (and the docs do indeed suggest this), it would report a value like "3" and "2000ms", indicating that 3 requests took a total of 2000ms together; how is one supposed to derive a latency histogram/profile from that?
Prometheus's fatal flaw here, IMO, is that it requires sampling of metrics. That is, things like CPU, which are essentially a continuous function that you're sampling over time. But its collection method/format doesn't seem to really work that well for when you have an event-based metric, such as request latency, which only happens at discrete points. (If no requests are being served, what is the latency? It makes no sense to ask, unlike CPU usage or RAM usage.)
To me, ideally, you want to collect up all the samples in a central location and then compute percentiles. Anything else seems to run afoul of the very "doing percentiles on the agents, then 'averaging' percentiles at the monitoring system" critique pointed out in the video posted in this sibling comment: https://news.ycombinator.com/item?id=18194507
Your points are largely valid, but prometheus is a monitoring solution, not a scientific or financial tool.
Certain tradeoffs are taken since the monitoring aspect comes first and being scientifically correct comes second.
Hence poll vs push, for instance.
For diagnosing, I like building up a cumulative distribution function (CDF) plot. If you're collecting data either for percentiles or thresholds you likely have the data already. If you're setting thresholds, it's a useful plot to judge how likely a given threshold might trigger an alarm.
I really like the idea of displaying what percentage is over a certain threshold. at my work, we kinda sorta simulate this by having separate alarms for many percentiles (with increasing thresholds). The approach suggested by the article seems to be quite obviously better tbh.