M3DB, a distributed timeseries database

(m3db.io)

320 points | by Anon84 1525 days ago

13 comments

  • roskilli 1525 days ago
    Thanks for the interest, I just did a talk at FOSDEM a few weeks ago on the subject of querying over large datasets that M3DB can warehouse and query in real-time here:

    Slides https://fosdem.org/2020/schedule/event/m3db/attachments/audi...

    Video https://video.fosdem.org/2020/UD2.120/m3db.mp4

  • missosoup 1525 days ago
    Uber has started many projects that ended up getting open sourced. And many of them are now either abandoned or on life support. H3 comes to mind as something we almost ended up using but luckily avoided.

    These open-sourcings seem a bit like PR pieces with no guarantees of any support or evolution after being published.

    • richieartoul 1525 days ago
      Chronosphere, a startup founded by two of the early M3 engineers, just raised 11 million dollars to build a monitoring platform based around M3DB: https://techcrunch.com/2019/11/05/chronosphere-launches-with...

      Uber also uses M3DB extensively internally and the project is nowhere near being abandoned or on life support: https://github.com/m3db/m3/commits/master

      • StreamBright 1524 days ago
        >> While the founders, CEO Martin Mao and CTO Rob Skillington, were working at Uber, they recognized a gap in the monitoring industry, particularly around cloud-native technologies like containers and microservices

        What is the actual gap that is not addressed with one or all of the following?

        - Prometheus / Grafana [1]

        - Datadog [2]

        - Cloudwatch [3]

        1. https://docs.docker.com/config/thirdparty/prometheus/

        2. https://www.datadoghq.com/blog/introducing-live-container-mo...

        3. https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

        • johncolanduoni 1523 days ago
          Prometheus doesn’t have a good built in story for horizontal scaling of storage and queries, or long term storage. That’s why there’s M3DB, Cortex, etc. which let you send metrics to a set of Prometheus servers which then write to a cluster that handles broad queries and tiered storage. So these are less a competitor to Prometheus and Grafana and more an augmentation of it since it supports both emitting metrics in Prometheus form and querying them with PromQL.

          Datadog gets very expensive beyond a certain point; much bigger than most startups, but much smaller than Uber.

          Cloudwatch is usually used as a source of metrics rather than a destination. It doesn’t have as rich a data model as Prometheus for custom metrics, and has a lot of quite limiting restrictions (like only 10 tags per metric).

          • StreamBright 1522 days ago
            Thanks for the detailed explanation. I agree with you on all points. I also think that even with the limitations you mentioned one company can find suitable monitoring for almost every scenario. I guess there is still a big enough market for M3DB where you need a huge amount of metrics, horizontal scalability and efficient long term storage.
        • opsunit 1524 days ago
        • nwsm 1523 days ago
          Logging / Monitoring is the hottest market to disrupt right now.
      • bfung 1524 days ago
        Yep, another way to pre-market/pre-signal to investors.

        A pretty common story these days:

        1. I built an X to solve Unicorn U’s problem.

        2. Open source it, give talks on it

        3. Leave Unicorn U and start a company based on X

        4. ...

        5. Profit(???)

        Also, debatable whether or not Unicorn U actually needed a freshly built X instead of using existing tools/tech.

    • gtirloni 1525 days ago
      It's opensource. Why should Uber give any guarantees? They are not in the business of selling software.

      Unless Uber is actively blocking contributions, it's not Uber's fault if no community formed around something they opensourced.

      As for this being a PR piece, they could have achieved the same with just a detailed blog post and no code. It looks like a expensive PR piece if they have to opensource work that took probably hundreds of development hours.

      • missosoup 1525 days ago
        I agree with everything you say.

        But without any certainty around the roadmap, support, and longterm commitment by Uber to maintain these projects, they're nothing more than interesting repos amongst a sea of interesting repos.

        The way Uber brands them suggests that they're suitable for use in production environments, but so far that hasn't been the case with anything they open sourced outside a narrow envelope that resembles their own operating model. Maybe this project will set a new trend, but so far nothing they put out gained any traction or became suitable for general purpose production use. In that regard, H3 and their other projects have remained at the level of decent 'show HN' pieces rather than something you'd ever use professionally. In other words, marketing.

        Based on previous news coming out e.g. https://news.ycombinator.com/item?id=20931644

        It seems like Uber had too big of an engineering department with too little work to do, so they started reinventing wheels. Which is cool if they're willing to support them in the long term, but so far that hasn't proven to be the case.

        • carlisle_ 1525 days ago
          >It seems like Uber had too big of an engineering department with too little work to do, so they started reinventing wheels. Which is cool if they're willing to support them in the long term, but so far that hasn't proven to be the case.

          Former Uber engineer here. I can assure you that while our engineering team was massive, there was anything but too little work. If anything most engineers were massively overtaxed. Whether or not the work we were undertaking was meritious and valuable is an entire branch of philosophy I'm pretty sure.

          Part of the struggle at big companies is that a lot of existing solutions just don't work. Let me use an example with chat. A few years ago Slack was evaluated as a replacement for HipChat, since Atlassian's outages had finally started affecting us during our own outages.

          Everybody wanted to go to Slack, but the cost of Slack was tremendously prohibitive and the state of the service then (as I was told) was such that it could not support a company of Uber's size. Tremendous effort would have been undertaken by Slack to support Uber and they didn't want to expend that effort for a single customer. This was late 2015 early 2016.

          There were tons of options, but ultimately an in-house chat software was created. At the time it seemed required to make our own highly reliable chat, considering how distributed engineering teams were. I think if you talk to anybody without the background of how chat evolved at Uber they would think the in-house chat project would have been a boondoggle.

          Not all over-scoped engineering projects are actually so noble. There was certainly a ton of "reinventing wheels" going on. There was significantly more "these problems are really hard and I only have bad solutions."

          Though if the result is ultimately, "nothing more than interesting repos amongst a sea of interesting repos" sign me up.

          • jjeaff 1524 days ago
            There seems to be a pervasive misconception by the very employees at Uber, that they built their own chat platform. When in reality, and someone please correct me if I am wrong, uChat was a white labeled Mattermost.

            I have heard that the team that put it together actually tried to hide that fact from the company (for the glory, I guess). But that could be apocryphal.

            • carlisle_ 1524 days ago
              That's entirely not true. The team was always forthcoming about the fact it was Mattermost, at least to other engineers.

              Mattermost didn't work out of the box, and certainly not the way and at the scale Uber needed it to. I'm not overly familiar with the technical details, but one thing in particular stands out as an example. There was a Town Hall channel that every user had to be a member of. This unfortunately did not scale, and not enough ACLs were available to limit all the ways users could use this universal room. Eventually they really fixed the problem, but it was a tremendous pain point for a long time. There were a lot of fundamentally "less than great" things about Mattermost that had to get updated to work for Uber.

              There was the amusing time employees found out anybody could change the topic in the room, even if our chat permissions had been disabled. It was absolute chaos for at least an hour, I can't remember if it actually negatively impacted the deployment though it sounds vaguely familiar.

              It's pretty telling of employees that badmouth the uChat team. That team ultimately was trying to do what they thought was best for the company, even if at the time it seemed like they bit off more than they could chew. There was no other engineering team so directly visible and exposed to the entire company internally like they were. People dismissive of their efforts are generally not used to the difficulty of making so many very vocal customers happy all at the same time, and could be more sympathetic.

              • jjeaff 1523 days ago
                I dug up the comment I saw that mentioned this. They also claim to be an Uber employee.

                https://news.ycombinator.com/item?id=19101617

                The fact that uChat is commonly mentioned by Uber employees as being the "custom chat solution we built in-house". It leads me to believe that comment that the team tried to hide that it was built on open source.

                I don't doubt for a second that scaling Mattermost for a huge organization like Uber was a big undertaking. But it seems disingenuous for people to always mention that Uber built uChat when it should be more like "Uber put a lot of work into Mattermost to scale it up."

            • iamleppert 1524 days ago
              Why not just use IRC?? It scaled for the entire internet.
              • cerberusss 1523 days ago
                These IRC replacements do more than that. There are voice and video calls, integrated file sharing, and a whole bunch of tools that you can add.

                To be honest, I don't much like Slack because I feel the desktop app doesn't feel like a real macOS app. And I don't use all these features. So in the end, IRC would be fine for me. But it wouldn't for the rest of the company.

          • remote_phone 1524 days ago
            Uber is in the process of ditching uChat and moving to Slack
            • carlisle_ 1524 days ago
              I accidentally left this point out. In retrospect it's easy to say Uber made the wrong decision to make uChat but it was one of few options at the time.
              • hitekker 1524 days ago
                Seems like a huge point to leave out.

                Are you affiliated with Uber?

                • pc86 1524 days ago
                  Well the comment starts with "former Uber engineer here" so I'd venture yes but not anymore.
                • jjeaff 1524 days ago
                  Doesn't seem relevant unless the point about slack being too expensive and unable to handle the load is untrue.
          • creddit 1524 days ago
            > There were tons of options, but ultimately an in-house chat software was created.

            You drank too much kool-aid. uChat was just a reskinned Mattermost.

            • carlisle_ 1524 days ago
              I think you're being overly dismissive of how much work that team did.
        • Scarbutt 1525 days ago
          Ignore the project and move on? If you expect every open source project to be served by all your entitlements, you will be repeatedly disappointed.
        • lazaroclapp 1524 days ago
          > The way Uber brands them suggests that they're suitable for use in production environments, but so far that hasn't been the case with anything they open sourced outside a narrow envelope that resembles their own operating model.

          Not a contradiction. Many of these tools are suitable for use in production, almost by definition, since they are being used in production, at Uber. They might or might not work in your environment out of the box. But they are certainly often likely a better starting point than an empty editor, even when they do not. Most of the ones I am familiar with, are happy to get PRs generalizing them to more varied environments.

          > they're nothing more than interesting repos amongst a sea of interesting repos.

          As someone who has open-sourced on GitHub: research prototypes hacked together for a research paper deadline in grad school, class projects, for-fun hacks, and also production tooling I built as a paid engineer, I'd say there is a big difference! :) And there would still be a big difference even if the later were somehow never touched again after the first "we are open-sourcing this!" commit.

          That said, we do try to maintain the things we open-source. Standards of support vary because individuals maintaining these projects, and their situations, vary. This is true for non-OSS internal tools too. In my experience, having gone through the Uber OSS process twice, and having started it a third time and decided against releasing (yet?), Uber does try to make reasonably sure that it's open-sourcing stuff that will be useful and is planned to be maintained. At the same time, they have to balance it with making it easy to open-source tools, otherwise too many useful things would remain internal only.

          Also, note, some of these tools have exactly one developer internally as the maintainer, and not even as their full time job. For example, I am the sole internal maintainer[1] for https://github.com/uber/NullAway and also have 3-4 other projects internally on my plate, most of which are in earlier stages and need more frequent attention[2]. If and when said developer leaves, effort is made to find a new owner. This is not always successful, particularly if the tool has become non-critical internally. Sometimes, leaving owners retain admin rights on the repos and keep working on the tool (Manu, NullAway's original author, co-maintains it), but I don't think anyone is suggesting that that should be an obligation.

          Finally, obviously, nothing here is the official Uber position on anything, just my own personal observations. This doesn't represent my employer, and so on. I am also pretty sure most of this is not even Uber specific :)

          [1] Not the only internal contributor! Also, there is one external maintainer, as mentioned a few sentences later. But in terms of this being anyone's actual responsibility...

          [2] Just to clarify, I think between Manu's interest, my own, and it being relatively critical tooling at Uber, NullAway is pretty well maintained. But I can understand why that isn't always a given for all projects.

        • mamon 1525 days ago
          Isn't that kind of the point of open-sourcing your internal tools? You don't want to be bothered with maintenance and support, so you're hoping that some anonymous volounteers will do that for you :)
          • lopsidedBrain 1525 days ago
            Pretty much every successful open source project that people pay attention to is one that has had long-term support behind it. Linux, Mozilla, gcc, clang, git. Almost always, that support begins with the original author.

            Protects that don't do that are therefore unlikely to remain interesting for long.

          • cfors 1525 days ago
            Maybe I’m a cynic of all big corp companies but if you’ve ever worked with a big corp open source department that’s almost the entire point is to build PR for the engineering department. Same goes for tech blogs. These things will be PR pieces first, and valid production tools/frameworks second (mostly).
            • closeparen 1525 days ago
              In my experience software gets written in the first place for the usual internal reasons. Corporate or individual prestige may be the driving factor in open sourcing, though, rather than a genuine interest in having it used externally.
      • _jal 1525 days ago
        > Why should Uber give any guarantees?

        First, "guarantee" is the wrong word to take too literally here. Depending on how you want to look at it, there are no guarantees, even with guarantees.

        But looked at more loosely, answering that is really Uber's responsibility. Why did they release it? If it is just a PR release, fire-and-forget works fine for that.

        If they want to see wider adoption outside of their firm, there are some fairly obvious things they should do to foster that. Sometimes you release just the right thing at just the right time and everyone else does your evangelism and support work for you, but it is much more normal for your next great thing to take a while to build a user base.

        • tylerl 1524 days ago
          There are lots of reasons to open source internal software, and only a minority of them involve establishing a serious community and driving significant adoption. But the PR claim you're making isn't particularly credible. The ROI is abysmal if that's all you're after, and there are easier ways to get it.

          Based on your specific complaints, it sounds like your opinion doesn't matter in this case; you're not the audience. You want support and a predictable future: you're looking for a product, not for technology. This isn't a product, and it's not a platform.

          If instead you represented another company looking into solving this same problem yourself, and are looking at starting points, then you're the perfect audience. In that case, you'd have time and motivation to contact the developers directly rather than gripe on HN. You'd be less interested in whether there was an organized community, and more interested in how to directly influence the roadmap. You'd care about what the code looks like, how they solved Problem X and Problem Y, that kind of thing.

      • ForHackernews 1525 days ago
        Everything you say is true, but tossing useless code releases over the wall isn't really participating in the open source community, either.

        It looks to me like maybe their engineers internally are fans of the idea of "open source", and the PR department is happy to try and get some good press out of it, but the company culture isn't really set up to develop in public or maintain these things they've nominally "released".

        Sadly, this isn't unusual among tech companies, but it'd be more obvious what's happening if they just put up a bare-bones FTP with a README: "Here's some code under <LICENSE>. Use it at your own risk."

      • papito 1524 days ago
        Even I have half a mind about releasing something that I cannot commit to for a little bit - the initial bugfix stage, at least. I would say a giant like Uber hurts themselves more in terms of PR when not being conservative enough about putting source out there. People inevitably gravitate towards big player "open sauce" as it implies some commitment.

        A company should first put their own system through hell and decide that "yeah, this is good, we are sticking with this", before luring people to use it.

      • grogenaut 1524 days ago
        Open source for the steward is limping the project along is the worst type of open source because the steward usually isn't going to make any real decisions around the project the people are reticent to fork it and drive it because there is a Steward doing some activity. Selenium was in the state for years.
      • Udik 1524 days ago
        > They are not in the business of selling software.

        But they are probably in the business of selling themselves as a software/ tech company. People value tech companies, so you'd better be one, even if you do taxi services, produce and distribute tv series, or rent office space.

    • reichardt 1525 days ago
      Why do you consider H3 to be on life support? It's basically a finished spec with actively developed implementations. https://github.com/uber/h3
      • ForHackernews 1524 days ago
        So is Gopher, but almost nobody uses it.
    • excerionsforte 1524 days ago
      https://github.com/facebookarchive - 11 Pages of unsupported open source software.

      Beringei, a TSDB, (https://github.com/facebookarchive/beringei) in particular with what you are saying was a PR piece (https://engineering.fb.com/core-data/beringei-a-high-perform...) since it was never really used by anyone outside of FB.

      I really don't see the negative part of free code that you can learn from and/or incorporate at all.

      • Fellshard 1524 days ago
        Free code is good, yes.

        When you rely on the systems themselves, and do so with expectation of support from the originating company, your expectations will almost certainly be broken.

        I think that's the simplest takeaway - not to run away from any open-sourced project, but to take into proper consideration if/how they plan on supporting the tool, and how much you would be capable of adapting and owning yourself if the worst happened.

        • excerionsforte 1524 days ago
          Yeah exactly. If one wants support they can pay for it i.e. SaaS if available. If an open source project has not created a contract with any user then there is no guarantee of support. I don't believe any company creates a contract with users automatically because they made source code available. That is unsustainable.

          Chronosphere is the SaaS part for M3DB in this case. The negativity around someone open sourcing code for PR is nuts especially when all of the code is available. I love reading the code and getting ideas about how things work.

    • api 1525 days ago
      Many large companies push this stuff out for publicity and recruitment. Sometimes employees are encouraged to spend a little bit of their time on it or to brand extracurricular activities with the company name for publicity.

      The test for open source is if it keeps getting maintained and supported for years. That only happens when the project is a core business effort, has some direct means of support (e.g. dual licensing or SaaS), or happens to be one of the few genuinely volunteer driven large scale open source projects.

    • ajfriend 1524 days ago
      We're definitely still working on H3. We just got a nice, new domain: h3geo.org

      We're also basically done with a new Python wrapper written in Cython. https://github.com/uber/h3-py/tree/cython

      We could probably use some help with the last step of packaging, if anyone is interested!

    • parentheses 1524 days ago
      The issue here is not the company but the fact that the owners of the original library did not figure out how to “disown” the library.

      Open sourcing something is naturally more expensive than not. It’s seldom the case that impact to the community triggers contributions that outweigh that cost.

      The fallacy we hold is that companies will prop up software that is open source for everyone to use despite the lacking community contribution.

      We as engineers should push ourselves to contribute when we find issues - rather than simply create tickets that represent work were want to have done for free. This is how open source software dies.

      There is a minority that does this.

    • scrappyjoe 1524 days ago
      Huh? The last commit to the H3 github repo was 3 days ago. In what sense is it abandoned? Genuinely interested as we are considering using it as a core library.
    • throwaway5752 1524 days ago
      Lots of good open source projects fail. But not every company is willing to open source code like this, though, and I'm very happy that Uber did so in this case.

      I get your frustration, but everyone should remember there are never any promises of support with open source software, regardless of how well supported it is at a particular time.

    • iblaine 1524 days ago
      Open sourcing projects is the new merit badge for engineers. But at least there’s more good than bad from it. Hudi is at least one Uber project I can point to off the top of my head that is a great idea.
  • monstrado 1525 days ago
    On a related note, one of their engineers wrote a POC that uses FoundationDB instead of their custom storage engine.

    https://github.com/richardartoul/tsdb-layer

    The README does a really good job explaining the internals and motivation.

  • tnolet 1525 days ago
    I get Uber is huge. But honestly, there was nothing out there that could fulfill there use case? Cassandra, ElasticSearch, Influx, etc.? I might be completely wrong, but I just highly doubt that.
    • hkarthik 1525 days ago
      I can give you an ex-insiders view on this.

      Uber made an early strategic decision to invest in on-premise infrastructure due to fears that either Amazon or Google would enter the on-demand market as competitors and bring their cloud infrastructure to bear and potentially squeeze us for costs. Azure wasn’t much of an option during this time. This decision limited our adoption of cloud native solutions like SpannerDB and DynamoDB. We ended up doing a lot of sharded MySQL in our own data centers instead.

      This on-prem decision led to a lot challenges internally where we would adopt OSS and then have difficulty scaling it to our needs. For some tech like Kafka it worked out, and we hired Kafka contributors who helped us scale it. For other tech like Cassandra it was a pretty epic failure. I am sure more of these war stories exist that I wasn’t privy to myself.

      Coupled with the fact that we were early adopters into Golang which had its own OSS ecosystem, we found that writing a lot of our own infrastructure solutions was the only viable option at our scale.

      What you are seeing now is a lot of that home grown infrastructure being open sourced in big way as people who have left Uber continue to see value in investing in the tech that they worked so hard to build. There is probably a nontrivial amount of work to scale the Uber OSS down for smaller use cases but some startups are emerging to make that happen.

      Source: I worked at Uber from 2015-2019 on product and platform teams and had several close colleagues in infra.

      • pas 1525 days ago
        Netflix loves Cassandra, right? [0][1] So could someone describe why it wasn't a great fit for Uber? How come it was easier to invent the wheel in Go compared to cobbling together something with Cassandra/ES/Kafka (or other Java gadgets from the Hadoop ecosystem)?

        [0]: https://netflixtechblog.com/scaling-time-series-data-storage... [1]: https://www.datastax.com/resources/video/cassandra-netflix-a...

        • remote_phone 1524 days ago
          It was an epic failure because you need a team to support and guide Cassandra use properly but no one wanted to do the grunt work. The VP of infrastructure MM openly called it “toil vs talent”, meaning those that did the grunt work would be held in high esteem and get yearly bonuses, but the promotions would go to those with “talent”, ie creating new things.

          When people are openly and stupidly incentivized like this, expect those people to behave in a predictable way. People started building new services to get promotions instead of “toiling” at supporting their fellow engineers.

          It affected most of engineering but especially in teams like Cassandra, where you needed guidance and support to properly use it effectively, it was a disaster. There should have been open office hours to help people with questions and to ensure that teams were using it properly but there wasn’t. Instead people were left to do what they wanted with no structure or guidance and Cassandra was completely misused. Productions problems ensued, people left the team because they didn’t want to be oncall fixing fires all the time, and eventually it came to the point where they decided to stop supporting it altogether. It was a complete disaster caused by very poor engineering management.

          We all knew that Netflix and Facebook use it without issues, but because of stupid management, it failed at Uber.

          • pas 1523 days ago
            Oh, wow, that sounds worse than bad. Thanks for elaborating on the "root cause" too!
        • roskilli 1525 days ago
          Netflix actually built their own metrics time series store called Atlas for similar reasons to Uber building M3DB (FOSDEM talk mentions hardware reduction and oncall reduction), however open source Atlas only has an in-memory store component which was too expensive for Uber to run (since the dataset is in petabytes).

          https://github.com/Netflix/atlas

          • ckdarby 1524 days ago
            > which was too expensive for Uber to run (since the dataset is in petabytes).

            Ok, but I am fairly confident Netflix also is at that kind of scale.

            Netflix has a section on Atlas's documentation about how they get around this: https://github.com/Netflix/atlas/wiki/Overview#cost

            They also did this nice video that outlines their entire operation including how they do rollups: https://www.youtube.com/watch?v=4RG2DUK03_0

            This is how they do the rollup but keep their tails accurate to parts per million and the middle to be parts per hundred: https://github.com/tdunning/t-digest

            • roskilli 1524 days ago
              I want to first say, I have a great amount of respect for Netflix's engineering and for Atlas itself, it's great that it exists and is more accessible than other scalable in-memory TSDBs open sourced by large companies.

              A few of my thoughts on this, and this has come up before. Firstly Netflix self-identifies it is expensive to run an in-memory TSDB for metrics - for instance Roy's talk on Atlas mentions this as such[0] at the 37min mark of his Operations Engineering talk "It scales kind of efficiently. I'd love to say efficiently instead of efficiently-ish however that's hard to claim when my platform until this last quarter cost Netflix more than any other element of the cloud ecosystem ... Atlas and the associated telemetry costs Netflix 100s of thousands of dollars a week". At Uber M3 cost a significant amount to run as well at first and that is why M3DB was born to drive down that cost as much as it could and still provide a ton of instrumentation to engineers. Either way, giving engineers tons of room to instrument their code will result in a high cost no matter what since it will be viewed as a free lunch, that is why squeezing the economics on this matters since you want to provide as much instrumentation as possible at the lowest cost.

              Regarding your points about their documentation on cost:

              1) Yes reducing cardinality by dropping node dimension on metrics, etc is possible to save cost - but also keeping things on disk is an alternate and great way to save cost too and keep the data at high fidelity. The challenge is making on disk lookup fast too, which with M3DB is what we were focused on doing.

              2) Dropping replication of the data to a single replica is another way to save cost, however also comes with operational complexity as now you need to do backup/restore if you lose data and lose the ability to query that data in the meantime. This is why M3DB always is recommended (as per documentation) to run at RF=3 with quorum reads and writes so losing a single machine does not impact the availability of your operational monitoring and alerting platform.

              3) Regarding rollups and tail solutions accurate, we always push for people to use histograms as that can be aggregated over any arbitrary time window and across time series. T-Digests are much more expensive to store raw and aggregate later. Bjorn talked about histograms, their use in Prometheus at FOSDEM[1] and why they're more desirable than t-digests or other similar aggregations.

              [0]: https://www.infoq.com/presentations/netflix-monitoring-syste... (video, quote is at 37minutes in)

              [1]: https://fosdem.org/2020/schedule/event/histograms/ (slides and videos)

              • pas 1523 days ago
                Thanks for the FOSDEM link. I know the videos are out, but just looking at the schedule to find the interesting talks took more time than I wanted to spend on it. (The conference became so huge.)

                Maybe Bjorn's talk has the answers, but would you mind explaining how histograms are easy to aggregate? Don't you need either fixed buckets or raw data to produce a new histogram over a different dataset? (I know there are tricks to get great estimates, but naturally every re-aggregation would add larger and larger +/- intervals, no?)

    • roskilli 1525 days ago
      As per sibling comment, they do most definitely work until they don’t. M3 actually started with ElasticSearch and Cassandra for index and storage respectively but then were replaced with M3DB. I mentioned the FOSDEM talk elsewhere in the thread but you might be interested in the evolution segment where it’s mentioned ”With M3DB 7x less servers from Cassandra, while increasing RF=2 to RF=3” and something that’s not on the slides but is in the talk is a reference to an order of magnitude reduction in operational overhead (incidents/oncall debugging). Both slides and video is linked from the FOSDEM talk’s page https://fosdem.org/2020/schedule/event/m3db/.
    • buro9 1525 days ago
      It's a database for a metric platform.

      Think of OpenTSDB and Prometheus. Or for a better comparison think of Thanos https://thanos.io/

      As to whether they could fulfil Uber's needs, the thing about scale (real massive scale - I work at Cloudflare) is that everything breaks in weird ways according to your specific uses of a technology. The things listed above work for companies, until they don't. There's few things that seem to truly work at every scale, Kafka and ClickHouse come to mind for wholly different use cases than a time series database.

      • manigandham 1524 days ago
        Clickhouse (and other columnstore RDBMS) are all perfectly fine for time-series and usually better than the standard options because they have SQL querying.
      • linuxhansl 1524 days ago
        We run OpenTSDB (which stores its data in Apache HBase) at scale.

        150m-200m events/minute and about 20-30 trillion (10^12) events stored. Doubling about every 12-18 months or so.

        While it's true that things start to creak at scale, this has worked remarkably well for us so far. I doubt M3DB is somehow magical in this regard.

        • roskilli 1524 days ago
          M3DB ingested 30 million datapoints per second (so 1.8 billion per minute) with each node writing hundreds of thousands of writes per second. The dataset was in the petabytes.

          For us the cost savings vs OpenTSDB (millions of dollars of hardware), the faster query time and the reduction in oncall overhead was worthwhile.

          • shaklee3 1524 days ago
            Hundreds of thousands per second isn't very high when you compare that to clickhouse or kdb+.
      • 1996 1524 days ago
        > ClickHouse come to mind for wholly different use cases than a time series database.

        ClickHouse works fine as a TSDB if you don't mind getting a little dirty

        • valyala 1524 days ago
          There is a TSDB solution if you don't want getting a little dirty - VictoriaMetrics [1]. It is built on the same principles as ClickHouse [2].

          [1] https://github.com/VictoriaMetrics/VictoriaMetrics/blob/mast...

          [2] https://medium.com/@valyala/how-victoriametrics-makes-instan...

          • 1996 1520 days ago
            "-retentionPeriod - retention period in months for the data. Older data is automatically deleted. Default period is 1 month."

            Not sure I want that in a TSDB!

            "Reducing disk space usage by deleting unneded time series. This doesn't work as expected, since the deleted time series occupy disk space until the next merge operation, which can never occur."

            Ouch. But ok, disk space is cheap.

            The killer point: it seems to be purely json based- no SQL of any kind. I'm not sure about that. A lot of code would have to be changed to fit that model.

    • cube2222 1525 days ago
      Having deployed m3 recently, I’ve not found an alternative which is cost effective and fast at the same time. Granted, it uses a lot of memory, but other than that I’ve been incredibly happy with it.
      • hagen1778 1524 days ago
        Since you've done your research, would you mind to post a short list of alternatives and reasons why have they been rejected? Thanks!
        • cube2222 1524 days ago
          This was as of november.

          Raw Prometheus: Isn't able to hold my data.

          Thanos: I liked the project, it's architecture and ease of deployment, but after spending a non-trivial amount of time with it I wasn't able to setup any long-term caching. Thanos uses the prometheus storage format. So whenever I was querying one metric, it was downloading all metrics which were in the same block (all metrics basically afaik), this resulted in gigabytes/s of network traffic where it definitely wasn't necessary, and fairly long query times. (I used it with ceph) Though I know the maintainers were planning to add some kind of caching so this may be fixed. By using the native prometheus data format you also don't get storage space savings over it.

          Cortex: Didn't spend any time on it, as I expected similar problems as with Thanos, so left it out for the end (which didn't came after all). I know it does contain a caching element.

          Victoria Metrics: As far as I know it's very well engineered and performs great. But I see only one active maintainer so am afraid to use it.

          M3DB: Requires a non-trivial amount of memory (I have 3 machines, each 128GB RAM to handle 70k writes/s each (though 1 was able to handle 120 and be stable)). However, with all machines on bunches of raid 0 ssd's, querying is quite snappy. You can set it up with different storage resolutions, so you get detailed data for recent queries, but also fast long range queries. It also uses a magnitude of storage space less than raw prometheus. The documentation is lacking in my opinion in terms of performance tuning, however, the code is well written, so I've just spent a while reading it and it exports very good metrics for itself. Network traffic between the m3coordinator (prometheus remote write gateway) and m3db nodes is kinda huge (5-10x the traffic prometheus->gateway) but that wasn't an issue. Another bonus is that it handles statsd metrics, though I haven’t yet tried that.

          For anybody afraid of it operationally, I’ve had no problems. It mostly worked as is.

          • cube2222 1524 days ago
            Another comment on m3db:

            m3query did have some inconsistencies compared to prometheus in how queries using intervals evaluated. (and sometimes didn't return any data because of it).

            Having stepped through the code I don't remember the reason why that was, but I ended up using prometheus instances using remote_read from the m3coordinator (gateway), working like a charm.

          • bboreham 1524 days ago
            Cortex has a design much more like M3DB than Thanos; I wouldn’t expect the same problems as Thanos at all.

            (I am a Cortex maintainer)

          • hagen1778 1524 days ago
            Thanks for details! Really appreciate it.

            > It also uses a magnitude of storage space less than raw prometheus

            AFAIK, Prometheus compression is about 1.2-3 bytes per datapoint. A magnitude less is 0.12-0.3 bytes - are these numbers correct?

            • cube2222 1524 days ago
              Here you have the specifics: https://m3db.github.io/m3/m3db/architecture/engine/

              I admit I’ve exaggerated a bit as Prometheus doesn’t support downsampling, in m3db I only keep 2 weeks of data at full resolution, 2 months at lower, and 5 years at even lower.

              • sciurus 1524 days ago
                Downsampling might not save as much space as you think depending on how m3db works. https://github.com/thanos-io/thanos/issues/813 goes in to why for a similar project.
                • roskilli 1524 days ago
                  Whether it saves space or not, looking at metrics over period of months or years when the data is raw is far more slow/expensive than looking at downsampled data.

                  If you still want to be able to quickly graph and view old data, downsampling is the only way to keep your queries interactive.

                  Take for instance 30s data vs 10min data. 20x more computation, network exchange and everything else of that nature needs to happen.

                  Also if you want to keep only a subset of your data for a very long time, you need to have retention policies - otherwise you end up storing all that extra data forever.

                  At large numbers (terabytes to petabytes) this stuff is impactful, at smaller numbers (gigabytes) my points here are far less relevant.

                • cube2222 1524 days ago
                  I know about the Thanos problem, and it's one of the things I didn't really like about it using the prometheus storage format.

                  It does work in m3db.

                  Following are the on-disk sizes of one replica:

                  2 weeks at 15s res: 90G

                  2 months at 1m res: 160G

                  3 months at 5m res: 60G

                  EDIT: The only thing is that m3db doesn't really downsample. You just create one namespace (table) for each resolution, and set the m3coordinator up so that it writes to each at the wanted interval, then set up a different retention for it. (this way you have duplicates in recent data)

                  M3 namespaces aren't set up for a specific resolution. The writer decides and you can write various series in different resolutions to one namespace in theory.

          • sagichmal 1524 days ago
            > Victoria Metrics: As far as I know it's very well engineered and performs great

            As someone who has auditioned it, briefly, let me assure you that it is certainly not the former, and only appears to be the latter due to a lot of cut corners and spec-violating implementations.

            • cube2222 1524 days ago
              Could you elaborate on the specifics? I'd be very interested.
      • jandrewrogers 1524 days ago
        Not M3 specific, this the summary in a nutshell. If you need both scale/performance and operational cost efficiency at the same time, there is not much in open source for you.
    • whyreplicate 1525 days ago
      Cassandra and ElasticSearch would probably have been fine except that Uber dramatically under-provisioned the hardware used for them. The database redundancy was so low that any minor hardware issue could quickly turn into a major outage for all of Uber's monitoring services.
      • roskilli 1525 days ago
        Well if you’re going to run at RF2 and push them when they can only do 60,000 writes per second vs multiple hundreds of thousands per second with specialized software on the same hardware.

        It’s hard to justify using tens of millions of dollars more of hardware more to run Cassandra.

        • apta 1524 days ago
          What do you think of ScyllaDB then? Would it have been able to do the job
          • roskilli 1524 days ago
            I think ScyllaDB would have definitely done better than Cassandra (which we were using alongside ElasticSearch), although another thing I mention in this thread is that a lot of existing distributed databases do not have a multi-dimensional inverted index available that can index keys in there primary storage engine.

            This makes it tough to use solely either ScyllaDB, ClickHouse or Cassandra for that matter for metrics workloads at scale since they need to find a needle in a haystack - a few thousand time series amongst a set of millions to billions, where users only specify a subset of the dimensions on the metrics in any order they want to. This is hard to do without an inverted index.

    • manigandham 1524 days ago
      Any columnstore RDBMS would work great. Clickhouse, MemSQL, KDB, Greenplum, Vertica, etc. Fast, efficient, and with the full flexibility of SQL queries.
    • jandrewrogers 1524 days ago
      It is relatively common for companies to not use open source for this type of application above a certain scale, even if they use open source for most other things. I've seen it happen at multiple companies big and small. There are two major reasons for this.

      The first reason is that open source platforms struggle beyond a certain scale due to architectural weaknesses, which becomes an ongoing operational headache. Most companies just deal with it but it gets worse as the workload grows.

      The second reason is that it is expensive to run the open source platforms due to their very low efficiency. I've seen companies reduce their hardware footprint by a factor of 10 by rolling their own metrics/time-series implementations due solely to superior software design. When you are running a petabyte of metrics per day through these systems, that adds up to a lot of money.

      tl;dr: it is technically straightforward for a company to design their own metrics infrastructure that massively outperforms the open source tooling, and the limitations of the open source implementations are often painful enough as the data models scale up that many companies do.

  • synack 1524 days ago
    I setup a lot of Uber's early metrics infrastructure, so I can speak to how they got to the place where building a custom solution was the right answer.

    In the beginning, we didn't really have metrics, we had logs. Lots of logs. We tried to use Splunk to get some insight from those. It kinda worked and their sales team initially quoted a high-but-reasonable price for licensing. When we were ready to move forward, the price of the license doubled because they had missed the deadline for their end of quarter sales quota. So we kicked Splunk to the curb.

    Having seen that the bulk of our log volume was noise and that we really only cared about a few small numbers, I looked for a metrics solution at this point, not a logs solution. I'd operated RRDtool based systems at previous companies, and that worked okay, but I didn't love the idea of doing that again. I had seen Etsy's blog about statsd and setup a statsd+carbon+graphite instance on a single server just to try out and get feedback from the rest of the engineering team. The team very quickly took to Graphite and started instrumenting various codebases and systems to feed metrics into statsd.

    statsd hit capacity problems first, as it was a single threaded nodejs process and used UDP for ingest, so once it approached 100% CPU utilization, events got dropped. We switched to statsite, which is pretty much a drop-in replacement written in C.

    The next issue was disk I/O. This was not a surprise. Carbon (Graphite's storage daemon) stores each metric in a separate file in the whisper format, which is similar to RRDtool's files, but implemented in pure Python and generally a bit easier to interact with. We'd expected that a large volume of random write ops on a spinning disk would eventually be a problem. We ordered some SSDs. This worked okay for a while.

    At this point, the dispatch system was instrumented to store metrics under keys with a lot of dimensions, so that we could generate per-city, per-process, per-handler charts for debugging and performance optimization. While very useful for drilling down to the cause of an issue, this led to an almost exponential growth in the number of unique metrics we were ingesting. I setup carbon-relay to shard the storage across a few servers- I think there were three, but it was a long time ago. We never really got carbon-relay working well. It didn't handle backend outages and network interruptions very well, and would sometimes start leaking memory and crash, seemingly without reason. It limped along for a while, but wasn't going to be a long-term solution.

    We started looking for alternatives to carbon, as we wanted to get away from whisper files... SSDs were still fairly expensive, and we believed that we should be able to store an append-only dataset on spinning disks and do batch sequential writes. The infrastructure team was still fairly small and we didn't have the resources to properly maintain a HBase cluster for OpenTSDB or a Cassandra cluster, which would've required adapting carbon- I understand that Cassandra is a supported backend these days, but it was just an idea on a mailing list at that point.

    InfluxDB looked like exactly what we wanted, but it was still in a very early state, as the company had just been formed weeks earlier. I submitted some bug reports but was eventually told by one of the maintainers that it wasn't ready yet and I should quit bugging them so they could get to MVP.

    Right around this time, we started having serious availability issues with metrics, both on the storage side- I estimated we were dropping about 60% of incoming statsd events, and on the query side- Graphite would take seconds-to-minutes to render some charts and occasionally would just time out. We had also built an ad-hoc system for generating Nagios checks that would poll Graphite every minute to trigger threshold-based alerts, which would make noise if Graphite was down and the monitored system was not. This led to on-call fatigue, which made everybody unhappy.

    We started running an instance of statsite on every server which would aggregate the individual events for that server into 10 second buckets with the server's hostname as a key prefix, then pushed those to carbon-relay. This solved the dropped packets issue, but carbon-relay was still unreliable.

    We were pretty entrenched in the statsd+graphite way of doing things at this point, so switching to OpenTSDB wasn't really an option and we'd exhausted all of the existing carbon alternatives, so we started thinking about modifying carbon to use another datastore. The scope of this project was large enough that it wasn't going to get built in a matter of days or weeks, so we needed a stopgap solution to buy time and keep the metrics flowing while we engineered a long term solution.

    I hacked together statsrelay, which is basically a re-implementation of carbon-relay in C, using libev. At this point, I was burned out and handed off the metrics infrastructure to a few teammates that ran with statsrelay and turned it into a production quality piece of code. Right around the same time, we'd begun hiring for an engineering team in NYC that would take over responsibility for metrics infrastructure. These are the people that eventually designed and built M3DB.

    • 4d617832 1524 days ago
      Really interesting read for me. I am currently not so far from your SSD point but our setup still works fine most of the time. It’s just 100k/m though. I am trying to use more of the Go implementations of the graphite stack which did improve load. I will consider m3db probably to get some benchmarks. All the other ones would require some more people as you said.
    • winrid 1524 days ago
      Awesome story, thanks.
  • ksec 1524 days ago
    How does it compare to TimescaleDB [1] ?

    [1] https://www.timescale.com

    • akulkarni 1524 days ago
      TimescaleDB co-founder here.

      TimescaleDB is a more versatile time-series database. It supports a variety of datatypes (text, ints, floats, arrays, json), allows for out-of-order writes and backfilling of old data, supports full SQL, JOINs between tables (eg for metadata), flexible continuous aggregates, native compression, and is backed by the reliability of Postgres. [0]

      M3DB seems much more limited in scope [1]:

      "Current Limitations

      Due to the nature of the requirements for the project, which are primarily to reduce the cost of ingesting and storing billions of timeseries and providing fast scalable reads, there are a few limitations currently that make M3DB not suitable for use as a general purpose time series database.

      The project has aimed to avoid compactions when at all possible, currently the only compactions M3DB performs are in-memory for the mutable compressed time series window (default configured at 2 hours). As such out of order writes are limited to the size of a single compressed time series window. Consequently backfilling large amounts of data is not currently possible.

      The project has also optimized the storage and retrieval of float64 values, as such there is no way to use it as a general time series database of arbitrary data structures just yet."

      [0] https://www.timescale.com/

      [1] https://m3db.github.io/m3/m3db/#current-limitations

  • jmakov 1525 days ago
    So how does this compare to e.g. Clickhouse?
    • bdcravens 1525 days ago
      Clickhouse is an analytic column-based RDBMS. It's not a timeseries database. Each class of product is used to solve different problems.
      • manigandham 1524 days ago
        Time-series data is just data where every record has a "time" field. That's it.

        Any database can handle it, and columnstore RDBMS are designed to store and query trillion-row tables with full SQL functionality. The only advantage a "time-series" database gives you is some time-based query operators (like gap filling, last value, smoothing, etc). Those are now being added to SQL support for RDBMS so there's really nothing to be gained from a time-series database anymore.

        • akulkarni 1524 days ago
          "Time-series data is just data where every record has a "time" field. That's it."

          That's a pretty big simplification. It's like saying one could go running in dress shoes. (Yes, it's possible, but don't you want to use the right tool for the job?)

          As one time-series database example, because TimescaleDB [0] is focused on time-series data, its users benefit from [1]:

          * 40-50x compression for metrics data (so storage costs for compressed data are 2-2.5% what they would normally be)

          * Versatile continuous aggregate policies

          * Variable data retention policies

          * Overall much more efficient compute and memory utilization (because of faster insert and query rates)

          * And yes, also time-based query operators for gap filling, first/last, LOCF, etc

          (And I'm sure roskilli could describe M3DB's own advantages over non-time-series DBs.)

          [0] Disclaimer to other readers, I'm a co-founder (although OP already knows this, as we've jousted on HN before :-) )

          [1] All of our benchmarks and other engineering notes are published here: https://blog.timescale.com/tag/engineering/

          • manigandham 1524 days ago
            I'd say you started with a standard RDBMS and added the other features like sharding, column-oriented storage, time-series helper functions to SQL, and more to it to end up similar to the other native column stores but with a more flexible and popular Postgres frontend.

            At the very least, I think we both agree that the relational/SQL options work just fine compared to limited time-series databases like Influx. And for the record, we do use timescale so you've won me over on the PG usability front.

            • akulkarni 1524 days ago
              "And for the record, we do use timescale so you've won me over on the PG usability front."

              Great! Didn't realize that.

              And yes, I would agree that a relational/SQL time-series database like TimescaleDB can work quite well. :-)

          • shaklee3 1524 days ago
            It seems clickhouse also has most of those features, but is not considered a time series database. Is that wrong?
      • mbell 1524 days ago
        Clickhouse has a table engine for graphite. We've used it for a couple years now after out scaling InfluxDB and working around it several times. Clickhouse works _extremely_ well for graphite data, it can handle several orders of magnitude more load than Influx in my experience.
      • aeyes 1525 days ago
        Clickhouse works exceptionally well as a TSDB.
        • roskilli 1524 days ago
          While this is true, for a metrics workload it does not work great I have both seen and heard from others, mainly due to the fact it does not have an inverted index - so finding a small subset of metrics in a dataset of billions of metrics ends up taking significant time due to the scan required to find the timeseries matching the arbitrary number of dimensions specified to find the timeseries you're looking for.

          If you're building it with a specific application and a concrete schema you can create which will result in fast queries and don't have requirements for arbitrary dimensions being specified for lookup, then yes it's great as a TSDB.

          Prometheus, M3DB, etc all use an inverted index alongside the column store TSDB to help with metrics workloads.

          • mbell 1524 days ago
            Most practical applications using Clickhouse for metrics data store the metric index separately. What index you want really depends on the metric system, e.g. with graphite data you don't want an inverted index, you want a trie.
            • roskilli 1524 days ago
              Yes I've seen that also work, it's a lot of stitching together things yourself and we had to put a lot of caching in front of the inverted index we were using, however definitely plausible. ClickHouse doesn't do any streaming of data between nodes as you scale up and down which was a big thing for us since we had large datasets and needed to rebalance when cluster expanded/shrunk.

              With regards to trie vs inverted index for Graphite data, I'd actually still be inclined to say inverted index is better based on the amount of queries I saw at Uber with Graphite where people did `servers.*.disk.bytes-used` type queries which is way faster to do using an inverted index since you have a postings list for each part of the dot-separated metric name, rather than traversing a trie with thousands to tens of thousands of entries in index 1 host part of the Graphite name. This is what M3DB does[0].

              [0]: https://github.com/m3db/m3/blob/b2f5b55e8313eb48f023e08f6d53...

              • idjango 1524 days ago
                Just to point out that there is inverted index implementation of graphite data working on clickhouse.

                Regarding the auto-rebalance feature, I cannot much more agree with you. It's something that clickhouse definitely need to handle internally.

                • roskilli 1523 days ago
                  That's interesting, I had not heard of ClickHouse as a backend for Graphite with an inverted index. Let me know if you have any links to that.

                  I'm assuming this is an out of process inverted index used alongside ClickHouse? Or is it more of a secondary table contained by ClickHouse which can be searched to find the metrics, then the data is looked up?

                  The latter scales not as well with billions of unique metrics since it's always a scan across the unique metrics stored in the time window your query searches for (since any arbitrary dimensions can be specified, all must be evaluated). This is the drawback of PromHouse which is an implementation of Prometheus remote storage on top of ClickHouse - and the major reason why PromHouse was only ever a proof of concept rather than a production offering.

        • idjango 1524 days ago
          I also confirm that. Several companies have successfully transitioned their monitoring stack from graphite initial python implementation to a clickhouse based backend.
          • rixed 1524 days ago
            Not to bad-mouth Clickhouse but the original python implementation of graphite + carbon was setting the bar very low, though, and transitioning from there to anything would have increasing performances by orders of magnitude.
            • idjango 1524 days ago
              You should read the link below [1]. Even if it's not uber scale I suspect yandex to use something similar.

              I agree that python implementation of graphite was not particularly fast but there was faster implementation in C that companies used first to significantly increase performance. Then coordination of storage backend becomes complex when you try to scale the initial design. This is where clickhouse really shine. It provide out of the box distributed storage with compaction, rollup and fast querying. The other layers are stateless, which means that they'll scale with your computing ressources.

              M3DB is roughly doing the same thing as clickhouse but clickhouse is much more advanced database that has proven records of running at petabyte scale without a sweat. For example they now have tiered storage which means that you can store recent event in nvme and rollup to standard HDD...

              [1]https://medium.com/avitotech/metrics-storage-how-we-migrated...

        • jmakov 1524 days ago
          That is also my experience. Also in bencmark it is almost as fast as GPU analytical DBs or KDB+.
      • jmakov 1524 days ago
        Hm... I would say that the workload is the same, is it not? After all yandex is using it for logs and for metrics.
  • MichaelRazum 1525 days ago
    Ok everything open source was not good enough. Please make a simple benchmark. Without it, it is so hard to make decisions
  • katzgrau 1524 days ago
    Nice... patiently waits for AWS to create a managed version of it...
  • TheRealPomax 1525 days ago
    admins/mods: this needs an apostrophe to turn it into "Uber's M3DB".

    For anyone who's never heard of M3DB, and lives in a place where Uber doesn't operatore or is even banned (and so isn't part of daily life or conversation) "Ubers" might just as easily be some db researcher affiliated with the university of who knows where showing off something they came up with last summer and got a grant for.

    • tlb 1524 days ago
      Fixed, thanks
  • heliodor 1525 days ago
    When the Android app is broken in so many easy-to-fix ways that blatantly interfere with usability, how does a company allow its developers to spend time on making custom internal tools or even spend time open-sourcing them? The company has so much money and yet seems so utterly mismanaged.
    • rossjudson 1524 days ago
      Sounds like off-the-shelf tooling just didn't work. What's your solution for that?
      • heliodor 1522 days ago
        In the grander scheme that I'm discussing, the solution you're asking about is to gather no more metrics than the off-the-shelf tools allow and spend the engineering effort on fixing the simple bugs that prevent customers from getting cars. Why do they need a perfect ton of metrics while they can't fix simple things such as:

        - the car icon doesn't move as location updates come in (as evidenced by the route line getting shorter)

        - the on-screen keyboard does not allow me to type anything after the first message (no other app on my phone has this problem)

        - after I rate a driver, the app shows a map and none of the UI. I have to kill the app and restart it in order for it to be usable again. Picture this: I'm trying to get a ride, I open the app, I get nagged to rate the last driver. I agree just to be nice to the driver. (I should skip instead and get back to my task of getting a ride). After putting up with the nagware, the app fails 100% and I cannot complete the original task!

        Alternatively, if they demand the collection of so many metrics, why don't they collect the metrics that would show them just how pathetically broken their Android app is?

  • clircle 1525 days ago
    Does "Time Series Database" mean anything technical, or is this just some Uber marketing? In statistics, time series has a technical meaning.
    • bostik 1524 days ago
      TSDBs are a special case of databases. And oh boy, time-series is hard.

      Your regular RDBMS is going to be either write-heavy or read-heavy. You can pretty easily[ß] optimise the database for one of these utilisation patterns. But a TSDB basically combines the worst of both worlds: telemetry at any scale is important, and monitoring reliability in an always-online system is not optional.

      TSDBs are written to very frequently; even at a reasonably low scale we could be talking about couple of hundred thousand writes every few seconds. But because they are also used for system-wide monitoring, they are read from all the time.

      ß: a read-heavy regular DB has the ratio of reads:writes in thousands, perhaps millions; a write-heavy DB can be read from a couple of times every few seconds, but can be written to at a rate of tens of thousands of entries per second. You - or your expensive DBA - can optimise the DB for one of these patterns, but not for both. TSDBs have to support both patterns at the same time, so their internals have been geared to this one specific domain.

      • rixed 1524 days ago
        > But because they are also used for system-wide monitoring, they are read from all the time.

        And this is the billion dollars mistake of the current devop culture. I believe we are doing monitoring wrong. Real time monitoring need no persistent storage. Troubleshooting does need persistent storage, but not monitoring, and unless your infrastructure is broken all the time then querying past data must occur only rarely.

        From what i have seen, this mistake seems to stem from the web culture that tends to favor designs centered on a database. Whereas you want your real-time monitoring to be centered on stream processing, with one output to the persistent store for later retrieval.

        There is no good reason for your dashboards nor your alerts to hit a database every few minutes, this design is just wrong.

        I'm actually working on the prototype of a stream processor tailored at small scale network monitoring and would welcome any discussion/criticism on this topic. Notice that "small scale" for a stream processor is much larger than "small scale" for a database, and that i believe a good stream processor capable of running arbitrary persistent queries for monitoring on the order of a million data points per second should fit a single server and be more than enough to monitor an above the average sized business infrastructure.

        • closeparen 1524 days ago
          When you have very high cardinality series, it's decently common to slot in a "rollup" or "reducer" query between collection and persistence.

          This is similar to the insight you mentioned: I don't actually need to store a separate series per hostname in perpetuity just to know when the worst one is out of control. I can ask it to persist the top five individually, and then an average.

          But when responding to a PagerDuty alert, the first thing I want to see is a plot of the alerting series. The next thing I want to see is a plot of every series about that service (bonus points if you can find the ones that have discontinuities with similar timing). So you're really not getting out of the "fast serving" business, just reducing the load on it.

          • rixed 1524 days ago
            > But when responding to a PagerDuty alert, the first thing I want to see is a plot of the alerting series.

            When you do receive a page, you can then hit a database. This happens rarely enough (hopefully!) so that your database can be optimised for writes only, not for writes and reads, as $parent suggested.

            I would even argue that this is a similar use case than dashboarding: when paged you want to see the last 3 days or so of accurate data + some longer term averages/trends/baseline. All of this could fit in RAM. At least in the current prototype I mentioned above the ambition is to be able to serve recent data for such dashboards out of RAM without bothering with a DB, and have the DB completely out of the way for the monitoring use case (while still having a persistent store for long term capacity planning / business analyses).

            Assuming doubles are compressed and a collection rate of 1/10s, 3 days of data is <100KiB per timeseries, ie 0.5GiB for 10k source timeseries.

    • abvdasker 1525 days ago
      A time series database is specialized for use cases where the data and query patterns are solely temporal in nature and must show the latest data in real-time (performance metrics/monitoring and stock prices come to mind). Relational and NoSQL databases tend to degrade rapidly with these query patterns at scale (think of the complexity of SQL queries to bucket rows by timestamp).

      https://en.m.wikipedia.org/wiki/Time_series_database

      • roskilli 1525 days ago
        I touch on this a little in the podcast I did with Jeff[0], but it boils down to OpenTSDSB for us when we benchmarked could only do low tens of thousands of writes per second per node, whereas M3DB is hyper optimized and can do hundreds of thousands to millions of writes per second per node depending on compute/disk.

        Also with a fast inverted index we were able to achieve much faster query times than OpenTSDB at scale.

        [0]: https://softwareengineeringdaily.com/2019/08/21/time-series-...

      • refset 1525 days ago
        Note that temporal databases are also a thing, so it's probably wise to avoid using the word "temporal" when discussing time series databases. As far as I know kdb+ is the only technology that has a foot in both camps.

        https://en.m.wikipedia.org/wiki/Temporal_database

        • CharlesW 1525 days ago
          > As far as I know kdb+ is the only technology that has a foot in both camps.

          Teradata Vantage also supports both. And you're absolutely right, it's important not to conflate "temporal" and "time series" support.

    • namanaggarwal 1525 days ago
      It's not a marketing term. It's a database optimised for storing and querying time based metrics.

      Uber didn't invent the term, there are a lot of existing products in market.

      Question is why none of them worked for them. I have used OpenTSDB and it worked great at Mastercard scale. What issues did Uber had?

      • rossjudson 1524 days ago
        "Mastercard scale" doesn't mean anything in particular, unless you quantify it. How many metrics? Write rate? Query rate? Query complexity?
    • idunno246 1525 days ago
      A db that backs lots of time based graphs. They generally hit some pathological cases for general dbs. Frequent write of small data, most recent time is very hot so sharding is tricky, queries generally pull lots of little bits of data in from long time ranges, resolution of past data can often be lowered, etc.

      Opentsdb for instance was built on top of Hbase because implementing one naively in hbase hits tons of performance issues.

    • steve_adams_86 1525 days ago
      There many flavours of time series databases, but my understanding is that at their core they're optimised for storing and querying based on time stamp/time series data. Typically very quickly in large volumes. That's maybe the primary criteria to define a database as a time series database.

      Someone could probably elaborate on this a massive amount. I'm sure there is some nuance and a lot of relevant details around how that optimization is done.