Faster – Fast key-value store from Microsoft Research

(github.com)

401 points | by JeffCyr 2077 days ago

22 comments

  • skyde 2077 days ago
    I had to opportunity to look at the code. I believe this should be compared against embedded hash database such as "kyoto cabinet, LevelDB, RocksDB" . They introduce a novel latch free hashtable that they say is faster than other in memory data structure.

    They also introduce a new disk persistence system called HybridLog that combines in-place updates (in memory) and log-structured organization (on disk).

    The interesting aspect of the HybridLog is that it act as a bufferpool but seem to work at the record level instead of working with pages like a B-tree buffer pool would or compressed block as levelDB block_cache does.

    • brongondwana 2076 days ago
      I was very surprised to see pretty much the entire implementation inside a file called faster.h rather than in a .cc file. Maybe that's all the rage in C++ libraries, I don't work with many of them.
    • fahrradflucht 2075 days ago
      To compare it to rocksdb or leveldb: it looks like there is no iteration support or the notion of ordered keys to support partial scans...
    • captain_crabs 2077 days ago
      Although I never got a CS degree, because I kind of understood this comment, I guess I can officially consider myself a nerd now.
  • atombender 2077 days ago
    For those wondering what this is, it is not a client/server app, from what I can tell, but an embedded engine. It looks like it's intended to be a library, and it's been implemented in two languages (C# and C++).

    To get something like Redis or Riak you would have to build API, clustering, etc. on top of it. So it's more analogous to libraries like RocksDB, BoltDB, BDB etc.

    Paper: https://www.microsoft.com/en-us/research/uploads/prod/2018/0...

    • dlahoda 2077 days ago
      c++ part is one liner intrinsic which may be supported directly with new .net and several io methods which would be heavy on pinvoke calls like 5 per method if in c#, but these are very simple. so c++ part could be done in c and easily ported to unix. i guess c# should allow for pinvoke strategy like in lua to replace c++ more.
      • int_19h 2077 days ago
        What kind of different strategies did you have in mind? So far as I know, P/Invoke already tries to be as low overhead as possible - e.g. primitive types and blittable structs are just passed as is, with no conversions. The recently added Span<T> is also special-cased for P/Invoke.
        • chrisseaton 2076 days ago
          > So far as I know, P/Invoke already tries to be as low overhead as possible

          As low as possible maybe, but there's lots of limiting factors with P/Invoke, such as:

          - They are not inlined

          - Registers need to be saved pessimistically

          - The state at the point of the call is visible to observers outside of code the compiler controls so options for scheduling are constrained

          - Objects need to be materialised

          - All parameter values need to be available even if they aren't actually used

          - Objects passed have to be pinned or given an indirect handle

          - The caller has to be in a safepoint so GC can keep running

          - The compiler has to juggle things into place for the C ABI, where it may normally lay things out differently

          - etc

          Going through a P/Invoke call is a whole inconvenient circus compared to an intrinsic.

    • yarper 2076 days ago
      Can anyone work out what platforms it works on? .net core?
  • mooman219 2077 days ago
    "What differentiates FASTER are its cache-optimized index that achieves very high performance — up to 160 million operations per second when data fits in memory;"

    I really dislike when papers make performance claims like this in the introduction. That "160 million" number is so meaningless at face value because everything from the runtime environment to the hardware is going to play a huge role in ops. I rather see how it compares against other key-value stores across a number of configurations and use cases.

    • haney 2077 days ago
      It'd be awesome if they provided a comparison of other tools on the same hardware like, "on an AWS M4.xlarge instance we were able to achieve 160m/ops/sec when the dataset fit into memory where as redis only did X"

      Lacking that I agree it's a pretty meaningless stat.

      • mgraczyk 2077 days ago
        Check out the linked paper for a detailed performance comparison.
      • ddorian43 2077 days ago
        Don't do benchmarks on shared-host.
        • ebikelaw 2077 days ago
          If that’s your production environment then that’s where you should run them.
          • dymk 2077 days ago
            That's not why you don't run them on a shared host.

            It's because every other tenant on the machine is going to make run of the same benchmark unpredictable, and it will likely vary greatly through the day.

            Even taking multiple runs of each benchmark isn't sufficient, because you don't know the usage patterns of other tenants.

            • pvg 2077 days ago
              You're making it sound like performance on such hosts in unknowable which isn't really accurate. 'Multiple runs of each benchmark' is vague enough to be potentially insufficient in just about any environment to boot.
              • eklavya 2076 days ago
                Variance matters though. Test where you can reasonably be sure about low variance.
                • pvg 2076 days ago
                  Sure. But it can be measured and accounted for. The idea that you can't or shouldn't benchmark such environments seems weird, given that they're pretty popular.
      • andrewstuart 2077 days ago
        No benchmark means anything on an EC2 shared instance (or probably any other cloud instance) because you don't know what else is running on the machine.
        • haney 2076 days ago
          What about running the benchmark multiple times on instances of the same type? I get that it would be noisy but lots of workloads run on shared instances so it’s a useful measuring stick in that way.
    • kodablah 2077 days ago
      And: "FASTER achieves higher throughput than current systems, by more than two orders of magnitude, and scales better than current pure in-memory data structures, for in-memory working sets."

      Looking at 7.2 of the paper, they probably mean "more than 2x", definitely not exponentially faster in most cases. Still nice work though.

      • jamesblonde 2077 days ago
        Two orders of magnitude is patently not true. MySQL Cluster has been benched at 200m ops/s (where operations are part of 2PC transactions!). And that in 2015! https://www.mysql.com/why-mysql/benchmarks/mysql-cluster/

        The echo chamber of silicon valley is an important reason why MySQL Cluster is not more popular.

        • cliffordj 2077 days ago
          From your link: "This was achieved with 32 (out of a maximum 48) data nodes." The project being discussed is benchmarked on a single node, not a cluster. This is why they compare performance to RocksDB.
        • CyberDildonics 2077 days ago
          As well as being owned by Oracle.
    • feyman_r 2077 days ago
      Links to 'our website' on the github page follows through to this paper [1]. Haven't read through it, but I did see a graph comparing Intel TBB, RocksDB and Masstree.

      [1] https://www.microsoft.com/en-us/research/uploads/prod/2018/0...

    • hedora 2076 days ago
      The comment about being two orders of magnitude faster than other stores is also pretty annoying. Many systems perform at most 1 disk access for data sets with working sets 100’s of times larger than DRAM, and that’s essentially optimal for those workloads.

      It would be nice if they said what workload patterns they’ve optimized for, and which ones they perform poorly on.

    • nolok 2077 days ago
      Agreed. Hardware and software has had a fair amount of upgrades since then, but reading that I could pseudo-remember that quote by the memcache guy that goes something like "you can't tell if that or that tweak is giving you a few % more requests because you're being i/o limited by your network link anyway".
      • dormando 2077 days ago
        close enough.

        heavily batched on a 48 core I can pull 50 million keys/sec over localhost. if you remove syscalls and use it as a library it should double at least.

        writes are another story, but they're slower because nobody asks for them to be faster.

    • imhoguy 2076 days ago
      This. Somebody already filled an issue https://github.com/Microsoft/FASTER/issues/2
    • xjia 2076 days ago
      So 160 million packets per second?
  • iamleppert 2077 days ago
    For key value stores that make performance claims, it would be helpful if they would publish the time and space complexity graphs for various sample data workloads. Write/read, and various data homogeneity for at least one of the standard architectures in different kinds of memory and disk configurations if what is being touted is the novel storage architecture. And provide the scripts to generate said sample datasets.
  • Rauchg 2077 days ago
    Haven't read the papers yet, but it immediately reminds me of Anna:

    https://databeta.wordpress.com/2018/03/09/anna-kvs/

    • sometimesijust 2077 days ago
      That link has comments between anna and faster authors. The best part being:

      > There are two high-level design goals that separate Anna and FASTER.

      > First of all, for Anna, we set out to explore an execution model that’s truly coordination-free; each thread accepts requests, performs computation and sends out response without communicating or waiting for other threads. We believe having a coordination-free execution model is the key to fully exploiting multi-core parallelism within a single machine, and scale out smoothly to a distributed setting. We acknowledge that the fundamental caveat of having a coordination-free execution model is that strong consistencies (linearizability, serializability) are not achievable. Anna instead offers a wide-spectrum of coordination-free consistencies taxonomized in Bailis’s HAT paper (http://www.vldb.org/pvldb/vol7/p181-bailis.pdf).

      > In addition, in Anna we focus on exploring a unified architecture that works at any scale, from a single multi-core machine to NUMA to a geo-distributed setting. Under this goal, architectures that rely on shared memory within a machine (including FASTER) need to be redesigned as we move to a distributed setting. This complicates the software, and can introduce challenges in maintaining consistency as the execution model within nodes and across nodes are now different.

      > Anna currently focuses on workloads that fit in memory. For larger-than-memory data, we believe Anna can benefit from the hybrid-logging technique in FASTER for efficiently persisting data to stable storage.

  • fizx 2077 days ago
    If you want concrete benchmarks, they compare to RocksDB and Redis around page 10 of their academic paper. (https://www.microsoft.com/en-us/research/uploads/prod/2018/0...)

    TL;DR:

    I find their choice of benchmarks to be very convenient.

    They tested on in-memory 8 byte payloads and were way faster than RocksDB and Redis. They then tested against only different configurations of themselves for configurations that hit disk. They also tested an embedded version of their software vs a Redis that hits loopback (rather than e.g. running against the raw redis data structure implementation, which would have been a more fair comparison).

    • makmanalp 2077 days ago
      > They then tested against only different configurations of themselves for configurations that hit disk.

      In the case of Redis, AFAIK it can't support larger than memory use cases, right? And in fact they do compare to RocksDB for larger-than-memory, see Fig 10. Granted, it could be more detailed, but I think it makes their point.

      • fizx 2076 days ago
        They sure hid figure 10! Unexpected section, bad header choice, etc.
    • utopcell 2076 days ago
      8-byte payloads are a necessary limitatuon of their system because of atomic operations.
  • baybal2 2077 days ago
    I'm sceptical. Modern top tier in mem key value storages like LMDB are just within 20% 30% away from CPU maximum IO throughput on server class hardware. There must be some tricks in their metric.

    Do they actually fetch data or they just measure how many memory pointers per second they can dump?

  • frugalmail 2077 days ago
    Why the hell is Microsoft so bad at naming?!

    I'm sure it will be easy to find information about this,

    .NET,

    abysmally performing FAST search product (which is named too close to this with too much overlap),

    Creators update,

    SQL Server like they're the only SQL capable server out there even though it's based on somebody else's product

    etc...

    • HNLogInShit 2075 days ago
      Google is worse. The “Play Store,” which should be only games.

      Google Video... we all know how that turned out.

      Google Plus: Not only a worthless, idiotic name; but Google actually crippled its core product (search) by removing “+” as the “requires” operator for search terms... apparently because it “interfered” Google Plus.

      Then there’s the mess that is Android.

  • hashrate 2077 days ago
    This is really interesting.

    This is usually what people would do when using NoSQL database to reduce latency by caching elements with an in-memory database.

    Basically they are mixing up Redis with RocksDB , which is what devs usually do to get even higher throughput by storing IDs in Redis to save a call to RocksDB.

    Now what bother me is the look of repository , it looks completely rushed out.

    No logo , unclear description of the tech...

    Hence , as other mentioned it's just "an engine" , it doesn't actually contain the network layer, the clustering mechanism etc...

    I guess it's probably the tech that is powering their flagship engine : "CosmosDB".

  • Nican 2076 days ago
    I have seen plenty of local-machine fast key-value stores, such as LevelDB (By Google), or RocksDB (By Facebook), but I have a hard time imagining what they are for.

    What are the use cases for such a library?

    • DSingularity 2076 days ago
      Imagine you want to run a service. This service needs to maintain some intermediate state. This state might’ve frequently read. There might be little value in persisting this state. Also, your service is used by many users, so this state can grow to be pretty big. For example, contents of a shopping cart.

      One solution is to maintain such state in some key-value store. Different functions of your service can query this state within the data center and never have to suffer disk delays - which are often much longer than the internal network of your data center.

      • jcelerier 2074 days ago
        sooo.... what's the difference with your run-of-the-mill hash map / ordered map / whatever ? How does it compare to other maps such as these ones ? https://tessil.github.io/2016/08/29/benchmark-hopscotch-map....
      • wiradikusuma 2076 days ago
        I think it's an in-memory embedded library, so why not just use a global variable map/dictionary?
        • seabrookmx 2066 days ago
          > so this state might grow pretty big

          As mentioned by the comment you replied to, it's because FASTER (and similar libraries) allow persisting this to disk in an efficient manner. If it can fit in memory and your language/framework has an efficient hashmap implementation, then you're right.. it's probably not worth using something like this.

      • xmichael999 2076 days ago
        Great summary.
    • hedora 2076 days ago
      They sit under a service you would actually use. For instance, I think mysql can use rocksdb as a storage backend.
  • andreygrehov 2077 days ago
    What does FASTER stand for? Because otherwise it's a terrible name, imo.
    • comboy 2077 days ago
      Yeah, it seems like an acronym, maybe some internal codename. But for some reason they didn't make it public.. Inappropriate? So... Fuck Anything Slower Than Expected Results? ;)
    • badloginagain 2077 days ago
      Microsoft tends to choose bad names: See Visual Studio Code, the most ungooglable name ever.
      • andrewstuart 2077 days ago
        What are your thoughts on how "Googleable" these product names are: ".NET" and "Azure Functions"
        • mcbits 2076 days ago
          Google makes it work. But then there are C# and F#, which are Googleable but fail on probably 99.9% of other sites and software with search functionality, including some of Microsoft's own products (e.g. NuGet).
    • gitgud 2076 days ago
      A very arrogant name from a very arrogant company...

      (Someone should make something faster than faster)

  • faragon 2077 days ago
    "up to 160 million operations per second"

    Read operations? Write operations? Per core, or using N cores?

    For the case of one core at 3GHz, it would mean 20 CPU cycles per operation, which is barely enough for storing the data in the RAM for a log write. If using more than one core, e.g. one for writing the log, and others for updating the actual data structures, could imply much longer times for completing one operation, despite doing 160M per second on average: in that case it would not their "faster" would mean scalable rather than faster.

  • utopcell 2077 days ago
    The VLDB reviewers of their paper [1] must not have thought much of it because it's accepted as a short paper, which is not a great signal. No comment on the quality of the work though, as I only just started reading it.

    [1] https://www.microsoft.com/en-us/research/uploads/prod/2018/0...

  • fein 2077 days ago
    It would be nice to know why I would use this instead of something tried and tested like Redis, but the sparse description doesn't really help.
    • makmanalp 2077 days ago
      You wouldn't - first and foremost, this is probably most useful as a storage layer for some more higher-level data store.

      Second, think of software artifacts from papers as a proof of concept, or a prototype. It's intended more to demonstrate some architectural innovation, and less to serve customer needs. Their ideas revolve around supporting a high level of concurrent access and removing the locking overhead that would normally stem from this, and an interesting way to handle spilling over to disk in larger-than-memory scenarios. This might evolve into a customer-oriented product eventually. Or perhaps get retrofitted into the guts of something like MS SQL server.

    • staticassertion 2077 days ago
      Redis is like 0 for 3 from a CAP perspective as far as I understand it. Fine for a cache, but if you want a KV store you can rely on for some properties, it does not seem viable.

      So redis is 'tried and true' but may not meet your constraints.

      https://www.quora.com/What-is-Redis-in-the-context-of-the-CA...

      I haven't read the Faster paper yet (planning to) so I don't know that it provides better guarantees. But I personally would like a simple-as-redis KV store that provides better guarantees.

      edit: Ah, yeah, Faster doesn't seem to even really be directly comparable to Redis - seems more like rocksdb, and not a distributed system.

    • kodablah 2077 days ago
      On the surface, I'd guess the statically linked vs separate daemon differences apply (both have obvious pros and cons depending upon requirements and deployment scenarios).

      Also, MS research is just that, a research wing and many of their libs go stale/unsupported after written.

  • throwaway010317 2073 days ago
    Could somebody give an example of a use case for this?
  • 21 2077 days ago
    Faster? Really?

    Name for fast X: Quicker

    Name for fast Y: Speedy

    • Iwan-Zotow 2076 days ago
      Name for fast Z: Gonzales
    • gaius 2077 days ago
      Didn’t Google call one of their HTTP extensions SPDY? Or am I thinking of something else?
      • tialaramex 2077 days ago
        Yes, Google's binary replacement for HTTP when run over TLS was named SPDY and is ultimately the origin of the HTTP/2 standard

        Their encrypted replacement for TCP was named QUIC and is now being worked up for a standard also to be named QUIC (people working in this space call Google's GQUIC). The IETF's standard QUIC is firming up and may be finished in 2019 or so, but then I expected TLS 1.3 in 2016 so what do I know.

      • dragonwriter 2077 days ago
        > Didn’t Google call one of their HTTP extensions SPDY? Or am I thinking of something else?

        SPDY was the experiment that became HTTP/2, yes.

      • utopcell 2077 days ago
        SPDY was a good 4-letter acronym.
  • tribesman 2077 days ago
    So what's the underlying storage? Is it LSM with more tricks and prallezation + probabilistic algorithms + SIMD intrisics?
  • HNLogInShit 2075 days ago
    The term is “key/value store.” It doesn’t just store key values, which is what a key-value store would do.
  • graycat 2077 days ago
    Lots of people here are much deeper in key-value stores, Redis, etc. than I am. So, here I outline what I wrote for a key-value store for a Web server session state store long ago and am still using and ask for expert comments on any pros/cons.

    So, for my Web site, (A) when I first send a Web page to a user, I send in the HTML of that page, in an encrypted, hidden field, a key that identifies the user's session. The data I want to keep on the user's session I make a value and then write the key-value pair to my Web session state store; (B) when a user does an HTTP POST back to my Web server(s), I get the user's key and from my session state store get the value that is the user's session state. Each such value is just a byte array that is the serialized instance of my session state class, the instance particular to that user.

    My session state store is just some simple .NET Framework software running as a Windows console application. The core of the store is just two instances of the standard .NET collection class, hopefully as fast as AVL trees (as in Knuth, Sorting and Searching) or red-black trees, IIRC, in Sedgwick. One collection class instance holds the key-value pairs. The other collection class instance holds for each key the time of the last access to the key-value pairs and is used to implement session time outs. The communications with the Web servers is just via simple TCP/IP sockets sending/receiving byte arrays.

    The session state store is single threaded; so, there are no issues about, or efforts for, concurrency.

    Then it would be good for the session state store to be sufficiently fast for the Web site and also to have a FIFO (first in, first out) queue of incoming requests. TCP/IP provides the desired FIFO queue, and, for the FIFO queue length, I'm setting that with just the standard option for TCP/IP.

    So far, the session state store is all in main memory, but, of course, that memory might page as in virtual memory.

    I'm guessing that there is more computing just for the TCP/IP than for the core of the session state store itself, i.e., the two collection class instances. If so, then as long as I'm using just TCP/IP communications, e.g., over my server farm LAN, the speed of the session state store code becomes a secondary issue -- the whole session state store operation as seen by the Web server(s) might not be much faster even if the session state store ran in 0 time. Maybe.

    So far with my server farm software architecture, it should be easy to have as many executing instances of the session state store as needed for Web site performance and with no user affinity between a particular user and a particular Web server. There would be affinity between a particular user and a particular session state store via whatever Web server instance the user most recently got assigned to via load leveling.

    Q 1. Is there a fundamental and serious flaw in my design?

    Q 2. Would Redis actually be much better for me?

    Q 3. If my Web site becomes very busy and has me using, say, a full rack of servers just for session state store, should I look for a still faster way to do the work?

    Q 4. Maybe solid state disks (SSDs) are now so fast that I should just program my session state store to use a solid state disk instead of main memory -- e.g., can get SSDs of several terabytes, and that much main memory would be much more expensive. Of course, this assumes that the LAN and software for the TCP/IP stack are fast enough that the performance bottleneck just the work on the collection class instances and that, even with the rest of the assumptions, and SSD would be fast enough -- sounds like a strange situation.

    Q 5. Should I plan on using a 10 GBps LAN for communicating with a session state store? That is, might such a fast LAN significantly reduce the latency of the communications between the Web server(s) and the session state server(s)? Have the Web server wait less time for the read/write with the session state store might speed up the whole server farm. Anyone have any actual experience about such speeds?

    Thanks!!

  • magoon 2077 days ago
    Faster than the leading brand
  • bufferoverflow 2077 days ago
    But is it faster than Aerospike or Tarantool?
    • dfee 2077 days ago
      Is it webscale?