Algorithms Behind Modern Storage Systems

(queue.acm.org)

572 points | by matt_d 210 days ago

8 comments

  • throwawaypls 210 days ago

    I read this book titled "Designing Data Intensive Applications", which covers this and a lot of other stuff about designing applications in general. https://www.amazon.com/Designing-Data-Intensive-Applications...

    • _nrvs 210 days ago

      Just finished reading DDIA, can't recommend it enough! I learned a lot of new info in every single chapter even for topics I thought I had a firm grasp on. Great job Martin Kleppmann!

      • kuwze 210 days ago

        He also published this awesome paper on CRDTs, "A Conflict-Free Replicated JSON Datatype".

        [0]: https://arxiv.org/abs/1608.03960

        • pbowyer 210 days ago

          It's the best written computer-related book I've read. On a par with Friedl's "Mastering Regular Expressions".

          Very highly recommended.

          • _jal 210 days ago

            > On a par with Friedl's "Mastering Regular Expressions".

            That comparison sold me. You deserve a commission.

            • pbowyer 210 days ago

              Thank you!

              Be warned, the O'Reilly print quality is miles apart between the two. Their print-on-demand text quality and the binding are real let-downs.

              • _jal 210 days ago

                That's too bad. The MRE book was great - the special typography for zero-width indicators and whatnot was really great. Progress.

            • sah2ed 209 days ago

              Currently reading the book and I agree with your recommendation, it's very well-written, perhaps written with an intuitive learner in mind.

            • drej 210 days ago

              Too bad I can't upvote this multiple times.

              • acidtrucks 210 days ago

                Thanks, I literally just came here to ask for book recommendations on this topic.

                Are there any other suggestions?

                • commandlinefan 210 days ago

                  I know this sounds like a cliche at this point, but volume 3 of Donald Knuth's "The Art of Computer Programming" goes into more depth on the theory that underpins these algorithms than anything else I've ever seen/read/heard of (in fairness, I haven't read OP's suggestion yet, though).

                • Joeri 210 days ago

                  This is one of the best technology books I've ever read. I spent a few years diving into big data architecture. I thought I had a reasonably good handle on it, then I read this book. So many insights.

                  • rollulus 210 days ago

                    That book is a must-read for anyone dealing with data.

                    • tw1010 210 days ago

                      The simultaneity of your and mkandas89's comment is what I'd call a canonical example of spooky action at a distance.

                    • mkandas89 210 days ago

                      Its a must read for anyone dealing with data

                      • polymathemagics 210 days ago

                        Completely agree, fantastic book. Does anyone know of any similarly wonderful technical books?

                      • JelteF 210 days ago

                        There's is one important detail that is never fully articulated in this article. It says B-trees are read optimized. This is true, but there's only a big difference for random reads and not sequential reads. Because seeking a key in LSM is relatively expensive, because it has to read multiple indexes. But after this is done, reading the a lot of keys an values from this point on is not expensive (or at least not much more so than in a B-tree). The reason for this is mentioned in the article, which are the SSTables. These allow for quick through the keys, because they are stored on disk in sorted order.

                        • dhd415 210 days ago

                          LSMs do not guarantee that sequential key/value pairs are stored in the same SSTable so reading "n" k/v pairs, in the worst case, could require seeking through "n" different SSTables. There are datastores that use LSMs such as HBase that provide guarantees on top of LSMs to facilitate efficient range reads and, of course, many LSM compaction algorithms improve range reads, but the basic LSM is optimized only for single k/v reads. That's the reason, for example, why Cassandra doesn't even offer range reads such as "SELECT * FROM mytable WHERE KEY > x AND key < (x + k)".

                          • misframer 210 days ago

                            > That's the reason, for example, why Cassandra doesn't even offer range reads

                            This is not true. The reason why Cassandra doesn't support that is because of hashing of keys across the cluster -- you'd have to query all shards and merge the results. That has nothing to do with the LSM storage.

                        • koverstreet 210 days ago

                          cough bcachefs's b+ tree can push about a million random updates per second (almost half a million single threaded, last I checked).

                          I haven't heard of anything faster...

                          edit: found the benchmarks I ran https://www.patreon.com/posts/more-bcachefs-16647914

                          • espadrine 210 days ago

                            Congrats! That is Redis-on-Xeon territory! https://redis.io/topics/benchmarks

                            Though I assume that, like Redis, a reboot would lose acked writes?

                            An SSD can't save a write in a microsecond, its latency is at least an order of magnitude higher, right?

                            • FullyFunctional 210 days ago

                              Enterprise drives usually have ultra caps with enough energy to power the writeback of the writebuffer to stable storage in case of power loss, thus your write latency is just the time it takes to hit the memory buffer, typically dominated by the PCIe transfer, that is, a few µs (unless of course, the buffer has filled but that means hitting peak write bandwidth).

                              Consumer drives typically don't pay for this, but instead good drives buffer writes in SLC flash which has lower latency than MLC.

                              ADDENDUM: funny thing is that you have to actively manage the energy in the caps; the writebuffer must never have more data than you have energy to write back. This becomes especially important on power-up where the empty cap is charging.

                              • koverstreet 209 days ago

                                Yes, unless you did a journal flush. Default journal flush interval is 1 second because of hard drives, but on a good SSD it can be turned up to around 10 ms and still maintain those numbers.

                                • surajrmal 210 days ago

                                  NAND program operations are in the order of O(100us) to O(1ms).

                                  • surajrmal 210 days ago

                                    I see how you made that mistake now. SSDs typically have many (eg 10s-100s) program operations happening in parallel. They also pack each write into something around 16kb or larger units so it's possible multiple b+ op are in each program.

                              • riskneutral 209 days ago

                                Isn’t this pretty traditional stuff? What is modern about it?

                                Does any of this map to a GPU for column-oriented analytical data processing? Basically, machines are only good at reading and writing large, contiguous chunks of data. As these machines evolve, the optimal width of those chunks keeps getting larger. The volume of data available is growing. And the types of operations being done on data are becoming more “analytical” (meaning column-oriented and streaming access, rather than row-oriented random access). I would expect “modern storage” algorithms to therefore be cache friendly, column oriented and take the modern, in-memory storage hierarchy into account (from on-chip registers to, to high bandwidth GPU type parallel devices, to NVRAM system memory).

                                This article comes off to me like a CS101 intro doing Big-O asymptotic analysis on linked lists, without even mentioning the existence and effects of memory caches.

                                • akeck 210 days ago

                                  This reminds me of RAID 6’s use of abstract algebra.

                                  • akalin 210 days ago

                                    I wrote a blog post a while back that might be of interest: https://www.akalin.com/intro-erasure-codes . It's similar to Igor's but fills in a few more details.

                                    • jorangreef 210 days ago

                                      That's a fantastic introduction (although it's far more than an introduction!). I wrote a native addon for Node.js that does Reed Solomon, also using optimized Cauchy matrices (MIT license): https://github.com/ronomon/reed-solomon

                                      • akalin 210 days ago

                                        Very nice! I think I stumbled on your implementation when researching my post. :)

                                        • jorangreef 209 days ago

                                          Thanks. If that was in November last year, then it's changed a lot since then (and now a few times faster). Back then it was using a Vandermonde matrix as it was based on Backblaze's Java implementation, but a month ago everything was rewritten to use optimized Cauchy matrices.

                                    • gameshot911 210 days ago

                                      Link below with more info, for those who were interested like me!

                                      http://igoro.com/archive/how-raid-6-dual-parity-calculation-...

                                      • 210 days ago
                                        [deleted]
                                      • hinkley 210 days ago

                                        Reed Solomon Coding.

                                        Linear algebra. It’s so hot right now.

                                        • vanderZwan 210 days ago

                                          Until recently I worked for a molecular neurobiology research group that, among other things, used Single-molecule RNA Fluorescence in situ Hybridization (smFISH) to measure gene expression in tissue[0].

                                          Basically: design viruses with fluorescent molecules attached to them such that they attach to specific sections of RNA in the cell that are associated with particular genes. Soak tissue in viruses. Look at tissue through a microscope and count individual fluorescent dots, each of which represents one RNA molecule (I find this absolutely mind-blowing). Wash off the viruses with a specific type of chemical, and repeat with a new one for a different gene.

                                          You can only use a few colours at a time because otherwise the microscope cannot discern them. But that would severely limit the throughput - there are a lot of genes we want to check, but the tissue will also degenerate after repeated washing. So, what can we do? Well, as I've understood, scientists using smFISH and similar techniques now use multiplexing and with Hamming codes to get around that.

                                          So yeah, linear algebra is definitely so, so hot right now.

                                          [0] http://linnarssonlab.org/osmFISH/

                                        • bogomipz 209 days ago

                                          Doesn't RAID 6 just use Reed-Soloing/Hamming codes? Is that based on abstract algebra?

                                        • jacksmith21006 210 days ago

                                          Most interesting thing I have seen for for database indexes is this paper.

                                          https://www.arxiv-vanity.com/papers/1712.01208v1/

                                          Using a Neural Network in place of a B-tree. What is interesting is a NN can use many processors at the same time versus you can not with a B-tree.

                                          In the end it comes down the power as in joules to get something done.

                                          • koverstreet 210 days ago

                                            it's more about space efficiency. if a model can fit in only one or two cachelines, and get you "close enough", you may end up touching a lot less memory than if you just did a binary search.

                                          • walrus01 210 days ago
                                            • KirinDave 210 days ago

                                              This was a bad post and shouldn't be preserved. And I can still edit it. So I did. Please see below, it's a better post (even if it's bad).

                                              • truncate 210 days ago

                                                >> LSM-trees are great, but better stuff exists.

                                                Sure. My knowledge is pretty much limited to what this articles talks about, so interested to know what else is out there.

                                                • willvarfar 210 days ago

                                                  Fractal Trees as used in TokuDB (now owned by Percona).

                                                  The big thing in practice is that TokuDB vs InnoDB is dealing with large datasets.

                                                  However, I don't know where the very latest MyRocks stands vs TokuDB.

                                                • sanderjd 210 days ago

                                                  This may be an interesting comment if it had some keywords to search or (better) links. As is, I really don't see the point of it.

                                                  • KirinDave 210 days ago

                                                    I don't feel like it tonight. Drinking whiskey and trying not to think about how effed up the Israel footage and the North Korea situation are. Maybe that's why it's not the best post.

                                                    I don't mean to crap on the work presented. Great article, good summary, and the tech is solid. It's just older than what excites me; a lot of progress has been made in 20 years and the majority of it hasn't found commercial applications.

                                                    But here is a citation root for a lot of amazing work in this space:

                                                    https://www.dcc.uchile.cl/~gnavarro/ps/alenex10.pdf

                                                    My absolute hero Edward Kmett gave a talk stitching a lot of this work together a long time ago: https://www.youtube.com/watch?v=uA0Z7_4J7u8 . I have no idea if he's pursued it, it's just one of his many talks that left me with an incredibly lasting impression.

                                                    Variants of this technique work for arbitrary documents and structures, work better at very high volume, have cache oblivious properties, and support transactions. Universal indexes that are reasonably good for all queries (as opposed to specfiic queries) are also possible. Coupled with discrimination-based techniques for O(n) table joins, there's probably a whole startup around there.

                                                    Sorry I can't do better right now.

                                                    • ovao 210 days ago

                                                      I'm curious what you mean by "cache oblivious". How can a data structure be "cache oblivious"?

                                                      • KirinDave 210 days ago

                                                        If you're optimal for a cache without knowing how big the cache is, you're optimal for all caches. It's tough to do, but it happens.

                                                        This property probably doesn't get the respect it deserves in this super weird world where you can't really say how many caches are between you and your data.

                                                        • rusbus 210 days ago
                                                          • ovao 210 days ago

                                                            Really good read, thanks. My brain couldn't quite grok the idea of a "cache-oblivious" data structure, but the article suggests they'd be more aptly described as "cache size-oblivious".

                                                        • pasbesoin 210 days ago

                                                          Thank you. That youtubed presentation was quite interesting.

                                                          • ddorian43 210 days ago

                                                            Are you connected somehow to the israel and north-korea situations ? (meaning more than a normal person)

                                                            • sanderjd 210 days ago

                                                              Great response, thanks!