Kyoto Tycoon in 2019

(charlesleifer.com)

111 points | by coleifer 223 days ago

14 comments

  • HardLuckLabs 223 days ago

    I keep hoping that the creators of fallabs make a triumphant return. They did post a small bugfix for a build error not too long ago, so the lights aren't completely out.

    We used kyotocabinet exclusively for a project that ran for 3 years with ~1 billion txns/day in a big AWS cluster. Totally bomb-proof and a great piece of software. There's no reason not to use it today, really. Tycoon, the server, is good, but the library kyotocabinet is multithreaded and highly performant.

    The flexibility of B+ tree or hash type stores, on-disk or in-memory with the same interface is extremely useful for building distributed systems. Very stable under load, and no data bloat as records pile up.

    edit: nobody should feel guilty for using stable software with a reliable track record. Just because a project isn't being printed on tee-shirts anymore and breathlessly fawned over during the free happy hour at conferences doesn't mean that it isn't still a perfectly useful piece of software.

    • diekhans 223 days ago

      TK doesn't compile with modern C++ compilers; it has been a real headache for us. It needs some TLC if it is to be used.

    • romwell 223 days ago

      I've used Tokyo Tyrant / Tokyo Cabinet and Kyoto Tycoon / Kyoto Cabinet in production in my previous job in 2018.

      There was a need to cache a large number of computation results, hundreds of gigabytes, over network, for a scientific computing process that could run for days.

      The cache would not fit in memory, so the KV store needed to be persistent.

      I wrote a caching layer with switchable backend, with an API inspired by KT, so that plugging Kyoto Tycoon in would be easy.

      Tested with TT, KT and Aerospike. In the end, went with either TT or KT, both being stable, easy to use, performant, and compiling on something as old as RH5 (which some of our clients were still running, apparently).

      Compared to KT, Aerospike had more features (ssd optimization, sharding) when used as a KV store, but we didn't need them.

      In the world of persistent KV-stores there have not been that many options as of last year. Kyoto Tycoon is feature-complete and works well.

      That's why you don't hear many people talking about it: it just works, there's not much to talk about.

      And the source/API/documentation it's a work of art.

      • continuations 223 days ago

        How did KT's performance compare to that of Aerospike's?

        • romwell 222 days ago

          Sadly, I don't recall off the bat, and I don't have access to those files anymore.

          IIRC Aerospike would be faster initially because it keep a lot in RAM, but they'd be about the same once the DB fills up you can't get away with that. In the end, either network or disk would be the bottleneck for the workload I tried it with.

      • stakhanov 223 days ago

        KyotoCabinet is one of the most important weapons in my holster as a data scientist.

        With that kind of a specialty, there are essentially two different kinds of jobs / work environments. One is where it's all about infrastructure and implementing the database to end all databases (and "finding stuff out" is a mere afterthought to that. The other is where 100% of the focus is on "finding stuff out BY NEXT WEEK" with zero mandate for any infrastructure investment of any kind. When I find myself in the latter kind of work environment and I need to quickly get sortable and/or indexable datastructures of any kind, then a key-value store is the way to go and KyotoCabinet is a really good one that I've used LOTS and it has never let me down.

        Just don't let your boss find out about it if yours is of the pointy-haired variety. If he finds out that it's an open source project that has seen its last commit some 6 years ago and has been abandoned since, he will be less than pleased. -- Personally I find that's a really bad argument. It's feature-complete w.r.t. anything I've ever wanted to do with it, and its feature set is actually way richer than most of the younger alternatives that are still being actively developed (because they are still in need of very active development, and aren't nearly as well battle-tested). Plus what's the worst that could happen? That you will one day want to move to an environment where it will no longer work. Well in that case it shouldn't be so hard to substitute another key-value store for this one. After all, the simplicity of the API around a key-value store where get & set is basically all you need, should make that really easy.

        • ims 223 days ago

          > When I find myself in the latter kind of work environment and I need to quickly get sortable and/or indexable datastructures of any kind, then a key-value store is the way to go

          Interesting, can you expand on this?

          • stakhanov 223 days ago

            Cf my ansewr to zzleeper's comment below.

          • zzleeper 223 days ago

            Can you tell me a bit about what use cases did you have for KC as a data scientist? I sometimes work with large datasets, but I'm not sure how exactly it would fit.

            Are you basically using it as a nosql database for up to whatever your SSD drive has capacity for?

            • stakhanov 223 days ago

              Sure: First off: Sorry for going a bit overboard in what follows length-wise. It's a lot of moving parts that interact to really explain my thinking around this.

              First off, most use cases don't actually fall within the "big data" family of use cases. Three reasons for that (a) You will often work with datasets that are so small, that the sheer amount of data and efficiency of processing won't present a difficulty anway. (b) You will often be in a situation where it is workable to do random sampling of your real dataset very early on in the data processing pipeline which will allow you to still obtain valid estimates of the statistics you're interested in while reducing the size of the datasets that need to be juggled. (c) You will often be able to do preaggregation. (For example: instead of each observation being one record, a record might represent a combination of properties, plus a count of how many observations have that combination of properties).

              My strategy will be roughly as follows: A "database object" that's tabular in nature is, by default, a CSV file. An object that's a collection of structured documents is, by default, a YAML file. The data analysis is split up into processing steps, each turning one or more input files into one output file. Each processing step is a Python-script or Perl-script or whatever. You can get pretty far with just Python, but, say there's one processing step where you need to make a computation where there's a great library to do it in Java, but not in python; feel free to drop a java programme into the data analysis pipeline that otherwise consists of Python. Then you tie the data processing pipeline together with a Makefile.

              This general design pattern has several things to recommend it:

              (1) Everything is files, which is great! If I work in a big dysfunctional corporate environment, I might be faced with this scenario: I have a bunch of databases at my disposal, like an enterprise-scale Oracle server. But waiting for signoff from managers and waiting for database admins to provision a table space, create a schema, make some grants, would take so long, that in the same amount of time, I can be half-finished implementing the whole thing with a file-based system, since I need no one's permission and no one's cooperation to just create a file on some machine. A bit further into the project, I might face a tough deadline for producing a report that depends on a heavy payload of data crunching. Going with a shared Oracle, I might find myself in a tough spot where the database server is completely hammered by what other people are doing with it, and there will be zero I can do about it to get my payload finished in time. With a file-based solution I can usually work in a compute environment where resources are less contended company-wide than on a central database server and I can easily move from one server to another if I should need to for capacity-reasons. It sounds like I'm incapable of system-level thinking and being a teamplayer. But it's just the way real life is. DBAs are unsung heroes who don't get the resources they need. Shared resources fall victim to the tagedy of the commons. But at the same time, going into a meeting saying "I don't have the numbers today, because Oracle was slow" sounds like "the dog ate my homework" and will reflect poorly on me personally, rather than the organization, so I try not to put myself in that situation.

              (2) I like to equip my scripts with the ability do progress report and extrapolate an ETA for the computation to finish. If, 1% into the computation, it becomes apparent that it takes too long, I cancel the job and think about ways to optimize. I'm not saying it's impossible to do that with a SQL database, but your SQL-fu needs to be pretty damned good to make sense of query plans and track the progress of their execution etc etc. If you have a csv file, it might be as simple as saying "if rownum % 10000 == 0: print( rownum/total_rows )" then control-c if necessary. In practice, doing things with a database often means that you send off a SQL query with no idea of how long it's going to take and if it's still running after a few hours you start to investigate. But that's a few hours of lost productivity. -- Things are particularly painful when the scenarios described under (1) and (2) combine. You might be used to a certain query taking, let's say, 4 hours. Today, for some reason, it's been running for 8 hours and is still not finished. You start suspecting that the database is busy with other people's payloads and give it a few more hours, but it's still not finished. Only now do you start investigating what's happening. But this sort of lost productivity is often the difference between making a deadline on reporting some numbers or something or missing it. (Think about the scenario where it's a "daily batch", and you need to go into that all-important meeting, reporting on TODAY's numbers, not yesterday's.)

              (3) "Make" is a great tool for doing datascience, but in order for it to be able to work its magic you mostly have to stick to the "one data object equals one file" equation. You want parallel processing? No problem. You've already told Make about the dependencies in your data processing. So wherever there isn't dependency, there's an opportunity for parallelization. Just go "make -j8" and make will do its best to keep 8 cpus busy at all times. You want robust error handling? No problem. Just make sure your scripts have exit code zero on success and nonzero on failure. "make -k" will "keep going". So when there's an error it will still work off the parts of the dependency-graph where there wasn't an error. Say you run something over the weekend. You can come back monday, inspect the errors, fix them, hit "make" again, and it will continue and not need to redo the parts of the work that were error-free. Etc. Etc. Etc.

              Now, after this whole prelude around my philosophy of doing data crunching pipelines in a datascience context, we finally get to the point about KyotoCabinet.

              Even though you usually find that CSV or YAML is fine for MOST of the data objects in your pipeline, there will almost always be SOME where you can't be so laissez-faire about the computational side of things. Say you have one CSV file which you've already managed down to a manageable size (1M rows, let's say) through random sampling. But it contains an ID that you need to use as a join-criterion. Let's say the table that the ID resolves to is biggish (100M rows, let's say). You can't really apply any of the above "tricks" to manage down the size of the second file. By random-sampling you'd end up throwing away most of the rows, and your join will, for most rows, not produce a hit, even though there would have been one to begin with which you've just decided to throw away, which would be pretty bad. So, for that file, you can't get around having it sitting around in its entirety to be able to do the join, and CSV is not the way to go. You can't have each of the 1M rows on the left-hand side of the join trigger a linear search through 100M rows in the CSV.

              Your two options for the right-hand side would be to load it into memory and join against the in-memory data-structure. Or use something like KyotoCabinet. The latter is preferable for a number of reasons.

              (a) Scalability. A datascience project usually has a tendency for the size of data objects to get bigger over time through additional feature requests being added to the project. If you get to a point where a computation that you've initially implemented in-memory exceeds the size where this is no longer feasible you're in trouble. If you go to a pointy-haired-boss and tell them "I can't acommodate this additional feature request without first doing some refactoring. So I'll do the refactoring this week, then start working on your feature request next week", it sounds in their ears like "I'm going to do NOTHING this week". So, I have a strong bias against doing things in-memory so as to not put me in that situation.

              (b) By making the datastructure persistent, it means that the computational effort that goes into producing the data structure doesn't have to be expended over and over again, as you go through development cycles, fixing errors etc on the payload-side of the computation.

              It may not even be slower in terms of performance, thanks to the MMAPed IO that KyotoCabinet and other key-values stores do.

              ...so this is roughly where I'm coming from as a KyotoCabinet frequent-flyer.

              • ims 222 days ago

                I'd love to hear more about this - email is in my profile if you have a minute.

          • iten 223 days ago

            We use KT's in-memory database in a scientific computing application I work on (to store a graph 30-100GB in size accessed by 100-1000 workers). The performance is very impressive, and it's been reliable for our use case. But I don't think I'd recommend anyone use it in 2019 -- having an active community (and someone to actively maintain the software!) is too important. A small performance gain over alternatives like Redis is probably not worth the tradeoff of using software that is (sadly) abandoned.

            edit: That's not to say it is really in need of much development -- it's pretty feature-complete. But it's undergone a bit of software rot: the Debian package, for example, ships header files which fail to compile under many recent gcc versions. And the network effect is just not there. If you run into some database slowness, searching the Web for "Redis performance problem" might get you some ideas. Searching for "kyototycoon performance problem" will get you nowhere.

          • booleandilemma 223 days ago

            Fun fact: “tycoon” is a loanword from Japanese.

            https://en.wiktionary.org/wiki/tycoon

            • ohbarye 223 days ago

              That's new to me even though I'm Japanese. lol

              • escherplex 223 days ago

                Jim Breen's EDICT dictionary lists 'great' and 'old boy' as core meanings for associated kanji. Aside from an in-group honorific suffix the kimi kanji is also associated with moro no kimi or a Shinto wolf spirit which may be a good definition of a tycoon. :)

                • ramchip 223 days ago

                  Another reading of 大君 is ookimi, which is how the Emperor was called in ancient times. In this context the kanji means lord, prince.

                  • unscaled 222 days ago

                    Yes, ookimi is the common reading for 大君 today. Taikun seems to have come from Edo period. According to several Japanese dictionaries, this was the title applied to the Shogun, especially in foreign correspondence, which I assume was primarily with China and Korea from most of that period. This title was probably more appropriate for use with Chinese or Korean speakers, who would read Shogun (將軍) as General, a purely military title. In any case, it seems that after Japan opened up to western countries following Commodore Perry's gunboat diplomacy, the term was imported to the West referring to the Shogun.

              • contingencies 223 days ago

                Fun fact... the Japanese word is itself phonetically, semantically and philologically Chinese. The English adaptation taipan from Cantonese being of similar derivation, and the character the Japanese use for 'coon' (Mandarin jun) being the traditional Chinese word for gentleman or learned lord since at least Confucius' time.

              • jgrahamc 223 days ago

                Cloudflare used to use KT. That's very true, but we removed it and replaced it with our own distributed KV store and replication mechanism called Quicksilver.

                • numbsafari 223 days ago

                  What were the engineering motivations for moving from KT to Quicksilver? Did that have to do with performance/throughput, or more with tooling and functionality?

                  • majke 223 days ago

                    There was a number of problems, but mostly boiling down to the sheer scale of our deployment. As an engineer I would say that KT gave us surprisingly good mileage. We were able to grow the company from couple of servers to hundreds, but at some point we just outgrew KT.

                    The main issues that triggered serious internal discussions about KT:

                    (1) Tuning the data structures was hard. Each data backend has many toggles, and it's not obvious what they actually do. I ended up writing script to try out all the different toggles just to figure out what the disk and performance cost would be.

                    (2) If we decided to tweak the data backend toggles, it was not obvious how to upgrade the storage. There isn't a single command to take the KT store and rewrite it with different specific-optimization-setting. This meant our disks stored stuff in non-optimal format, wasting our disk and memory.

                    (3) We added encryption to the replication protocol, it was fine for some time, but not perfect. If I remember right the SSL handshake was blocking, which meant the master stalled for some time on each replica connect. This code was ours, sure, but the problem is: integrating non-blocking SSL accept() into KT event loop was very much non trivial.

                    (4) The replication protocol is great (and trivial!), if you the lifetime of the replicas is roughly in line with the lifetime of the master. This was not the case in our deployment. We might bring a new colo up, or just format disks in some region. In such case, how do you bootstrap the replica? This is an unsolved problem. Sadly, you can't just rsync the data directory while keeping KT master running: it will end up inconsistent. We had to: stop the master, rsync the disk, start the replica, start the master, make sure the replica picks up the log in right place. This is fine if you have two replicas, but not hundreds.

                    (5) The HA story of the "KT master" was unsatisfactory. The docs say that you should configure two masters, each replicating to each other. But I never understood how it was supposed to work in case of conflicts, clock skew, or disk failures. Think about problems in second or third layer of replicas.

                    Most importantly (4) and (5) lead to data inconsistencies, which is unacceptable. Furthermore as we grew we were constantly toggling our KT architecture, adding more and more replica layers. This made (4) only worse.

                    I personally love KT, I think it's great piece of software. Sadly, it looks abandoned, lacking community. I guess at some point for us the pains of operating it outgrew the benefits it gave.

                    Our replacement, Quicksilver, is brilliant. Perhaps we should blog about it more.

                    • jamesog 223 days ago

                      And to add to point (4): as Marek says KT's replication is great... to a point. When Cloudflare was small and we only had to replicate to a relatively small number of machines it coped fine.

                      As we scaled out each data centre we quickly hit limitations in KT's ability to replicate to many clients. We ended up having to install many more replication points in each data centre or region just to allow KT to replicate to all downstream clients, otherwise it would regularly stall while trying to service all replication requests, massively slowing down replication, or sometimes even missing updates.

                      There's also two more points to be made:

                      (6) As our KC (Kyoto Cabinet) databases grew, KT had trouble keeping consistency of the file on disk. Sometimes a restart of the service would leave you with a corrupted KC database, causing us to have to resync the node from scratch. This was a huge operational burden - especially if corruption somehow replicated downstream.

                      (7) KT allows you to write to replicas. This one we always found crazy. You can write (or overwrite) a key on a downstream replica and that only ever stays on the replica - it can't propagate upwards (and nor should it). Until the top-level writes to that key again it will remain with this erroneous value. This made monitoring problematic because it would cause a keycount or replication diff if it ever happened.

                • peatfreak 223 days ago

                  At a previous organizion I worked at, about 5-7 years ago, I made very heavy use of Tokyo Tyrant and T. Cabinet (the predecessors of Kyoto Tycoon and KC). We had many instances throughout our infrastructure, and many engineers had at least some knowledge of it.

                  It was the fastest key-value store I'd ever used. I was floored by its speed and relative stability.

                  And the article and comments rightly point out KT's terrible documentation and thin user community on the ground. TT was just the same. This caused a lot of surprising (and mostly undocumented) issues whenever somebody was trying to get onboarded.

                  I was surprised that it was so obscure and so undocumented. Especially considering its speed and simplicity (in terms of getting a basic k-v store set up).

                  I depended a lot on DBA's and DB engineers, our sysadmins, and other engineers with more experience of TT than I had. Usually, whenever I needed to understand a configuration option in detail, deploy a new instance with different characteristics, or optimize an existing instance, I would consult them. I think we had just as much, if not, more, knowledge of this software than any community on the Internet.

                  But all of this raw speed, at the cost of lack of support, lack of documentation or community bit us in the ass, hard. Either the TT/TC service itself crashed, or the host it was running on crashed (I can't remember).

                  Some data was lost because it was designed with no fault tolerance. Most data was recovered from backups. An executive decision was immediately made to move to a different DB, with proper resilience, etc. We quickly moved to a well known, open source, commercially supported system with all the usual value-added proprietary parts.

                  Then the $$$$$ started pouring in to the support contract. At least we had something trustworthy that we could understand, lots of documentation, and good upstream support engineers :-)

                  It's a bit sad to imagine what could have become of TT/TC. I've no idea about KT/KC. They had a LOT of potential. Maybe the potential is still there? Maybe they're project deserve resurrection, at least for educational purposes?

                  At least the source code still exists if anybody is interested enough. Documentation generally consists of reading struct definitions and comments; a couple of decent StackOverflow posts; and now-mostly defunct mailing lists on Google Groups.

                  • k-ian 223 days ago

                    I was really hoping that this was a videogame

                    • anoother 223 days ago

                      Does anyone have comparisons (even anecdata) with (L)MDB?

                      • jamesog 223 days ago

                        Cloudflare's replacement for KT that jgrahamc mentioned - Quicksilver - uses LMDB for storage.

                        I don't have numbers, but it performs great for our needs. It's heavier on the disk though because of fsync, however unlike KT we never get database corruption on disk.

                        • hyc_symas 222 days ago

                          Here ya go. Quite critical of Kyoto Cabinet. https://www.anchor.com.au/blog/2013/05/second-strike-with-li...

                        • acd 223 days ago

                          When investing key value stores for trying to beat a nosql speed record Kyoto Tycoon and LevelDB came up as low latency alternatives.

                          http://blog.creapptives.com/post/8330476086/leveldb-vs-kyoto...

                          Seems like KyotoDB is beating LevelDB on performance?

                        • jnwatson 223 days ago

                          I used KT back in the day, but found LMDB to be much more reliable and performant.

                          • totoglazer 223 days ago

                            I was a huge Tokyo Cabinet fan back in the day. Was a big step up from BerkeleyDB for some workloads we were doing a lot. Always disappointed it was so fractured with successors and less adoption than I felt it deserved.

                            • aasasd 223 days ago

                              The benchmark is likely limited by Python. You can't really test Redis' speed with Python, not in one process.

                              • coleifer 223 days ago

                                The purpose of the benchmark was to see the relative performance of the two databases.

                                • aasasd 223 days ago

                                  You won't see the performance of the database if the time is spent in the Python driver.