SuperSQLite: SQLite library for Python (2018)

(github.com)

135 points | by jkldotio 1707 days ago

12 comments

  • dest 1707 days ago
    Interesting pick from one of the links in the article:

    "SQLite has fantastic write performance as well. By default SQLite uses database-level locking (minimal concurrency), and there is an “out of the box” option to enable WAL mode to get fantastic read concurrency — as shown by this test. But lesser known is that there is a branch of SQLite that has page locking, which enables for fantastic concurrent write performance."

    https://blog.expensify.com/2018/01/08/scaling-sqlite-to-4m-q...

    • patelajay285 1707 days ago
      Author of SuperSQLite here. SQLite is under-utilized in my opinion! Because of the name, people seem to have the misconception that it can't be used for production workloads. It's true that it generally works well for embedded use cases, but it can be used for so much more and has an extremely simple codebase without a lot of bloat.
      • agumonkey 1707 days ago
        Some guy on IRC told me they made a world scale DNS server work because of sqlite.
      • NilsIRL 1707 days ago
        Would you recommand it on the server side as well?

        For example to store user credentials.

        • patelajay285 1706 days ago
          You could, plenty of sites do! It really depends on your write workload, but I would say it would scale pretty well for most small to medium size SaaS companies unless your product is Google scale.
    • marktangotango 1707 days ago
      When you have a write-heavy workload with multiple servers that need to write concurrently to a shared database (backend to a website), you would probably want to choose something that has a client-server model instead like PostgreSQL

      It's easy to get really stellar concurrent performance out of SQLite using a many reader, single writer model (ie many threads, single process). In testing we did it easily surpassed Postgres.

      • Moto7451 1707 days ago
        I had great success using SQLite as the backing store for fast sharing and reindexing of a domain specific search engine component my company licensed. Before we developed our own algorithms and moved to Elasticsearch I was playing with augmenting the licensed component with the full text search capabilities of SQLite. The whole thing ended up being shuffled around via Gluster so I was able to offload the responsibility for sharing shards to it.

        My experience pretty much matched what you describe and it was such a great opportunity to really lean on a fabulous piece of software. Given this was many years ago, before ES was stable and SOLR wasn't working well for us, I think it was the right choice. These days, ES is good enough that if I had to do it all over I'd go straight to it. None the less, I don't think SQLite should be ignored as an option when you have high levels of control over data access patterns.

      • skrebbel 1707 days ago
        I really like this idea.

        Does anyone use this seriously in production on a typical web service? I wonder about how eg backups and stuff like that work out in real scenarios.

      • thunderbong 1707 days ago
        That's really interesting.

        I've always been a big fan of SQLite and this is the one challenge I've always faced.

        Can you give some more insights as to how you achieved that?

        • marktangotango 1707 days ago
          For reads just create a new connection for every request (be sure to set connection properties for wal etc). Then create a global (or something equivalent to global, like a singleton) connection to serve as the writer and put a mutex around it when doing writes. Easy, scales like a mofo.
          • firebacon 1707 days ago
            What does "mofo" mean here. Can you give us a rough estimate on the transaction rate you achieved with this setup? My own experience and all independent benchmarks I can find seem to indicate a limit of 100-1000TPS on reasonable hardware.

            Note that you can "batch" up many inserts into a transaction, which gives you a high "nominal" insert rate but still only ~100 actual transactions or so per second. To see why this is not the most useful number when comparing to a database like Postgres, consider that the limiting factor in a SQLite/Posgres design are cache flushes which outweight the costs of actually writing the data, so the number of rows per batch is mostly arbitrary; using this metric you can always claim a huge insert performance by choosing a suitable N. Also, if you do the batching, you of course loose some of SQLite's consistency/durability guarantees for your writes, which is probably fine if you didn't need them in the first place, but begets the question if an embedded ACID database is the best tool for the job at hand.

            • marktangotango 1707 days ago
              Test it yourself, you can try all the different combinations of journalling mode and synchonous modes. I'm not here to proscribe anything to anyone. Every use case is different.
            • skrebbel 1707 days ago
              > What does "mofo" mean here

              Motherfucker

              • no_wizard 1707 days ago
                I think they meant what makes it a positive `mofo`, if you will.
                • skrebbel 1707 days ago
                  Ah yes of course. Not deleting my comment because I think my misunderstanding is funny.
    • nbevans 1707 days ago
      In case anyone is curious "BEGIN CONCURRENT" is what is being discussed here (page level locks). There is also a WAL2 mode which is basically the WAL mode operating in a A/B hot swap mode - to facilitate checkpointing without holding up writes.

      SQLite has progressed a lot in the last few years. It is no longer advisable to mock it over SQLITE_BUSY et al; you will come unstuck on the Internet very quickly ;-)

    • pdw 1707 days ago
      I knew Oracle offered something like that (implemented by grafting the SQLite frontend/VM on top of Berkeley DB). But it'd be really cool to see page-level locking appear in standard SQLite version.
    • alberth 1707 days ago
      Dumb questions: if BedRockDB has all of these huge benefits over using stock-SQLite, why hasn't SQLite merged in all of the changes into SQLite? Why does BedRockDB have to exist as a separate fork?
      • nbevans 1707 days ago
        I think it is because the changes are, currently, considered esoteric and slightly experimental in the sense they want to reserve the right to make breaking changes to them. SQLite has a relatively slow cadence when it comes to things like this. Other reasons may be that they're still working on the tests to validate and support those features - as well as documentation. I am hopeful BEGIN CONCURRENT and WAL2 will make it into the amalgamation at some future date.
        • rogerbinns 1707 days ago
          > SQLite has a relatively slow cadence

          They intend to support it till 2050 - https://sqlite.org/lts.html - which includes keeping file format and API backwards compatibility. That accounts for the more careful decision making since things added now have to be supported for another 30 years.

    • cheez 1707 days ago
      uh.... concurrent writes? southparkejaculation.gif

      Unfortunately, I use SQLite through Python so I'm stuck with the system version :-(

  • _pgmf 1707 days ago
    Looks like it just bundles a full build of the sqlite shared library, along with some kind of gross hacks. It bundles pysqlite2, which notably is missing a ton of features and improvements from upstream. I'll just compile my own, thanks...
    • patelajay285 1707 days ago
      Hi @coleifer, I'll disagree, I'm the author. I don't think they are gross hacks, just a more sensible default configuration. In fact, I used a lot of your advice from your blog post https://charlesleifer.com/blog/going-fast-with-sqlite-and-py... while making this library.

      Fair enough if you want to compile your own version, but there's a lot of people and Python devs who don't know how / won't go through the effort and that's what this project is for.

      It's also a work in progress and not meant to be released yet. Shoot me an e-mail ajay@plasticityai.com if you have suggestions for how to improve it.

      • _pgmf 1707 days ago
        The advice in my post that you linked can be supplemented by this:

        http://charlesleifer.com/blog/compiling-sqlite-for-use-with-...

        Describes several handy ways to compile sqlite for use with python (2 or 3), as a system-wide shared library or compiled directly into the python module. This can also be applied to sqlcipher, etc.

        If you don't know how to compile sqlite, I'd argue that you have no business trying to use it's more advanced features. How can you tune or optimize something you don't understand?

        Furthermore, your library is stale. Anyone using it is dependent on you to merge in upstream changes constantly. And based on what I saw, it's already well out of date.

        • patelajay285 1707 days ago
          Yes, I've seen that post as well. Thanks! It's good advice.

          >> If you don't know how to compile sqlite, I'd argue that you have no business trying to use it's more advanced features. How can you tune or optimize something you don't understand?

          I'm not sure I agree with that, some people understand the advanced features but have a miserable time with compilers and compiling something like SQLite. There's a lot of great engineers with Python and SQL expertise that just don't know how to compile a C project. That might sound contradictory, but it's just something a lot of engineers still don't do day to day since people are using a lot of scripted languages (Node/Python/Ruby).

          For example, there's probably a lot of people who want to use the JSON1 or FTS5 modules with SQLite in Python. That doesn't require advanced knowledge, but requires them to re-compile SQLite!

          >> Furthermore, your library is stale. Anyone using it is dependent on you to merge in upstream changes constantly. And based on what I saw, it's already well out of date.

          It is stale, I agree. But it's still a WIP. This was posted here early. My plan is to pull in changes from the upstream sqlite module with the latest CPython 2.7 tag and latest CPython 3 tag in the source.

  • patelajay285 1707 days ago
    Hi, I'm the author and founder at Plasticity (YCS17) and built this for some internal use cases our startup. This is actually a Work-in-Progress and not meant to be released yet, but it seems like someone has found it online. I will re-submit this here when it is ready as a Show HN given the interest, shoot me an e-mail at ajay@plasticityai.com if you would like to be kept up to date!
  • nabdab 1707 days ago
    This looks like a dead project that just bundles together the built-in sqlite3 And another wrapper library APSW which itself is more active than this project. Why is this on hn?
    • stedaniels 1707 days ago
      No commits in 9 months, "looks like a dead project".
    • patelajay285 1707 days ago
      Hi, I'm the author! It's not a dead project, it's just not released yet, someone seems to have submitted it early :). Will post it back here when it's ready for prime time.
    • airstrike 1707 days ago
      > Why is this on hn?

      Because anyone can submit links?

    • typon 1707 days ago
      We need comments like yours and let the down votes do the rest.
  • tony 1707 days ago
    Any features in this library you'd like to see standard library's sqlite3 [1]? Maybe a PEP [2, python enhancement proposal] could do it.

    [1] https://docs.python.org/3/library/sqlite3.html

    [2] https://www.python.org/dev/peps/pep-0001/

    • mehrdadn 1707 days ago
      > Any features in this library you'd like to see standard library's sqlite3 [1]? Maybe a PEP [2, python enhancement proposal] could do it.

      Not sure if it's in this library, but I'd want the ability to open a database that's already loaded in memory as a byte array.

    • nerdponx 1707 days ago
      In the standard library? Probably nothing.

      But if someone published an alternative APSW wheel with JSON1, ICU, and FTS5 enabled, I'd be happy.

      • rogerbinns 1707 days ago
        I'm the APSW author. The binary builds for Windows are distributed with those extensions all compiled in, although my doc needs some updating. It is also only a single flag for other platforms to include all extensions during compilation. What can I do?
        • no_wizard 1707 days ago
          Whats the hold back to getting it via pip install via a wheel? I'm curious (happy APSW user here). Its not well elaborated from what I could find. This is something I would love to help with but not sure where to start, as I'm not sure where in the process of doing the builds it falls down.

          Would Cython help, perhaps?

        • nerdponx 1707 days ago
          First of all thanks for producing APSW and providing good documentation.

          The problem for my case is that if I want to distribute anything with a Sqlite+JSON1 workflow I also have to host and distribute a version of sqlite3 or apsw with those flags enabled. If that's already done on Windows then great, but many people I'm sure (myself included) would like a cross-platform solution.

          It would be great if we could use the Setuptools "extras" feature to select which flags we want enabled. Not sure what the technical requirements for that are.

          Alternatively just having an "everything enabled" version of the package would i think be good for most people who dont need such fine control over their project dependencies. If anything, someone who cares so much about keeping the library minimal is probably in the minority and should be free to disable extensions as needed.

          • rogerbinns 1707 days ago
            I'm very happy to take this general approach, but someone else needs to figure out exactly what needs to be done to get a package that way. Discussion is welcomed on python-sqlite mailing list.
    • dragonsh 1707 days ago
      May be just create a new PEP to replace sqlite3 with it.
  • johnisgood 1707 days ago
    > SQLite is extremely reliable and durable for large amounts of data (up to 140TB). It is considered one of the most well-engineered and well-tested software solutions today, with 711x more test code than implementation code.

    I keep seeing this statement. Why is it considered one of the most well-engineered software?

    • firebacon 1707 days ago
      About the well-tested bit: Per it's own documentation, SQLite has a massive test suite. [*Not all of] the test suite is actually open source though, so the overlap between commenters selling you on how well tested SQLite is and those that have seen how the sausage is made is probably [close to] zero.

      However, pointing this or any of the other practical shortcomings of SQLite out on hacker news is blasphemy and will invariably get you downvoted into oblivion by people who (apparently) never ran into them.

      • mytherin 1707 days ago
        Part of the SQLite test cases are open source (the logic tests), but indeed unfortunately the biggest part are not. You can purchase a license for these tests though, so saying nobody has seen them is likely incorrect.

        Either way, the fact that it is DO-178B certified for use on aircraft should tell you something about the reliability of it. It is certainly more reliable than most other software, OSS or otherwise.

        I am curious what practical problems you encountered with SQLite. Could you perhaps expand on that?

      • smush 1707 days ago
        What are some other shortcomings of SQLite? I'm genuinely curious even though the below probably sounds like some text from the ~Rust~ Sqlite evangelism strike force.

        With the default settings which I semi-affectionately refer to as paranoid mode, an untuned database can start to have worse performance after getting 500,000+ records going. Then things like indexes, RediSQL, and WAL mode start being more necessary rather than just best-practices.

        But if you set your pragmas correctly and so on, SQLite scales up just fine. I haven't done a large scale Sqlite base simply due to caution around needing to use a 'real' database in production like SQL server or Postgres, Maria etc. Sqlite is excellent for ephemeral databases to be created, seeded with test data, run tests against, and deleted in repeatable automated testing.

        Based on the link to the Expensify article, it sounds like Sqlite can scale up even better than Sql server under some circumstances.

        But I have barely tried using it in production because of that aforementioned caution. What are some pitfalls to watch out for?

        • zzzeek 1707 days ago
          the elephant in the room with SQLite is that they refuse to support most forms of ALTER TABLE, that is, to be able to change the structure of a schema.

          Their rationalization for this is that the way SQLite stores data, this is more efficiently performed by simply creating a new table and copying all the rows from the old table into the new one, and that one would want to batch all the table changes together rather than emitting individual ALTER statemnts.

          from the POV of people who make tools like my own Alembic Migrations, this is an annoyingly insufficient answer because the logistics of recreating table schemas and copying data over is much more complicated than just emitting an ALTER directive. I'd like if SQLite had at least the ability to be extended to support a third party "ALTER" plugin that would under the hood run the intricacies of copying the tables around, rather than pushing this out to the tool creators. It doesn't really matter in most cases, for the use cases used by SQLite, that ALTER would be inefficient.

          instead, my users bugged me for years to solve this problem and I have to maintain this thing https://alembic.sqlalchemy.org/en/latest/batch.html which I mostly hate completely.

          SQLite's typing model is also very idiosyncratic and is based on a naming convention approach, which I think most users of SQLite don't understand very well, because it works in a completely strange way based on looking for substrings inside of the completely arbitrary names you can assign to types. I can create a column with the datatype ELEPHINT and that is a legitimate datatype which will store integers. there's also the "INTEGER PRIMARY KEY " / "INTEGER PRIMARY KEY AUTOINCREMENT" silliness but that's a relatively mild poor API compared to things MySQL does all over the place, I suppose.

        • firebacon 1707 days ago
          It's hard to point those out without coming across as a hater, and because they might seem so obvious. Trying...:

          With the way transactions and durability work in SQLite, a normal setup will not do more than on the order of ~100 transactions per second without adding additional layers of complexity. Which is completely fine for it's intended usecase of being an embedded, lightweight database and of course is not due to some bad engineering decisions, but due to inherent tradeoffs in how transactions are handled.

          So basically it's all fine until you try to use SQLite for something for which it is not a good fit, like a large volume of inserts. Other solutions exist that are conceptually better equipped for this usecase out of the box (Postgres/MySQL/Elastic/etc). Not a shortcoming in SQLite per-se, but rather a practical shortcoming that comes up when you try to use it for the wrong job.

          Case in point is TFA which takes the theoretical SQLite database size (140TB) limit and runs with claiming that is the useful limit on how much data you can store in SQLite. LOL! That value is when SQLite page IDs start overflowing and not an estimate on how much data can be usefully handled in a single SQLite database.

          SQLite is, by the nature of what it is, not a good fit for big data volumes. Try loading 100TB of data into SQLite and try to run even a single query. Even assuming that you have a hyper-fast SSD, a single query will take days to complete! But it still get's mentioned as a potential solution for that problem every so often. In this post even...

          So you could consider those practical shortcomings or ill-advised usage. It depends on the definition/perspective I guess. At any rate, SQLite, even if generally being excellent, is not the panacea as which it sometimes gets sold here on HN. No database is, there are just too many tradeoffs involved.

      • ritchiea 1707 days ago
        Why are portions of the test suite closed source?
        • firebacon 1707 days ago
          They charge money for it...

          Note that I'm not saying all of the test suite should be open source. It's clearly a valid/cool business model and SQLite is, for a lot of use cases, an excellent piece of software that I have often used myself. The criticism was honestly more directed towards the cherry-picking of facts that sometimes happens in discussions on hacker news, especially those related to SQLite.

    • PudgePacket 1707 days ago
      Not a direct answer but it's certainly the most used, which correlates highly with getting a lot of dev attention :)

      https://sqlite.org/mostdeployed.html

      • firebacon 1707 days ago
        That completely depends on the chosen metric/definition though. For example, you could consider the most used database the one that handles the most queries per day. Not so clear who is the winner now. The claim that SQLite is one of the 5 most widely deployed pieces of software worldwide is completely unsubstantiated and most likely untrue once you count in things like ICU, the linux kernel, etc.
        • quietbritishjim 1707 days ago
          > ICU, the linux kernel, etc.

          It seems unlikely that either of these are more widely deployed than SQLite, especially the Linux kernel.

          Probably the vast majority of deployments of the Linux kernel include SQLite. Add in all the non-Linux devices (e.g. the billions of iOS and Windows devices) and that's probably far more than the number of Linux installations without SQLite.

          I don't know much about ICU, and perhaps you're right that it's one of the most used bits of software. But I can certainly imagine embedded devices that never communicate with users and wouldn't need ICU but do need to store data.

          • firebacon 1707 days ago
            My point was less about arguing which is more popular, but more that the same claim can equally be made for at least hundreds if not thousands of other software packages. Still, SQLite stands out to me as the only package I can recall that consistently puts this message all over the documentation front and center. Which is definitely still a useful signal, even if we both agree that it is, in fact, widely used.
        • sigzero 1707 days ago
          Are you deliberately misreading the linked page?

            Most Widely Deployed and Used Database Engine
          
          That last part is rather important.
          • firebacon 1707 days ago
            No, I'm referring to the second section.

            > MOST WIDELY DEPLOYED SOFTWARE MODULE OF ANY KIND? -- SQLite is probably one of the top five most deployed software modules of any description. Other libraries with similar reach include: zlib... blah blah blah

            It stops just short of claiming it's the most deployed software in the world...

        • typon 1707 days ago
          Barring an actual study with data, I don't think their claim is that far fetched.
    • hanche 1707 days ago
      A small (?) part of the answer is likely to be found in the massive amount of testing: https://sqlite.org/testing.html
    • platz 1707 days ago
      I would like not to have to deal with SQLITE_BUSY errors for once. It even throws when trying to obtain a connection. It got so bad I had to put a mutex around obtaining a sqlite connection.
      • breakingcups 1707 days ago
        It's possible you are using it wrong.
        • firebacon 1707 days ago
          The less flippant explanation is that SQLite can only handle a single writer at any time and when you try to access it with two concurrent writers (or a concurrent reader and writer in some modes) it will by default return a "BUSY" error instead of blocking.

          So, if you're were getting unexpected "BUSY" erorrs than, yes, you would be using it incorrectly. However, AIUI, you are always expected to see some amount of BUSY errors during normal, concurrent operation and have to deal with them explicitly. So the fact that you're seeing BUSY errors alone doesn't mean you're doing anything wrong...

          To use SQLite correctly from multiple processes, you have to do one of two things:

            - Add explicit code to retry on BUSY errors everywhere you do SQL queries
            - Serialize all access the database, e.g. by using a mutex
          
          GP appears to have chosen the second option.
    • patelajay285 1707 days ago
      Hi, I'm the author of this library. This is a pretty common opinion, the SQLite code base is rigorously well tested for performance and correctness.

      See: https://www.sqlite.org/testing.html https://www.sqlite.org/hirely.html

    • mkj 1707 days ago
      It's well thought through and commented. Read for example the commented design decisions of https://github.com/sqlite/sqlite/blob/master/src/pager.c or os_unix.c
    • tempguy9999 1707 days ago
      https://sqlite.org/about.html

      The bit about testing.

  • Gys 1707 days ago
    Its supports 'Remote Streaming over HTTP' without explaining what that means. Maybe someone here knows?
    • niea_11 1707 days ago
      From what i saw in the source code, it's a feature that lets sqlite open a database file stored on a web server. There is a class in the code HTTPVFS [1] that proxies sqlite's filesystem operations to http requests.

      [1]:https://github.com/plasticityai/supersqlite/blob/01e54bbb829...

    • patelajay285 1707 days ago
      I'm the author, yes @niea_11's comment is right. We add a HTTP Virtual File System. So you can stream the SQLite files over a static HTTP Server like Amazon S3 using HTTP Range Headers to do it efficiently. No dynamic web software needed.
    • egorfine 1707 days ago
      Got me, too
  • ddorian43 1707 days ago
    Title is misleading. It needs to add something like `python-library`. I thought they supercharged sqlite itself.
    • patelajay285 1707 days ago
      Hi I'm the author, see my parent comment in the thread. This is still a WIP. You're right that it's mostly a Python library, but when it's released, it will actually be released with pre-built native static SQLite libraries that can be linked into any C application, so it is actually supercharged SQLite itself :).
  • jgalt212 1707 days ago
    > SQLite is extremely reliable and durable for large amounts of data (up to 140TB).

    This has not been our experience. Our experience is that it sort of bogs down around 12GB in file size. linux box, ext3/4, spinning disk (not SSD)

    • patelajay285 1707 days ago
      Hi @jgalt, I'm the author of this library, and we've been using it to serve extremely large files over ~130GB at Plasticity (YCS17) for AI model data! Shoot me an email ajay@plasticityai.com and if you'd like to chat, but it should remain extremely responsive even at that file size.
      • jgalt212 1706 days ago
        Thanks may take you up on that after running some new tests. It was years ago we had performance issues (approx 3-4), so will see if time has healed all wounds.
  • d_burfoot 1707 days ago
    I've been toying with the idea of using SQLite as a data exchange format instead of JSON or XML. I can't stand navigating through complex JSON trees, I want to just use SQL queries. Has anyone else tried this?
    • sametmax 1707 days ago
      Works well if your data is big and doesn't have to be read on a web browser.
  • mtw 1707 days ago
    How accurate is this statement? "SQLite is faster than nearly every other database". It links to a page that's decades old.
  • trollied 1707 days ago
    The title would be better as "SuperSQLite: a supercharged SQLite library for Python".