Ten years of improvements in PostgreSQL's optimizer

(rmarcus.info)

299 points | by samaysharma 13 days ago

11 comments

  • darksaints 12 days ago
    As someone who has worked with Postgres for 15 years and also has spent most of my career modeling and solving mathematical optimization problems, I have so much to say on this topic, but I'll leave it at these three points:

    * Every optimization problem needs data on costs, and the more and better the data is, the better. Postgres has made some improvements here, especially with cross column statistics, but there are still massive improvements left on the table. The most glaring omission is data on syscall latencies. Reading a page from disk has dramatically different latencies from system to system, and postgres still relies on configuration values for these costs, when it could very easily be measuring them. Another omission is foreign key statistics. Joins along foreign keys should never have a bad plan, but they still occasionally do.

    * Deferred or alternative scenario planning should be adopted, especially for large and expensive queries. As it is today, your plan is finalized before it is executed, even though earlier stages of execution could provide information (like rowcounts or cardinality estimates) that could dramatically improve later stage plans.

    * Machine learning most definitely could be an area where improvements could be made, but I've been unimpressed with the efforts that Ive seen. Don't use machine learning for planning, use machine learning for cost discovery and estimation. Build better cost models, and then let the optimization engine work with that data.

    • RMarcus 12 days ago
      I'm curious to hear a bit more of your opinion. For example, I'm surprised that syscall latency is something near the top of your list. I think the usual wisdom in the DB community is that the cost models are mostly fine, but the cardinality estimation is really bad.

      In terms of the deferred / alternative planning, do you think adaptive query execution is a reasonable way to achieve this? It certainly allows for information early in the query execution to impact later plans. My worry with these approaches is that if you get the first couple of joins wrong (which is not uncommon), unless you have something like Yannakakis/SIPs, you still can't recover.

      I am obviously biased on the whole "ML for query optimization" thing. One thing I would note is that every "ML for planning" approach I've seen does, under the hood, use ML for cost discovery/estimation. These approaches are just trying to balance the data they collect (exploration) with the quality of the plans they produce (exploitation). Interestingly, if you use ML in a way that is completely removed from planning, you actually get worse query plans despite more accurate estimates: https://people.csail.mit.edu/tatbul/publications/flowloss_vl... (again, I've got a horse in this race, so my opinion should come with a side of salt :D)

      • aeyes 12 days ago
        As a DBA I have a very hard time properly tuning parameters like random_page_cost and the default of 4.0 is no longer applicable for most database servers.

        I don't want to be tuning this, it takes a lot of time to do it properly and I haven't retested this in a long time. I just set something that has worked in the past which is probably bad.

        I completely agree that Postgres should be able to figure this out on its own. This is just an example, there are more such parameters which should be adjusted to the hardware but most people will leave the defaults.

      • darksaints 11 days ago
        Syscall latency was a much bigger deal back when we had spinning disks, but it still matters today (you can get dramatically different plans depending on slightly different costs of pulling a page from disk) and I find it silly that we've never even measured our those costs. A bigger impact might be for SQL functions...all of the Postgres SQL functions have configured cost values, but they could easily be measured. Also a simple cost model for functions can be a dramatic oversimplification. For example, some PostGIS functions have O(n) or O(n^x) behavior depending on the size and complexity of the input geometry. If we could measure exact costs, or model costs with statistical distributions, or possibly predict with ML, that would be a huge improvement.

        My opinion on ML is that there is nothing in the execution planning side that couldn't be modeled and solved as a linear program, with extremely fast and mathematically-optimal results. By trying to use ML for the planning part, you're really just using ML to reverse engineer LP solvers, and it is a poor use of the compute resources.

        The reason why some ML planners might have better results than typical SQL query planners is because typical SQL engines are optimized towards OLTP workloads that require small transactions executed very quickly. In order to do that, they purposefully don't explore the true planning space...they might explore 3-10 alternative ways of executing, whereas there might be hundreds or thousands of ways to do the same thing. While Postgres has explicitly chosen to not implement planning pragmas to override planner behavior, it would be really cool if you could have multiple planners optimized for different types of workloads, and be able to explicitly choose a query planner that takes 3 seconds to plan for a query that takes 1hr to execute and for which a better plan could save several minutes. I would even love a fairly naive query planner which does index scans for a deterministic and exact cardinality before planning joins.

        BTW, I really like your blog and your research focuses. You're working on exceptionally hard problems that have a huge impact on computing.

        • anarazel 11 days ago
          Where I think ML would be much better than what we (postgres) do, is iteratively improving selectivity estimation. In today's postgres there's zero feedback from noticing at runtime that the collected statistics lead to bad estimates. In a better world we'd use that knowledge to improve future selectivity estimates.

          > In order to do that, they purposefully don't explore the true planning space...they might explore 3-10 alternative ways of executing, whereas there might be hundreds or thousands of ways to do the same thing.

          FWIW, often postgres' planner explores many more plan shapes than that (although not as complete plans, different subproblems are compared on a cost basis).

          > While Postgres has explicitly chosen to not implement planning pragmas to override planner behavior, it would be really cool if you could have multiple planners optimized for different types of workloads,

          FWIW, it's fully customizable by extensions. There's a hook to take over planning, and that can still invoke postgres' normal planner if the query isn't applicable. Obviously that's not the same as actually providing pragmas.

      • adgjlsfhk1 12 days ago
        My guess is that the reason for putting syscall latency high is that it should be easy to fix. Cardinality tracking is a hard problem, but running a loop on install that measures the cost of a couple dozen syscalls really could be done automatically.
      • srcreigh 12 days ago
        Awesome paper! Might be my cue to delve into the world of cost estimates. My naive brain thinks, how can it be so hard :-)
    • zeroimpl 12 days ago
      > Deferred or alternative scenario planning should be adopted, especially for large and expensive queries. As it is today, your plan is finalized before it is executed, even though earlier stages of execution could provide information (like rowcounts or cardinality estimates) that could dramatically improve later stage plans.

      Alternative planning sounds awesome. I saw a query plan the other day where it expected about 1000 rows from some subquery so did a nested loop to an index scan. In reality there’s about a billion rows. I’m not sure yet why the estimate was so bad, but being able to pivot from a nested loop to a hash join if the row count crossed some threshold would be great at avoiding some catastrophic plans.

    • da_chicken 12 days ago
      > Another omission is foreign key statistics. Joins along foreign keys should never have a bad plan, but they still occasionally do.

      I'm curious what you mean. Postres, like many RDBMSs, does not automatically create an index on a FK. But you have to know that already.

      Is it join sequencing?

      • Sesse__ 12 days ago
        Having a foreign key should probably be a hint to the planner that it should create a cross-table correlation estimate on the source and destination columns. This is about join cardinality estimation, not having an index.
      • darksaints 11 days ago
        It's not about indexing, though most casual users don't know that FKs are not indexed automatically. Postgres should be smart enough to know when an index is appropriate for a foreign key, but I respect the fact that the core team wants people to make explicit well-reasoned decisions to index instead of doing it automatically.

        Regardless of indexing, what actually matters is cardinality estimates. Foreign keys are almost always a reasonable hint to the SQL engine that there will be a large number of joins along foreign keys, and thus should be able to provide cross-table cardinality estimates.

    • nextaccountic 12 days ago
      Do you think that MSSQL is better on this front?
  • DoubleFree 12 days ago
    The postgres query optimizer will try to minimize the number of pages read from disk (and the number of intermediate pages written to disk). Benchmarking the query optimizer by making the shared buffers large enough to hold all the data therefore seems wrong, as you're then measuring the speed of the query optimizer and the join processor, instead of the quality of the generated query plans. It would not surprise me if the generated plans for these versions are actually all the same and this is only measuring execution speed.
    • Sesse__ 12 days ago
      No, it optimizes cost, which includes pages read from disk _and_ things like CPU use. Cost is an arbitrary unit meant to correlate with time spent, not by disk loads, so it's completely reasonable to compare plans with everything loaded in RAM. (Cost is by convention scaled such that one page read from disk is 1.0, but that's a different things from “the optimizer will try to minimize the number of pages read from disk”. It could just as well have been scaled so that 1.0 was 1 ms on some arbitrary machine.)
    • RMarcus 12 days ago
      It is certainly possible that the plans are similar, and that improvements to the execution engine are being measured. The join order benchmark was designed to test optimizer quality. It is worth noting that in addition to trying to measure the number of pages read from disk, the PG optimizer also tries to reduce the number of tuples examined by the CPU, the number of predicate evaluations, etc. All these numbers are rolled up into a "cost," which is the function that the optimizer minimizes.

      It is also true that measuring cold cache and warm cache performance can produce different results, and this experiment is certainly in the warm cache scenario. But, the cold cache scenario suffers from the problem you mention as well: an improvement to PG's B-tree that saves a few IOs will dominate any kind of CPU-based improvement (at least at the data size of the join order benchmark).

      FWIW, the plan for the query with the P90 latency changes from a plan that uses loop and merge join in PG8.4 to a plan that uses hash join in PG16 (where it is no longer the P90 query), which is at least some evidence of optimizer improvements.

      • petergeoghegan 12 days ago
        > It is certainly possible that the plans are similar, and that improvements to the execution engine are being measured. The join order benchmark was designed to test optimizer quality.

        I don't think that it's possible to test optimizer quality in isolation -- not really, not if it's to be in any way practical. Many (most?) individual improvements to the optimizer are hopelessly intertwined with executor enhancements. This is usually fairly obvious, because the same commit changes both the optimizer and the executor together. Sometimes it's much less obvious, though, because its more a case of the optimizer and executor coevolving.

        It's probably still the case that a lot of the improvements seen here are pure executor enhancements, but I can't say that I'm very confident about that. (As the main person behind those B-Tree IO savings you might expect me to have more confidence than that, but I find it horribly difficult to tease these issues apart while keeping the discussion high level/relevant.)

  • BitPirate 12 days ago
    The author mentions PostgreSQL's JIT compiler. Up to this day, I've only seen it degrade the performance of queries. Disabling it is on my install checklist.
    • baba92 12 days ago
      A customer had the worst performance issues after switching to postgres. Weirdly it only happened in the docker and test server setup and not on the developers machine (he ran postgres via homebrew).

      Turns out homebrew installed postgres without JIT support and one query on the developer machine ran in 200ms, while with JIT the query took 4-5 seconds. Took some time to figure out, since I'm not a heavy postgres user, but since then I've always disabled JIT and never looked back.

    • LunaSea 12 days ago
      The JIT compiler is great for analytical queries.

      You can configure thresholds for the JIT activation in PostgreSQL as well if you want to elevate the bar from which the JIT is enabled.

      • refset 12 days ago
        For analytical queries note that you really have to learn how to express the queries efficiently with Postgres - unfortunately the optimizer is still missing lots of tricks found in more sophisticated engines [0][1][2]. JIT compilation alone can't get close to making up for those gaps.

        [0] https://pganalyze.com/blog/5mins-postgres-optimize-subquerie...

        [1] https://duckdb.org/2023/05/26/correlated-subqueries-in-sql.h...

        [2] "Unnesting Arbitrary Queries" https://cs.emis.de/LNI/Proceedings/Proceedings241/383.pdf

        • Sesse__ 12 days ago
          Postgres isn't really hot on implementing the latest stuff from academia (most of what they do seem to be fairly home-grown, not the least because some of it was never really written about before they started doing it). Arbitrary unnesting is certainly one of the optimizations they don't do; fancy sketches and other estimation techniques generally likewise. But what they're really good at is polish. A thousand small tweaks and optimizations to iron out issues for 0.1% of queries. It tends to have fairly few sharp edges, less so than any other RDBMS I've used.
          • fuy 5 days ago
            Thomas Neumann and Alfons Kemper papers can hardly be called "home-grown", their stuff has been implemented in multiple industrial systems. Postgres optimizer, though, is very bad even with simple correlated subqueries, let alone "arbitrary", so it'd be useful to have at least some version of unnesting.
      • dur-randir 12 days ago
        But PG is much more commonly used for OLTP, not OLAP. I'm baffled to this day that they enabled it by default.
        • LunaSea 12 days ago
          OLAP is a very common use case for PostgreSQL as well.
    • masklinn 12 days ago
      Yeah pg’s JIT is a pretty good demonstration that LLVM is not great at JITs, worsened by postgres not having a good persistent shared query cache.

      It would probably be less detrimental if it could perform asynchronous compilation for future queries (which really is closer to how normal JITs actually behave, at least for optimising backends)

      • MaxBarraclough 12 days ago
        > LLVM is not great at JITs

        That seems like a generalisation. It's true LLVM will never be a lightweight toolkit, but if you want to generate highly optimised code it seems like a solid choice, assuming you're doing relatively 'vanilla' code-gen work.

      • SigmundA 12 days ago
        PG just needs to do like the other databases and cache query plans, also add hints please.
        • masklinn 12 days ago
          > PG just needs to do like the other databases and cache query plans

          The inline blocking compilation step will still fuck your query.

          > also add hints please

          I’d much rather they allowed bypassing the planner entirely and had an interface to feed plans directly to the executor.

          • RMarcus 12 days ago
            PostgreSQL has coarse-grain query hints (like `enable_hashjoin`), and the excellent `pg_hint_plan` extension allows you to specify a complete or partial execution plan: https://github.com/ossc-db/pg_hint_plan
          • SigmundA 12 days ago
            >The inline blocking compilation step will still fuck your query

            Only on the first execution, the long plan/jit time is usually only an issue when running a fast query over and over again.

            However if plans where cached then you could also plan without jitting then background jit and update cached plan for next time which would be even smarter.

            >I’d much rather they allowed bypassing the planner entirely and had an interface to feed plans directly to the executor

            Other database have plan locking where you can load an existing plan from somewhere else in serialized form (XML) and force it to use that. Most of the time I prefer just a few hints in the query source solves almost all issue with the planner.

    • nextaccountic 12 days ago
      Can't postgres JIT a query once and run the compiled query multiple times?
      • jeltz 5 days ago
        Nope, because nobody has implemented re-locatable symbols for the JIT so plans cannot be reused.
  • uhoh-itsmaciek 13 days ago
    Neat, but Postgres version numbering changed with v10. 9.6, 9.5, 9.4, 9.3, 9.2, 9.1, 9.0, 8.4, 8.3, 8.2, 8.1, and 8.0 are effectively all distinct major versions. It'd be interesting to see how perf changed with those.
    • guiriduro 12 days ago
      But at least they maintained the courtesy of major version filesystem compatibility allowing faster in-place upgrades during an extended period (from v9.0 - 9.6), by simply changing the binaries. Possibly that held them back, but yearly updates that require more downtime/reindexing isn't so much fun, and possibly the reason why many sites skip upgrades until previous versions drop out of support (esp AWS RDS users). While the alternative of a logical replication upgrade (from v10 onward) has availability benefits, its a major project with unavoidable costs and significant risks unless schemas are relatively simple.
      • ioltas 12 days ago
        The bar to merge a patch that would break on-disk format is so high that it’s really impossible to reach it ;)

        More seriously, we care a lot about the format of the 8k pages to ease upgrade paths as much as possible.

    • RMarcus 12 days ago
      I totally agree -- I picked the latest version for each major version using the semver interpretation of the version numbers, which is not how PostgreSQL has traditionally done major version numbers (i.e., PG 8.2 and 8.1 are different major versions, but I interpreted the version numbers as if they were minor versions).

      The main reason I did this was to reduce the number of versions I had to test, but I agree a more complete analysis would test each (true) major version.

  • uhoh-itsmaciek 13 days ago
    >Of course, not all of these improvements are attributable to the query optimizer.

    It would be interesting to see plan changes, if any, across versions.

  • nsilvestri 13 days ago
    Proebsting's Law comes to mind: https://proebsting.cs.arizona.edu/law.html
    • fear91 12 days ago
      The nice thing about compiler optimizations is that you can improve performance of existing CPU's without physically touching them. Year by year. You squeeze more of the machine someone designed. It adds up.

      Imagine what environmental impact you would have if you optimized Python's performance by 1%? How much CO2 are you removing from the atmosphere? It's likely to overshadow the environmental footprint of you, your family and all your friends combined. Hell, maybe it's the entire city you live in. All because someone spent time implementing a few bitwise tricks.

      • hnben 12 days ago
        > Imagine what environmental impact you would have if you optimized Python's performance by 1%?

        I imagine this would also increase the complexity of the python intepreter by more than 1%, which in turn would increase the number of bugs in the interpreter by more than 1%. Which would burn both manpower and CPU-Cycles on a global scale.

        (I assume that optimization, that reduce complexity are already exhausted, e.g. stuff like "removing redundant function calls". )

        • fear91 12 days ago
          This is an assumption that a reasonable person naively comes up with.

          Then if you actually go ahead and check, it turns out it's not true! It's quite a shocking revelation.

          When you dig into the popular compilers/runtimes (with the exception of things like LLVM)

          Many of them still have low hanging fruit of the form:

          a = b + c - b

          Yes, the above is still not fully optimized in the official implementations of some popular programming languages.

          Also an optimization of "removing redundant function calls" isn't a binary on/off switch. You can do it better or worse. Sometimes you can remove them, sometimes not. If you improve your analysis, you can do more of that and improve performance. Same for DSE, CSE, etc...

          • Sesse__ 12 days ago
            In many languages, you can't just optimize + b - b willy-nilly, as there could be side effects and non-obvious interactions abound. For instance, in JavaScript, where everything is a 64-bit double, a + b - b is definitely not the same as a, given large enough or small enough a or b. In LLVM for floats as well, certainly.
            • fear91 12 days ago
              It's an example of how trivial some of this low hanging fruit is, the above is a concrete case that I have personally implemented (arithmetic of in-register 64/32-bit integers). You can get into semantics and restrictions, but I think the point I'm raising is clear.
    • devjab 12 days ago
      Why? The at law seems to talk about how performance software improvements isn’t relevant while this article talks about how there improvements to Postgres has been significant.

      Is it because you view the 15% to be a low number? Because it really, really, isn’t in the context. It’s certainly smaller than the 60% from your linked law especially if you do the whole 15/10 thing, but you really shouldn’t compare Postgres’s performance to hardware increases because they aren’t the same. You would need absolutely massive amounts of hardware improvements to compare to even 1% performance on what is being measured here.

      I don’t think the law is as silly as other posters here, but it’s talking about programming language compile times. I wouldn’t compare something as unimportant as that to what is arguably one of the most important things in computer science… considering how much data storage and consumption means to our world.

    • cpeterso 12 days ago
      In this case, the researcher built each PostgreSQL version using the same GCC version (13.2) and tested on the same OS.
    • waterproof 12 days ago
      That seems like a very weak “law”. Is it meant as a joke? The supporting evidence is just numbers pulled apparently from nowhere (“let’s assume…”) and the conclusion is wildly off base. They seem to imply that if optimization improves the performance of much of the world’s software, by 4% each year, then it’s a waste of time.

      Murphy’s law is the only comparison given. I wonder what the comparative expense is to develop faster and faster hardware, vs. to keep improving compilers. Depending on how the ROIs compare, (in dollars-per-percent gain, let’s say) maybe this “law” is worth some weight.

      OTOH, the Postgres article in question seems to show diminishing returns from optimization, which belies the whole premise of the “law” which assumes the gain is consistent from year to year. And might prove Prorbstig’s implied point about optimization being a bad investment over the long term!

  • mehulashah 12 days ago
    I’m a bit confused with the analysis here. How do you confirm a downward trend in the data that is not visible on the graph. It looks like the median drops a little in the first few versions, but then rises in the last few. The R2 is really low, so I’m not sold on the correlation here. Basically, it looks like tail latency has improved and YMMV for everything else.
    • RMarcus 12 days ago
      I'm the author of the blog post.

      "Tail latency has improved by YMMV for everything else" => yes, I think that's a valid (but conservative) read. Of course, in many (most?) applications, tail latency is very important. Tail latency also tends to be the thing that optimizer engineers target (i.e., reduce the runtime of the longest running query).

  • edweis 12 days ago
    How query optimisation looks like? Does it optimize on the SQL or algorithm level?
    • magicalhippo 12 days ago
      For the databases I've used, which excludes PostgreSQL, most optimization happens on algorithmic level, that is, selecting the best algorithms to use for a given query and in what order to execute them.

      From what I gather this is mostly because various different SQL queries might get transformed into the same "instructions", or execution plan, but also because SQL semantics doesn't leave much room for language-level optimizations.

      As a sibling comment noted, an important decision is if you can replace a full table scan with an index lookup or index scan instead.

      For example, if you need to do a full table scan, and do significant computation per row to determine if a row should be included in the result set, the optimizer might change the full table scan with a parallel table scan then merge the results from each parallel task.

      When writing performant code for a compiler, you'll want to know how your compilers optimizer transforms the source code to machine instructions, so you prefer writing code the optimizer handles well and avoid writing code the optimizer outputs slower machine instructions for. After all the optimizer is programmed to detect certain patterns and transform them.

      Same thing with a query optimizer and its execution plan. You'll have to learn which patterns the query optimizer in the database you're using can handle and generate efficient execution plans for.

      • Sesse__ 12 days ago
        Postgres is a fairly standard System R implementation. You convert your SQL into a join tree with some operations at the end, and then try various access paths and their combinations.
    • BenoitP 12 days ago
      It describes all the way the SQL could be executed, then choses the faster plan. For example: if you're looking for the user row with user_id xx, do you read the full table then filter it (you have to look at all the rows)? Or do you use the dedicated data structure to do so (an index will enable to do it in the logarithm of the number of rows)?

      A lot more can be done: choosing the join order, choosing the join strategies, pushing the filter predicates at the source, etc. That's the vast topic of SQL optimization.

      • abhishekjha 12 days ago
        Is there a more general reading for software engineers?

        Seems like jumping right into the code can be a bit overwhelming if you have no background on the topic.

    • mdavidn 12 days ago
      At a very high level, the query planner's goal is to minimize the cost of reading data from disk. It gathers pre-computed column statistics, like counts of rows and distinct values, to estimate the number of rows a query is likely to match. It uses this information to order joins and choose indexes, among other things. Joins can be accomplished with several different algorithms, like hashing or looping or merging. The cheapest option depends on factors like whether one side fits in working memory or whether both sides are already sorted, e.g. thanks to index scans.
    • paulddraper 12 days ago
      The query optimization chooses what algorithms provide the results requested by the SQL.
  • morepork 12 days ago
    Site seems to be down, you can try this instead: https://web.archive.org/web/20240417050840/https://rmarcus.i...
  • CodePhoenix 11 days ago
    I'm not an expert in developing databases, but I understand that creating an excellent optimizer is quite challenging. Having followed your blog, I believe it will help me to enhance my knowledge in the area of databases.
  • throwaway984393 13 days ago
    It would arguably be more accurate to compare the actual systems running Postgres and their precompiled binaries. Nearly everything else in a given system running a given version of an application may have an impact on its performance in subtle ways. Its development occurred in conjunction with these other system components, thus it was optimized for them. In addition, distributions tend to ship their binaries with patches not found upstream which can also affect performance. Everything from the kernel and scheduler to glibc and more may affect performance. Newer isn't always faster. Even the containerization you now use might add a subtle penalty that an older non-containered version may not experience.

    Basically, it's theoretically possible that some old ass system running version 8 runs it faster than your current system runs version 8 (excluding hardware).

    Add to that all the other usual caveats about benchmarks etc (like the fact that IO wasn't tested here but is pretty dang relevant to real world performance)