Turning PostgreSQL into a queue serving 10k jobs per second (2013)

(gist.github.com)

445 points | by faizshah 1618 days ago

25 comments

Diggsey 1618 days ago
I have also found the lack of transactional guarantees in typical job queues to be very problematic.
One problem with using PostgreSQL in this way (using either advisory locks or LOCK FOR UPDATE) is that it requires you to keep an open connection to the database whilst the job is being worked on.
For a MySQL database, this would be just fine, but PostgreSQL uses a process-per-connection model which caps the number of active connections to the database to a relatively low number (on the order of 1000x fewer connections than a similarly sized MySQL instance) and tools like PgBouncer do nothing to help with this.
As a result, if your jobs take more than a few milliseconds to execute (let's say you make external HTTP requests as part of your job) this is not a good approach to take.
I use a similar approach which avoids this problem, but it only works because I have relatively low throughput requirements. I essentially implement in-database advisory locks using a separate table - before taking a job, workers create a row in the table, and the primary key of this table is used as a worker ID. Jobs are "taken" by assigning them a worker ID. Each row in the worker table has an expiry date, so if workers die, the corresponding row will be deleted and any linked jobs released back into the queue.
As well as transactional guarantees, using a database as a job queue gives you a lot of power over how jobs are executed: for example, our service for delivering webhooks has a separate queue per customer, and we can ensure that within a single queue jobs are processed strictly in order. Meanwhile, our service for search indexing supports different priority levels, so that newly created records are indexed with a higher priority.
[-]
- bgentry 1618 days ago
  I elaborated a bit on this elsewhere (https://news.ycombinator.com/item?id=21537414) but Que’s design has changed significantly and no longer holds open a connection or transaction for the duration of working a job. It holds one database connection per worker process. Each worker process handles all job locking and assignment to the individual worker threads in that process, each of which only use connections that they themselves decide to open.
  Otherwise, I completely agree on the benefits of transactional job enqueueing. This lets you push off so much complexity until you actually need it (when you’ve scaled such that a single database doesn’t handle your needs well). At that point, you get to solve the same challenges you will have been solving all along to deal with jobs that might run which depend on transactions that may not have committed (yet or ever). I believe this model is almost always the right starting point for a web application, barring some unusual job requirements or massive initial scale.
  [-]
  - ianai 1618 days ago
    Is this what you’re talking about?
    https://github.com/que-rb/que
    [-]
    - s0l1dsnak3123 1618 days ago
      Yep!
  - Diggsey 1618 days ago
    That is a very neat solution. The one downside I see is that you need to be much more careful about gracefully handling errors within workers.
    For example: if a single worker thread crashes but the process doesn't realise it, it may be possible for jobs to become stuck, because other workers rely on the process releasing the job back to the queue. It's definitely solvable but something to be aware of.
  - grumpy8 1618 days ago
    > I believe this model is almost always the right starting point for a web application, barring some unusual job requirements or massive initial scale.
    Would you run this job queue on the same postgres database as the rest of the application or rather use a different one specific for workers?
    [-]
    - bgentry 1618 days ago
      One of the main benefits of this model is the ability to transactionally enqueue any jobs that are being triggered at the same time as other changes your app is making. Also the ability to transactionally apply changes to your database at the same time as marking the job completed.
      You only get this benefit if you’re doing everything in a single Postgres database. By the time you outgrow that setup, you may as well move to a more dedicated job queue or something built on Redis, because you’ve already lost the benefits of having a single system.
      I suppose you could maintain the queue in a separate Postgres database, but without sharing ACID with the app I’m not sure What you’d gain vs another system.
  - zahreeley 1618 days ago
    Good job
- z92 1618 days ago
  // One problem with using PostgreSQL in this way (using either advisory locks or LOCK FOR UPDATE) is that it requires you to keep an open connection to the database whilst the job is being worked on. //
  Not necessarily. You lock the row for an instant update to a field, for example called "status" into "running" and then disconnect from the database within milliseconds. Finish your job taking as much time as you want. And then connect again to change the row's status to "finished".
  This is how it's always designed, as I have seen. The locking problem is for querying rows where status="waiting" and then instantly change it.
  It's not to "keep the record locked, and DB connection up, until I finish my batch job". That would be a bad design.
  [-]
  - jimktrains2 1618 days ago
    What happens when the worker processing the job dies and never updates the status?
    [-]
    - shrimpx 1618 days ago
      An easy way to implement this is to have a "heartbeat" column that the worker updates every N seconds in a thread. A periodic cron-like job reaps jobs who missed M heartbeats. It's possible a worker was alive but unable to update the heartbeat due to, e.g., a temporary network partition. It's also possible the worker's job execution thread crashed while the heartbeat thread is still chugging, causing the job to remain unfinished indefinitely. You can minimize the probability of these failures with various client side logic though. If the worker can't update the heartbeat due to network errors it cancels itself, and you can structure workers such that you can detect stuck/crashed job threads and put the job in "failed" state or just exit and allow the job to be reaped.
      But come to think of it, these two classes of problems also occur with systems that hold a db lock for the duration of the job. If the worker loses its connection it needs to somehow cancel itself unless the work is idempotent and computation waste doesn't matter. And if the job crashes you need to make sure you release the lock.
      Btw to add another related point, databases do have a lock timeout that you have to worry about if you hold a lock for the duration of the job. Your job execution time cannot exceed the lock timeout.
    - bgentry 1618 days ago
      It helps to reference Que's schema [1] and source code to explain this further. But I'm also going from memory so it's possible I will miss some details :)
      * If the entire worker process dies, then it will lose its Postgres connection which is holding an advisory lock on the jobs being worked. This releases those jobs to be worked by another worker. I don't recall how the built-in retry & back-off mechanisms work in this scenario. This advisory lock is indeed held for the entire time the jobs are being worked on, but only from a single supervisor connection (rather than one connection per job).
      * If the job thread crashes, the worker supervisor catches this and the job is marked for a retry at a later time.
      [1]: https://github.com/que-rb/que/blob/master/lib/que/migrations...
    - drabinowitz 1618 days ago
      You can include a locked_at field and have your update query be for not_started rows and started rows where locked_at is older than the job timeout
      [-]
      - shrimpx 1618 days ago
        A global job timeout might be unreasonable with high variance in workload. Eg some jobs taking 0.5 seconds and others 30 seconds. You might set a global timeout of say 60s but it sucks to wait 59.5s to reap that short job whose worker crashed. A better system is to make workers update a timestamp on an interval and you reap any jobs that haven't been updated in N seconds.
        [-]
        pas 1617 days ago
        It's a trade off between updates per sec and latency.
        Maybe simply using a timeout per job type is a better way. (That of course trades off simplicity.)
        [-]
        shrimpx 1617 days ago
        I agree. Frequency of updates also becomes more of an issue as you add workers. Say you have 1000 workers each updating every 2 seconds. That's ~500 timestamp update statements per second which is not trivial in terms of added load on the DB.
    - ukd1 1618 days ago
      Some of these systems also check if the worker running the job is still alive, then if it's not kill the transaction. The job then restarts. Also, you can do this by time - when did it lock it, how long does it have to complete before you nuke. Of course there are edge cases, but you have to make a choice of how to deal with this; just leaving it means doing it manually.
- ukd1 1618 days ago
  I think the biggest bonus for me of transactional queues like this is knowing that either everything worked - the inserted/updated/deleted things, and enqueuing of any jobs - or nothing did. Whilst using resque, or anything else outside of postgres, occasionally things would rollback but have sent jobs. This sucks; as you either get errors or unexpected things, or have to code around it to fix that.
  FYI, we've been using QueueClassic at Rainforest for a while. Not sure how many actual jobs we do, as we've reset at least when moving hosts, but we're currently at 849202793 jobs (~850m) though QC.
  [edit; looking at the last failed job from ~24h ago, we're doing 2.5-3m jobs per day]
  [-]
  - faizshah 1618 days ago
    Did you guys ever evaluate Que? What made you choose QueueClassic?
    [-]
    - ukd1 1618 days ago
      We did; but I think it came around after we started. It was faster at that time, we looked at also it's well written - we'd probably have used it if it was out, but we didn't really mind about the speed - we're not latency sensitive for what we're doing. Fast forward a little, and we now maintain QueueClassic, and it's now comparable / slightly faster.
      [-]
      - ukd1 1618 days ago
        Maybe chanks recalls differently; but I think the backstory was he saw / was interested in this - but wanted to use newer features that came out in Postgres. QueueClassic's first release was early 2011; it wasn't fast support newer Postgres stuff, mostly as it was being used a lot (inside Heroku) and wanted to maintain support for users. I think (might recall wrong) that Chris ended up writing Que ~2013 because of QC was too slow to update / didn't want to do these features!
bgentry 1618 days ago
The author of this post, Chris Hanks, created the Que queueing library for Ruby: https://github.com/que-rb/que
It’s changed significantly since this post as the 1.x betas use a very different structure which should actually be more efficient, use fewer Postgres connections, cause less lock contention, and cause less table bloat.
Not sure if the benchmarks have been run recently or not but I’m definitely curious how things stack up to this post from 6 years ago :)
[-]
- ukd1 1618 days ago
  I actually ran them in July 2019 against QueueClassic using the Que benchmark;
```
  queue_classic jobs per second: avg = 1879.1, max = 2072.1, min = 1779.4, stddev = 128.3
  que jobs per second: avg = 1500.5, max = 1550.8, min = 1405.2, stddev = 58.1
```
  Full deets: https://github.com/QueueClassic/queue_classic/pull/303#issue...
- aldoushuxley001 1618 days ago
  Wish I could use that in a Django app. Doesn't seem to be a viable python queueing library that allows using postgresql
  [-]
  - subleq 1618 days ago
    I wrote django-postgres-queue for this purpose. It uses postgres transactions to keep queue and application state in sync. It also uses SKIP LOCKED to avoid some of the typical issues with using a database as a queue.
    https://github.com/gavinwahl/django-postgres-queue
    [-]
    - elnygren 1618 days ago
      Thanks for making this, I've been using it to help me send emails, push notifications etc. to 20-30k users in a Django project :)
    - aldoushuxley001 1617 days ago
      That actually does look like exactly what I need. Beautiful. Thank you! Great work.
  - elamje 1618 days ago
    Recently was wishing this myself. I’ve been using rq, which although I hear is much faster than an rdbms is also another dependency and service to worry about failing. Trying to keep things simple is nice, I am already having trouble trying to grok what kind of weird state I can have in my application if Redis goes down, or my dB goes down but Redis remains
    [-]
    - codetrotter 1618 days ago
      > I am already having trouble trying to grok what kind of weird state I can have in my application if Redis goes down, or my dB goes down but Redis remains
      If it is possible to set up a smaller test environment with one or a few instances of your DB and Redis, and you have time to do it, you could do some testing where you purposely shut down each of them at various points in time and inspect what happens to your application state compared to how it behaves when all is well.
      Perhaps you could even outfit the testing version of your application with two proxies, one that will proxy the DB connection and one that will proxy the Redis connection, and have these proxies randomly decide within some threshold whether to pass the data along to their upstreams, or to simulate a broken connection. Then test the application with some different threshold values, for example, 1% probability of failure, 5% probability of failure, 50% probability of failure and 100% probability of failure. Repeat the tests some number of times for each threshold value and observe the behavior of the application each time. Also, make sure that you log the decisions made by the proxies each run, so that you are able to look at these afterwards.
      [-]
      - ukd1 1618 days ago
        I mean, I love testing, but this is a great example of why you should just use less things if you can. Combinations of failure modes aren't fun to get right.
      - nicoburns 1618 days ago
        docker-compose is excellent for these kind of testing setups.
  - michelpp 1618 days ago
    There is a Python library that uses SKIP LOCKED https://github.com/malthe/pq
  - StavrosK 1618 days ago
    Yeah, it would be nice if Dramatiq supported it. django-postgres-queue sounds promising, though.
  - dimino 1618 days ago
    rq is pretty good, have you checked it out?
bloody-crow 1618 days ago
One of the interesting unforeseen downsides of RDBMS-based queues is explored here: https://brandur.org/postgres-queues
TDLR the lock time grows exponentially depending on the number of dead tuples in the table, which naturally grows as you use long running transactions.
[-]
- bgentry 1618 days ago
  While this is a great article and was completely accurate as of the time it was written, I believe it predates SKIP LOCKED, as well as some of the more recent design changes in Que which minimize the likielihood of this being an issue: https://www.2ndquadrant.com/en/blog/what-is-select-skip-lock...
  In particular, Que’s new design locks jobs in a single connection per worker process, not a connection per-job. Job assignments are handled in an in-memory/unlogged table. Jobs are also automatically assigned to a free worker upon being enqueued via LISTEN/NOTIFY and an unlogged table of available workers. And jobs are locked in batches by each worker process, not one-at-a-time. So polling becomes much less frequent and is far more efficient when it happens.
  This does not mean that it’s suddenly ok for jobs to take out transactions which run a long time (that is usually a bad idea in a production database) but it does substantially minimize the scenarios where these problems might occur.
  [-]
  - bloody-crow 1618 days ago
    TIL. This is very useful. I believe 9.4 was the latest version around the time the article was published.
    I'm glad the state of things has improved, since I really love database-backed job queues and all transactional guarantees it gives you.
- olefoo 1618 days ago
  If you are experiencing this problem look at your autovacuum settings; postgres ( versions > 9.0 ) can take per table settings for autovacuum.
  Percona has a good article on this https://www.percona.com/blog/2018/08/10/tuning-autovacuum-in...
- Exuma 1618 days ago
  Is that fixed by the VACUUM process?
  [-]
  - bloody-crow 1618 days ago
    No, cause VACUUM can't kill those dead tuples while transaction is still running. Think of it this way...
    When you open a transaction, you need to have a guarantee that you can touch rows that existed at the moment when transaction has started. You job queue is chugging along and processes let's say a 1000 jobs per minute. Processing a job involves deleting the row from the queue, but since you have a transaction running, Postgres only marks the row as deleted, and keeps it around in case a transaction would want to access it at some point. Each time you need to process a job, Postgres needs to lock the row. The way this mechanism works involves iterating over the rows until you find one you can lock on. If your transaction is running 30 minutes, each job would have to iterate through 30k dead rows (deleted, but still around for the sake of the transaction). Slowing down lock time leads to overal degreaded performance of the job queue, which leads to jobs being added faster than they're being processed, which further exacerbates the problem
    [-]
    - biggestdecision 1618 days ago
      I wonder if this could be solved by moving in-progress jobs to a separate table...
      Guess you lose job atomicity that way.
  - taesko 1618 days ago
    It naturally grows due to long running transactions because the vacuum process cannot clean dead tuples that a transaction can "see". Though more aggressive vacuum settings might solve this problem.
ukd1 1618 days ago
This post is pretty out of date; Que and QueueClassic moved to SKIP LOCKED at some point, over using advisory locks / lock head methods. It's much faster, but only supported in Postgres >= 9.5.
[-]
- ukd1 1618 days ago
  https://www.2ndquadrant.com/en/blog/what-is-select-skip-lock... - is actually a pretty good post explaining the before (aka this gist) and after (aka skip locked). Another way of doing it was (now pointless) using http://www.cs.tau.ac.il/~shanir/nir-pubs-web/Papers/Lock_Fre... (you can see an implementation here - https://github.com/QueueClassic/queue_classic/blob/v3.2.0.RC...)
- arcticfox 1618 days ago
  Que still doesn't use SKIP LOCKED, unfortunately. I think QueueClassic is just the better library at this point, as SKIP LOCKED simplifies everything tremendously and offers similar performance.
oftenwrong 1618 days ago
Why use a RDBMS as a queue? Because you already have it, and it works well. One day you may need a purpose-built component, but today YAGNI.
http://boringtechnology.club/
[-]
- pas 1617 days ago
  Breaking up the monolith is hard. Yes YAGNI is terrible, over-engineering makes everything brittle and hard to test and so on, but at the same time I burnt myself with spaghetti obelisks more than with the too many docker containers way.
  That said, using a common persistence store initially makes sense, but trying to compartmentalize the jobqueue/batch-processing stuff never hurts.
bloody-crow 1618 days ago
This is an article from 2013. I suggest reflecting this in HN title, cause it's not immediately obvious.
ukd1 1618 days ago
There is also QueueClassic (https://github.com/QueueClassic/queue_classic); another Ruby based project. Recently I benchmarked some improvements we made to it vs Que (before and after) - it's now faster which is kinda cool -https://github.com/QueueClassic/queue_classic/pull/303#issue...
kraih 1618 days ago
The Minion job queue is another implementation of this idea. But using the even more efficient and safe FOR UPDATE SKIP LOCKED. https://github.com/mojolicious/minion
voldacar 1618 days ago
I'm not very knowledgable about db internals so sorry if this comes off as ignorant, but in an era where cpus execute billions of instructions per second per core, is 10000 jobs per second supposed to be impressive? Is this kind of problem bottlenecked by memory?
[-]
- qaq 1618 days ago
  by IOPS you fsync transaction to Write Ahead Log (WAL) on commit.
  [-]
  - Noumenon72 1616 days ago
    Is this supposed to explain why 10000 per second is impressive? I don't follow.
    [-]
    - qaq 1614 days ago
      This is supposed to explain that it's orthogonal to CPU performance.
golergka 1618 days ago
Seems related to this one from a week ago: https://layerci.com/blog/postgres-is-the-answer/
[-]
- faizshah 1618 days ago
  Yup, the reason I posted this was the great discussion from last week: https://news.ycombinator.com/item?id=21484215
cwalv 1618 days ago
Another potential benefit to use a RDBMS system as a queue is that it can make it much simpler to express priorities in cases where it's not a simple FIFO, i.e., if the next job a consumer should take depends on more than just the time the job was added to the queue.
One place this has come up for me is when the next job that's picked depends on currently running jobs, e.g., each job is associated with a user, and if a single user already has N tasks running you may want to prioritize another user's tasks for the N+1 slot.
bufferoverflow 1618 days ago
Now compare to OpenAMQ, ZeroMQ, RabbitMQ, NSQ, Kafka.
I have seen benchmarks reaching millions of messages per second.
[-]
- madhadron 1618 days ago
  The article is discussing job queues, not messaging. The MQ systems plus Kafka that you are referring to are message transport systems.
  [-]
  - bufferoverflow 1618 days ago
    What's the actual difference between a job queue and a message queue filled with job IDs?
    [-]
    - mlyle 1618 days ago
      In a decent job queue you know when jobs get completed, whether the worker died, and whether they're otherwise stuck.
    - icheishvili 1618 days ago
      Message queues achieving the rate you said are typically not durable.
    - wjd2030 1618 days ago
      The difference is nothing. A message can be a serialized object representing a job. Or the message can be a jobId that points to a record in a db.
    - ubu7737 1617 days ago
      When I used PG for my job queue, I was already using Kafka instead, to handle immediate job entries. The reason I needed PG was for jobs scheduled at a future time.
      I had no difficulty with long-running jobs because servicing jobs out of PG was simply a matter of pushing them onto the Kafka queue for immediate uptake there.
- wjd2030 1618 days ago
  Exactly. It is so easy to achieve using a dedicated queuing system. You could just as easily achieve much higher throughput by passing a jobId as a message to Rabbit and have the worker pop the required data out of the db. Thats assuming you dont want to just pass in a serialized object. All this talk of it not being a durable system is just wrong. Rabbit has strong durability guarantees across a cluster with queue mirroring, confirmations and acknowledgements.
  [-]
  - ukd1 1618 days ago
    Rabbit and most others do; but only once it reaches the queue itself. The problem that using a fb to queue stuff gets around is guaranteeing it earlier; your changes to database and the queued messages are either all committed, or not - a single transaction. This is hard to impossible with queues outside a dB. However obviously it’s a trade off; it’s slower, it’s more work and connections to your dB. But, you don’t have to code around messages either never sent it from a later rolled back transaction.
- baq 1618 days ago
  i don't need millions messages per second for my job queues, i'm not google.
rhacker 1618 days ago
About 22 years back I got in trouble for recommending we drop Tuxedo Queue with just a database table since we can get all the transactional options for free, just the same, without all the headaches an operational issues involved in connecting all our code with the C++ middleware, especially since Java and PHP were in the mix and Tuxedo was kinda of in it's own world and was pretty much the only non-free software we were using and using in a horrible idiotic way.
sbov 1618 days ago
Depends on the project. We have some that use PostgreSQL backed queues. Some using Redis. As you get more traffic RDBMS resources are precious - wasting them on queues starts to become a poor tradeoff. Redis is so generally useful we tend to use it on most projects anyways, so it doesn't add another moving part. And we tend to be really conservative about adding new things to our production environment, and despite that Redis has found a home in our stack.
[-]
- mikeklaas 1618 days ago
  You're allowed to use more than one PostgreSQL instance for your queue if RDMS resources are precious <g>
dpedu 1618 days ago
How does this compare to Redis? Seems like Redis would handily beat it.
[-]
- danharaj 1618 days ago
  Postgres is not best in class in many things it does well but it suits the needs most people will actually encounter in the wild while being built on a rock solid foundation. It truly is a fantastic hammer for every nail involving data integrity or synchronization.
  It's well designed so that when you do venture into territory where you need a specialized solution it's not difficult to replace. But you'll definitely miss ACID and relational data if you're leveraging them.
  There is a great operational economy to making Postgres your default solution for these sorts of problems.
  [-]
- dickeytk 1618 days ago
  The author addresses Redis:
  > So, many developers have started going straight to Redis-backed queues (Resque, Sidekiq) or dedicated queues (beanstalkd, ZeroMQ...), but I see these as suboptimal solutions - they're each another moving part that can fail, and the jobs that you queue with them aren't protected by the same transactions and atomic backups that are keeping your precious relational data safe and consistent. Inevitably, a machine is going to fail, and you or someone else is going to be manually picking through your data, trying to figure out what it should look like.
  I disagree though. BLPOP is far easier to grok than any Postgres solution and Redis is rock-solid. Either using the new Redis streams or a good queueing library is going to guarantee you don’t miss any jobs and have a need for transactions across both systems either.
  I would also be very hesitant to add work to my database. That’s often a sensitive part of systems.
  [-]
  - erichocean 1618 days ago
    > Redis is rock-solid
    I've lost gigabytes of data with Redis; nothing with Postgres. I run both in production.
    Redis has a lot of obscure failure modes, lacks transactions (with rollback), and has no query language. For simple stuff where it's okay to lose data regularly, it's fantastic (that's when we use it).
    That said, we've been doing more and more with Postgres over time, and less and less with Redis. The benefits of having all of your data in the same storage, transactionally consistent, with a query planner for ad hoc visualizations and reporting is just too great.
  - ukd1 1618 days ago
    You can disagree, but you're still at least partially wrong; if you're using postgres already, adding redis WILL make your system less reliable as you now have another point of failure to deal with (something else to patch, keep up to date, restart, etc, etc). Unless you have super high traffic, low latency requirements - using postgres adds some load, plus saves you having to write code to deal with rollback / race conditions with postgres vs a-redis-queue.
  - shandor 1618 days ago
    I was also wondering about the "safe and consistent" part. Doesn't Redis Streams with persistence solve those pretty well?
    Edit: oh, someone mentioned this being from 2013. No Redis Streams back then.
    [-]
    - ukd1 1618 days ago
      Ya, it doesn't mean streams - even if it's consistent in isolation, you're using it with Postgres - how do you make actions in one consistent with actions in the other? (TLDR: you won't/don't)
- n4r9 1618 days ago
  You may be right in terms of performance (I've no idea) but article states:
  > many developers have started going straight to Redis-backed queues (Resque, Sidekiq) or dedicated queues (beanstalkd, ZeroMQ...), but I see these as suboptimal solutions - they're each another moving part that can fail, and the jobs that you queue with them aren't protected by the same transactions and atomic backups that are keeping your precious relational data safe and consistent
  [-]
  - dimino 1618 days ago
    I find this argument to be pretty weak, and seems to be a stand-in for, "I don't want to use a new technology because it'd require me to learn something, so I'm going to shoehorn something I feel more comfortable with, even though it's not the best tool for the job."
    Edit: I'm rate limited so to elaborate a bit.
    Queueing technologies are not "new interesting bit of technology", they're tailored solutions to solve a specific problem.
    Using a RDBMS for queueing is not what it was built to do, and you will run into issues doing so (I have).
    By trying to use one tool for everything, you lose out on all kinds of optimizations, features, and performance enhancements that are specific to the problem you're trying to solve.
    It's bad engineering to try and force Postgres into the role of queue when vastly superior technologies exist. You're actively hurting the engineers you work with, and the company you work for, if you force the same tech into use cases it isn't optimal for.
    I cannot stress this enough; fear of learning is anathema to software, and trying to hide it behind a veneer of caution is not only disingenuous but potentially malicious as well, maximizing exclusively for the benefit of the individual against the interests of the group.
    [-]
    - mlyle 1618 days ago
      Fewer pieces in the technology stack is better, because each piece is operationally expensive (it is its own set of work in maintenance, upgrades, migrations, etc). There are also the mentioned costs (you can't atomically do things across divergent pieces of infrastructure, at least not very easily at all).
      Fighting the tendency of everyone wanting to draw in each new interesting bit of technology is important.
      Every piece of complexity you incur should be proven as required, and you should build things to just somewhat larger than near term projected scale... and not expect that you are a temporarily embarrassed unicorn with loads of unexpected demand showing up.
      If you run out of performance, you have some headroom from tuning; from that point, you can then choose whether it makes sense to retrofit to a higher scale technology or throw larger hardware at the problem (or both).
      [-]
      - ukd1 1618 days ago
        So much love for this; 'Every piece of complexity you incur should be proven as required', plus 'Fewer pieces in the technology stack is better, because each piece is operationally expensive'. Also, this totally ignores (as these, imho are enough in themselves) the actual practical differences between using a single thing and two - synchronizing state; most of the cases here guarantee atomicnes individually, but not as a pair - you (the engineer) have to deal with that - which is non-trivial.
      - threeseed 1618 days ago
        These days it's pretty trivial to have a cloud managed component e.g. Redis that is maintained, upgraded and supported.
        And then you have a component that is designed for the job instead of trying to use a database as a poor man's queue.
        [-]
        mlyle 1618 days ago
        > These days it's pretty trivial to have a cloud managed component e.g. Redis that is maintained, upgraded and supported.
        It's still another moving part. Thing should be as simple as they can be, but no simpler.
        > trying to use a database as a poor man's queue.
        Of course, it's not really a "poor man's" queue-- it's got some superior capabilities. It just loses on top-end performance. (Of course, using those capabilities is dangerous, because it creates some degree of lock-in, so go into it open-eyed).
        For as much as you accuse others of looking down their nose / not willing to seriously consider other technologies... you seem to be inclined that way yourself.
        Redis is great. But if you have Postgres already, and modest to moderate queuing requirements, why add another piece to your stack? Postgres-by-default is not a bad technology sourcing strategy.
        [-]
        dimino 1618 days ago
        It is very much a bad technology sourcing strategy, and you will kill your business if you attempt it. Period.
        It comes from a place of ignorance, and you're promoting ignorance. Learn why technologies exist and make an informed decision about tradeoffs, instead of being lazy and incompetent by blindly choosing technology based on what's a very locally maxima for you personally. It's selfish and damaging.
        Edit since I'm rate limited: The folks landing on Postgres are not landing there after due consideration, they're landing on a technology they're familiar with because they don't know how to learn.
        It is lazy, ignorant and selfish, and while of course people don't like having truth spoken to them, that's what it is; truth.
        Edit 2: Making these poor technology choices will kill your business because you won't have the agility that using a specific message broker technology gives you. You won't have built-in solutions to common queueing problems, you won't have dedicated specific logging/metrics to monitor, you won't have the libraries in your preferred language to directly tackle your problem, you won't have the support community available to you (it will be much smaller), you won't be able to pivot onto related patterns as your needs change, your scaling will always be more complicated because fewer people are doing it and the tool you use isn't tailored for your use case; the list goes on. Your competitors will swallow you because they move faster than you do, and your business will die.
        This is annoying to me because I've been in situations where I've had to maintain and write features against technology that was a poor fit for its use, but people like those in this comment sections bitched and moaned about a "new thing" existing in our stack. The reality was they didn't want to learn anything, and were fine pushing the hard work off onto the developers, so they could safely continue to do as little as they possibly could get away with.
        Do your job, learn technology that actually fits your use case, and stop trying to push work off onto other people.
        [-]
        edoceo 1618 days ago
        Bruh, folks are making an informed decision - and landing on PostgreSQL. Their scenario/conclusion is just different than yours.
        It's not lazy, blind or selfish as you claim - and frankly when you disagree with authority & condescended tone - it's a turn off.
        ukd1 1618 days ago
        Wow, you're super jaded on this, do you sell queuing software for a living? Also wrong: how will this kill a business, at all?
        [-]
        ukd1 1618 days ago
        Downvote away; but really “kill” your business!!?
    - ukd1 1618 days ago
      It's not bad-engineering at all; how many engineers have to operate at that level in your stack, ever. Close to zero, as normally they just use an abstraction on top of it - Rails for instance is ActiveJob. No one has to care what's behind it, except folks running production - and for them it's a trade off; it's simpler - one less moving part, but also more load on one single part. Is that worth it, or not? Depends.
      I can't stress this enough; claiming anything is "vastly superior", especially with no examples, isn't useful - things aren't this black and white in reality. Most tech-choices like this are somewhere between trade-offs, preferences, ignorance or wrongly held opinions about something being vastly superior > something else.
      [-]
      - dimino 1618 days ago
        Sorry, but this isn't one of those cases. A RDBMS is objectively inferior for queueing compared to dedicated message broker technologies. This is blatantly obvious if you've ever used both for queueing.
        You're applying generic heuristics against a problem where folks have specific domain knowledge that contradicts those heuristics.
        I cannot stress this enough; you are flat wrong if you think Postgres is appropriate to use for queueing.
        [-]
        tomtheelder 1618 days ago
        I believe you are missing the point a bit here. Postgres is not the optimal tool for queueing. No one is arguing that position. However, it's also true that for a very large number of use cases, it is entirely sufficient.
        If you already have tooling and knowledge for managing Postgres, and Postgres is sufficient for your queueing needs, then why would you incur the cost of introducing a new service? Most situations don't require the best tool for the job, and in that case, you are well served to use a tool you already have.
        If you are using an abstraction layer over your queues (you almost certainly are) then you should be able to easily change to a different queue implementation later, should the need arise.
        [-]
        Thaxll 1618 days ago
        Except that PG and other RDMS don't offer "queue" commands in driver because the queue is built on top of other basic commands, when you use a system designed for that you can leverage the API offered by the driver directly.
        So if my language doesn't have a library like Que ( no thanks using Ruby ) I can't do anything with PG whereas using Redis or any other solution you can use "subscribe" / "publish" ect ...
        ukd1 1618 days ago
        This ^ x 100.
        dimino 1618 days ago
        I'm not missing the point, I'm discounting the point, because the point that adding technology is additional complexity, and complexity is bad is a nonsense argument that actually stems from a fear of new things.
        This has nothing to do with getting shit done, and everything to do with people who don't want to learn new things because they're afraid of not being smart enough to make the new thing work.
        [-]
        DLA 1618 days ago
        Every different technology added to an infrastructure adds potential failure points, raises the DevOps burden, and adds surface area subject to bugs and security attacks.
        If a PG-based solution meets the needs then this reduces complexity in the large. And this has nothing to do with learning or not learning.
        Sorry but your argument just does not hold up.
        [-]
        dimino 1618 days ago
        It only adds the burden to DevOps who can't figure out how to monitor/maintain a system, which is not rocket science, and if you're a DevOps person who doesn't know how to handle such a mainstream technology as, say, ZeroMQ or RabbitMQ or even Redis, then maybe you shouldn't be in the position you're in.
        The only argument that doesn't hold up is "use one tool for everything". This ideology is for lazy people who want to anchor themselves in a period of time, and refuse to learn new things.
        [-]
        DLA 1618 days ago
        Nobody is saying use one tool for everything. In fact the argument is why not use PG until you grow out of it (most people won’t).
        Agree on devops. Of course.
        There’s still the security surface area argument and the overall complexity argument—complexity of the system of systems is NOT reduced by adding technologies.
        By way of analogy, using military aircraft, take the F35 vs. the A-10. The F35 is insanely more complex than the nearly indestructible A10 in large part due to the massive number of subsystems, software, sensors, etc. Sure the F35 way more capable (this is NOT a head to head argument about the military capabilities or said aircraft) than the A10, but it also has orders of magnitude more failure modes. More parts. More complexity. More opportunities for failure. This is engineering fact.
        [-]
        dimino 1618 days ago
        People are literally saying to default to using PG for "everything". That's the attitude I'm arguing against.
        What's frustrating about this conversation is that I'm literally, right now, supporting two different queueing systems based on PG and Redis, so I get on a very real level, the tradeoffs. I know in great detail the problems that come up, but HN is not conducive to talking at that level of detail. At this point every comment I make is flagged and downvoted, so why would I pay time into a system that has clearly decided my opinion isn't relevant?
        [-]
        ukd1 1618 days ago
        I mean i think it’s mostly the blunt, binary right/wrong nature of your posts getting you this reaction. For most folks they’ve moved past that and almost all of us have non perfect world systems where we’ve made trade offs. It’s just not my (and I guess a lit of folks) reality to be so black/white on this.
        [-]
        dimino 1618 days ago
        Of course there are trade offs, of course we live in imperfect situations where we have to deal with shit that isn't ideal, but sitting around pretending like everything is alright is just lying to ourselves.
        You're stuck in PG because your DevOps team is a bunch of incompetent dolts, so yeah, make the best out of the bad situation and go ahead and use something like Que.
        Just don't develop Stockholm Syndrome while doing it.
        People in this thread are protecting their own egos by expressing how great PG is as a message broker, but it's harmful to the industry to let that kind of attitude perpetuate.
        AlfeG 1618 days ago
        But what about "box" products? We can't force clients have dedicated DevOps just to support our product. Less is better in this case. To have same code in Cloud and offline installs is even better.
        ukd1 1618 days ago
        An array can be appropriate for queuing for some things. You're wrong - as you are trying to state "a problem" has a specific solution, yet you don't know the problem. Yes, there are situations when using a dedicated message broker would be objectively better; there are many when that's totally wrong.
        Nothing is likely be correct about your arguments if you don't know what you are trying to fix or the sitation, which should also be blatantly obvious. You're applying some unstated subjective situation in your head, then dumping out something you've heard / done before. LOL.
        Also "if you've ever used both" - I have, which is why I know. It's gray; sometimes you should, sometimes you shouldn't. You're assuming I haven't as I have differing point of view to you.
        [-]
        dimino 1618 days ago
        No, I'm assuming you haven't otherwise you'd know that redis AND postgres are inferior message brokers, both of them, and the fact that you haven't corrected me means you're not well versed at all in this problem domain.
        [-]
        ukd1 1618 days ago
        LOL; what problem domain? We've not talked about any specific situation - which is why you're wrong; you assume far too much. I was attempting to enlighten you a little, but after a few redeliveries, it seems it's gone to DLQ.
        [-]
        dimino 1618 days ago
        The fact that you can't even recognize this as a problem domain means you do have zero clue what you're talking about. How about THAT?!?
        Thanks for making my point for me.
        edoceo 1618 days ago
        Earlier you said something about making excuses for learning and now, claim something subjective is 100% true and blatently obvious -- and that could block your own learning.
        Like many things in this space there are a number of good ways - each with minor trade offs
        Others posters have detailed those trades very clearly.
        So now we can all be more educated about which one of the six good ways we'd choose.
    - outworlder 1618 days ago
      > I cannot stress this enough; fear of learning is anathema to software, and trying to hide it behind a veneer of caution is not only disingenuous but potentially malicious as well, maximizing exclusively for the benefit of the individual against the interests of the group.
      This is wrong.
      You are ultimately getting paid to solve the company's problems. If those problems actually require deployment of a new technology to satisfy the requirements, then it is fine. Note that I said satisfy requirements, not exceed them.
      But more often than not, * that's not the case *. If your queue requirements can be served with PostgreSQL, and you already have it, then why not? To do otherwise would be over engineering.
      Because you may know a new and fancy technology that would be objectively better. Cool. Now, do you have enough know-how in the company to use it effectively? Do you know what the best practices are? Can you deploy, monitor, audit and patch this in production? If you have answered NO to any of those questions, you'll either hire people or you shouldn't deploy. Period.
      And if you have only one person answering YES to the previous questions, you still shouldn't deploy it. Because one day that person will leave, and you are now left the company in a worse situation than it was before. For no reason other than satisfying your engineering itch.
      THAT is anathema to working, reliable production software. Toys, you can do whatever.
      If you want to introduce something new, you can. But you need to do it responsibly. It needs a reason to exist - a valid reason, not "it's better". How does it being "better" help the company? Will it reduce costs? Maintenance? Future development work will happen faster? Will it make a faster user experience? Scale to the projected company growth?
      It will have a cost that will have to be accounted for – including opportunity costs.
      > By trying to use one tool for everything, you lose out on all kinds of optimizations, features, and performance enhancements that are specific to the problem you're trying to solve.
      You only care about this if you have a reason to care about this. Otherwise it's all irrelevant.
      [-]
      - dimino 1618 days ago
        Sorry, but what you call a "toy", I call a highly reliable technology that's used in production widely across the majority of Fortune 500 companies. No one here is suggesting anything remotely like what you're strawmanning, but it's clear the straw man is the only argument you can take down.
        The reason you are completely wrong and a toxic member of your team is that your logic can be used to justify all kinds of terrible, short term wins that sacrifice any kind of long term/sustainable design. You'd rather make a bad technology play so your boss gets off your back, than think through your decision and make a long term investment into a sustainable, maintainable, improvable, and agile technology that's actually suited to solve the real problem you have.
        OF COURSE you're there to accomplish the tasks your business needs to succeed. That's a given. What's not a given is that you need to slap together the shittiest piece of software that'll do that job right then and there.
        Your attitude is what keeps companies from solving problems, and forces them to pay dollar after dollar to fix the same issue over and over again. You're selfish, you're greedy, and you cost your company insane amounts of money.
        You should not be employed in this industry if what you wrote here is how you think, period.
        [-]
        dang 1618 days ago
        Would you please stop posting flamewar comments and crossing into personal attack? This breaks the site guidelines and is not ok here.
        https://news.ycombinator.com/newsguidelines.html
        [-]
        dimino 1618 days ago
        I'm not doing anything like that, but please, pretend that I am, seems like your go-to move here anyway.
        [-]
        zentiggr 1617 days ago
        > You should not be employed in this industry if what you wrote here is how you think, period.
        That seems pretty personal. Not that I'm hopping on dang's bandwagon, but you can't post that sort of attack and say you're not doing anything like that.
        I've wound up seeing a lot of your recent posts and I can understand where a lot of your responders are coming from... you are very strongly opinionated and have no problem crossing the line from "I hold this opinion" to "you're wrong for not agreeing and therefore you <insert consequence>".
        THAT's the behavior that's getting you the pushback. Be more open to discussing things and comparing experiences and maybe you'll see more cordial response back.
        [-]
        dimino 1617 days ago
        It's not a personal attack, it's my opinion. I've worked with people who think that way, and I'd prefer if they didn't exist in the industry. How would you prefer I shape that comment so it's not crossing some magical line?
        I am aware, generally, how things are. My complaining here is not confusion, it's frustration. I know I can change my behavior to elicit a better response, but I'm frustrated that it's relevant. People should have thicker skin, and when people dish it out (as what was going on here), I should be allowed to give it back. Dang's not commenting on anyone else's posts here, though there were many other rule violators.
        It's an unfair application of the rules.
        Edit: This post is 0 minutes old, how does it have a downvote already? This is the shit I'm talking about...
        dang 1618 days ago
        We cut you much more slack than we give to most accounts who break the site guidelines, so I'm not really feeling your complaints.
        Would you please pick one account to post with? Using multiple accounts to work around moderation restrictions is obviously abusive.
        [-]
        dimino 1618 days ago
        You know moderation is at an IP level, we've had multiple discussions about it, and I'm clearly still getting hit with the rate limit, so when you say "to work around restrictions" it's obviously disingenuous on your part. Someone/multiple people are auto-downvoting everything I comment on with my other account, so I switched to see if it kept happening. Unsurprisingly, it didn't until I mentioned it again, and it started up immediately after.
        As for the other comment you made (can't reply directly, rate limit), it isn't a low information post, I'm showing a screenshot of the message I get when I hit the rate limit.
        You're literally gas lighting right now, it's insane.
        You also know if you actually do anything else beyond rate limiting I'll just go dark, which is even "worse" for you, so this so-called "leniency" is nonsense. Just stop.
        Edit: You can't use multiple accounts to get around rate limits, as you and I have discussed multiple times, so I'm not "doing" anything that constitutes a ban, but feel free to ban at your leisure. Over the years you've established yourself firmly as an unreasonable person, so there's no point trying to treat you like one.
        [-]
        dang 1618 days ago
        I'm afraid I don't really follow what you've written here.
        Rate limits are meaningless if people can just use multiple accounts to get around them. If you keep doing that, we're going to have to ban at least one of your accounts—but more likely all of them. After several years of trying to persuade you to use HN as intended, I'm beginning to lose patience.
        aldoushuxley001 1618 days ago
        You gotta relax your tone eh. Not sure why you’re so personally/emotionally invested in this topic but your aggressive tone makes your myriad comments grating.
        [-]
        dimino 1618 days ago
        This isn't an aggressive tone, except to people who have their own emotional investment into the topic.
        I put out what is given.
- tracer4201 1618 days ago
  If your requirements don’t necessitate such scale, I can see Postgres being a viable alternative. I’d rather not introduce another dependency unless it’s absolutely needed.
  [-]
  - dickeytk 1618 days ago
    I’ve never in my life worked on something where we didn’t keep having to upgrade our database resources
    [-]
    - oftenwrong 1618 days ago
      When you scale your db, now you're scaling your queue, too. One thing to scale, one thing to monitor.
- truth_seeker 1618 days ago
  Redis would be faster. In PG you can use UNLOGGED table which is basically a in-memory table.
  [-]
  - ukd1 1618 days ago
    UNLOGGED: You could, but if you get some types of crash, you'd be in trouble. More on non-durable settings here - https://www.postgresql.org/docs/current/non-durability.html
- mobilemidget 1618 days ago
  I'm still waiting for a work project to come by where I can replace Redis with https://keydb.dev and test that in action.
  [-]
  - wjd2030 1618 days ago
    I have tried this in our dev environment several times and failed b/c it didnt support all RESP2 commands (Using stackexchange.redis as the lib)
- Exuma 1618 days ago
  Because Que has transactional/ACID guarantees, and Redis does not.
  [-]
  - adverbly 1618 days ago
    Actually, redis makes several guarantees. Just not as acid as rds. Source: https://redis.io/topics/transactions
    [-]
    - balfirevic 1618 days ago
      Enqueueing jobs in Redis is not transactional with regard to your primary data store.
      [-]
      - adverbly 1617 days ago
        Oh I see what you mean. So ya its not transactional across systems. Ya good point. I usually manage that by just updating the primary data store in the job, but I can see what you mean, and how that might not always be an option.
- echlebek 1617 days ago
  With postgres, you have a chance to not lose data.
mianos 1618 days ago
This reminds me of pgqueue https://github.com/markokr/skytools as used at skype. It works and it works well but needs a plugin so maybe not so good for RDS.
maxpert 1618 days ago
May be I am being ignorant/arrogant, or have seen similar toy systems crash and burn in prod environments. And if you go through https://www.2ndquadrant.com/en/blog/what-is-select-skip-lock... it clearly states:
> A queue implemented in the RDBMS will never match the performance of a fast dedicated queueing system, even one that makes the same atomicity and durability guarantees as PostgreSQL. Using SKIP LOCKED is better than existing in-database approaches, but you’ll still go faster using a dedicated and highly optimised external queueing engine.
People only realize these warnings when they have built a toy system that is rolled out to a production/product that takes off. By that time you will realize "ohhh I don't need all the ACID guarantees for every job in my system", and I want to run my workers as lambdas/elastic scaling workers (you need more connections). That is where companies will be spending efforts from their best engineers to move away from Postgres as queue. Which comes down to question; why do it in first place? What makes it cheaper (other than local dev) to deploy a system that is bound to fail in future? Don't get me wrong I love Postgres, I just don't believe that it's the right tool for doing this job.
ptrwis 1618 days ago
I would love to see now PostgreSQL going more into improving developer's expierience. Job queues, job scheduler, packages.
tuldia 1618 days ago
PostgreSQL never ceases to amaze me.
The great thing is that it is a mature tool and you don't need to bring another beast to the zoo.
Exuma 1618 days ago
This is awesome, I love Que
29athrowaway 1618 days ago
What is the problem with Kafka?
effnorwood 1618 days ago
TrumpQL is the new name. 10K jobs per second!
danmg 1618 days ago
I always thought that using a database server for RPC was considered an anti-pattern.
Thaxll 1618 days ago
Bad idea to use that because if the worker crash the event is lost, don't use PG for that.
[-]
- bloody-crow 1618 days ago
  Wait, what do you mean here? The job is a row in a database. You can crash as much as you want, the row is not going anywhere until you explicitly mark it as "worked" and delete it.
  There are different performance constraints to this approach, but data integrity and robustness of it are unmatched, really.
- dragonwriter 1618 days ago
  > Bad idea to use that because if the worker crash the event is lost
  The only case where this would be problematic is if the worker crashed after doing some side effect external to the queue server that rendered the job non-idempotent, but before committing an acknowledgement. But that's not really a “what you use as a queue server” issue as a “are you also using a distributed transaction system to coordinate all side effects including queue updates” issue. (If the same Postres DB is the operational DB and the queue server, you may be able to avoid distributed transactions by having all the side effects in the DB, though that creates its own issues.)
- ecnahc515 1618 days ago
  This is pubsub and the data is in an ACID compliant database. The worker will just re-subscribe and poll for events on startup.