I love Django / Django Rest Framework and have used it for a long time, but we recently dumped it from a project in favor of FastAPI.
There is just so many layers of magic in Django, that it was becoming impossible for us to improve the performance to an acceptable level. We isolated the problems to serialization / deserialization. Going from DB -> Python object -> JSON response was taking far more time than anything else, and just moving over to FastAPI has gotten us a ~5x improvement in response time.
I am excited to see where Django async goes though. Its something I had been looking forward to for a while now.
-- Perf: JWT services on another stack (GPU, Arrow streaming, ...)
So stuff is either Boring Code or Performance Code. Async is great b/c now Boring Code can now simply await Performance Code :) Boring Code gets predictability & general ecosystem, and Performance Code does wilder stuff where we don't worry about non-perf ecosystem stuff, just perf ecosystem oddballs. We've been systematically dropping node from our backend, where we tried to have it all, and IMO too much lift for most teams.
Similarly, we ended up doing the same. Boring CRUD/CMS stuff is all in Django. That's 90% of our codebase and by far the most important. Our "user scale" endpoints are all implemented in Lua in NGINX and just read/write to Redis and data changes go into SQS and processed by Celery back in the Django app. It scales phenomenally well and we don't lose any of the great things about developing all of our core biz-critical stuff in Django.
That is quite interesting. There are a lot of things like management, tests, and such that I love and miss from django. Going to have to really think about what I think of this.
Edit: Although, now that I think a little more its not that surprising. Our initial tests did literally just define FastAPI schemas on top of our existing DB. The co-mingling while actually running is an interesting concept though.
Just FYI, for anyone reading this and having the same problem, I suggest they try Serpy which is a near drop-in replacement for default DRF serializers. It might solve your performance problem without having to switch to a completely different API framework.
I recently wrote a python api for work and used FastAPI, I want to like it, but it was doing so much magic behind the scenes that it ended up being frustrating to use and just got in my way, ended up dropping it in favour of using Starlette directly
The way it tries to construct the return values kept getting in the way.
I’d define the class, add it as the return value, I was manually instantiating the class and returning that, but for it didn’t like that and would constantly throw errors about it. I think it was PyDantic which was the root cause there.
The Depends functionality refused to inject my classes as well, but I was probably doing something wrong there...
Dropping back to Starlette was good because it gave me everything I needed and got out of my way. I’ve still got everything fully typed and passing MyPy.
If your DB is Postgres and you can do everything you need to fetch the data in SQL, Postgres can output JSON directly. It’s pretty fast at it. Usually it’s not too hard to do this on a few performance-sensitive endpoints in a framework web project.
For performance-sensitive endpoints, Django just isn't the right tool. You can do a lot of optimizations in Django but in reality the WSGI/ASGI overhead and Django's request routing through middleware and view functions or CBVs is extremely slow. Is anyone handling 1,000 requests/second in their Django app without having to run 50 servers? The answer is no. If you're getting to the point where you're trying to figure out how to emit JSON from your database directly, then you've already lost. Django is exceptionally well suited to exactly what it was originally designed for: a content management system and "source-of-truth" for all of the business data in your application. High-velocity "user-scale" is better done in another service.
Have any interest in expanding this into a blog post? I've been working on a similar post. Maybe we can compare notes. I'm at michael at testdriven dot io, if interested.
Django Rest Framework has really slow serialization. After seeing it in action, I wrote my own simple serializer that I have been using quite a bit. Deserialization isn’t event really needed: just feed the submitted JSON into vanilla Django forms. It works better anyways.
FastAPI uses Pydantic under it for python objects. And we have been tinkering with orjson for the actual json serialization, since it appears to be the winner in json serialization at the moment.
Why didn't you use Pydantic with Django if the DRF serializers were too slow?
You can also skip the object serialization from the ORM and work with python dicts directly to significantly improve serialization performance from the database.
Was there still a significant speedup using the standard library json module?
For the DB requests Are you writing sql directly, using a different ORM, or something like sqlalchemy core that makes sql pythonic without being an ORM?
Yeah, the main improvement was seen even before playing with orjson. It did help too I think, but only started it yesterday so haven't actually profiled the two side by side. To have real numbers.
And it uses SQLAlchemy under the hood. Can use all of it. But if you want full async all the way down, can just use core and something like encode/databases for the DB access.
How detailed was the profiling on this? Reason I ask is I’ve faced this myself and had to spend a lot of time on both query and serializer optimization.
We used `silk` a lot to profile the app. And basically all the time was being spend inside django somewhere between getting the data from the DB and spitting out the response. We would have things like 15ms in the DB, but 250ms to actually create the response. On simple things. Some of our responses were into multiple second (large amounts of data) but still only spending maybe 150ms in the db. And there was at least two weeks spent on and off trying to improve it before we finally decided we had to go somewhere else. And thats after having to redo some of our queries by hand because the ORM was doing something like 15 left joins.
Basically just a complex permission model based on relationships. Much better handled with a subquery. Mostly on is. I don't blame the ORM entirely, but it was more joins than necessary too.
Yeah, FastAPI uses SQLAlchemy under it. Along with pydantic to define schemas with typing. And then just started tinkering with orjson for the json serialization. Seems to be the fastest library at the moment.
I have also been experimenting with encode/databases for async DB access. It still uses the SA core functions, which is nice, but that means it does not do the nice relationships stuff that SA has built in when using it to handle everything. At least not that I have found. However it does allow for things like gets without relationships, updates of single records, and stuff like that quite nicely.
FastAPI is database agnostic, although tutorials talk about using SQLAlchemy (probably because it's most popular).
I am using asyncpg[1] (much more performant and provides close mapping to PostgreSQL, making it much easier to use its advanced features) through raw SQL statements without problems.
Yeah, you do end up defining everything more than once, once for SA, and then for pydantic. Create, Read, and Update may all be different pydantic models as well. They are for defining what comes in and out of the actual API. Your create request may not have the id field yet and some optional fields, and then the response has everything. And then an update may have everything as optional except the id. Only been using it a few weeks now, but liking it a lot so far.
Great article. But I think this part may need as second look:
If your views involve heavy-lifting calculations or long-running network calls to be done as part of the request path, it’s a great use case for using async views.
That seems true for long-running network calls (IO). But for heavy-lifting calculations? I thought that was the canonical example of situations async won't improve. CPU bound and memory bound, after all.
This will only help if the workers are separate processes. Thread workers will hold the GIL in Python and prevent network I/O while they are doing CPU bound tasks.
You are correct, async will only help for long-running network calls, which happens when calling another service or querying a database.
When doing a long computation the CPU is not idle so there is no free compute power to use for something else.
Finally, when doing IO calls in Python so GIL is usually released so the kernel can already schedule another thread while waiting for IO, so it is not sure that converting to async will yield improvement and should be benchmarked if you plan on converting an existing program.
> when doing IO calls in Python so GIL is usually released so the kernel can already schedule another thread while waiting for IO
This is true, but scheduling another thread through the kernel can have higher overhead since it requires context switches. Running multiple threads also has other potential issues with lock contention; how problematic they are will depend on the use case.
The potential advantage of scheduling another thread is, of course, that it can do CPU bound work; but in Python, unfortunately, doing that means the GIL doesn't get released so that thread will prevent any further network I/O while it's running, the same as would happen in an async framework if a worker did a lot of CPU work. So Python doesn't really let you realize the advantages of threads in this context.
> doing that means the GIL doesn't get released so that thread will prevent any further network I/O while it's running, the same as would happen in an async framework if a worker did a lot of CPU work. So Python doesn't really let you realize the advantages of threads in this context.
> Computing intensive, no. Code that is doing a CPU intensive computation but makes no system calls will never release the GIL.
Any code that does not involve Python objects can release the GIL, no matter whether it makes system call or not.
For example, NumPy the most popular scientific computation package in Python, on which many other popular packages like Pandas are based, releases the GIL when doing operation on matrix. This is documented at https://numpy.org/doc/stable/reference/internals.code-explan...:
> If NPY_ALLOW_THREADS is defined during compilation, then as long as no object arrays are involved, the Python Global Interpreter Lock (GIL) is released prior to calling the loops. It is re-acquired if necessary to handle error conditions.
And does not involve running Python bytecode. Yes, numpy and other packages that provide C extensions do this when they are doing computations that don't require running Python bytecode.
there is an advantage to threads in the CPU bound case which is that the work of other threads will not be blocked for a CPU-intense operation. With an IO-event based scheduler, your CPU bound task will not context switch leading to network logic elsewhere to simply time out. A particularly acute example is something like a network library logging into the MySQL database which gives the client a ten second window to respond to the initial security challenge. It was both an extremely difficult bug for me to diagnose as well as helpful for my role at work that I was able to track that one down in Openstack :).
I thought the only reason to use ASGI is to use web sockets, and the only reason to use web sockets is to avoid making multiple requests for things that don't matter if a particular message is lost.
Whats the most elegant way for cutting edge Django to do websockets? is it still to 'tack' on the channels package [0] ?
compared to FastAPI[1] I really don't want to use it, I only miss the ORM since in FastAPI it looks like you have to manually write the code to insert stuff[2].
As someone who's done over a decade of Django work: Do use FastAPI, especially if you need websockets and such.
Django is great for CRUD apps, MVPs and such. And I've used it with success for larger platforms, but it doesn't take long for me to want something closer to the metal whenever I need custom work. FastAPI has filled that need wonderfully well.
I rolled a microservice in node to handle websockets from our Django app. I can push a notification into an event queue, make an HTTP call to send a notification, and even use postgres notify/listen.
FWIW I'd have just used Pusher.io and triggered via django signals but I didn't want clients to be able to query the list of channels, or have to think about costs.
This was more of a Friday hack for fun rather than a core product requirement for profit. But I feel like using a separately deployed and scaled service, agnostic of business logic, was the right move.
time and time again, whenever I start a new project, Django has always been my go-to choice after analyzing the alternatives. I've worked on large scale, mono-repo, billion users to side projects over the weekend, Django really stay true to the batteries included philosphy.
I think the article and some of the comments are not really looking at this the right way.
For most things you're probably better off "doing the work" in a celery task, regardless if it is IO bound or CPU bound. Then use web sockets just for your status updates/progress bar, instead of having your front end poll on a timer.
I'm also coming from Django. I've found Keystone to fit my brain a lot better than most alternatives. Tried strapi and it looks good for basic stuff but the documentation is just absolutely terrible, and once you get out of the really basic stuff you're on your own digging into their source code to understand how to do anything. Nonetheless, ok it looks promising and maybe in a few years it could be something more interesting (to me at least).
I’ve seen lots of blog posts, even the Django docs, saying async is available but still haven’t seen any real world examples yet. Do any exist?
Also, I still haven’t seen how the async addition will work with Class Based Views. Also, Django Rest Framework is still considering spending time for support. Until these two use cases are viable many users won’t benefit.
> If your views involve heavy-lifting calculations ...
Nooo, not at all. Your tasks should be I/O bound, not CPU bound to take advantage of asyncio. Maybe the async server using multiple threads with multiple event loops, but don't ever do a CPU-heavy task in an event loop because you just invalidated using asyncio completely.
javaScript ecosystem went out of the way to bring async await keywords, despite Node.js being asynchronous by using call backs and promises. The argument being code readability, While async await are just wrappers around Promise system.
Wasn't there an article about how the async syntax was benchmarked to actually be slower than the traditional way of using threads? What's the current story on python async?
I think the story with async is always "it depends", unless we are questioning whether the specific implementation is broken.
For some web applications, it might actually be faster (in meaningful aggregate volume) to service a complete request on the calling thread rather than deferring to the thread pool periodically throughout execution. I think the break over point between sync and async comes down to how much I/O (database) work is involved in satisfying the request. If each request only hits the database 1-2 times on average incurring a few milliseconds of added latency, making sync all the way down is might be better than with any amount of added context switching. If each request may take 100-1000 milliseconds to complete overall due to various long-running I/O operations, then async is certainly one good approach for maximizing the number of possible concurrent requests.
In most of my applications (C#/.NET Core) I default to async/await for backend service methods, because 9/10 times I am going to the database multiple times for something and I cannot always guarantee that it will return quickly under heavy load. For other items, I explicitly go wide on parallelizable CPU-bound tasks. All of these are handled as a blocking call against a Parallel.ForEach(). Never would a CPU-bound task be explicitly wrapped with async/await, but one may be included as part of a larger asynchronous operation.
This stuff used to confuse the hell out of me, and then I finally wrapped my head around the 2 essential code abstractions: async/await for I/O, Parallel.For() (et. al.) for CPU-bound tasks which have parallelism opportunities. Never try to Task.Run or async/await your way out of something that is CPU-bound and is blocking the flow of execution. Try to leverage asynchrony responsibly when delays >1ms are possible in large concurrent volumes.
The "slower" is not really the problem--as the article notes, the sync frameworks it tested have most of the heavy lifting being done in native C code, not Python bytecode, whereas the async frameworks are all pure Python. Pure Python is always going to be slower than native C code. I'm actually surprised that the pure Python async frameworks managed to do as well as they did in throughput. But of course this issue can be solved by coding async frameworks in C and exposing the necessary Python API using bindings, the same way the sync frameworks do now. So the comparison of throughput isn't really fair.
The real issue, as the article notes, is latency variation. Because async frameworks rely on cooperative multitasking, there is no way for the event loop to preempt a worker that is taking too long in order to maintain reasonable latency for other requests.
There is one thing I wonder about with this article, though. The article says each worker is making a database query. How is that being done? If it's being done over a network, that worker should yield back to the event loop while it's waiting for the network I/O to complete. If it's being done via a database on the local machine, and the communication with that database is not being done by something like Unix sockets, but by direct calls into a database library, then that's obviously going to cause latency problems because the worker can't yield during the database call. The obvious way to fix that is to have the local database server exposed via socket instead of direct library calls.
>whereas the async frameworks are all pure Python.
No it's not pure python. It's a combination. The underlying event loop uses libuv, a C library that's makes up the underlying core of nodejs. The marker of "Uvicorn" is an indicator of this as "Uvicorn" uses uvlib.
Overall the benchmark is testing a bit of both. The event loop runs on C but it has to execute a bit of python code when handling the request.
>If it's being done via a database on the local machine, and the communication with that database is not being done by something like Unix sockets, but by direct calls into a database library, then that's obviously going to cause latency problems because the worker can't yield during the database call.
I am almost positive it is being done with some form of non blocking sockets. The only other way to do this without sockets is to write to file and read from file.
There is no "direct library calls" as the database server exists as a separate process to the server process. Here's what occurs:
1. Server makes a socket connection to database.
2. Server sends a request to database
3. database receives request, reads from database file.
4. database sends information back to server.
Any library call you're thinking of that's called from the library here may be a "client side" library meaning that the library actually makes a socket connection to the sql server.
> I am almost positive it is being done with some form of non blocking sockets.
Database libraries in Python that support this (as opposed to blocking, synchronous sockets, which are of course common) are pretty thin on the ground. That's why I would have liked to see more details in the article about exactly how the benchmark was doing the database queries.
> There is no "direct library calls" as the database server exists as a separate process to the server process.
Yes, you're right, I wasn't being very clear. The key question is, as above, whether nonblocking sockets are being used or not.
the psycopg2 driver for PostgreSQL supports an async mode which uses PostgreSQL's full blown non-blocking API, this is what I used when I did my tests and might be what was used here. There is also the asyncpg driver that is native to PG's non-blocking API. PG is the one database that does lend itself to async because it has a fully non-blocking client library available.
None of the other replies acknowledge this but it seems you are conflating concurrency and asynchronous. An asynchronous program can be sequentially executed. It is a distinct concept.
Wow, thank you for this link. I appreciate it when my assumptions are challenged like this, particularly given the fact that I have a tendency to take benchmark synopses like FastAPI's [1] for granted. I'll have to be more conscious of the ways in which authors hamstring the competition to game their results.
I didn't downvote it, but apart from the fact that async io is not meant to be faster (it's all about throughput, after all), the benchmark is flawed and it's been discussed in full before https://news.ycombinator.com/item?id=23496994
asyncio is meant to be "faster" for IO heavy tasks and low compute. The benchmark tests requests per second which is indeed directly testing what you expect it to test.
It's been discussed before but the outcome of that discussion (in the link you brought up) was divided. Highly highly divided. There was no conclusion and it is not clear whether the benchmark was flawed.
The discussion is also littered with people who don't understand why async is fast for only certain types of things and slow for others. It's also littered with assumptions that the test focused on compute rather than IO which is very very evidently not the case.
> asyncio is meant to be "faster" for IO heavy tasks and low compute.
the point is that it's not meant to be any faster than a parallel pool of processes that perform the same heavy IO without blocking all requesting clients. asyncio is about packing as many concurrent socket interactions into a single process as possible, hence optimising for throughput by giving up the speed that gets eaten up by task context-switching. Hence the flaw in the becnhmark. The benchmark was run on the same machine where Postgres was operating. The benchmark used different number of processes for sync and async workloads, the connection pool was not setup to prevent blocking when a coroutine tries to acquire a connection from the pool when the pool is exhausted (for benchmark purposes it should not have upper bound limit and to be pre-populated with already established connections).
I see that aiohttp has 5, uwsgi has 16, and gunicorn has 12, 14, and 16 depending on a web-framework, is this your definition of the same?
The author says:
> The rule I used for deciding on what the optimal number of worker processes was is simple: for each framework I started at a single worker and increased the worker count successively until performance got worse.
That's not how benchmark is supposed to be conducted, one doesn't fit workers number to the result that one finds "optimal", one should use the same amount of workers and find bottlenecks and either eliminate them or explain why they cannot be eliminated without affecting benchmark invariants.
> this wouldn't affect the variance between sync and async results very much because both frameworks were run on the same machine.
it will affect the variance. Firstly, because the db will spawn processes on the same machine, pgbouncer will spawn processes on the same machine, they all will compete for the same CPU and the order of preemptive context switches affects individual benchmark runs differently. On top of that, there are periodic and expensive WAL checkpoints, and fsync that competes with a benchmark for the kernel system call interruptions and context switches, and the multi-process workers setup may be affected dramatically. If you don't believe that external processes affect the numbers to the extent they become incomparable, try to serf the Internet with your web-browser randomly while running a benchmark.
> Real world connection pools have an upper bound limit. I don't see why setting an upper bound limit to be closer to reality is not a good test.
Because benchmarks are not real-world workloads, they are designed to show unbound performance of the implementation detail that is selected for a test, where external resources are non-exhaustable for the purpose of avoiding side-effects external to the functionality that is being tested.
> Also you're completely wrong about the connection pool blocking when it is exhausted. See source code:
> If all connections are exhausted then the system still yields to compute and incoming requests.
I didn't say that it wouldn't yield, I said that the coroutine will be blocked at the point where it tries to acquire a non-existing connection from the pool, which affects the benchmark. Now, instead of one blocking context switch at a network socket call that queries Postgres, the coroutine will yield AND WAIT twice - at the exhausted connection pool, and at the network socket call after the connection is acquired. This is exactly the reason why resources should be unbound, and why DB should be on a separate machine (unbound spawning of connection processes upon request), and why the number of OS workers should be the same in all benchmarks, because the sync version will also block twice, and the consequence of blocking there will be much more dramatic and different than in the case of async, WHICH IS THE POINT of a proper benchmark - https://github.com/calpaterson/python-web-perf/blob/master/s...
This article doesn't evaluate the case that you actually want ASGI for, so I don't think it's very useful. (Or at least, it confirms something that should have already been clear).
If you're compute-bound, then Python async (which uses cooperative scheduling similar to green threads) isn't going to help you. You get parallelism, but not concurrency from this progamming model; only one logical thread of execution is running on the CPU at a time (per-process), so this can only slow you down if you are CPU-constrained.
The standard usecase of a sync API backed by a local DB with low request latency is typically going to be compute-bound.
The case where async workers are interesting is for I/O-bound workloads. Say you're building an API gateway, or your monolithic API starts to need to call out to other API services, particularly external ones like Google Maps API. In this case, the worst-case result is that the proxied HTTP request times out; this could block your Django API's work thread for many seconds.
In the async / green-threaded model, this case is fine; you have a green thread/async function call per request, and if that gthread is blocked on an upstream I/O operation, the event loop will just start working on a different API call until the OS gives a response on the network socket.
Essentially, there's no reason to use Django async if you're doing a traditional monolithic DB-backed application. It's going to give you benefits in usecases where the standard sync model struggles.
(Note, there's an argument that you might want green threads even in a normal monolith, to guard against cases like "developer accidentally wrote a chunky DB query that takes 60 seconds to run for some inputs", but most DB engines don't support one-DB-connection-per-HTTP-connection. There was a bunch of discussion on this topic a few years ago, with the SQLAlchemy author arguing that async is not useful for DB connections: https://techspot.zzzeek.org/2015/02/15/asynchronous-python-a... although asyncio support was added: https://docs.sqlalchemy.org/en/14/orm/extensions/asyncio.htm...)
> The standard usecase of a sync API backed by a local DB with low request latency is typically going to be compute-bound.
> This is covered in the Django async docs
Where is this mentioned in the linked docs? I see four mentions of the ORM in the linked pages, including where it says they are working on async ORM, but I see nothing about performance for the ORM typically being compute bound.
I read the article, thanks. I think you've missed my point. I'll TL;DR it for you to make it clearer:
> This article doesn't evaluate the case that you actually want ASGI for, so I don't think it's very useful.
> The standard usecase of a sync API backed by a local DB with low request latency is typically going to be compute-bound.
(Note, I'm specifically talking about Django here)
> Essentially, there's no reason to use Django async if you're doing a traditional monolithic DB-backed application. It's going to give you benefits in usecases where the standard sync model struggles.
My claim is that these benchmarks are not looking at the use-case that Django async is intended to solve. It's not about increasing throughput to your local DB, and so it's not surprising that you don't see an improvement in benchmarks testing that case. Django's async is intended to enable API-gateways and other long-running requests where the upstream latency can be long enough to starve your API worker threads.
Regarding compute-bound vs. I/O-bound, I'm sure YMMV, but my APM tracing for a mature non-trivial production Django API shows that waiting on the DB accounts for about 25% of the total request time across all my endpoints.
Serialization/deserialization takes an embarrassing amount of time in Django, see https://news.ycombinator.com/item?id=24161828 for example. This framework is optimized for developer productivity, not for performance.
I had a bad experience with Django. I found it cluttered and slow. I really wanted to like it.
It might seem funny but a more straightforward framework like Symfony didn't get it in the way and ended up much faster.
Python should be much much faster than PHP but I guess the framework matters a lot too.
There is just so many layers of magic in Django, that it was becoming impossible for us to improve the performance to an acceptable level. We isolated the problems to serialization / deserialization. Going from DB -> Python object -> JSON response was taking far more time than anything else, and just moving over to FastAPI has gotten us a ~5x improvement in response time.
I am excited to see where Django async goes though. Its something I had been looking forward to for a while now.
-- Boring code - Business logic, CRUD, management, security, ...: django
-- Perf: JWT services on another stack (GPU, Arrow streaming, ...)
So stuff is either Boring Code or Performance Code. Async is great b/c now Boring Code can now simply await Performance Code :) Boring Code gets predictability & general ecosystem, and Performance Code does wilder stuff where we don't worry about non-perf ecosystem stuff, just perf ecosystem oddballs. We've been systematically dropping node from our backend, where we tried to have it all, and IMO too much lift for most teams.
Edit: Although, now that I think a little more its not that surprising. Our initial tests did literally just define FastAPI schemas on top of our existing DB. The co-mingling while actually running is an interesting concept though.
I’d define the class, add it as the return value, I was manually instantiating the class and returning that, but for it didn’t like that and would constantly throw errors about it. I think it was PyDantic which was the root cause there.
The Depends functionality refused to inject my classes as well, but I was probably doing something wrong there...
Dropping back to Starlette was good because it gave me everything I needed and got out of my way. I’ve still got everything fully typed and passing MyPy.
Have any interest in expanding this into a blog post? I've been working on a similar post. Maybe we can compare notes. I'm at michael at testdriven dot io, if interested.
You can also skip the object serialization from the ORM and work with python dicts directly to significantly improve serialization performance from the database.
For the DB requests Are you writing sql directly, using a different ORM, or something like sqlalchemy core that makes sql pythonic without being an ORM?
And it uses SQLAlchemy under the hood. Can use all of it. But if you want full async all the way down, can just use core and something like encode/databases for the DB access.
I have also been experimenting with encode/databases for async DB access. It still uses the SA core functions, which is nice, but that means it does not do the nice relationships stuff that SA has built in when using it to handle everything. At least not that I have found. However it does allow for things like gets without relationships, updates of single records, and stuff like that quite nicely.
https://fastapi.tiangolo.com/tutorial/sql-databases/
I am using asyncpg[1] (much more performant and provides close mapping to PostgreSQL, making it much easier to use its advanced features) through raw SQL statements without problems.
[1] https://github.com/MagicStack/asyncpg
https://fastapi.tiangolo.com/tutorial/sql-databases/#create-...
This is somewhat inaccurate. They use SQLAlchemy in the tutorial, but FastAPI is in no way tied to SQLAlchemy.
- https://docs.python.org/3/library/concurrent.futures.html
- https://docs.python.org/3/library/asyncio-eventloop.html#asy...
Using cython:
If you're doing heavy calculations from python you should at least be considering cython.When doing a long computation the CPU is not idle so there is no free compute power to use for something else.
Finally, when doing IO calls in Python so GIL is usually released so the kernel can already schedule another thread while waiting for IO, so it is not sure that converting to async will yield improvement and should be benchmarked if you plan on converting an existing program.
This is true, but scheduling another thread through the kernel can have higher overhead since it requires context switches. Running multiple threads also has other potential issues with lock contention; how problematic they are will depend on the use case.
The potential advantage of scheduling another thread is, of course, that it can do CPU bound work; but in Python, unfortunately, doing that means the GIL doesn't get released so that thread will prevent any further network I/O while it's running, the same as would happen in an async framework if a worker did a lot of CPU work. So Python doesn't really let you realize the advantages of threads in this context.
I don't think that's true, the GIL get released for many computing intensive or IO bound tasks in Python, for example when reading from a socket the GIL gets released at https://github.com/python/cpython/blob/e822e37946f27c09953bb...
I/O bound, yes, since that requires system calls, and system calls (reading from a socket is an example of a system call) release the GIL.
Computing intensive, no. Code that is doing a CPU intensive computation but makes no system calls will never release the GIL.
Any code that does not involve Python objects can release the GIL, no matter whether it makes system call or not.
For example, NumPy the most popular scientific computation package in Python, on which many other popular packages like Pandas are based, releases the GIL when doing operation on matrix. This is documented at https://numpy.org/doc/stable/reference/internals.code-explan...:
> If NPY_ALLOW_THREADS is defined during compilation, then as long as no object arrays are involved, the Python Global Interpreter Lock (GIL) is released prior to calling the loops. It is re-acquired if necessary to handle error conditions.
To do so it uses the same macro used by the socket module when doing system calls: https://github.com/numpy/numpy/blob/18a6e3e505ee416ddfc617f3...
And does not involve running Python bytecode. Yes, numpy and other packages that provide C extensions do this when they are doing computations that don't require running Python bytecode.
> no matter whether it makes system call or not
Yes, you're right, my statement was too broad.
https://docs.python.org/3/c-api/init.html#releasing-the-gil-...
Yes, you're right, my statement was too broad.
compared to FastAPI[1] I really don't want to use it, I only miss the ORM since in FastAPI it looks like you have to manually write the code to insert stuff[2].
[0] https://realpython.com/getting-started-with-django-channels/
[1] https://fastapi.tiangolo.com/advanced/websockets/#create-a-w...
[2] https://fastapi.tiangolo.com/tutorial/sql-databases/#create-...
Django is great for CRUD apps, MVPs and such. And I've used it with success for larger platforms, but it doesn't take long for me to want something closer to the metal whenever I need custom work. FastAPI has filled that need wonderfully well.
I also miss the ORM though… SQLAlchemy is a pain.
FWIW I'd have just used Pusher.io and triggered via django signals but I didn't want clients to be able to query the list of channels, or have to think about costs.
This was more of a Friday hack for fun rather than a core product requirement for profit. But I feel like using a separately deployed and scaled service, agnostic of business logic, was the right move.
For most things you're probably better off "doing the work" in a celery task, regardless if it is IO bound or CPU bound. Then use web sockets just for your status updates/progress bar, instead of having your front end poll on a timer.
I'm consistently surprised that there are not awesome web frameworks in JavaScript similar to Django
Also, I still haven’t seen how the async addition will work with Class Based Views. Also, Django Rest Framework is still considering spending time for support. Until these two use cases are viable many users won’t benefit.
Nooo, not at all. Your tasks should be I/O bound, not CPU bound to take advantage of asyncio. Maybe the async server using multiple threads with multiple event loops, but don't ever do a CPU-heavy task in an event loop because you just invalidated using asyncio completely.
reference: http://calpaterson.com/async-python-is-not-faster.html
For some web applications, it might actually be faster (in meaningful aggregate volume) to service a complete request on the calling thread rather than deferring to the thread pool periodically throughout execution. I think the break over point between sync and async comes down to how much I/O (database) work is involved in satisfying the request. If each request only hits the database 1-2 times on average incurring a few milliseconds of added latency, making sync all the way down is might be better than with any amount of added context switching. If each request may take 100-1000 milliseconds to complete overall due to various long-running I/O operations, then async is certainly one good approach for maximizing the number of possible concurrent requests.
In most of my applications (C#/.NET Core) I default to async/await for backend service methods, because 9/10 times I am going to the database multiple times for something and I cannot always guarantee that it will return quickly under heavy load. For other items, I explicitly go wide on parallelizable CPU-bound tasks. All of these are handled as a blocking call against a Parallel.ForEach(). Never would a CPU-bound task be explicitly wrapped with async/await, but one may be included as part of a larger asynchronous operation.
This stuff used to confuse the hell out of me, and then I finally wrapped my head around the 2 essential code abstractions: async/await for I/O, Parallel.For() (et. al.) for CPU-bound tasks which have parallelism opportunities. Never try to Task.Run or async/await your way out of something that is CPU-bound and is blocking the flow of execution. Try to leverage asynchrony responsibly when delays >1ms are possible in large concurrent volumes.
The real issue, as the article notes, is latency variation. Because async frameworks rely on cooperative multitasking, there is no way for the event loop to preempt a worker that is taking too long in order to maintain reasonable latency for other requests.
There is one thing I wonder about with this article, though. The article says each worker is making a database query. How is that being done? If it's being done over a network, that worker should yield back to the event loop while it's waiting for the network I/O to complete. If it's being done via a database on the local machine, and the communication with that database is not being done by something like Unix sockets, but by direct calls into a database library, then that's obviously going to cause latency problems because the worker can't yield during the database call. The obvious way to fix that is to have the local database server exposed via socket instead of direct library calls.
No it's not pure python. It's a combination. The underlying event loop uses libuv, a C library that's makes up the underlying core of nodejs. The marker of "Uvicorn" is an indicator of this as "Uvicorn" uses uvlib.
Overall the benchmark is testing a bit of both. The event loop runs on C but it has to execute a bit of python code when handling the request.
>If it's being done via a database on the local machine, and the communication with that database is not being done by something like Unix sockets, but by direct calls into a database library, then that's obviously going to cause latency problems because the worker can't yield during the database call.
I am almost positive it is being done with some form of non blocking sockets. The only other way to do this without sockets is to write to file and read from file.
There is no "direct library calls" as the database server exists as a separate process to the server process. Here's what occurs:
Any library call you're thinking of that's called from the library here may be a "client side" library meaning that the library actually makes a socket connection to the sql server.Database libraries in Python that support this (as opposed to blocking, synchronous sockets, which are of course common) are pretty thin on the ground. That's why I would have liked to see more details in the article about exactly how the benchmark was doing the database queries.
> There is no "direct library calls" as the database server exists as a separate process to the server process.
Yes, you're right, I wasn't being very clear. The key question is, as above, whether nonblocking sockets are being used or not.
https://www.postgresql.org/docs/12/libpq-async.html
https://www.psycopg.org/docs/advanced.html#green-support
https://github.com/MagicStack/asyncpg
[1] https://fastapi.tiangolo.com/benchmarks/
If you heard of Gunicorn, Uvicorn is the version of Gunicorn with libUV, hence the name.
It's been discussed before but the outcome of that discussion (in the link you brought up) was divided. Highly highly divided. There was no conclusion and it is not clear whether the benchmark was flawed.
The discussion is also littered with people who don't understand why async is fast for only certain types of things and slow for others. It's also littered with assumptions that the test focused on compute rather than IO which is very very evidently not the case.
the point is that it's not meant to be any faster than a parallel pool of processes that perform the same heavy IO without blocking all requesting clients. asyncio is about packing as many concurrent socket interactions into a single process as possible, hence optimising for throughput by giving up the speed that gets eaten up by task context-switching. Hence the flaw in the becnhmark. The benchmark was run on the same machine where Postgres was operating. The benchmark used different number of processes for sync and async workloads, the connection pool was not setup to prevent blocking when a coroutine tries to acquire a connection from the pool when the pool is exhausted (for benchmark purposes it should not have upper bound limit and to be pre-populated with already established connections).
Wrong. Workers amounts are the same. See the chart with benchmark results. http://calpaterson.com/async-python-is-not-faster.html
>The benchmark was run on the same machine where Postgres was operating.
This wouldn't affect the variance between sync and async results very much because both frameworks were run on the same machine.
>the connection pool was not setup to prevent blocking when a coroutine tries to acquire a connection from the pool when the pool is exhausted.
Real world connection pools have an upper bound limit. I don't see why setting an upper bound limit to be closer to reality is not a good test.
Also you're completely wrong about the connection pool blocking when it is exhausted. See source code:
https://github.com/calpaterson/python-web-perf/blob/master/a...
If all connections are exhausted then the system still yields to compute and incoming requests.
>(for benchmark purposes it should not have upper bound limit and to be pre-populated with already established connections).
Disagree. The real world sets an upper bound. There's nothing wrong with simulating this in a test.
I see that aiohttp has 5, uwsgi has 16, and gunicorn has 12, 14, and 16 depending on a web-framework, is this your definition of the same?
The author says:
> The rule I used for deciding on what the optimal number of worker processes was is simple: for each framework I started at a single worker and increased the worker count successively until performance got worse.
That's not how benchmark is supposed to be conducted, one doesn't fit workers number to the result that one finds "optimal", one should use the same amount of workers and find bottlenecks and either eliminate them or explain why they cannot be eliminated without affecting benchmark invariants.
> this wouldn't affect the variance between sync and async results very much because both frameworks were run on the same machine.
it will affect the variance. Firstly, because the db will spawn processes on the same machine, pgbouncer will spawn processes on the same machine, they all will compete for the same CPU and the order of preemptive context switches affects individual benchmark runs differently. On top of that, there are periodic and expensive WAL checkpoints, and fsync that competes with a benchmark for the kernel system call interruptions and context switches, and the multi-process workers setup may be affected dramatically. If you don't believe that external processes affect the numbers to the extent they become incomparable, try to serf the Internet with your web-browser randomly while running a benchmark.
> Real world connection pools have an upper bound limit. I don't see why setting an upper bound limit to be closer to reality is not a good test.
Because benchmarks are not real-world workloads, they are designed to show unbound performance of the implementation detail that is selected for a test, where external resources are non-exhaustable for the purpose of avoiding side-effects external to the functionality that is being tested.
> Also you're completely wrong about the connection pool blocking when it is exhausted. See source code: > If all connections are exhausted then the system still yields to compute and incoming requests.
I didn't say that it wouldn't yield, I said that the coroutine will be blocked at the point where it tries to acquire a non-existing connection from the pool, which affects the benchmark. Now, instead of one blocking context switch at a network socket call that queries Postgres, the coroutine will yield AND WAIT twice - at the exhausted connection pool, and at the network socket call after the connection is acquired. This is exactly the reason why resources should be unbound, and why DB should be on a separate machine (unbound spawning of connection processes upon request), and why the number of OS workers should be the same in all benchmarks, because the sync version will also block twice, and the consequence of blocking there will be much more dramatic and different than in the case of async, WHICH IS THE POINT of a proper benchmark - https://github.com/calpaterson/python-web-perf/blob/master/s...
If you're compute-bound, then Python async (which uses cooperative scheduling similar to green threads) isn't going to help you. You get parallelism, but not concurrency from this progamming model; only one logical thread of execution is running on the CPU at a time (per-process), so this can only slow you down if you are CPU-constrained.
The standard usecase of a sync API backed by a local DB with low request latency is typically going to be compute-bound.
This is covered in the Django async docs (https://docs.djangoproject.com/en/3.1/topics/async/) and also in green threading libraries like gevent (http://www.gevent.org/intro.html#cooperative-multitasking).
The case where async workers are interesting is for I/O-bound workloads. Say you're building an API gateway, or your monolithic API starts to need to call out to other API services, particularly external ones like Google Maps API. In this case, the worst-case result is that the proxied HTTP request times out; this could block your Django API's work thread for many seconds.
In the async / green-threaded model, this case is fine; you have a green thread/async function call per request, and if that gthread is blocked on an upstream I/O operation, the event loop will just start working on a different API call until the OS gives a response on the network socket.
Essentially, there's no reason to use Django async if you're doing a traditional monolithic DB-backed application. It's going to give you benefits in usecases where the standard sync model struggles.
(Note, there's an argument that you might want green threads even in a normal monolith, to guard against cases like "developer accidentally wrote a chunky DB query that takes 60 seconds to run for some inputs", but most DB engines don't support one-DB-connection-per-HTTP-connection. There was a bunch of discussion on this topic a few years ago, with the SQLAlchemy author arguing that async is not useful for DB connections: https://techspot.zzzeek.org/2015/02/15/asynchronous-python-a... although asyncio support was added: https://docs.sqlalchemy.org/en/14/orm/extensions/asyncio.htm...)
> This is covered in the Django async docs
Where is this mentioned in the linked docs? I see four mentions of the ORM in the linked pages, including where it says they are working on async ORM, but I see nothing about performance for the ORM typically being compute bound.
> This article doesn't evaluate the case that you actually want ASGI for, so I don't think it's very useful.
> The standard usecase of a sync API backed by a local DB with low request latency is typically going to be compute-bound.
(Note, I'm specifically talking about Django here)
> Essentially, there's no reason to use Django async if you're doing a traditional monolithic DB-backed application. It's going to give you benefits in usecases where the standard sync model struggles.
My claim is that these benchmarks are not looking at the use-case that Django async is intended to solve. It's not about increasing throughput to your local DB, and so it's not surprising that you don't see an improvement in benchmarks testing that case. Django's async is intended to enable API-gateways and other long-running requests where the upstream latency can be long enough to starve your API worker threads.
Regarding compute-bound vs. I/O-bound, I'm sure YMMV, but my APM tracing for a mature non-trivial production Django API shows that waiting on the DB accounts for about 25% of the total request time across all my endpoints.
Serialization/deserialization takes an embarrassing amount of time in Django, see https://news.ycombinator.com/item?id=24161828 for example. This framework is optimized for developer productivity, not for performance.
How? Afaik PHP is faster than Python in most aspects.