Python Task Queue Performance Benchmarks 2026: Which Library Is Fastest for I/O-Bound Work?
I needed to pick a task queue for a new project. My workload was pretty simple — lots of I/O: database calls, cache lookups, external API requests. I’d always reached for Celery by default. But I started wondering — is Celery actually the fastest option for I/O-bound work, or is it just the most popular?
I found a set of benchmarks by aleksul that tested five Python task queue libraries head-to-head: Celery, Dramatiq, Repid, FastStream, and Taskiq. The results changed how I think about this decision.
The benchmark setup
Four scenarios were run against RabbitMQ 4.3 on an 8-core/16-thread machine. Task durations modeled real-world operations:
- 0.01s — cache lookup
- 0.1s — database call
- 0.5s / 1s — external API call
- 5s — LLM response
The benchmark code is available on GitHub if you want to reproduce or extend it.
I/O-bound throughput: async-native dominates
When concurrency was unlimited, the async-native libraries (Repid, FastStream, Taskiq) dramatically outperformed sync-first Celery and Dramatiq. At the shortest task durations, the difference was an order of magnitude.
Celery and Dramatiq could approach async throughput when paired with gevent green threads, but they showed higher variance due to process overfetching. The gevent approach also requires monkey-patching, which I’ve seen cause subtle bugs in production — things just break in unpredictable ways when you patch everything.
import asyncio
# Repid (async-first)@router.actor(channel="io-heavy")async def process_io_task(data: dict) -> None: await asyncio.sleep(0.1) # simulate DB call result = await external_api_call(data) await cache_result(result)
# Celery (sync-first, gevent boosted)@app.taskdef process_io_task(data: dict) -> None: time.sleep(0.1) # simulate DB call (blocking) result = external_api_call_sync(data) cache_result_sync(result)The async version can keep thousands of I/O operations in flight per process. The sync version blocks the entire thread on every I/O call, wasting CPU cycles on context switching.
Limited concurrency: same story
When concurrency was capped at 2000, the async-native libraries still maintained their lead. Non-green-thread Celery and Dramatiq were “barely visible” on the chart. The asyncio libraries reached higher CPU utilization — meaning better vertical scaling on the same hardware.
Steady-rate: confirming the pattern
The steady-rate I/O test ruled out the possibility that warmup or drain phases were hiding the real story. Same shape, same conclusion.
CPU-bound: now it’s a different game
For CPU-bound workloads (SHA-256 hash iterations), the differences were small across all libraries. Once the CPU is saturated, the library has less room to matter. If your background jobs are crunching numbers or transforming data, pick based on features, ecosystem, or operational preferences — not throughput.
What this means for your architecture
The practical takeaway: if your workload is I/O-heavy (most web backend jobs are), async-native libraries let you handle the same throughput with fewer worker processes. Fewer processes means less memory, fewer database connections, simpler deployment.
Celery’s market dominance is based on ecosystem maturity — not raw throughput. That’s fine. It has more monitoring tools, more documentation, more battle-tested deployment patterns. But if performance matters for I/O work, async-native is the better technical choice.
Caveats
The benchmarks were run by Repid’s author. There’s always a risk of unintentional bias. The code is open source though — you can inspect it, modify it, run it yourself.
Production behavior also differs from benchmarks. Shutdown handling, broker recovery, monitoring integration — these matter in practice. Performance alone isn’t sufficient.
Bottom line
For I/O-bound Python workloads — the most common background job pattern — async-native task queues deliver the highest throughput. Repid led these benchmarks, FastStream and Taskiq followed close behind. For CPU-bound work, library choice is secondary to feature fit.
Benchmark against your actual workload before committing. Your production traffic pattern is the only benchmark that truly counts.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Repid Benchmarks Article (aleksul.space)
- 👨💻 Reddit Discussion: Python task queue benchmarks
- 👨💻 Repid Benchmarks Repository
- 👨💻 Repid
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments