Cython vs Numba: When to Use Each for Python Optimization
Should I use Cython or Numba to speed up my Python code? I benchmarked both on the Benchmarks Game problems and found the answer depends on your use case, not just raw performance numbers.
The Quick Answer
| Tool | n-body Speedup | spectral-norm Speedup | Effort Level ||-------------------|----------------|----------------------|---------------|| Numba | 56x | 135x | Low || Cython (optimized)| 99x | 124x | High || Cython (naive) | 10x | - | Medium |Numba delivers excellent speedups with a single @njit decorator. Cython can outperform it, but only if you know what you’re doing. My first Cython attempt got 10x instead of 124x, and nothing warned me.
Compilation Models: JIT vs AOT
The fundamental difference between these tools is when compilation happens.
Numba: Just-In-Time Compilation
Numba compiles your Python code to machine code at runtime, on first call:
Python bytecode -> Numba IR -> Type inference -> LLVM IR -> Machine codefrom numba import njitimport numpy as np
@njit # That's all you needdef nbody_step(pos, vel, mass, dt, n): for i in range(n): for j in range(i + 1, n): dx = pos[i, 0] - pos[j, 0] dy = pos[i, 1] - pos[j, 1] dz = pos[i, 2] - pos[j, 2] dist = np.sqrt(dx*dx + dy*dy + dz*dz) mag = dt / (dist * dist * dist) vel[i, 0] -= dx * mag * mass[j] vel[j, 0] += dx * mag * mass[i] return velFirst call compiles, subsequent calls use cached machine code.
Cython: Ahead-Of-Time Compilation
Cython transpiles your code to C, then compiles it as a Python extension module:
.pyx file -> C code -> C compiler -> Python extension module (.so/.pyd)import numpy as npcimport numpy as npfrom libc.math cimport sqrt
def nbody_step(np.ndarray[np.float64_t, ndim=2] pos, np.ndarray[np.float64_t, ndim=2] vel, np.ndarray[np.float64_t, ndim=1] mass, double dt, int n): cdef int i, j cdef double dx, dy, dz, dist, mag
for i in range(n): for j in range(i + 1, n): dx = pos[i, 0] - pos[j, 0] dy = pos[i, 1] - pos[j, 1] dz = pos[i, 2] - pos[j, 2] dist = sqrt(dx*dx + dy*dy + dz*dz) mag = dt / (dist * dist * dist) vel[i, 0] -= dx * mag * mass[j] vel[j, 0] += dx * mag * mass[i] return velPlus a build file:
from setuptools import setupfrom Cython.Build import cythonize
setup(ext_modules=cythonize("nbody_cython.pyx"))python setup.py build_ext --inplaceThe Hidden Trap in Cython
I enjoyed writing Cython, but I learned the hard way that it has silent performance pitfalls.
My Naive Attempt (10x speedup)
# This compiled and ran correctly...def compute(values): result = 0.0 for v in values: result += v ** 0.5 # Silent trap! return resultThe code worked. It gave 10x speedup. But it should have been much faster.
The Problem: Python Object Operations
Cython’s ** operator with float exponents falls back to Python’s slow object operations:
from libc.math cimport sqrt
def compute_slow(double[:] values): cdef double result = 0.0 for v in values: result += v ** 0.5 # 40x slower than sqrt! return result
def compute_fast(double[:] values): cdef double result = 0.0 cdef double v for v in values: result += sqrt(v) # Direct C call return resultThe Solution: Annotation Reports
Always use cython -a to generate an HTML report:
cython -a mymodule.pyx# Open mymodule.html in browserThe report shows yellow lines (Python interaction, slow) and white lines (pure C, fast). My naive code was full of yellow lines I didn’t know about.
When Numba Wins
Numba excels when you want speed without learning C semantics:
| Criteria | Numba ||---------------------|---------------------------|| Setup | pip install numba || Code changes | Add @njit decorator || Learning curve | Low || NumPy integration | Excellent || Parallel execution | @njit(parallel=True) || GPU support | CUDA via @cuda.jit |Automatic Parallelization
from numba import njit, prangeimport numpy as np
@njit(parallel=True)def compute_distances(points): n = len(points) distances = np.zeros((n, n)) for i in prange(n): # Parallel loop for j in range(i+1, n): diff = points[i] - points[j] distances[i, j] = np.sqrt(np.sum(diff**2)) distances[j, i] = distances[i, j] return distancesGPU Acceleration
from numba import cudaimport numpy as np
@cuda.jitdef vector_add(a, b, c): idx = cuda.grid(1) if idx < len(a): c[idx] = a[idx] + b[idx]When Cython Wins
Cython is the better choice when you need C integration or distribution:
| Criteria | Cython ||-----------------------|--------------------------------|| C library wrapping | Excellent || Maximum performance | 99-124x || Distributable wheels | Yes (no JIT at runtime) || Compile-time errors | Catches type mismatches || Long-term maintenance | More robust |Wrapping C Libraries
from libc.math cimport sin, cos, sqrt, pow, exp, log
def fast_sqrt(double x): """Direct C sqrt - 40x faster than ** 0.5""" return sqrt(x)
def fast_sin(double x): """Direct C sin""" return sin(x)
def fast_exp(double x): """Direct C exp""" return exp(x)This is difficult or impossible with Numba’s pure Python approach.
Building Distributable Packages
Cython compiles to .so/.pyd files you can ship in wheels:
| Tool | Distribution Method ||--------|----------------------------------------|| Numba | Requires Numba + JIT at runtime || Cython | Compiled wheel, no build tools needed |Decision Matrix
| Criteria | Numba | Cython ||---------------------------|--------------|----------------|| Setup complexity | Low | Medium-High || Learning curve | Low | Medium-High || Maximum performance | 56-135x | 99-124x || NumPy integration | Excellent | Good || C library wrapping | Limited | Excellent || Distribution | JIT required | Compiled wheel || Debugging | Easier | Harder || GPU support | CUDA | No || Automatic parallelization | Yes | Manual || Silent performance traps | Rare | Common |Common Pitfalls
Numba: Object Mode Fallback
Numba silently falls back to slow object mode when it can’t compile:
from numba import jit
# WRONG: No error, but falls back to slow object mode@jitdef process(data): result = {} for i, v in enumerate(data): result[i] = v * 2 return result
# RIGHT: Force nopython mode, get error if it failsfrom numba import njit
@njit # Equivalent to @jit(nopython=True)def process_fast(data): # Will raise TypingError if it can't compile result = 0.0 for v in data: result += v return resultAlways use @njit instead of @jit to catch compilation failures early.
Cython: Silent Slowdowns
Cython compiles and runs code that looks correct but isn’t optimized:
# WRONG: Compiles fine, but uses slow Python operationsdef compute(data): result = 0.0 for v in data: result += v ** 0.5 # Falls back to Python pow return result
# RIGHT: Use C functions with typed variablesfrom libc.math cimport sqrt
def compute_fast(double[:] data): cdef double result = 0.0 cdef double v for v in data: result += sqrt(v) # Direct C call return resultAlways check with cython -a annotation report.
Practical Workflow
Here’s the workflow I recommend:
START: Need to optimize Python code?|+-- Is it numerical/NumPy-heavy?| +-- YES -> Try Numba @njit first| | +-- Good speedup? -> DONE| | +-- Need more? -> Try parallel=True, fastmath=True| || +-- NO -> Need to wrap C/C++ libraries?| +-- YES -> Use Cython| +-- NO -> Evaluate both:| - Numba for ease/speed balance| - Cython for max performanceStep 1: Profile First
import cProfileimport pstats
cProfile.run('your_function()', 'profile_stats')stats = pstats.Stats('profile_stats')stats.sort_stats('cumulative')stats.print_stats(20) # Top 20 time-consuming functionsStep 2: Try Numba
from numba import njit
@njitdef hot_function(data): # Your hot path here pass
# Warm up (first call compiles)hot_function(small_test_data)
# Measure%timeit hot_function(real_data)Step 3: If Numba Isn’t Enough, Try Cython
# Write your .pyx file with type annotations# Generate annotation report to find slow linescython -a mymodule.pyx
# Buildpython setup.py build_ext --inplace
# Benchmarkpython -m timeit "import mymodule; mymodule.function(data)"When They Work Together
You don’t have to choose just one. A common pattern:
Python orchestrator|+-- Numba: Numerical kernels (quick iteration)|+-- Cython: C library bindings (stable interface)Use Numba for rapid prototyping of numerical code, then migrate stable kernels to Cython for production builds.
Key Takeaways
-
Numba wins on effort-to-reward ratio: Add
@njit, get 56-135x speedup. No build system, no C knowledge required. -
Cython wins on maximum performance: 99-124x possible, but requires understanding C semantics and using annotation reports.
-
Cython has hidden traps: Silent performance issues like
** 0.5being 40x slower thansqrt()require the annotation report to debug. -
Use
@njitnot@jit: Force nopython mode to catch Numba compilation failures early. -
Always check
cython -a: The annotation report is essential for finding yellow (slow) lines in Cython code.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Cython Documentation: Overview
- 👨💻 Numba 5-Minute Guide
- 👨💻 The Optimization Ladder - Comprehensive Python Benchmark
- 👨💻 GitHub: faster-python-bench
- 👨💻 Reddit Discussion: Benchmarked Every Python Optimization Path
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments