Skip to content

Cython vs Numba: When to Use Each for Python Optimization

Should I use Cython or Numba to speed up my Python code? I benchmarked both on the Benchmarks Game problems and found the answer depends on your use case, not just raw performance numbers.

The Quick Answer

Performance Comparison
| Tool | n-body Speedup | spectral-norm Speedup | Effort Level |
|-------------------|----------------|----------------------|---------------|
| Numba | 56x | 135x | Low |
| Cython (optimized)| 99x | 124x | High |
| Cython (naive) | 10x | - | Medium |

Numba delivers excellent speedups with a single @njit decorator. Cython can outperform it, but only if you know what you’re doing. My first Cython attempt got 10x instead of 124x, and nothing warned me.

Compilation Models: JIT vs AOT

The fundamental difference between these tools is when compilation happens.

Numba: Just-In-Time Compilation

Numba compiles your Python code to machine code at runtime, on first call:

Numba Compilation Flow
Python bytecode -> Numba IR -> Type inference -> LLVM IR -> Machine code
numba_example.py
from numba import njit
import numpy as np
@njit # That's all you need
def nbody_step(pos, vel, mass, dt, n):
for i in range(n):
for j in range(i + 1, n):
dx = pos[i, 0] - pos[j, 0]
dy = pos[i, 1] - pos[j, 1]
dz = pos[i, 2] - pos[j, 2]
dist = np.sqrt(dx*dx + dy*dy + dz*dz)
mag = dt / (dist * dist * dist)
vel[i, 0] -= dx * mag * mass[j]
vel[j, 0] += dx * mag * mass[i]
return vel

First call compiles, subsequent calls use cached machine code.

Cython: Ahead-Of-Time Compilation

Cython transpiles your code to C, then compiles it as a Python extension module:

Cython Compilation Flow
.pyx file -> C code -> C compiler -> Python extension module (.so/.pyd)
nbody_cython.pyx
import numpy as np
cimport numpy as np
from libc.math cimport sqrt
def nbody_step(np.ndarray[np.float64_t, ndim=2] pos,
np.ndarray[np.float64_t, ndim=2] vel,
np.ndarray[np.float64_t, ndim=1] mass,
double dt, int n):
cdef int i, j
cdef double dx, dy, dz, dist, mag
for i in range(n):
for j in range(i + 1, n):
dx = pos[i, 0] - pos[j, 0]
dy = pos[i, 1] - pos[j, 1]
dz = pos[i, 2] - pos[j, 2]
dist = sqrt(dx*dx + dy*dy + dz*dz)
mag = dt / (dist * dist * dist)
vel[i, 0] -= dx * mag * mass[j]
vel[j, 0] += dx * mag * mass[i]
return vel

Plus a build file:

setup.py
from setuptools import setup
from Cython.Build import cythonize
setup(ext_modules=cythonize("nbody_cython.pyx"))
Build the extension
python setup.py build_ext --inplace

The Hidden Trap in Cython

I enjoyed writing Cython, but I learned the hard way that it has silent performance pitfalls.

My Naive Attempt (10x speedup)

naive_cython.pyx
# This compiled and ran correctly...
def compute(values):
result = 0.0
for v in values:
result += v ** 0.5 # Silent trap!
return result

The code worked. It gave 10x speedup. But it should have been much faster.

The Problem: Python Object Operations

Cython’s ** operator with float exponents falls back to Python’s slow object operations:

slow_vs_fast.pyx
from libc.math cimport sqrt
def compute_slow(double[:] values):
cdef double result = 0.0
for v in values:
result += v ** 0.5 # 40x slower than sqrt!
return result
def compute_fast(double[:] values):
cdef double result = 0.0
cdef double v
for v in values:
result += sqrt(v) # Direct C call
return result

The Solution: Annotation Reports

Always use cython -a to generate an HTML report:

Generate annotation report
cython -a mymodule.pyx
# Open mymodule.html in browser

The report shows yellow lines (Python interaction, slow) and white lines (pure C, fast). My naive code was full of yellow lines I didn’t know about.

When Numba Wins

Numba excels when you want speed without learning C semantics:

Numba Advantages
| Criteria | Numba |
|---------------------|---------------------------|
| Setup | pip install numba |
| Code changes | Add @njit decorator |
| Learning curve | Low |
| NumPy integration | Excellent |
| Parallel execution | @njit(parallel=True) |
| GPU support | CUDA via @cuda.jit |

Automatic Parallelization

numba_parallel.py
from numba import njit, prange
import numpy as np
@njit(parallel=True)
def compute_distances(points):
n = len(points)
distances = np.zeros((n, n))
for i in prange(n): # Parallel loop
for j in range(i+1, n):
diff = points[i] - points[j]
distances[i, j] = np.sqrt(np.sum(diff**2))
distances[j, i] = distances[i, j]
return distances

GPU Acceleration

numba_cuda.py
from numba import cuda
import numpy as np
@cuda.jit
def vector_add(a, b, c):
idx = cuda.grid(1)
if idx < len(a):
c[idx] = a[idx] + b[idx]

When Cython Wins

Cython is the better choice when you need C integration or distribution:

Cython Advantages
| Criteria | Cython |
|-----------------------|--------------------------------|
| C library wrapping | Excellent |
| Maximum performance | 99-124x |
| Distributable wheels | Yes (no JIT at runtime) |
| Compile-time errors | Catches type mismatches |
| Long-term maintenance | More robust |

Wrapping C Libraries

wrap_libm.pyx
from libc.math cimport sin, cos, sqrt, pow, exp, log
def fast_sqrt(double x):
"""Direct C sqrt - 40x faster than ** 0.5"""
return sqrt(x)
def fast_sin(double x):
"""Direct C sin"""
return sin(x)
def fast_exp(double x):
"""Direct C exp"""
return exp(x)

This is difficult or impossible with Numba’s pure Python approach.

Building Distributable Packages

Cython compiles to .so/.pyd files you can ship in wheels:

Distribution Comparison
| Tool | Distribution Method |
|--------|----------------------------------------|
| Numba | Requires Numba + JIT at runtime |
| Cython | Compiled wheel, no build tools needed |

Decision Matrix

Cython vs Numba Decision Matrix
| Criteria | Numba | Cython |
|---------------------------|--------------|----------------|
| Setup complexity | Low | Medium-High |
| Learning curve | Low | Medium-High |
| Maximum performance | 56-135x | 99-124x |
| NumPy integration | Excellent | Good |
| C library wrapping | Limited | Excellent |
| Distribution | JIT required | Compiled wheel |
| Debugging | Easier | Harder |
| GPU support | CUDA | No |
| Automatic parallelization | Yes | Manual |
| Silent performance traps | Rare | Common |

Common Pitfalls

Numba: Object Mode Fallback

Numba silently falls back to slow object mode when it can’t compile:

numba_pitfall.py
from numba import jit
# WRONG: No error, but falls back to slow object mode
@jit
def process(data):
result = {}
for i, v in enumerate(data):
result[i] = v * 2
return result
# RIGHT: Force nopython mode, get error if it fails
from numba import njit
@njit # Equivalent to @jit(nopython=True)
def process_fast(data):
# Will raise TypingError if it can't compile
result = 0.0
for v in data:
result += v
return result

Always use @njit instead of @jit to catch compilation failures early.

Cython: Silent Slowdowns

Cython compiles and runs code that looks correct but isn’t optimized:

cython_pitfall.pyx
# WRONG: Compiles fine, but uses slow Python operations
def compute(data):
result = 0.0
for v in data:
result += v ** 0.5 # Falls back to Python pow
return result
# RIGHT: Use C functions with typed variables
from libc.math cimport sqrt
def compute_fast(double[:] data):
cdef double result = 0.0
cdef double v
for v in data:
result += sqrt(v) # Direct C call
return result

Always check with cython -a annotation report.

Practical Workflow

Here’s the workflow I recommend:

Optimization Decision Tree
START: Need to optimize Python code?
|
+-- Is it numerical/NumPy-heavy?
| +-- YES -> Try Numba @njit first
| | +-- Good speedup? -> DONE
| | +-- Need more? -> Try parallel=True, fastmath=True
| |
| +-- NO -> Need to wrap C/C++ libraries?
| +-- YES -> Use Cython
| +-- NO -> Evaluate both:
| - Numba for ease/speed balance
| - Cython for max performance

Step 1: Profile First

profile.py
import cProfile
import pstats
cProfile.run('your_function()', 'profile_stats')
stats = pstats.Stats('profile_stats')
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 time-consuming functions

Step 2: Try Numba

try_numba.py
from numba import njit
@njit
def hot_function(data):
# Your hot path here
pass
# Warm up (first call compiles)
hot_function(small_test_data)
# Measure
%timeit hot_function(real_data)

Step 3: If Numba Isn’t Enough, Try Cython

cython_workflow.sh
# Write your .pyx file with type annotations
# Generate annotation report to find slow lines
cython -a mymodule.pyx
# Build
python setup.py build_ext --inplace
# Benchmark
python -m timeit "import mymodule; mymodule.function(data)"

When They Work Together

You don’t have to choose just one. A common pattern:

Hybrid Architecture
Python orchestrator
|
+-- Numba: Numerical kernels (quick iteration)
|
+-- Cython: C library bindings (stable interface)

Use Numba for rapid prototyping of numerical code, then migrate stable kernels to Cython for production builds.

Key Takeaways

  1. Numba wins on effort-to-reward ratio: Add @njit, get 56-135x speedup. No build system, no C knowledge required.

  2. Cython wins on maximum performance: 99-124x possible, but requires understanding C semantics and using annotation reports.

  3. Cython has hidden traps: Silent performance issues like ** 0.5 being 40x slower than sqrt() require the annotation report to debug.

  4. Use @njit not @jit: Force nopython mode to catch Numba compilation failures early.

  5. Always check cython -a: The annotation report is essential for finding yellow (slow) lines in Cython code.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments