Cython vs Numba: When to Use Each for Python Optimization

Mar 11, 2026

Should I use Cython or Numba to speed up my Python code? I benchmarked both on the Benchmarks Game problems and found the answer depends on your use case, not just raw performance numbers.

The Quick Answer

| Tool              | n-body Speedup | spectral-norm Speedup | Effort Level |
|-------------------|----------------|----------------------|---------------|
| Numba             | 56x            | 135x                 | Low           |
| Cython (optimized)| 99x            | 124x                 | High          |
| Cython (naive)    | 10x            | -                    | Medium        |

Numba delivers excellent speedups with a single @njit decorator. Cython can outperform it, but only if you know what you’re doing. My first Cython attempt got 10x instead of 124x, and nothing warned me.

Compilation Models: JIT vs AOT

The fundamental difference between these tools is when compilation happens.

Numba: Just-In-Time Compilation

Numba compiles your Python code to machine code at runtime, on first call:

Python bytecode -> Numba IR -> Type inference -> LLVM IR -> Machine code

from numba import njit
import numpy as np

@njit  # That's all you need
def nbody_step(pos, vel, mass, dt, n):
    for i in range(n):
        for j in range(i + 1, n):
            dx = pos[i, 0] - pos[j, 0]
            dy = pos[i, 1] - pos[j, 1]
            dz = pos[i, 2] - pos[j, 2]
            dist = np.sqrt(dx*dx + dy*dy + dz*dz)
            mag = dt / (dist * dist * dist)
            vel[i, 0] -= dx * mag * mass[j]
            vel[j, 0] += dx * mag * mass[i]
    return vel

First call compiles, subsequent calls use cached machine code.

Cython: Ahead-Of-Time Compilation

Cython transpiles your code to C, then compiles it as a Python extension module:

.pyx file -> C code -> C compiler -> Python extension module (.so/.pyd)

import numpy as np
cimport numpy as np
from libc.math cimport sqrt

def nbody_step(np.ndarray[np.float64_t, ndim=2] pos,
               np.ndarray[np.float64_t, ndim=2] vel,
               np.ndarray[np.float64_t, ndim=1] mass,
               double dt, int n):
    cdef int i, j
    cdef double dx, dy, dz, dist, mag

    for i in range(n):
        for j in range(i + 1, n):
            dx = pos[i, 0] - pos[j, 0]
            dy = pos[i, 1] - pos[j, 1]
            dz = pos[i, 2] - pos[j, 2]
            dist = sqrt(dx*dx + dy*dy + dz*dz)
            mag = dt / (dist * dist * dist)
            vel[i, 0] -= dx * mag * mass[j]
            vel[j, 0] += dx * mag * mass[i]
    return vel

Plus a build file:

from setuptools import setup
from Cython.Build import cythonize

setup(ext_modules=cythonize("nbody_cython.pyx"))

python setup.py build_ext --inplace

The Hidden Trap in Cython

I enjoyed writing Cython, but I learned the hard way that it has silent performance pitfalls.

My Naive Attempt (10x speedup)

# This compiled and ran correctly...
def compute(values):
    result = 0.0
    for v in values:
        result += v ** 0.5  # Silent trap!
    return result

The code worked. It gave 10x speedup. But it should have been much faster.

The Problem: Python Object Operations

Cython’s ** operator with float exponents falls back to Python’s slow object operations:

from libc.math cimport sqrt

def compute_slow(double[:] values):
    cdef double result = 0.0
    for v in values:
        result += v ** 0.5   # 40x slower than sqrt!
    return result

def compute_fast(double[:] values):
    cdef double result = 0.0
    cdef double v
    for v in values:
        result += sqrt(v)    # Direct C call
    return result

The Solution: Annotation Reports

Always use cython -a to generate an HTML report:

cython -a mymodule.pyx
# Open mymodule.html in browser

The report shows yellow lines (Python interaction, slow) and white lines (pure C, fast). My naive code was full of yellow lines I didn’t know about.

When Numba Wins

Numba excels when you want speed without learning C semantics:

| Criteria            | Numba                     |
|---------------------|---------------------------|
| Setup               | pip install numba         |
| Code changes        | Add @njit decorator       |
| Learning curve      | Low                       |
| NumPy integration   | Excellent                 |
| Parallel execution  | @njit(parallel=True)      |
| GPU support         | CUDA via @cuda.jit        |

Automatic Parallelization

from numba import njit, prange
import numpy as np

@njit(parallel=True)
def compute_distances(points):
    n = len(points)
    distances = np.zeros((n, n))
    for i in prange(n):  # Parallel loop
        for j in range(i+1, n):
            diff = points[i] - points[j]
            distances[i, j] = np.sqrt(np.sum(diff**2))
            distances[j, i] = distances[i, j]
    return distances

GPU Acceleration

from numba import cuda
import numpy as np

@cuda.jit
def vector_add(a, b, c):
    idx = cuda.grid(1)
    if idx < len(a):
        c[idx] = a[idx] + b[idx]

When Cython Wins

Cython is the better choice when you need C integration or distribution:

| Criteria              | Cython                         |
|-----------------------|--------------------------------|
| C library wrapping    | Excellent                      |
| Maximum performance   | 99-124x                        |
| Distributable wheels  | Yes (no JIT at runtime)        |
| Compile-time errors   | Catches type mismatches        |
| Long-term maintenance | More robust                    |

Wrapping C Libraries

from libc.math cimport sin, cos, sqrt, pow, exp, log

def fast_sqrt(double x):
    """Direct C sqrt - 40x faster than ** 0.5"""
    return sqrt(x)

def fast_sin(double x):
    """Direct C sin"""
    return sin(x)

def fast_exp(double x):
    """Direct C exp"""
    return exp(x)

This is difficult or impossible with Numba’s pure Python approach.

Building Distributable Packages

Cython compiles to .so/.pyd files you can ship in wheels:

| Tool   | Distribution Method                    |
|--------|----------------------------------------|
| Numba  | Requires Numba + JIT at runtime        |
| Cython | Compiled wheel, no build tools needed  |

Decision Matrix

| Criteria                  | Numba        | Cython         |
|---------------------------|--------------|----------------|
| Setup complexity          | Low          | Medium-High    |
| Learning curve            | Low          | Medium-High    |
| Maximum performance       | 56-135x      | 99-124x        |
| NumPy integration         | Excellent    | Good           |
| C library wrapping        | Limited      | Excellent      |
| Distribution              | JIT required | Compiled wheel |
| Debugging                 | Easier       | Harder         |
| GPU support               | CUDA         | No             |
| Automatic parallelization | Yes          | Manual         |
| Silent performance traps  | Rare         | Common         |

Common Pitfalls

Numba: Object Mode Fallback

Numba silently falls back to slow object mode when it can’t compile:

from numba import jit

# WRONG: No error, but falls back to slow object mode
@jit
def process(data):
    result = {}
    for i, v in enumerate(data):
        result[i] = v * 2
    return result

# RIGHT: Force nopython mode, get error if it fails
from numba import njit

@njit  # Equivalent to @jit(nopython=True)
def process_fast(data):
    # Will raise TypingError if it can't compile
    result = 0.0
    for v in data:
        result += v
    return result

Always use @njit instead of @jit to catch compilation failures early.

Cython: Silent Slowdowns

Cython compiles and runs code that looks correct but isn’t optimized:

# WRONG: Compiles fine, but uses slow Python operations
def compute(data):
    result = 0.0
    for v in data:
        result += v ** 0.5  # Falls back to Python pow
    return result

# RIGHT: Use C functions with typed variables
from libc.math cimport sqrt

def compute_fast(double[:] data):
    cdef double result = 0.0
    cdef double v
    for v in data:
        result += sqrt(v)  # Direct C call
    return result

Always check with cython -a annotation report.

Practical Workflow

Here’s the workflow I recommend:

START: Need to optimize Python code?
|
+-- Is it numerical/NumPy-heavy?
|   +-- YES -> Try Numba @njit first
|   |         +-- Good speedup? -> DONE
|   |         +-- Need more? -> Try parallel=True, fastmath=True
|   |
|   +-- NO -> Need to wrap C/C++ libraries?
|             +-- YES -> Use Cython
|             +-- NO -> Evaluate both:
|                       - Numba for ease/speed balance
|                       - Cython for max performance

Step 1: Profile First

import cProfile
import pstats

cProfile.run('your_function()', 'profile_stats')
stats = pstats.Stats('profile_stats')
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 time-consuming functions

Step 2: Try Numba

from numba import njit

@njit
def hot_function(data):
    # Your hot path here
    pass

# Warm up (first call compiles)
hot_function(small_test_data)

# Measure
%timeit hot_function(real_data)

Step 3: If Numba Isn’t Enough, Try Cython

# Write your .pyx file with type annotations
# Generate annotation report to find slow lines
cython -a mymodule.pyx

# Build
python setup.py build_ext --inplace

# Benchmark
python -m timeit "import mymodule; mymodule.function(data)"

When They Work Together

You don’t have to choose just one. A common pattern:

Python orchestrator
|
+-- Numba: Numerical kernels (quick iteration)
|
+-- Cython: C library bindings (stable interface)

Use Numba for rapid prototyping of numerical code, then migrate stable kernels to Cython for production builds.

Key Takeaways

Numba wins on effort-to-reward ratio: Add @njit, get 56-135x speedup. No build system, no C knowledge required.
Cython wins on maximum performance: 99-124x possible, but requires understanding C semantics and using annotation reports.
Cython has hidden traps: Silent performance issues like ** 0.5 being 40x slower than sqrt() require the annotation report to debug.
Use @njit not @jit: Force nopython mode to catch Numba compilation failures early.
Always check cython -a: The annotation report is essential for finding yellow (slow) lines in Cython code.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Cython Documentation: Overview
👨‍💻 Numba 5-Minute Guide
👨‍💻 The Optimization Ladder - Comprehensive Python Benchmark
👨‍💻 GitHub: faster-python-bench
👨‍💻 Reddit Discussion: Benchmarked Every Python Optimization Path

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!