Skip to content

Numba Python Optimization: 56-135x Speedup with @jit and @njit Decorators

I had a Python simulation running for 14 seconds per iteration. After adding @njit to my function and restructuring my data into NumPy arrays, it dropped to 104 milliseconds. That’s a 135x speedup for the cost of a decorator.

Numba sits at the sweet spot of Python optimization: significant performance gains with minimal code changes. Here’s what I learned from benchmarking it against other approaches.

The Quick Result

Numba Performance Benchmarks
| Benchmark | CPython 3.14 | Numba @njit | Speedup |
|------------------------|--------------|-------------|---------|
| n-body (500K iter) | 1,242ms | 22ms | 56x |
| spectral-norm (N=2000) | 14,046ms | 104ms | 135x |

These aren’t toy examples. They’re the Benchmarks Game problems that real performance engineers use. Numba approaches Cython (99-124x) and Rust PyO3 (113-154x) performance without requiring you to learn a new language.

How Numba Works

Numba reads your Python bytecode, combines it with type information from your input arguments, and uses LLVM to generate machine code tailored to your CPU. The compiled version is cached and reused.

Numba Compilation Pipeline
Python bytecode -> Numba IR -> Type inference -> LLVM IR -> Machine code

The key insight: Numba is a scalpel, not a saw. It targets numerical loops with NumPy arrays. If you try to use it on string processing or Pandas DataFrames, you’ll be disappointed.

My First Attempt: Adding @njit

I started with a simple numerical computation:

before_numba.py
import numpy as np
def compute_slow(n):
result = 0.0
for i in range(n):
result += np.sqrt(i) * np.sin(i)
return result

Adding Numba required one line:

with_numba.py
from numba import njit
import numpy as np
@njit
def compute_fast(n):
result = 0.0
for i in range(n):
result += np.sqrt(i) * np.sin(i)
return result
Benchmark Results
n = 10_000_000
compute_slow(n): ~2.5 seconds
compute_fast(n): ~0.02 seconds (125x faster!)

The first call to compute_fast took longer because it compiled the function. Subsequent calls used the cached machine code.

@jit vs @njit: What’s the Difference?

@njit is shorthand for @jit(nopython=True). This is the mode you almost always want.

decorator_comparison.py
from numba import jit, njit
# These are equivalent:
@njit
def func1(x):
return x * 2
@jit(nopython=True)
def func2(x):
return x * 2

Nopython mode compiles your function to run entirely without the Python interpreter. This gives the best performance but has strict requirements:

  • Use NumPy arrays, not Python lists
  • Stick to numeric types (int, float, complex)
  • Avoid strings, dicts, and Python objects
nopython_mode.py
from numba import njit
import numpy as np
@njit
def nbody_step(dt, n, pos, vel, mass):
"""N-body simulation - works in nopython mode"""
for i in range(n):
for j in range(i + 1, n):
dx = pos[i, 0] - pos[j, 0]
dy = pos[i, 1] - pos[j, 1]
dz = pos[i, 2] - pos[j, 2]
dist = np.sqrt(dx*dx + dy*dy + dz*dz)
mag = dt / (dist * dist * dist)
vel[i, 0] -= dx * mag * mass[j]
vel[i, 1] -= dy * mag * mass[j]
vel[i, 2] -= dz * mag * mass[j]
vel[j, 0] += dx * mag * mass[i]
vel[j, 1] += dy * mag * mass[i]
vel[j, 2] += dz * mag * mass[i]
return vel

Object Mode (Fallback)

If nopython compilation fails, Numba falls back to object mode, which runs through the Python interpreter. Performance is often worse than pure Python due to Numba overhead.

Performance Mode Comparison
| Mode | Speedup | Use Case |
|-----------|--------------|-----------------------------|
| nopython | 56-135x | Numerical loops (default) |
| object | ~1x (slower) | Fallback, rarely useful |

Common Pitfalls I Encountered

Pitfall 1: Including Compilation Time

The first call to a jitted function compiles it. Always use timeit for accurate measurements:

benchmarking.py
from numba import njit
import numpy as np
from time import time
@njit
def compute(n):
result = 0.0
for i in range(n):
result += np.sqrt(i) * np.sin(i)
return result
# WRONG: First call includes compilation
start = time()
result = compute(1_000_000) # Includes compile time!
print(time() - start) # Misleading
# RIGHT: Warm up first, then measure
compute(1) # Trigger compilation
start = time()
result = compute(1_000_000) # Uses cached machine code
print(time() - start) # Accurate

Pitfall 2: Using Python Objects Inside Jitted Functions

I tried to use a dict inside a jitted function and watched performance collapse:

pitfall_objects.py
from numba import njit
# WRONG: Dict forces fallback to slow object mode
@njit
def process_with_dict(data):
lookup = {'a': 1, 'b': 2, 'c': 3} # Won't compile in nopython
result = 0.0
for i in range(len(data)):
result += data[i]
return result
# RIGHT: Use NumPy arrays only
@njit
def process_with_array(data, lookup_values):
result = 0.0
for i in range(len(data)):
idx = int(data[i]) # Use integer indexing
result += lookup_values[idx]
return result

Pitfall 3: Passing Pandas DataFrames

Numba doesn’t understand Pandas. Extract NumPy arrays first:

pitfall_pandas.py
import pandas as pd
import numpy as np
from numba import njit
df = pd.DataFrame({'values': np.random.rand(1000)})
# WRONG: Numba can't process DataFrames
@njit
def process_df(df):
return df.sum() # This fails
# RIGHT: Extract the NumPy array
@njit
def process_array(arr):
total = 0.0
for i in range(len(arr)):
total += arr[i]
return total
result = process_array(df['values'].values)

Pitfall 4: Modifying Global Variables

Numba treats globals as compile-time constants:

pitfall_globals.py
from numba import njit
THRESHOLD = 100
@njit
def check_value(x):
return x > THRESHOLD # THRESHOLD is baked in at compile time
check_value(150) # Returns True
THRESHOLD = 200 # This change is ignored!
check_value(150) # Still returns True (not False)

Useful Decorator Options

decorator_options.py
from numba import njit, jit, prange
import numpy as np
# cache=True: Save compiled code to disk
@njit(cache=True)
def expensive_compile(data):
# Long compilation time saved for future runs
pass
# parallel=True: Auto-parallelize loops
@njit(parallel=True)
def parallel_process(data):
n = len(data)
result = np.empty(n)
for i in prange(n): # prange enables parallel execution
result[i] = np.sin(data[i]) ** 2 + np.cos(data[i]) ** 2
return result
# fastmath=True: Aggressive floating-point optimizations
@njit(fastmath=True)
def fast_math(x):
return x * x + 2.0 * x + 1.0
# nogil=True: Release GIL during execution (for threading)
@njit(nogil=True)
def release_gil(data):
result = 0.0
for i in range(len(data)):
result += data[i]
return result

When Numba Excels

Numba Sweet Spots
| Use Case | Numba Fit | Why |
|-----------------------------|-----------|----------------------------------|
| Loop-heavy NumPy operations | Excellent | Direct machine code for loops |
| Mathematical simulations | Excellent | LLVM optimizes math operations |
| Repeated function calls | Excellent | Compile once, run many times |
| String processing | Poor | No string support in nopython |
| Dict-heavy code | Poor | Typed dicts limited |
| Pandas operations | Poor | Extract arrays first |

Comparison: Numba vs Other Optimizations

From the optimization ladder benchmarks:

Speedup Comparison (relative to CPython 3.14)
| Approach | N-body Speedup | Effort Required |
|-----------------|----------------|------------------------|
| PyPy | 13x | Zero code changes |
| Mypyc | 2.4-14x | Type annotations |
| Numba | 56x | Decorator + NumPy |
| Cython | 124x | Rewrite in Cython |
| Rust PyO3 | 113-154x | Rewrite in Rust |

Numba sits between low-effort solutions (PyPy, Mypyc) and high-effort solutions (Cython, Rust). For many numerical workloads, it offers the best balance.

Going Further: GPU Acceleration with CUDA

Numba supports GPU programming through CUDA:

cuda_example.py
from numba import cuda
import numpy as np
@cuda.jit
def vector_add(a, b, c):
idx = cuda.grid(1)
if idx < len(a):
c[idx] = a[idx] + b[idx]
n = 1_000_000
a = np.random.rand(n).astype(np.float32)
b = np.random.rand(n).astype(np.float32)
c = np.zeros(n, dtype=np.float32)
# Copy to GPU
a_gpu = cuda.to_device(a)
b_gpu = cuda.to_device(b)
c_gpu = cuda.to_device(c)
# Execute on GPU
threads_per_block = 256
blocks = (n + threads_per_block - 1) // threads_per_block
vector_add[blocks, threads_per_block](a_gpu, b_gpu, c_gpu)
# Copy result back
c = c_gpu.copy_to_host()

GPU programming adds complexity but can provide massive speedups for parallelizable workloads.

Debugging with Annotation Reports

Numba provides an annotation tool to identify slow regions:

Generate annotation report
numba -a your_script.py
# Creates HTML showing which lines compile to fast machine code
# vs which fall back to Python

Yellow lines indicate Python object manipulation (slow). White lines indicate compiled machine code (fast).

When to Choose Numba

Choose Numba when:

  • Your code has heavy loops with NumPy arrays
  • You want near-C performance without leaving Python
  • Your data fits naturally in NumPy arrays
  • You can restructure code to avoid Python objects

Look elsewhere when:

  • Heavy string or dict manipulation (consider Cython)
  • Need to accelerate Pandas operations (consider vectorization)
  • Want Ahead-Of-Time compilation for distribution (consider Cython or Nuitka)
  • Your code uses many Python objects that can’t be converted to NumPy

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments