Numba Python Optimization: 56-135x Speedup with @jit and @njit Decorators

Mar 11, 2026

I had a Python simulation running for 14 seconds per iteration. After adding @njit to my function and restructuring my data into NumPy arrays, it dropped to 104 milliseconds. That’s a 135x speedup for the cost of a decorator.

Numba sits at the sweet spot of Python optimization: significant performance gains with minimal code changes. Here’s what I learned from benchmarking it against other approaches.

The Quick Result

| Benchmark              | CPython 3.14 | Numba @njit | Speedup |
|------------------------|--------------|-------------|---------|
| n-body (500K iter)     | 1,242ms      | 22ms        | 56x     |
| spectral-norm (N=2000) | 14,046ms     | 104ms       | 135x    |

These aren’t toy examples. They’re the Benchmarks Game problems that real performance engineers use. Numba approaches Cython (99-124x) and Rust PyO3 (113-154x) performance without requiring you to learn a new language.

How Numba Works

Numba reads your Python bytecode, combines it with type information from your input arguments, and uses LLVM to generate machine code tailored to your CPU. The compiled version is cached and reused.

Python bytecode -> Numba IR -> Type inference -> LLVM IR -> Machine code

The key insight: Numba is a scalpel, not a saw. It targets numerical loops with NumPy arrays. If you try to use it on string processing or Pandas DataFrames, you’ll be disappointed.

My First Attempt: Adding @njit

I started with a simple numerical computation:

import numpy as np

def compute_slow(n):
    result = 0.0
    for i in range(n):
        result += np.sqrt(i) * np.sin(i)
    return result

Adding Numba required one line:

from numba import njit
import numpy as np

@njit
def compute_fast(n):
    result = 0.0
    for i in range(n):
        result += np.sqrt(i) * np.sin(i)
    return result

n = 10_000_000
compute_slow(n):  ~2.5 seconds
compute_fast(n):  ~0.02 seconds (125x faster!)

The first call to compute_fast took longer because it compiled the function. Subsequent calls used the cached machine code.

@jit vs @njit: What’s the Difference?

@njit is shorthand for @jit(nopython=True). This is the mode you almost always want.

from numba import jit, njit

# These are equivalent:
@njit
def func1(x):
    return x * 2

@jit(nopython=True)
def func2(x):
    return x * 2

Nopython Mode (Recommended)

Nopython mode compiles your function to run entirely without the Python interpreter. This gives the best performance but has strict requirements:

Use NumPy arrays, not Python lists
Stick to numeric types (int, float, complex)
Avoid strings, dicts, and Python objects

from numba import njit
import numpy as np

@njit
def nbody_step(dt, n, pos, vel, mass):
    """N-body simulation - works in nopython mode"""
    for i in range(n):
        for j in range(i + 1, n):
            dx = pos[i, 0] - pos[j, 0]
            dy = pos[i, 1] - pos[j, 1]
            dz = pos[i, 2] - pos[j, 2]
            dist = np.sqrt(dx*dx + dy*dy + dz*dz)
            mag = dt / (dist * dist * dist)
            vel[i, 0] -= dx * mag * mass[j]
            vel[i, 1] -= dy * mag * mass[j]
            vel[i, 2] -= dz * mag * mass[j]
            vel[j, 0] += dx * mag * mass[i]
            vel[j, 1] += dy * mag * mass[i]
            vel[j, 2] += dz * mag * mass[i]
    return vel

Object Mode (Fallback)

If nopython compilation fails, Numba falls back to object mode, which runs through the Python interpreter. Performance is often worse than pure Python due to Numba overhead.

| Mode      | Speedup      | Use Case                    |
|-----------|--------------|-----------------------------|
| nopython  | 56-135x      | Numerical loops (default)   |
| object    | ~1x (slower) | Fallback, rarely useful     |

Common Pitfalls I Encountered

Pitfall 1: Including Compilation Time

The first call to a jitted function compiles it. Always use timeit for accurate measurements:

from numba import njit
import numpy as np
from time import time

@njit
def compute(n):
    result = 0.0
    for i in range(n):
        result += np.sqrt(i) * np.sin(i)
    return result

# WRONG: First call includes compilation
start = time()
result = compute(1_000_000)  # Includes compile time!
print(time() - start)  # Misleading

# RIGHT: Warm up first, then measure
compute(1)  # Trigger compilation
start = time()
result = compute(1_000_000)  # Uses cached machine code
print(time() - start)  # Accurate

Pitfall 2: Using Python Objects Inside Jitted Functions

I tried to use a dict inside a jitted function and watched performance collapse:

from numba import njit

# WRONG: Dict forces fallback to slow object mode
@njit
def process_with_dict(data):
    lookup = {'a': 1, 'b': 2, 'c': 3}  # Won't compile in nopython
    result = 0.0
    for i in range(len(data)):
        result += data[i]
    return result

# RIGHT: Use NumPy arrays only
@njit
def process_with_array(data, lookup_values):
    result = 0.0
    for i in range(len(data)):
        idx = int(data[i])  # Use integer indexing
        result += lookup_values[idx]
    return result

Pitfall 3: Passing Pandas DataFrames

Numba doesn’t understand Pandas. Extract NumPy arrays first:

import pandas as pd
import numpy as np
from numba import njit

df = pd.DataFrame({'values': np.random.rand(1000)})

# WRONG: Numba can't process DataFrames
@njit
def process_df(df):
    return df.sum()  # This fails

# RIGHT: Extract the NumPy array
@njit
def process_array(arr):
    total = 0.0
    for i in range(len(arr)):
        total += arr[i]
    return total

result = process_array(df['values'].values)

Pitfall 4: Modifying Global Variables

Numba treats globals as compile-time constants:

from numba import njit

THRESHOLD = 100

@njit
def check_value(x):
    return x > THRESHOLD  # THRESHOLD is baked in at compile time

check_value(150)  # Returns True

THRESHOLD = 200   # This change is ignored!
check_value(150)  # Still returns True (not False)

Useful Decorator Options

from numba import njit, jit, prange
import numpy as np

# cache=True: Save compiled code to disk
@njit(cache=True)
def expensive_compile(data):
    # Long compilation time saved for future runs
    pass

# parallel=True: Auto-parallelize loops
@njit(parallel=True)
def parallel_process(data):
    n = len(data)
    result = np.empty(n)
    for i in prange(n):  # prange enables parallel execution
        result[i] = np.sin(data[i]) ** 2 + np.cos(data[i]) ** 2
    return result

# fastmath=True: Aggressive floating-point optimizations
@njit(fastmath=True)
def fast_math(x):
    return x * x + 2.0 * x + 1.0

# nogil=True: Release GIL during execution (for threading)
@njit(nogil=True)
def release_gil(data):
    result = 0.0
    for i in range(len(data)):
        result += data[i]
    return result

When Numba Excels

| Use Case                    | Numba Fit | Why                              |
|-----------------------------|-----------|----------------------------------|
| Loop-heavy NumPy operations | Excellent | Direct machine code for loops    |
| Mathematical simulations    | Excellent | LLVM optimizes math operations   |
| Repeated function calls     | Excellent | Compile once, run many times     |
| String processing           | Poor      | No string support in nopython    |
| Dict-heavy code             | Poor      | Typed dicts limited              |
| Pandas operations           | Poor      | Extract arrays first             |

Comparison: Numba vs Other Optimizations

From the optimization ladder benchmarks:

| Approach        | N-body Speedup | Effort Required        |
|-----------------|----------------|------------------------|
| PyPy            | 13x            | Zero code changes      |
| Mypyc           | 2.4-14x        | Type annotations       |
| Numba           | 56x            | Decorator + NumPy      |
| Cython          | 124x           | Rewrite in Cython      |
| Rust PyO3       | 113-154x       | Rewrite in Rust        |

Numba sits between low-effort solutions (PyPy, Mypyc) and high-effort solutions (Cython, Rust). For many numerical workloads, it offers the best balance.

Going Further: GPU Acceleration with CUDA

Numba supports GPU programming through CUDA:

from numba import cuda
import numpy as np

@cuda.jit
def vector_add(a, b, c):
    idx = cuda.grid(1)
    if idx < len(a):
        c[idx] = a[idx] + b[idx]

n = 1_000_000
a = np.random.rand(n).astype(np.float32)
b = np.random.rand(n).astype(np.float32)
c = np.zeros(n, dtype=np.float32)

# Copy to GPU
a_gpu = cuda.to_device(a)
b_gpu = cuda.to_device(b)
c_gpu = cuda.to_device(c)

# Execute on GPU
threads_per_block = 256
blocks = (n + threads_per_block - 1) // threads_per_block
vector_add[blocks, threads_per_block](a_gpu, b_gpu, c_gpu)

# Copy result back
c = c_gpu.copy_to_host()

GPU programming adds complexity but can provide massive speedups for parallelizable workloads.

Debugging with Annotation Reports

Numba provides an annotation tool to identify slow regions:

numba -a your_script.py
# Creates HTML showing which lines compile to fast machine code
# vs which fall back to Python

Yellow lines indicate Python object manipulation (slow). White lines indicate compiled machine code (fast).

When to Choose Numba

Choose Numba when:

Your code has heavy loops with NumPy arrays
You want near-C performance without leaving Python
Your data fits naturally in NumPy arrays
You can restructure code to avoid Python objects

Look elsewhere when:

Heavy string or dict manipulation (consider Cython)
Need to accelerate Pandas operations (consider vectorization)
Want Ahead-Of-Time compilation for distribution (consider Cython or Nuitka)
Your code uses many Python objects that can’t be converted to NumPy

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Numba @jit Decorator Documentation
👨‍💻 Numba 5-Minute Guide
👨‍💻 The Optimization Ladder - Comprehensive Python Benchmark
👨‍💻 GitHub: faster-python-bench
👨‍💻 Numba Performance Tips

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!