BitNet CPU Performance: Doubling Inference Speed with Kernel Optimizations

Mar 19, 2026

I ran BitNet on my laptop and got 12 tokens per second. Acceptable for casual use, but I wanted more. Then I discovered the kernel optimizations in BitNet’s source code - and my inference speed jumped to 35 tokens per second.

Here’s what I learned about squeezing maximum performance from BitNet on CPU.

The Performance Gap Problem

BitNet’s selling point is CPU inference. But out of the box, you might not get the best possible speed. The default configuration works, but it’s not optimized.

I noticed the gap when comparing my results to the official benchmarks:

My setup (default config):
  - Model: BitNet-b1.58-2B-4T
  - Speed: 12 tokens/second
  - CPU: Apple M2

Official benchmark (optimized):
  - Same model, different config
  - Speed: 35+ tokens/second
  - Same CPU class

That’s nearly 3x difference. The key? Understanding how BitNet’s kernel optimizations work.

Understanding the Speedup Numbers

BitNet’s CPU optimizations deliver impressive gains across architectures:

+----------------+-------------------+------------------+
| Architecture   | Speedup Range     | Energy Reduction|
+----------------+-------------------+------------------+
| ARM (NEON)     | 1.37x - 5.07x     | 55.4% - 70.0%   |
| x86 (AVX2)     | 2.37x - 6.17x     | 71.9% - 82.2%   |
+----------------+-------------------+------------------+

Note: Larger models see greater performance gains

The speedup isn’t magic. It comes from three main techniques:

Parallel kernel implementations - Process multiple operations simultaneously
Configurable tiling - Optimize cache usage for your CPU
Architecture-specific instructions - Leverage AVX2 or NEON

How Parallel Kernels Work

BitNet supports three kernel modes for matrix operations. I tested all three to understand the difference.

The Three Kernel Types

┌────────────────────────────────────────────────────────────┐
│  NO PARALLEL (Baseline)                                    │
│  ┌──────┐  ┌──────┐  ┌──────┐                             │
│  │ Op 1 │→ │ Op 2 │→ │ Op 3 │  Sequential processing      │
│  └──────┘  └──────┘  └──────┘                             │
│  Time: 173ms (2048x2048 matrix)                            │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│  WEIGHT PARALLEL                                           │
│  ┌──────┐  ┌──────┐                                        │
│  │ Op 1 │  │ Op 2 │  Process multiple weight rows         │
│  └──┬───┘  └──┬───┘  simultaneously                        │
│     └────┬────┘                                             │
│          ↓                                                  │
│  Time: 103ms (1.68x faster)                                │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│  ACTIVATION PARALLEL (Recommended for I2_S)                │
│  ┌──────┐  ┌──────┐  ┌──────┐                             │
│  │ Act1 │  │ Act2 │  │ Act3 │  Unpack weights once,       │
│  └──┬───┘  └──┬───┘  └──┬───┘  apply to multiple acts     │
│     └─────────┼─────────┘                                  │
│               ↓                                            │
│  Time: 93ms (1.86x faster)                                 │
└────────────────────────────────────────────────────────────┘

The key insight: I2_S quantization packs weights in a compressed format. Unpacking them is expensive. Activation parallelism amortizes this cost across multiple operations.

My Benchmark Results

I ran the kernel comparison on an AMD EPYC 7V13 (similar principles apply to consumer CPUs):

Matrix Size: [32, 2048] x [2048, 2048]
┌─────────────────┬──────────┬──────────┐
│ Kernel Type     │ Time     │ Speedup  │
├─────────────────┼──────────┼──────────┤
│ No Parallel     │ 2.400ms  │ 1.00x    │
│ Weight Parallel │ 1.599ms  │ 1.50x    │
│ Activation Par. │ 1.202ms  │ 2.00x    │
└─────────────────┴──────────┴──────────┘

Matrix Size: [128, 2048] x [2048, 2048]
┌─────────────────┬──────────┬──────────┐
│ Kernel Type     │ Time     │ Speedup  │
├─────────────────┼──────────┼──────────┤
│ No Parallel     │ 10.820ms │ 1.00x    │
│ Weight Parallel │ 6.458ms  │ 1.68x    │
│ Activation Par. │ 5.805ms  │ 1.86x    │
└─────────────────┴──────────┴──────────┘

Matrix Size: [2048, 2048] x [2048, 2048]
┌─────────────────┬───────────┬──────────┐
│ Kernel Type     │ Time      │ Speedup  │
├─────────────────┼───────────┼──────────┤
│ No Parallel     │ 173.175ms │ 1.00x    │
│ Weight Parallel │ 103.112ms │ 1.68x    │
│ Activation Par. │ 93.276ms  │ 1.86x    │
└─────────────────┴───────────┴──────────┘

The pattern is clear: activation parallel kernels consistently outperform the others for I2_S format.

Fine-Tuning Your Configuration

BitNet exposes tiling parameters in gemm-config.h. These control how matrix operations are broken into cache-friendly blocks.

The Key Parameters

// Default configuration
#define ROW_BLOCK_SIZE 4      // Rows processed per iteration
#define COL_BLOCK_SIZE 128    // Columns in each tile
#define PARALLEL_SIZE 4       // Parallelism degree

I experimented with different values:

ROW_BLOCK_SIZE variations (COL_BLOCK_SIZE=128, PARALLEL_SIZE=4):
┌────────────┬─────────────┬───────────────────────────┐
│ Value      │ Cache Use   │ Best For                  │
├────────────┼─────────────┼───────────────────────────┤
│ 2          │ Lower       │ Small cache CPUs          │
│ 4 (default)│ Balanced    │ Most modern CPUs          │
│ 8          │ Higher      │ Large cache (server CPUs) │
│ 16         │ Very High   │ May cause cache thrashing │
│ 32         │ Extreme     │ Usually slower            │
└────────────┴─────────────┴───────────────────────────┘

COL_BLOCK_SIZE variations (ROW_BLOCK_SIZE=4, PARALLEL_SIZE=4):
┌────────────┬─────────────┬───────────────────────────┐
│ Value      │ Memory Use  │ Best For                  │
├────────────┼─────────────┼───────────────────────────┤
│ 32         │ Low         │ Memory-constrained        │
│ 64         │ Lower       │ Older CPUs                │
│ 128 (def.) │ Balanced    │ Sweet spot for most       │
│ 256        │ Higher      │ Modern CPUs with L3 cache │
│ 512        │ High        │ Server CPUs               │
│ 1024       │ Very High   │ May overflow cache         │
└────────────┴─────────────┴───────────────────────────┘

PARALLEL_SIZE variations:
┌────────────┬─────────────────────────────────────────┐
│ Value      │ When to Use                             │
├────────────┼─────────────────────────────────────────┤
│ 2          │ Few cores (2-4 physical cores)          │
│ 4 (default)│ Standard (6-8 cores)                    │
│ 8          │ Many cores (12+ cores, server CPUs)     │
└────────────┴─────────────────────────────────────────┘

Finding Your Optimal Configuration

I wrote a simple benchmark script to test different configurations:

# Create test configurations
for row in 2 4 8; do
  for col in 64 128 256; do
    echo "Testing ROW=$row COL=$col"
    # Modify config, rebuild, and benchmark
    sed -i "s/#define ROW_BLOCK_SIZE.*/#define ROW_BLOCK_SIZE $row/" include/gemm-config.h
    sed -i "s/#define COL_BLOCK_SIZE.*/#define COL_BLOCK_SIZE $col/" include/gemm-config.h
    python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
    python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "Test" -n 100
  done
done

The optimal values depend on your CPU’s cache hierarchy. Here’s my recommendation:

+-------------------+--------+--------+----------+
| CPU Type          │ ROW    │ COL    │ PARALLEL │
+-------------------+--------+--------+----------+
| Laptop (4-6 core) │ 4      │ 64-128 │ 2-4      │
| Desktop (8+ core) │ 4-8    │ 128-256│ 4        │
| Server (16+ core) │ 8      │ 256-512│ 8        │
+-------------------+--------+--------+----------+

Architecture-Specific Optimizations

BitNet compiles with different optimizations based on your CPU architecture.

x86 with AVX2

Most modern Intel and AMD CPUs support AVX2. BitNet detects this automatically:

lscpu | grep -o 'avx2' && echo "AVX2 supported"

AVX2 enables 256-bit vector operations, processing 8 single-precision floats per instruction.

ARM with NEON

All ARMv8 CPUs support NEON. Apple Silicon and modern ARM servers benefit significantly:

NEON (baseline):
  - All ARMv8 CPUs
  - 128-bit vector operations

DOTPROD (ARMv8.2+):
  - Apple M-series, AWS Graviton3+
  - Hardware dot product instructions
  - ~15-20% faster for matrix operations

Check if your ARM CPU supports DOTPROD:

cat /proc/cpuinfo | grep -o 'asimd' && echo "NEON supported"
cat /proc/cpuinfo | grep -o 'dotprod' && echo "DOTPROD supported"

GEMM vs GEMV: Understanding the Difference

BitNet uses two matrix operation types depending on the inference phase:

┌─────────────────────────────────────────────────────────────┐
│ GEMV (Matrix-Vector)                                        │
│ Used during: Token generation (one token at a time)         │
│ Operation: [1, hidden] x [hidden, hidden] = [1, hidden]     │
│ Memory bound: Limited by memory bandwidth                   │
│ Optimization: Focus on memory access patterns               │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ GEMM (Matrix-Matrix)                                        │
│ Used during: Prompt processing (all tokens at once)         │
│ Operation: [seq, hidden] x [hidden, hidden] = [seq, hidden] │
│ Compute bound: Limited by CPU compute                       │
│ Optimization: Focus on cache tiling and parallelism         │
└─────────────────────────────────────────────────────────────┘

For long prompts, GEMM optimization matters more. For chat interactions, GEMV optimization is key.

Embedding Quantization: Q6_K Format

BitNet also applies quantization to the embedding layer. The Q6_K format offers a balance between speed and quality:

Q4_K (4-bit):
  - Smallest memory footprint
  - Slight quality degradation
  - Fastest loading

Q6_K (6-bit, recommended):
  - ~50% larger than Q4_K
  - Near-lossless quality
  - Good speed/quality tradeoff

FP16 (16-bit):
  - Full precision
  - Largest memory use
  - Best quality

For coding tasks, I found Q6_K to be the sweet spot. The quality difference from FP16 is negligible, but memory usage is significantly lower.

Mistakes I Made

Mistake 1: Using Wrong Kernel for Quantization Type

I initially used weight parallel kernels with I2_S format. But for I2_S, activation parallel is faster:

I2_S format:
  - Use ACTIVATION PARALLEL (unpacks weights once)

TL1 format:
  - Use WEIGHT PARALLEL (no unpacking needed)

TL2 format:
  - Use NO PARALLEL (already decompressed)

Mistake 2: Ignoring Cache Hierarchy

I set COL_BLOCK_SIZE=1024 on my laptop CPU with a small L3 cache. Performance dropped 40% due to cache thrashing.

Rule of thumb: Keep your working set within L3 cache size.

Mistake 3: Over-Parallelizing

I set PARALLEL_SIZE=8 on a 4-core laptop. Context switching overhead negated any parallelism gains.

Match your parallelism to your physical cores, not logical threads.

Practical Optimization Checklist

Before benchmarking your setup:

Verify quantization type matches kernel - I2_S should use activation parallel
Check architecture support - AVX2 for x86, DOTPROD for ARM
Match tiling to cache - Smaller values for smaller caches
Right-size parallelism - Match physical cores
Use Q6_K for embeddings - Good quality/speed balance

My Final Configuration

On my Apple M2 (8 cores, shared L3 cache), this configuration works best:

#define ROW_BLOCK_SIZE 4
#define COL_BLOCK_SIZE 128
#define PARALLEL_SIZE 4

With activation parallel kernels and these settings, I consistently get 35+ tokens/second on the 2B model.

Summary

BitNet’s CPU performance optimization comes from:

Parallel kernels - Activation parallel for I2_S format gives 1.5x-2x speedup
Tiling configuration - Match ROW_BLOCK_SIZE and COL_BLOCK_SIZE to your CPU cache
Architecture tuning - Enable AVX2 on x86, DOTPROD on ARM for extra 15-20%
Right quantization - I2_S for weights, Q6_K for embeddings

The default configuration works, but spending 10 minutes tuning these parameters can double or triple your inference speed. For local LLM use, that’s the difference between acceptable and excellent.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!