How to Benchmark BitNet Performance: End-to-End Testing Guide

Mar 19, 2026

Problem

I set up BitNet on my machine and ran a few prompts. Everything seemed to work, but I had no idea if I was getting the performance I should be getting. The paper claims impressive speedups, but how do I verify that on my hardware?

I tried running some informal tests with a stopwatch, but the results were inconsistent. I needed a proper benchmarking methodology.

What BitNet Provides

BitNet includes a built-in benchmark script at utils/e2e_benchmark.py. Here’s how to use it:

python utils/e2e_benchmark.py -m /path/to/model.gguf -n 200 -p 256 -t 4

The parameters:

-m: Path to your BitNet model file (gguf format)
-n: Number of tokens to generate
-p: Number of prompt tokens (input size)
-t: Number of CPU threads to use

My First Benchmark

I started with a basic test:

python utils/e2e_benchmark.py \
  -m models/bitnet_b1_58-2b-4t/ggml-model-i2_s.gguf \
  -n 128 \
  -p 128 \
  -t 8

The output looked like this:

Loading model...
Model loaded in 2.34 seconds

Running benchmark with 128 prompt tokens and 128 generated tokens...

Prompt processing: 156.23 tokens/second
Token generation: 42.87 tokens/second
Total time: 4.12 seconds

Memory usage: 1.2 GB

But what do these numbers mean? I needed to understand the metrics.

Understanding the Metrics

BitNet benchmarking produces two key metrics:

Prompt Processing (pp)

This measures how fast the model processes your input. It’s the “reading” speed.

Prompt Processing Speed = Input Tokens / Processing Time

Higher is better. This matters for:

Long context inputs
Document analysis
Code review tasks

Token Generation (tg)

This measures how fast the model produces output. It’s the “writing” speed.

Token Generation Speed = Output Tokens / Generation Time

Higher is better. This matters for:

Chat applications (user experience)
Content generation
Code completion

Why the Difference Matters

I noticed token generation is always slower than prompt processing. This is normal:

┌─────────────────────────────────────────────────────────────┐
│                    INFERENCE PIPELINE                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  PROMPT PROCESSING          TOKEN GENERATION                │
│  (Parallel computation)     (Sequential generation)         │
│                                                              │
│  [Token][Token][Token]      [Token] → [Token] → [Token]    │
│     ↓    ↓    ↓                ↓         ↓         ↓       │
│  All at once              One at a time                    │
│                                                              │
│  Fast: 150+ tokens/sec     Slower: 30-50 tokens/sec         │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Prompt processing can use parallel computation. Token generation must be sequential because each new token depends on previous tokens.

Comparing Configurations

I ran multiple benchmarks to find the optimal configuration for my hardware.

Thread Count Impact

# Test with 4 threads
python utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 128 -t 4

# Test with 8 threads
python utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 128 -t 8

# Test with 16 threads
python utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 128 -t 16

Results on my 8-core machine:

Threads | Prompt (tok/s) | Generation (tok/s) | Memory
--------|----------------|--------------------|--------
   4    |     98.45     |      28.12         |  1.1 GB
   8    |    156.23     |      42.87         |  1.2 GB
  16    |    152.89     |      41.54         |  1.3 GB

The 16-thread run was actually slightly slower than 8 threads. This is because:

My CPU has 8 physical cores (16 with hyperthreading)
Hyperthreading doesn’t help much for this workload
Extra threads add coordination overhead

Prompt/Generation Ratios

Different workloads have different characteristics:

┌─────────────────┬────────────────┬───────────────────────────────┐
│ Configuration   │ Prompt/Gen     │ Use Case                       │
├─────────────────┼────────────────┼───────────────────────────────┤
│ pp128 / tg128   │ 128 / 128      │ Standard benchmark (balanced)  │
│ pp512 / tg128   │ 512 / 128      │ Long context (RAG, documents)  │
│ pp64  / tg512   │ 64  / 512      │ Generation-heavy (writing)     │
│ pp1024/ tg64    │ 1024/ 64       │ Analysis-heavy (code review)  │
└─────────────────┴────────────────┴───────────────────────────────┘

I ran tests with different ratios:

# Balanced workload
python utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 128 -t 8

# Long context workload
python utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 512 -t 8

# Generation-heavy workload
python utils/e2e_benchmark.py -m models/bitnet.gguf -n 512 -p 64 -t 8

Using Dummy Models for Testing

What if you don’t have access to the full BitNet model? BitNet provides a dummy model generator:

python utils/generate-dummy-bitnet-model.py \
  models/bitnet_b1_58-large \
  --outfile models/dummy-bitnet-125m.tl1.gguf \
  --outtype tl1 \
  --model-size 125M

Then benchmark the dummy model:

python utils/e2e_benchmark.py \
  -m models/dummy-bitnet-125m.tl1.gguf \
  -p 512 \
  -n 128

Dummy models are useful for:

Testing kernel implementations
Validating architecture support
Comparing CPU optimizations
CI/CD pipeline validation

Interpreting Results

After running multiple benchmarks, I created a comparison table:

┌──────────────────┬─────────────────┬──────────────────┬──────────┐
│ Hardware         │ Prompt (tok/s)  │ Gen (tok/s)      │ Config   │
├──────────────────┼─────────────────┼──────────────────┼──────────┤
│ Intel i7-13800H  │ 180-220         │ 48-55            │ 8 threads│
│ AMD EPYC 7V13    │ 210-260         │ 52-60            │ 16 thr   │
│ Cobalt 100       │ 320-380         │ 72-85            │ AVX-512  │
│ My machine       │ 156             │ 43               │ 8 thr    │
└──────────────────┴─────────────────┴──────────────────┴──────────┘

My results were reasonable for my hardware class. If your numbers are significantly lower than expected, check:

CPU governor settings (use performance mode)
Memory bandwidth (close other apps)
Thermal throttling (check CPU temperature)
NUMA configuration on multi-socket systems

Creating a Benchmark Script

I automated my testing with a simple script:

#!/bin/bash

MODEL="models/bitnet_b1_58-2b-4t/ggml-model-i2_s.gguf"
THREADS="4 8 16"

echo "BitNet Benchmark Suite"
echo "======================"
echo "Model: $MODEL"
echo "Date: $(date)"
echo ""

for t in $THREADS; do
  echo "--- Testing with $t threads ---"
  python utils/e2e_benchmark.py -m "$MODEL" -n 128 -p 128 -t "$t"
  echo ""
done

echo "Benchmark complete."

Run it:

chmod +x benchmark_all.sh
./benchmark_all.sh | tee benchmark_results.txt

Memory Usage Monitoring

Performance isn’t just about speed. Memory matters too:

# Run benchmark and monitor memory
/usr/bin/time -v python utils/e2e_benchmark.py \
  -m models/bitnet.gguf \
  -n 128 -p 128 -t 8

# Look for:
#   Maximum resident set size (memory used)

BitNet’s 1.58-bit quantization should show significantly lower memory usage than standard models:

┌─────────────────┬─────────────────┐
│ Model           │ Memory (2B)     │
├─────────────────┼─────────────────┤
│ FP16            │ ~4.0 GB         │
│ INT8            │ ~2.0 GB         │
│ BitNet (1.58b)  │ ~0.4 GB         │
└─────────────────┴─────────────────┘

Why BitNet Claims Speedup

BitNet’s 1.58-bit quantization reduces:

Memory bandwidth requirements (smaller model)
Memory footprint (can fit in faster cache)
Computation (ternary weights enable faster operations)

The speedup comes from reduced memory movement, not faster computation itself.

Benchmarking Best Practices

Run multiple times: Single runs can be affected by system state
Warm up: First run often slower due to cold cache
Control environment: Close other applications
Document everything: Hardware, OS, model version, parameters

Comparing Against Baselines

To validate BitNet’s claims, compare against:

Standard FP16 model (if available)
INT8 quantized version
Other quantization methods (GPTQ, AWQ)

Summary

In this post, I covered how to benchmark BitNet inference performance using the built-in e2e_benchmark.py script. The key metrics are prompt processing speed (tokens/second for input) and token generation speed (tokens/second for output).

I found that:

8 threads was optimal for my 8-core CPU (hyperthreading didn’t help)
Token generation is always slower than prompt processing (sequential vs parallel)
Different workloads need different benchmark configurations
Memory usage is significantly lower than standard quantization

The benchmark script lets you validate BitNet’s performance claims on your hardware and optimize your configuration. Without proper benchmarking, you can’t know if you’re getting the performance you should be.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!