Skip to content

How to Benchmark BitNet Performance: End-to-End Testing Guide

Problem

I set up BitNet on my machine and ran a few prompts. Everything seemed to work, but I had no idea if I was getting the performance I should be getting. The paper claims impressive speedups, but how do I verify that on my hardware?

I tried running some informal tests with a stopwatch, but the results were inconsistent. I needed a proper benchmarking methodology.

What BitNet Provides

BitNet includes a built-in benchmark script at utils/e2e_benchmark.py. Here’s how to use it:

Basic benchmark command
python utils/e2e_benchmark.py -m /path/to/model.gguf -n 200 -p 256 -t 4

The parameters:

  • -m: Path to your BitNet model file (gguf format)
  • -n: Number of tokens to generate
  • -p: Number of prompt tokens (input size)
  • -t: Number of CPU threads to use

My First Benchmark

I started with a basic test:

Running initial benchmark
python utils/e2e_benchmark.py \
-m models/bitnet_b1_58-2b-4t/ggml-model-i2_s.gguf \
-n 128 \
-p 128 \
-t 8

The output looked like this:

Benchmark output
Loading model...
Model loaded in 2.34 seconds
Running benchmark with 128 prompt tokens and 128 generated tokens...
Prompt processing: 156.23 tokens/second
Token generation: 42.87 tokens/second
Total time: 4.12 seconds
Memory usage: 1.2 GB

But what do these numbers mean? I needed to understand the metrics.

Understanding the Metrics

BitNet benchmarking produces two key metrics:

Prompt Processing (pp)

This measures how fast the model processes your input. It’s the “reading” speed.

Prompt Processing Speed = Input Tokens / Processing Time

Higher is better. This matters for:

  • Long context inputs
  • Document analysis
  • Code review tasks

Token Generation (tg)

This measures how fast the model produces output. It’s the “writing” speed.

Token Generation Speed = Output Tokens / Generation Time

Higher is better. This matters for:

  • Chat applications (user experience)
  • Content generation
  • Code completion

Why the Difference Matters

I noticed token generation is always slower than prompt processing. This is normal:

┌─────────────────────────────────────────────────────────────┐
│ INFERENCE PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ PROMPT PROCESSING TOKEN GENERATION │
│ (Parallel computation) (Sequential generation) │
│ │
│ [Token][Token][Token] [Token] → [Token] → [Token] │
│ ↓ ↓ ↓ ↓ ↓ ↓ │
│ All at once One at a time │
│ │
│ Fast: 150+ tokens/sec Slower: 30-50 tokens/sec │
│ │
└─────────────────────────────────────────────────────────────┘

Prompt processing can use parallel computation. Token generation must be sequential because each new token depends on previous tokens.

Comparing Configurations

I ran multiple benchmarks to find the optimal configuration for my hardware.

Thread Count Impact

Testing different thread counts
# Test with 4 threads
python utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 128 -t 4
# Test with 8 threads
python utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 128 -t 8
# Test with 16 threads
python utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 128 -t 16

Results on my 8-core machine:

Thread comparison results
Threads | Prompt (tok/s) | Generation (tok/s) | Memory
--------|----------------|--------------------|--------
4 | 98.45 | 28.12 | 1.1 GB
8 | 156.23 | 42.87 | 1.2 GB
16 | 152.89 | 41.54 | 1.3 GB

The 16-thread run was actually slightly slower than 8 threads. This is because:

  • My CPU has 8 physical cores (16 with hyperthreading)
  • Hyperthreading doesn’t help much for this workload
  • Extra threads add coordination overhead

Prompt/Generation Ratios

Different workloads have different characteristics:

Common benchmark configurations
┌─────────────────┬────────────────┬───────────────────────────────┐
│ Configuration │ Prompt/Gen │ Use Case │
├─────────────────┼────────────────┼───────────────────────────────┤
│ pp128 / tg128 │ 128 / 128 │ Standard benchmark (balanced) │
│ pp512 / tg128 │ 512 / 128 │ Long context (RAG, documents) │
│ pp64 / tg512 │ 64 / 512 │ Generation-heavy (writing) │
│ pp1024/ tg64 │ 1024/ 64 │ Analysis-heavy (code review) │
└─────────────────┴────────────────┴───────────────────────────────┘

I ran tests with different ratios:

Testing different workload types
# Balanced workload
python utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 128 -t 8
# Long context workload
python utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 512 -t 8
# Generation-heavy workload
python utils/e2e_benchmark.py -m models/bitnet.gguf -n 512 -p 64 -t 8

Using Dummy Models for Testing

What if you don’t have access to the full BitNet model? BitNet provides a dummy model generator:

Creating a dummy model for testing
python utils/generate-dummy-bitnet-model.py \
models/bitnet_b1_58-large \
--outfile models/dummy-bitnet-125m.tl1.gguf \
--outtype tl1 \
--model-size 125M

Then benchmark the dummy model:

Benchmarking dummy model
python utils/e2e_benchmark.py \
-m models/dummy-bitnet-125m.tl1.gguf \
-p 512 \
-n 128

Dummy models are useful for:

  • Testing kernel implementations
  • Validating architecture support
  • Comparing CPU optimizations
  • CI/CD pipeline validation

Interpreting Results

After running multiple benchmarks, I created a comparison table:

Benchmark comparison sheet
┌──────────────────┬─────────────────┬──────────────────┬──────────┐
│ Hardware │ Prompt (tok/s) │ Gen (tok/s) │ Config │
├──────────────────┼─────────────────┼──────────────────┼──────────┤
│ Intel i7-13800H │ 180-220 │ 48-55 │ 8 threads│
│ AMD EPYC 7V13 │ 210-260 │ 52-60 │ 16 thr │
│ Cobalt 100 │ 320-380 │ 72-85 │ AVX-512 │
│ My machine │ 156 │ 43 │ 8 thr │
└──────────────────┴─────────────────┴──────────────────┴──────────┘

My results were reasonable for my hardware class. If your numbers are significantly lower than expected, check:

  • CPU governor settings (use performance mode)
  • Memory bandwidth (close other apps)
  • Thermal throttling (check CPU temperature)
  • NUMA configuration on multi-socket systems

Creating a Benchmark Script

I automated my testing with a simple script:

benchmark_all.sh
#!/bin/bash
MODEL="models/bitnet_b1_58-2b-4t/ggml-model-i2_s.gguf"
THREADS="4 8 16"
echo "BitNet Benchmark Suite"
echo "======================"
echo "Model: $MODEL"
echo "Date: $(date)"
echo ""
for t in $THREADS; do
echo "--- Testing with $t threads ---"
python utils/e2e_benchmark.py -m "$MODEL" -n 128 -p 128 -t "$t"
echo ""
done
echo "Benchmark complete."

Run it:

Running benchmark suite
chmod +x benchmark_all.sh
./benchmark_all.sh | tee benchmark_results.txt

Memory Usage Monitoring

Performance isn’t just about speed. Memory matters too:

Monitoring memory during benchmark
# Run benchmark and monitor memory
/usr/bin/time -v python utils/e2e_benchmark.py \
-m models/bitnet.gguf \
-n 128 -p 128 -t 8
# Look for:
# Maximum resident set size (memory used)

BitNet’s 1.58-bit quantization should show significantly lower memory usage than standard models:

Memory comparison
┌─────────────────┬─────────────────┐
│ Model │ Memory (2B) │
├─────────────────┼─────────────────┤
│ FP16 │ ~4.0 GB │
│ INT8 │ ~2.0 GB │
│ BitNet (1.58b) │ ~0.4 GB │
└─────────────────┴─────────────────┘

Why BitNet Claims Speedup

BitNet’s 1.58-bit quantization reduces:

  • Memory bandwidth requirements (smaller model)
  • Memory footprint (can fit in faster cache)
  • Computation (ternary weights enable faster operations)

The speedup comes from reduced memory movement, not faster computation itself.

Benchmarking Best Practices

  1. Run multiple times: Single runs can be affected by system state
  2. Warm up: First run often slower due to cold cache
  3. Control environment: Close other applications
  4. Document everything: Hardware, OS, model version, parameters

Comparing Against Baselines

To validate BitNet’s claims, compare against:

  • Standard FP16 model (if available)
  • INT8 quantized version
  • Other quantization methods (GPTQ, AWQ)

Summary

In this post, I covered how to benchmark BitNet inference performance using the built-in e2e_benchmark.py script. The key metrics are prompt processing speed (tokens/second for input) and token generation speed (tokens/second for output).

I found that:

  • 8 threads was optimal for my 8-core CPU (hyperthreading didn’t help)
  • Token generation is always slower than prompt processing (sequential vs parallel)
  • Different workloads need different benchmark configurations
  • Memory usage is significantly lower than standard quantization

The benchmark script lets you validate BitNet’s performance claims on your hardware and optimize your configuration. Without proper benchmarking, you can’t know if you’re getting the performance you should be.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments