How to Benchmark BitNet Performance: End-to-End Testing Guide
Problem
I set up BitNet on my machine and ran a few prompts. Everything seemed to work, but I had no idea if I was getting the performance I should be getting. The paper claims impressive speedups, but how do I verify that on my hardware?
I tried running some informal tests with a stopwatch, but the results were inconsistent. I needed a proper benchmarking methodology.
What BitNet Provides
BitNet includes a built-in benchmark script at utils/e2e_benchmark.py. Here’s how to use it:
python utils/e2e_benchmark.py -m /path/to/model.gguf -n 200 -p 256 -t 4The parameters:
-m: Path to your BitNet model file (gguf format)-n: Number of tokens to generate-p: Number of prompt tokens (input size)-t: Number of CPU threads to use
My First Benchmark
I started with a basic test:
python utils/e2e_benchmark.py \ -m models/bitnet_b1_58-2b-4t/ggml-model-i2_s.gguf \ -n 128 \ -p 128 \ -t 8The output looked like this:
Loading model...Model loaded in 2.34 seconds
Running benchmark with 128 prompt tokens and 128 generated tokens...
Prompt processing: 156.23 tokens/secondToken generation: 42.87 tokens/secondTotal time: 4.12 seconds
Memory usage: 1.2 GBBut what do these numbers mean? I needed to understand the metrics.
Understanding the Metrics
BitNet benchmarking produces two key metrics:
Prompt Processing (pp)
This measures how fast the model processes your input. It’s the “reading” speed.
Prompt Processing Speed = Input Tokens / Processing TimeHigher is better. This matters for:
- Long context inputs
- Document analysis
- Code review tasks
Token Generation (tg)
This measures how fast the model produces output. It’s the “writing” speed.
Token Generation Speed = Output Tokens / Generation TimeHigher is better. This matters for:
- Chat applications (user experience)
- Content generation
- Code completion
Why the Difference Matters
I noticed token generation is always slower than prompt processing. This is normal:
┌─────────────────────────────────────────────────────────────┐│ INFERENCE PIPELINE │├─────────────────────────────────────────────────────────────┤│ ││ PROMPT PROCESSING TOKEN GENERATION ││ (Parallel computation) (Sequential generation) ││ ││ [Token][Token][Token] [Token] → [Token] → [Token] ││ ↓ ↓ ↓ ↓ ↓ ↓ ││ All at once One at a time ││ ││ Fast: 150+ tokens/sec Slower: 30-50 tokens/sec ││ │└─────────────────────────────────────────────────────────────┘Prompt processing can use parallel computation. Token generation must be sequential because each new token depends on previous tokens.
Comparing Configurations
I ran multiple benchmarks to find the optimal configuration for my hardware.
Thread Count Impact
# Test with 4 threadspython utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 128 -t 4
# Test with 8 threadspython utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 128 -t 8
# Test with 16 threadspython utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 128 -t 16Results on my 8-core machine:
Threads | Prompt (tok/s) | Generation (tok/s) | Memory--------|----------------|--------------------|-------- 4 | 98.45 | 28.12 | 1.1 GB 8 | 156.23 | 42.87 | 1.2 GB 16 | 152.89 | 41.54 | 1.3 GBThe 16-thread run was actually slightly slower than 8 threads. This is because:
- My CPU has 8 physical cores (16 with hyperthreading)
- Hyperthreading doesn’t help much for this workload
- Extra threads add coordination overhead
Prompt/Generation Ratios
Different workloads have different characteristics:
┌─────────────────┬────────────────┬───────────────────────────────┐│ Configuration │ Prompt/Gen │ Use Case │├─────────────────┼────────────────┼───────────────────────────────┤│ pp128 / tg128 │ 128 / 128 │ Standard benchmark (balanced) ││ pp512 / tg128 │ 512 / 128 │ Long context (RAG, documents) ││ pp64 / tg512 │ 64 / 512 │ Generation-heavy (writing) ││ pp1024/ tg64 │ 1024/ 64 │ Analysis-heavy (code review) │└─────────────────┴────────────────┴───────────────────────────────┘I ran tests with different ratios:
# Balanced workloadpython utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 128 -t 8
# Long context workloadpython utils/e2e_benchmark.py -m models/bitnet.gguf -n 128 -p 512 -t 8
# Generation-heavy workloadpython utils/e2e_benchmark.py -m models/bitnet.gguf -n 512 -p 64 -t 8Using Dummy Models for Testing
What if you don’t have access to the full BitNet model? BitNet provides a dummy model generator:
python utils/generate-dummy-bitnet-model.py \ models/bitnet_b1_58-large \ --outfile models/dummy-bitnet-125m.tl1.gguf \ --outtype tl1 \ --model-size 125MThen benchmark the dummy model:
python utils/e2e_benchmark.py \ -m models/dummy-bitnet-125m.tl1.gguf \ -p 512 \ -n 128Dummy models are useful for:
- Testing kernel implementations
- Validating architecture support
- Comparing CPU optimizations
- CI/CD pipeline validation
Interpreting Results
After running multiple benchmarks, I created a comparison table:
┌──────────────────┬─────────────────┬──────────────────┬──────────┐│ Hardware │ Prompt (tok/s) │ Gen (tok/s) │ Config │├──────────────────┼─────────────────┼──────────────────┼──────────┤│ Intel i7-13800H │ 180-220 │ 48-55 │ 8 threads││ AMD EPYC 7V13 │ 210-260 │ 52-60 │ 16 thr ││ Cobalt 100 │ 320-380 │ 72-85 │ AVX-512 ││ My machine │ 156 │ 43 │ 8 thr │└──────────────────┴─────────────────┴──────────────────┴──────────┘My results were reasonable for my hardware class. If your numbers are significantly lower than expected, check:
- CPU governor settings (use
performancemode) - Memory bandwidth (close other apps)
- Thermal throttling (check CPU temperature)
- NUMA configuration on multi-socket systems
Creating a Benchmark Script
I automated my testing with a simple script:
#!/bin/bash
MODEL="models/bitnet_b1_58-2b-4t/ggml-model-i2_s.gguf"THREADS="4 8 16"
echo "BitNet Benchmark Suite"echo "======================"echo "Model: $MODEL"echo "Date: $(date)"echo ""
for t in $THREADS; do echo "--- Testing with $t threads ---" python utils/e2e_benchmark.py -m "$MODEL" -n 128 -p 128 -t "$t" echo ""done
echo "Benchmark complete."Run it:
chmod +x benchmark_all.sh./benchmark_all.sh | tee benchmark_results.txtMemory Usage Monitoring
Performance isn’t just about speed. Memory matters too:
# Run benchmark and monitor memory/usr/bin/time -v python utils/e2e_benchmark.py \ -m models/bitnet.gguf \ -n 128 -p 128 -t 8
# Look for:# Maximum resident set size (memory used)BitNet’s 1.58-bit quantization should show significantly lower memory usage than standard models:
┌─────────────────┬─────────────────┐│ Model │ Memory (2B) │├─────────────────┼─────────────────┤│ FP16 │ ~4.0 GB ││ INT8 │ ~2.0 GB ││ BitNet (1.58b) │ ~0.4 GB │└─────────────────┴─────────────────┘Related Knowledge
Why BitNet Claims Speedup
BitNet’s 1.58-bit quantization reduces:
- Memory bandwidth requirements (smaller model)
- Memory footprint (can fit in faster cache)
- Computation (ternary weights enable faster operations)
The speedup comes from reduced memory movement, not faster computation itself.
Benchmarking Best Practices
- Run multiple times: Single runs can be affected by system state
- Warm up: First run often slower due to cold cache
- Control environment: Close other applications
- Document everything: Hardware, OS, model version, parameters
Comparing Against Baselines
To validate BitNet’s claims, compare against:
- Standard FP16 model (if available)
- INT8 quantized version
- Other quantization methods (GPTQ, AWQ)
Summary
In this post, I covered how to benchmark BitNet inference performance using the built-in e2e_benchmark.py script. The key metrics are prompt processing speed (tokens/second for input) and token generation speed (tokens/second for output).
I found that:
- 8 threads was optimal for my 8-core CPU (hyperthreading didn’t help)
- Token generation is always slower than prompt processing (sequential vs parallel)
- Different workloads need different benchmark configurations
- Memory usage is significantly lower than standard quantization
The benchmark script lets you validate BitNet’s performance claims on your hardware and optimize your configuration. Without proper benchmarking, you can’t know if you’re getting the performance you should be.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 BitNet b1.58 Official Repository
- 👨💻 llama.cpp Benchmarking Guide
- 👨💻 Understanding LLM Inference Latency
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments