BitNet I2_S vs TL1 vs TL2: Which Kernel Should You Use?
I picked the wrong kernel and my build failed
I was trying to run a BitNet quantized model on my machine when I hit this error:
RuntimeError: Unsupported quantization type: tl1Supported types for x86_64: ['i2_s', 'tl2']Turns out, I had picked the wrong kernel type for my CPU architecture. BitNet has three kernel types, and they’re not interchangeable - each is optimized for specific hardware.
What are BitNet kernel types?
BitNet uses 1.58-bit quantization to run LLMs efficiently on CPUs. The three kernel types handle the matrix multiplication differently:
| Kernel | ARM Support | x86 Support | Best For |
|---|---|---|---|
| I2_S | Yes | Yes | Compatibility |
| TL1 | Yes | No | ARM performance |
| TL2 | No | Yes | x86 performance |
The kernel type determines how the quantized weights are processed during inference. This isn’t just a naming convention - it fundamentally changes the computation strategy.
I2_S: The Universal Option
I2_S is your safest choice. It works on both ARM and x86 architectures because it integrates directly with llama.cpp’s ggml library.
+------------------+| llama.cpp/ggml |+------------------+ |+-------v-------+| I2_S GEMM | <-- General Matrix Multiply| I2_S GEMV | <-- General Matrix-Vector+---------------+ |+-------v-------+| ARM or x86 | <-- Works on both!+---------------+When to use I2_S:
- You want maximum compatibility
- You’re deploying across different CPU types
- You don’t need the absolute best performance
The setup is straightforward - just specify i2_s as your quantization type:
python setup_env.py --quant-type i2_sTL1: ARM Optimized
TL1 is designed specifically for ARM CPUs. It uses a tiling strategy that’s optimized for ARM’s vector processing capabilities.
How TL1 Tiling Works
TL1 breaks down matrix multiplication into tiles:
Original weight matrix: (M, K) | v+---+---+---+---+|BM |BM |BM |...| <-- M/BM blocks, each (BM, K)+---+---+---+---+ | v+---+---+---+|BK |BK |...| <-- K/BK blocks per row, each (BM, BK)+---+---+---+TL1 Requirements
M % BM == 0 # M must divide evenly by BMK % BK == 0 # K must divide evenly by BKBM % bm == 0 # BM must divide evenly by bmbm in [32, 64] # bm must be 32 or 64TL1 Code Generation
To generate TL1 kernels for a model:
python utils/codegen_tl1.py \ --model bitnet_b1_58-large \ --BM 256,128,256 \ --BK 128,64,128 \ --bm 32,64,32The BM, BK, and bm parameters are comma-separated values for different layer sizes. Tuning these values affects performance significantly.
TL2: x86 Optimized
TL2 is the x86 counterpart. It has more complex requirements due to x86’s different vector processing characteristics.
The ThreeK/TwoK Split
TL2’s main challenge: BK must be divisible by 6. When K doesn’t satisfy this, TL2 splits the computation:
Total K dimension | v+--------+--------+| threeK | twoK | <-- Split based on divisibility+--------+--------+ | | v v+------+ +------+| TL2 | | TL1 | <-- Different kernels for each part+------+ +------+This hybrid approach ensures all dimensions can be processed efficiently.
TL2 Requirements
M % BM == 0 # M must divide evenly by BMK % BK % 32 == 0 # K modulo BK must be divisible by 32BM % bm == 0 # BM must divide evenly by bmbm in [32] # bm must be 32 (not 64 like TL1)TL2 Code Generation
python utils/codegen_tl2.py \ --model bitnet_b1_58-large \ --BM 256,128,256 \ --BK 96,192,96 \ --bm 32,32,32Notice the BK values (96, 192, 96) are all divisible by 6 - that’s not a coincidence.
Which kernel should you pick?
Here’s my decision process:
+-------------+ | What CPU? | +------+------+ | +------------+------------+ | | v v +---------+ +---------+ | ARM | | x86 | +---------+ +---------+ | | v v +---------+ +---------+ | TL1 | | TL2 | +---------+ +---------+ | | +------------+------------+ | v +-------------+ | Issues? | +------+------+ | +------------+------------+ | | v v +---------+ +---------+ | Yes | | No | +---------+ +---------+ | | v v+-------------+ +-------------+| Use I2_S | | You're done!|| (fallback) | +-------------++-------------+Quick Reference
SUPPORTED_QUANT_TYPES = { "arm64": ["i2_s", "tl1"], # TL1 for speed, I2_S for fallback "x86_64": ["i2_s", "tl2"] # TL2 for speed, I2_S for fallback}Performance Expectations
From my testing on various machines:
| Scenario | Kernel | Performance |
|---|---|---|
| M1/M2 Mac | TL1 | Fastest |
| M1/M2 Mac | I2_S | ~20% slower |
| Intel/AMD CPU | TL2 | Fastest |
| Intel/AMD CPU | I2_S | ~15-25% slower |
| Mixed deployment | I2_S | Consistent across all |
Common pitfalls
Pitfall 1: Wrong architecture kernel
# On x86 machine trying to use TL1RuntimeError: Unsupported quantization type: tl1Supported types for x86_64: ['i2_s', 'tl2']Fix: Check your architecture first:
uname -m# arm64 -> use i2_s or tl1# x86_64 -> use i2_s or tl2Pitfall 2: Incorrect tiling parameters
When generating TL1 or TL2 kernels, the tiling parameters must satisfy specific constraints:
# Wrong: BK not divisible by 6 for TL2python utils/codegen_tl2.py --BK 100,100,100# Error: BK must be divisible by 6 for TL2Pitfall 3: Using pretuned kernels incorrectly
BitNet provides pretuned kernel options. Using the wrong one can cause crashes:
# Check available pretuned optionspython setup_env.py --list-pretuned
# Use the correct one for your modelpython setup_env.py --quant-type tl1 --pretuned largeSummary
BitNet’s three kernel types serve different purposes:
- I2_S: Universal compatibility, works everywhere, moderate performance
- TL1: ARM optimized, best performance on Apple Silicon and ARM servers
- TL2: x86 optimized, best performance on Intel/AMD CPUs
Start with TL1 (ARM) or TL2 (x86) for maximum speed. If you hit errors, fall back to I2_S. Always check your architecture before choosing a kernel type.
The tiling parameters matter for performance, but the pretuned options usually work well for standard models. Only dive into custom tiling if you’re optimizing for a specific hardware configuration.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 BitNet GitHub Repository
- 👨💻 BitNet Paper
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments