Skip to content

BitNet I2_S vs TL1 vs TL2: Which Kernel Should You Use?

I picked the wrong kernel and my build failed

I was trying to run a BitNet quantized model on my machine when I hit this error:

build error
RuntimeError: Unsupported quantization type: tl1
Supported types for x86_64: ['i2_s', 'tl2']

Turns out, I had picked the wrong kernel type for my CPU architecture. BitNet has three kernel types, and they’re not interchangeable - each is optimized for specific hardware.

What are BitNet kernel types?

BitNet uses 1.58-bit quantization to run LLMs efficiently on CPUs. The three kernel types handle the matrix multiplication differently:

KernelARM Supportx86 SupportBest For
I2_SYesYesCompatibility
TL1YesNoARM performance
TL2NoYesx86 performance

The kernel type determines how the quantized weights are processed during inference. This isn’t just a naming convention - it fundamentally changes the computation strategy.

I2_S: The Universal Option

I2_S is your safest choice. It works on both ARM and x86 architectures because it integrates directly with llama.cpp’s ggml library.

I2_S architecture overview
+------------------+
| llama.cpp/ggml |
+------------------+
|
+-------v-------+
| I2_S GEMM | <-- General Matrix Multiply
| I2_S GEMV | <-- General Matrix-Vector
+---------------+
|
+-------v-------+
| ARM or x86 | <-- Works on both!
+---------------+

When to use I2_S:

  • You want maximum compatibility
  • You’re deploying across different CPU types
  • You don’t need the absolute best performance

The setup is straightforward - just specify i2_s as your quantization type:

building with i2_s
python setup_env.py --quant-type i2_s

TL1: ARM Optimized

TL1 is designed specifically for ARM CPUs. It uses a tiling strategy that’s optimized for ARM’s vector processing capabilities.

How TL1 Tiling Works

TL1 breaks down matrix multiplication into tiles:

TL1 tiling strategy
Original weight matrix: (M, K)
|
v
+---+---+---+---+
|BM |BM |BM |...| <-- M/BM blocks, each (BM, K)
+---+---+---+---+
|
v
+---+---+---+
|BK |BK |...| <-- K/BK blocks per row, each (BM, BK)
+---+---+---+

TL1 Requirements

TL1 constraints
M % BM == 0 # M must divide evenly by BM
K % BK == 0 # K must divide evenly by BK
BM % bm == 0 # BM must divide evenly by bm
bm in [32, 64] # bm must be 32 or 64

TL1 Code Generation

To generate TL1 kernels for a model:

codegen for TL1
python utils/codegen_tl1.py \
--model bitnet_b1_58-large \
--BM 256,128,256 \
--BK 128,64,128 \
--bm 32,64,32

The BM, BK, and bm parameters are comma-separated values for different layer sizes. Tuning these values affects performance significantly.

TL2: x86 Optimized

TL2 is the x86 counterpart. It has more complex requirements due to x86’s different vector processing characteristics.

The ThreeK/TwoK Split

TL2’s main challenge: BK must be divisible by 6. When K doesn’t satisfy this, TL2 splits the computation:

TL2 computation split
Total K dimension
|
v
+--------+--------+
| threeK | twoK | <-- Split based on divisibility
+--------+--------+
| |
v v
+------+ +------+
| TL2 | | TL1 | <-- Different kernels for each part
+------+ +------+

This hybrid approach ensures all dimensions can be processed efficiently.

TL2 Requirements

TL2 constraints
M % BM == 0 # M must divide evenly by BM
K % BK % 32 == 0 # K modulo BK must be divisible by 32
BM % bm == 0 # BM must divide evenly by bm
bm in [32] # bm must be 32 (not 64 like TL1)

TL2 Code Generation

codegen for TL2
python utils/codegen_tl2.py \
--model bitnet_b1_58-large \
--BM 256,128,256 \
--BK 96,192,96 \
--bm 32,32,32

Notice the BK values (96, 192, 96) are all divisible by 6 - that’s not a coincidence.

Which kernel should you pick?

Here’s my decision process:

kernel selection flowchart
+-------------+
| What CPU? |
+------+------+
|
+------------+------------+
| |
v v
+---------+ +---------+
| ARM | | x86 |
+---------+ +---------+
| |
v v
+---------+ +---------+
| TL1 | | TL2 |
+---------+ +---------+
| |
+------------+------------+
|
v
+-------------+
| Issues? |
+------+------+
|
+------------+------------+
| |
v v
+---------+ +---------+
| Yes | | No |
+---------+ +---------+
| |
v v
+-------------+ +-------------+
| Use I2_S | | You're done!|
| (fallback) | +-------------+
+-------------+

Quick Reference

supported quant types by architecture
SUPPORTED_QUANT_TYPES = {
"arm64": ["i2_s", "tl1"], # TL1 for speed, I2_S for fallback
"x86_64": ["i2_s", "tl2"] # TL2 for speed, I2_S for fallback
}

Performance Expectations

From my testing on various machines:

ScenarioKernelPerformance
M1/M2 MacTL1Fastest
M1/M2 MacI2_S~20% slower
Intel/AMD CPUTL2Fastest
Intel/AMD CPUI2_S~15-25% slower
Mixed deploymentI2_SConsistent across all

Common pitfalls

Pitfall 1: Wrong architecture kernel

error message example
# On x86 machine trying to use TL1
RuntimeError: Unsupported quantization type: tl1
Supported types for x86_64: ['i2_s', 'tl2']

Fix: Check your architecture first:

check architecture
uname -m
# arm64 -> use i2_s or tl1
# x86_64 -> use i2_s or tl2

Pitfall 2: Incorrect tiling parameters

When generating TL1 or TL2 kernels, the tiling parameters must satisfy specific constraints:

tiling parameter error
# Wrong: BK not divisible by 6 for TL2
python utils/codegen_tl2.py --BK 100,100,100
# Error: BK must be divisible by 6 for TL2

Pitfall 3: Using pretuned kernels incorrectly

BitNet provides pretuned kernel options. Using the wrong one can cause crashes:

pretuned kernel selection
# Check available pretuned options
python setup_env.py --list-pretuned
# Use the correct one for your model
python setup_env.py --quant-type tl1 --pretuned large

Summary

BitNet’s three kernel types serve different purposes:

  • I2_S: Universal compatibility, works everywhere, moderate performance
  • TL1: ARM optimized, best performance on Apple Silicon and ARM servers
  • TL2: x86 optimized, best performance on Intel/AMD CPUs

Start with TL1 (ARM) or TL2 (x86) for maximum speed. If you hit errors, fall back to I2_S. Always check your architecture before choosing a kernel type.

The tiling parameters matter for performance, but the pretuned options usually work well for standard models. Only dive into custom tiling if you’re optimizing for a specific hardware configuration.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments