BitNet I2_S vs TL1 vs TL2: Which Kernel Should You Use?

Mar 19, 2026

I picked the wrong kernel and my build failed

I was trying to run a BitNet quantized model on my machine when I hit this error:

RuntimeError: Unsupported quantization type: tl1
Supported types for x86_64: ['i2_s', 'tl2']

Turns out, I had picked the wrong kernel type for my CPU architecture. BitNet has three kernel types, and they’re not interchangeable - each is optimized for specific hardware.

What are BitNet kernel types?

BitNet uses 1.58-bit quantization to run LLMs efficiently on CPUs. The three kernel types handle the matrix multiplication differently:

Kernel	ARM Support	x86 Support	Best For
I2_S	Yes	Yes	Compatibility
TL1	Yes	No	ARM performance
TL2	No	Yes	x86 performance

The kernel type determines how the quantized weights are processed during inference. This isn’t just a naming convention - it fundamentally changes the computation strategy.

I2_S: The Universal Option

I2_S is your safest choice. It works on both ARM and x86 architectures because it integrates directly with llama.cpp’s ggml library.

+------------------+
|  llama.cpp/ggml  |
+------------------+
        |
+-------v-------+
|  I2_S GEMM    |  <-- General Matrix Multiply
|  I2_S GEMV    |  <-- General Matrix-Vector
+---------------+
        |
+-------v-------+
|   ARM or x86  |  <-- Works on both!
+---------------+

When to use I2_S:

You want maximum compatibility
You’re deploying across different CPU types
You don’t need the absolute best performance

The setup is straightforward - just specify i2_s as your quantization type:

python setup_env.py --quant-type i2_s

TL1: ARM Optimized

TL1 is designed specifically for ARM CPUs. It uses a tiling strategy that’s optimized for ARM’s vector processing capabilities.

How TL1 Tiling Works

TL1 breaks down matrix multiplication into tiles:

Original weight matrix: (M, K)
                |
                v
+---+---+---+---+
|BM |BM |BM |...|  <-- M/BM blocks, each (BM, K)
+---+---+---+---+
                |
                v
+---+---+---+
|BK |BK |...|  <-- K/BK blocks per row, each (BM, BK)
+---+---+---+

TL1 Requirements

M  % BM  == 0      # M must divide evenly by BM
K  % BK  == 0      # K must divide evenly by BK
BM % bm  == 0      # BM must divide evenly by bm
bm in [32, 64]     # bm must be 32 or 64

TL1 Code Generation

To generate TL1 kernels for a model:

python utils/codegen_tl1.py \
    --model bitnet_b1_58-large \
    --BM 256,128,256 \
    --BK 128,64,128 \
    --bm 32,64,32

The BM, BK, and bm parameters are comma-separated values for different layer sizes. Tuning these values affects performance significantly.

TL2: x86 Optimized

TL2 is the x86 counterpart. It has more complex requirements due to x86’s different vector processing characteristics.

The ThreeK/TwoK Split

TL2’s main challenge: BK must be divisible by 6. When K doesn’t satisfy this, TL2 splits the computation:

Total K dimension
        |
        v
+--------+--------+
| threeK | twoK   |  <-- Split based on divisibility
+--------+--------+
    |        |
    v        v
+------+  +------+
| TL2  |  | TL1  |  <-- Different kernels for each part
+------+  +------+

This hybrid approach ensures all dimensions can be processed efficiently.

TL2 Requirements

M  % BM == 0           # M must divide evenly by BM
K  % BK % 32 == 0      # K modulo BK must be divisible by 32
BM % bm == 0           # BM must divide evenly by bm
bm in [32]             # bm must be 32 (not 64 like TL1)

TL2 Code Generation

python utils/codegen_tl2.py \
    --model bitnet_b1_58-large \
    --BM 256,128,256 \
    --BK 96,192,96 \
    --bm 32,32,32

Notice the BK values (96, 192, 96) are all divisible by 6 - that’s not a coincidence.

Which kernel should you pick?

Here’s my decision process:

              +-------------+
              | What CPU?   |
              +------+------+
                     |
        +------------+------------+
        |                         |
        v                         v
   +---------+               +---------+
   |   ARM   |               |   x86   |
   +---------+               +---------+
        |                         |
        v                         v
   +---------+               +---------+
   |   TL1   |               |   TL2   |
   +---------+               +---------+
        |                         |
        +------------+------------+
                     |
                     v
              +-------------+
              | Issues?     |
              +------+------+
                     |
        +------------+------------+
        |                         |
        v                         v
   +---------+               +---------+
   |   Yes   |               |   No    |
   +---------+               +---------+
        |                         |
        v                         v
+-------------+            +-------------+
| Use I2_S    |            | You're done!|
| (fallback)  |            +-------------+
+-------------+

Quick Reference

SUPPORTED_QUANT_TYPES = {
    "arm64": ["i2_s", "tl1"],   # TL1 for speed, I2_S for fallback
    "x86_64": ["i2_s", "tl2"]   # TL2 for speed, I2_S for fallback
}

Performance Expectations

From my testing on various machines:

Scenario	Kernel	Performance
M1/M2 Mac	TL1	Fastest
M1/M2 Mac	I2_S	~20% slower
Intel/AMD CPU	TL2	Fastest
Intel/AMD CPU	I2_S	~15-25% slower
Mixed deployment	I2_S	Consistent across all

Common pitfalls

Pitfall 1: Wrong architecture kernel

# On x86 machine trying to use TL1
RuntimeError: Unsupported quantization type: tl1
Supported types for x86_64: ['i2_s', 'tl2']

Fix: Check your architecture first:

uname -m
# arm64  -> use i2_s or tl1
# x86_64 -> use i2_s or tl2

Pitfall 2: Incorrect tiling parameters

When generating TL1 or TL2 kernels, the tiling parameters must satisfy specific constraints:

# Wrong: BK not divisible by 6 for TL2
python utils/codegen_tl2.py --BK 100,100,100
# Error: BK must be divisible by 6 for TL2

Pitfall 3: Using pretuned kernels incorrectly

BitNet provides pretuned kernel options. Using the wrong one can cause crashes:

# Check available pretuned options
python setup_env.py --list-pretuned

# Use the correct one for your model
python setup_env.py --quant-type tl1 --pretuned large

Summary

BitNet’s three kernel types serve different purposes:

I2_S: Universal compatibility, works everywhere, moderate performance
TL1: ARM optimized, best performance on Apple Silicon and ARM servers
TL2: x86 optimized, best performance on Intel/AMD CPUs

Start with TL1 (ARM) or TL2 (x86) for maximum speed. If you hit errors, fall back to I2_S. Always check your architecture before choosing a kernel type.

The tiling parameters matter for performance, but the pretuned options usually work well for standard models. Only dive into custom tiling if you’re optimizing for a specific hardware configuration.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 BitNet GitHub Repository
👨‍💻 BitNet Paper

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!