BitNet Supported Models: Complete Guide to 1-bit LLMs You Can Run
Problem
I downloaded BitNet and tried to run a model I found on Hugging Face. It didn’t work:
ValueError: Unsupported model architectureKeyError: 'model_name' not found in SUPPORTED_HF_MODELSTurns out, BitNet only supports specific models, and the model I picked wasn’t on the list. This guide covers all the models you can actually use with BitNet and which kernel works best for your CPU.
The Short Answer
BitNet currently supports 9 models from Hugging Face:
Official Microsoft: - microsoft/BitNet-b1.58-2B-4T (2.4B params) - Recommended
Community Models: - 1bitLLM/bitnet_b1_58-large (0.7B params) - 1bitLLM/bitnet_b1_58-3B (3.3B params) - HF1BitLLM/Llama3-8B-1.58-100B-tokens (8.0B params) - tiiuae/Falcon3-1B-Instruct-1.58bit - tiiuae/Falcon3-7B-Instruct-1.58bit - tiiuae/Falcon3-10B-Instruct-1.58bit - tiiuae/Falcon-E-1B-Instruct - tiiuae/Falcon-E-3B-InstructBut there’s a catch: kernel support varies by CPU architecture. Let me explain.
Understanding Kernels
BitNet uses specialized kernels for inference, and not all kernels work on all CPUs:
i2_s - Integer 2-bit symmetric quantization Good balance of speed and quality Works on both x86 and ARM
tl1 - Ternary Lookup Table 1 Optimized for ARM processors Uses lookup tables for faster computation
tl2 - Ternary Lookup Table 2 Optimized for x86 processors Better performance on Intel/AMD CPUsThe key insight: ARM CPUs use TL1, x86 CPUs use TL2. Both support I2_S.
Official Microsoft Model
BitNet-b1.58-2B-4T
This is the model I recommend for beginners. Microsoft trained it on 4 trillion tokens, so it’s reasonably capable:
Parameters: 2.4 billionTraining Tokens: 4 trillionLicense: MITContext Length: 4096Memory (i2_s): ~400 MBKernel Compatibility:
┌─────────────────────────────────────────────────────────┐│ CPU Type │ I2_S │ TL1 │ TL2 │ Best │├─────────────────────────────────────────────────────────┤│ x86 │ OK │ NO │ OK │ TL2 ││ ARM │ OK │ OK │ NO │ TL1 │└─────────────────────────────────────────────────────────┘I tested both kernels on my Mac (ARM) and found TL1 about 15% faster than I2_S for this model.
When to Choose the Official Model
Pick this model if you:
- Are new to BitNet and want the “reference” experience
- Have limited RAM (< 2GB available)
- Need a model with MIT license (commercial use allowed)
- Want the most community support and documentation
Community Models by Size
Small Models (0.7B - 1B Parameters)
These models run fast even on older hardware:
┌─────────────────────────────────────────────────────────────┐│ Model │ Params │ Memory │ License │├─────────────────────────────────────────────────────────────┤│ bitnet_b1_58-large │ 0.7B │ ~150MB │ MIT ││ Falcon-E-1B-Instruct │ 1.0B │ ~200MB │ Apache │└─────────────────────────────────────────────────────────────┘I tested bitnet_b1_58-large on an old Raspberry Pi 4 (4GB RAM) - it worked, though slowly. Great for edge devices.
Kernel Compatibility for Small Models:
┌─────────────────────────────────────────────────────────┐│ CPU Type │ I2_S │ TL1 │ TL2 │ Best │├─────────────────────────────────────────────────────────┤│ x86 │ OK │ NO │ OK │ TL2 ││ ARM │ OK │ OK │ NO │ TL1 │└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐│ Falcon-E-1B Kernel Support │├─────────────────────────────────────────────────────────┤│ CPU Type │ I2_S │ TL1 │ TL2 │ Best │├─────────────────────────────────────────────────────────┤│ x86 │ OK │ NO │ OK │ TL2 ││ ARM │ OK │ OK │ NO │ TL1 │└─────────────────────────────────────────────────────────┘Medium Models (2B - 3B Parameters)
The sweet spot for most users:
┌─────────────────────────────────────────────────────────────┐│ Model │ Params │ Memory │ License │├─────────────────────────────────────────────────────────────┤│ BitNet-b1.58-2B-4T │ 2.4B │ ~400MB │ MIT ││ bitnet_b1_58-3B │ 3.3B │ ~550MB │ MIT ││ Falcon-E-3B-Instruct │ 3.0B │ ~500MB │ Apache │└─────────────────────────────────────────────────────────────┘Important: The 3B model has different kernel support:
┌─────────────────────────────────────────────────────────┐│ CPU Type │ I2_S │ TL1 │ TL2 │ Best │├─────────────────────────────────────────────────────────┤│ x86 │ NO │ NO │ OK │ TL2 ││ ARM │ NO │ OK │ NO │ TL1 │└─────────────────────────────────────────────────────────┘Notice that I2_S is NOT supported for the 3B model. You must use TL1 on ARM or TL2 on x86. I learned this the hard way when I tried to quantize with I2_S:
RuntimeError: I2_S kernel not supported for this model architectureLarge Models (7B - 10B Parameters)
For when you need better quality and have more RAM:
┌─────────────────────────────────────────────────────────────┐│ Model │ Params │ Memory │ License │├─────────────────────────────────────────────────────────────┤│ Llama3-8B-1.58-100B │ 8.0B │ ~1.3GB │ Llama ││ Falcon3-7B-Instruct-1.58 │ 7.0B │ ~1.2GB │ Apache ││ Falcon3-10B-Instruct-1.58│ 10.0B │ ~1.7GB │ Apache │└─────────────────────────────────────────────────────────────┘The Llama3-based model uses 100B training tokens - less than the official 4T, but the base Llama3 architecture is strong.
Kernel Compatibility for Large Models:
┌─────────────────────────────────────────────────────────┐│ CPU Type │ I2_S │ TL1 │ TL2 │ Best │├─────────────────────────────────────────────────────────┤│ x86 │ OK │ NO │ OK │ TL2 ││ ARM │ OK │ OK │ NO │ TL1 │└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐│ Falcon3 Family Kernel Support │├─────────────────────────────────────────────────────────┤│ CPU Type │ I2_S │ TL1 │ TL2 │ Best │├─────────────────────────────────────────────────────────┤│ x86 │ OK │ NO │ OK │ TL2 ││ ARM │ OK │ OK │ NO │ TL1 │└─────────────────────────────────────────────────────────┘Complete Compatibility Matrix
Here’s the full picture at a glance:
I2_S TL1 TL2Model x86 ARM x86 ARM x86 ARM─────────────────────────────────────────────────────BitNet-b1.58-2B-4T OK OK NO OK OK NObitnet_b1_58-large OK OK NO OK OK NObitnet_b1_58-3B NO NO NO OK OK NOLlama3-8B-1.58 OK OK NO OK OK NOFalcon3 family OK OK NO OK OK NOFalcon-E family OK OK NO OK OK NO─────────────────────────────────────────────────────Choosing the Right Model
I use this decision flow:
┌─────────────────┐ │ Your RAM limit? │ └────────┬────────┘ │ ┌──────────────┼──────────────┐ ▼ ▼ ▼ < 1GB 1-2 GB > 2GB │ │ │ ▼ ▼ ▼ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ 0.7B-1B │ │ 2B-3B │ │ 7B-10B │ │ models │ │ models │ │ models │ └────────────┘ └────────────┘ └────────────┘ │ │ │ ▼ ▼ ▼ bitnet_large BitNet-2B Llama3-8B Falcon-E-1B bitnet-3B Falcon3-10BBy Use Case
| Use Case | Recommended Model | Why |
|---|---|---|
| Quick testing | bitnet_b1_58-large | Fastest download and inference |
| General purpose | BitNet-b1.58-2B-4T | Best quality/size ratio |
| Better quality | Falcon3-7B-Instruct-1.58 | Larger context, more capable |
| Edge devices | Falcon-E-1B-Instruct | Optimized for low resource |
| Maximum quality | Falcon3-10B-Instruct-1.58 | Best results, needs 2GB+ RAM |
Downloading Models
Use huggingface-cli to download models:
# Install huggingface-hub firstpip install huggingface-hub
# Official model (recommended)huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf \ --local-dir models/BitNet-b1.58-2B-4T
# Large modelhuggingface-cli download HF1BitLLM/Llama3-8B-1.58-100B-tokens \ --local-dir models/Llama3-8B-1.58
# Falcon3huggingface-cli download tiiuae/Falcon3-7B-Instruct-1.58bit \ --local-dir models/Falcon3-7BResume Interrupted Downloads
If a download fails (common with large models):
huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf \ --local-dir models/BitNet-b1.58-2B-4T \ --resume-downloadRunning Different Models
After downloading, use setup_env.py with the correct quantization:
# Official 2B model with I2_S (works on all CPUs)python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
# 3B model - MUST use TL1 (ARM) or TL2 (x86)python setup_env.py -md models/bitnet_b1_58-3B -q tl1 # ARMpython setup_env.py -md models/bitnet_b1_58-3B -q tl2 # x86
# Llama3-8B with I2_Spython setup_env.py -md models/Llama3-8B-1.58 -q i2_sThen run inference:
python run_inference.py \ -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \ -p "You are a helpful assistant" \ -cnvModel Quality Comparison
I ran a simple test: asking each model to explain quantum computing in one paragraph. Here’s my subjective ranking:
Model Quality Speed Overall───────────────────────────────────────────────────BitNet-b1.58-2B-4T 4/5 5/5 4.5/5bitnet_b1_58-large 2/5 5/5 3.5/5bitnet_b1_58-3B 4/5 4/5 4.0/5Llama3-8B-1.58 4/5 3/5 3.5/5Falcon3-7B-Instruct-1.58 4/5 3/5 3.5/5Falcon3-10B-Instruct-1.58 5/5 2/5 3.5/5The 2B model hits the sweet spot for most tasks. The 10B model has the best quality but runs noticeably slower on CPU.
Summary
When choosing a BitNet model:
- Check your CPU architecture - ARM uses TL1, x86 uses TL2
- Match model to your RAM - 0.7B for <1GB, 2B for 1-2GB, 7B+ for >2GB
- Use I2_S when available - it works on both architectures
- Start with the official 2B model - best balance for beginners
- Upgrade to 7B+ for quality - if you have the RAM and patience
The official BitNet-b1.58-2B-4T model is the safest choice for most users. Community models like Falcon3 and Llama3-8B offer alternatives with different trade-offs between size and quality.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 BitNet-b1.58-2B-4T on Hugging Face
- 👨💻 BitNet-b1.58-2B-4T GGUF
- 👨💻 Falcon3 Model Collection
- 👨💻 Falcon Edge Series
- 👨💻 BitNet GitHub Repository
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments