Export Fine-Tuned LLM to GGUF: Run Custom Models on Ollama or LM Studio
I spent hours fine-tuning a language model. When I tried to run it locally, I hit a wall. The model files were 14GB. My laptop has 16GB RAM total. I couldn’t even load the model, let alone run inference.
The problem wasn’t the fine-tuning. The problem was the format.
The Problem
After fine-tuning an LLM, you end up with model weights in formats like PyTorch (.bin) or SafeTensors. These formats work great for training on GPUs. They’re terrible for local inference on consumer hardware.
I faced several issues:
- File sizes are huge - A 7B parameter model takes ~14GB in FP16 format
- RAM requirements exceed consumer hardware - Loading a 14GB model needs more than 16GB system RAM
- No integration with local inference tools - Ollama and LM Studio can’t load .bin files directly
- LoRA adapters need merging - If you used LoRA, the adapters are separate from the base model
I wanted to run my model offline, on my laptop, without requiring a GPU. I needed a different format.
The Solution: GGUF
GGUF (GGML Universal Format) is designed specifically for efficient CPU inference. The PersonalForge project on Reddit showed me the way: “Exports GGUF with Q4_K_M quantization” and “Run it offline forever.”
The key benefits:
- Quantized file sizes - 4-bit quantization reduces a 7B model from ~14GB to ~4GB
- Single file distribution - Everything (weights, tokenizer, metadata) in one file
- Works with local tools - Ollama, LM Studio, llama.cpp all support GGUF natively
- CPU inference - No GPU required, runs on standard RAM
The export pipeline has three main steps:
- Merge LoRA adapters with base model (if applicable)
- Convert to GGUF format
- Quantize to reduce file size (Q4_K_M recommended)
Understanding Quantization
Before diving into code, I needed to understand quantization options. Quantization reduces model precision to save memory:
| Quantization | Size (7B) | Quality | Speed | When to Use |
|---|---|---|---|---|
| Q4_K_M | ~4GB | Excellent | Fast | Default choice |
| Q4_K_S | ~3.8GB | Good | Faster | Low memory situations |
| Q5_K_M | ~4.8GB | Excellent | Medium | Quality-critical apps |
| Q6_K | ~5.5GB | Near-original | Slow | Maximum quality |
| Q8_0 | ~7GB | Near-original | Slowest | Research/benchmarking |
| Q2_K | ~2GB | Poor | Fastest | Avoid - too much quality loss |
I chose Q4_K_M as my default. It provides the best balance: maintaining 95%+ of original quality while fitting comfortably in 8GB RAM.
Method 1: Export with Unsloth (Easiest)
If you used Unsloth for fine-tuning, export is built-in. This is the simplest approach.
from unsloth import FastLanguageModel
# Load your fine-tuned modelmodel, tokenizer = FastLanguageModel.from_pretrained( model_name = "outputs/checkpoint-500", # Your checkpoint path max_seq_length = 2048, dtype = None, load_in_4bit = True,)
# Export directly to GGUF with Q4_K_M quantizationmodel.save_pretrained_gguf( "my_finetuned_model", tokenizer, quantization_method = "q4_k_m")
print("Export complete!")Unsloth handles everything automatically:
- Merges LoRA adapters with the base model
- Converts to GGUF format
- Applies quantization in one step
- Saves tokenizer and metadata
I tried this first. It worked perfectly for my Unsloth-trained models.
Method 2: Manual Conversion with llama.cpp
When I had models trained outside Unsloth, I needed the manual approach. This gave me more control but required more steps.
Step 1: Install llama.cpp
# Clone the repositorygit clone https://github.com/ggerganov/llama.cppcd llama.cpp
# Build the projectmake
# Verify installation./llama-cli --versionStep 2: Merge LoRA Adapters (If Needed)
If you used LoRA fine-tuning, the adapters must be merged with the base model before conversion:
from transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import PeftModel
# Load base modelbase_model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-3B", # Your base model torch_dtype="auto", device_map="auto")
# Load tokenizertokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
# Load LoRA adaptersmodel = PeftModel.from_pretrained(base_model, "outputs/checkpoint-500")
# Merge and unloadmerged_model = model.merge_and_unload()
# Save merged modelmerged_model.save_pretrained("merged_model")tokenizer.save_pretrained("merged_model")
print("LoRA merged successfully!")Important: Always test the merged model before proceeding. If the merge went wrong, quantization won’t fix it.
Step 3: Convert to GGUF Format
Now convert the merged model to unquantized GGUF:
# Navigate to llama.cpp directorycd llama.cpp
# Convert HuggingFace format to GGUF (FP16, unquantized)python convert-hf-to-gguf.py /path/to/merged_model \ --outfile my_model_f16.gguf \ --outtype f16
# Verify the file was createdls -lh my_model_f16.ggufAt this point, I had a GGUF file, but it was still ~14GB. I needed quantization.
Step 4: Quantize the Model
The quantization step reduces file size dramatically:
# Quantize to Q4_K_M (recommended)./llama-quantize my_model_f16.gguf my_model_q4_k_m.gguf Q4_K_M
# Check the resultls -lh my_model_q4_k_m.gguf# Should be around 4GB for a 7B modelI watched the file shrink from 14GB to 4GB. The quality remained surprisingly good.
Note: llama.cpp command names sometimes change between versions. The llama-quantize command was previously just quantize. If you get “command not found,” check with ./quantize or ls to see what binaries exist.
Import into Ollama
Ollama makes running GGUF models simple. I created a Modelfile:
cat > Modelfile << 'EOF'FROM ./my_model_q4_k_m.ggufTEMPLATE """{{ .System }}{{ .Prompt }}"""PARAMETER temperature 0.7PARAMETER num_ctx 4096EOF
# Create the Ollama modelollama create my-custom-model -f Modelfile
# Run the modelollama run my-custom-model
# Test it>>> Hello, how are you?Ollama loaded my 4GB model and started chatting. No GPU required.
To share the model:
# Push to Ollama registry (requires ollama.ai account)ollama push username/my-custom-model
# Others can then pull itollama pull username/my-custom-modelUse with LM Studio
LM Studio provides a GUI for running local models. It loads GGUF files directly.
Option 1: Direct file loading
- Open LM Studio
- Go to “Local Server” tab
- Click “Select a model to load”
- Navigate to your GGUF file
- Load and start chatting
Option 2: Model directory
# Copy GGUF to LM Studio's model directorymkdir -p ~/.lmstudio/models/cp my_model_q4_k_m.gguf ~/.lmstudio/models/
# Restart LM Studio to detect the new modelLM Studio’s interface makes it easy to adjust parameters like temperature, context length, and sampling methods.
Test Your Exported Model
Always test the quantized model to ensure quality isn’t degraded:
# Test with llama.cpp directly./llama-cli -m my_model_q4_k_m.gguf \ -p "Explain machine learning in simple terms" \ -n 512 \ --temp 0.7
# Test with Ollama APIcurl http://localhost:11434/api/generate -d '{ "model": "my-custom-model", "prompt": "Explain machine learning in simple terms", "stream": false}'Compare the output quality between the original and quantized versions. Q4_K_M should maintain 95%+ of original quality.
Troubleshooting Common Errors
I hit several issues during export. Here’s how I solved them:
Error: “Key ‘tokenizer.model’ not found”
ConversionError: Key 'tokenizer.model' not found in model directorySolution: Copy tokenizer files from the base model:
# Copy tokenizer files from base model to your fine-tuned modelcp /path/to/base_model/tokenizer.model /path/to/your_model/cp /path/to/base_model/tokenizer.json /path/to/your_model/cp /path/to/base_model/tokenizer_config.json /path/to/your_model/
# Retry conversionpython convert-hf-to-gguf.py /path/to/your_model --outfile output.gguf --outtype f16Error: “GGUF model not recognized by Ollama”
Error: model architecture not supportedSolution: Check architecture compatibility. Ollama supports:
- Llama family (Llama, Llama 2, Llama 3)
- Mistral family
- Qwen family
- Phi family
- Gemma family
Verify with:
# Inspect GGUF metadata./llama-inspect my_model.gguf
# Check the architecture fieldIf your model architecture isn’t supported, you may need to wait for Ollama updates or use llama.cpp directly.
Error: “Out of memory during quantization”
RuntimeError: CUDA out of memorySolution: Use CPU-only quantization or low-memory mode:
# Use low-memory flag./llama-quantize --low-memory input.gguf output.gguf Q4_K_M
# Or force CPU-onlyCUDA_VISIBLE_DEVICES="" ./llama-quantize input.gguf output.gguf Q4_K_MQuantization doesn’t require GPU. It runs fine on CPU, just slower.
Error: “Output is nonsense after quantization”
If the quantized model produces gibberish:
-
Test the merged model first - Before quantizing, test the FP16 GGUF:
test_before_quantize.sh ./llama-cli -m my_model_f16.gguf -p "Hello" -n 100 -
Check LoRA merge - The merge might have failed. Reload and test with transformers:
test_merge.py from transformers import AutoModelForCausalLM, AutoTokenizermodel = AutoModelForCausalLM.from_pretrained("merged_model")tokenizer = AutoTokenizer.from_pretrained("merged_model")inputs = tokenizer("Hello", return_tensors="pt")outputs = model.generate(**inputs, max_new_tokens=50)print(tokenizer.decode(outputs[0])) -
Try higher quantization - Q4_K_M too lossy? Try Q5_K_M or Q6_K.
Error: “Wrong outtype - got f32 instead of quantized”
This happens when --outtype isn’t specified correctly during conversion.
Solution: Always specify outtype explicitly:
# Correct: specify outtypepython convert-hf-to-gguf.py model_path --outfile model.gguf --outtype f16
# Then quantize separately./llama-quantize model.gguf model_q4.gguf Q4_K_MWhy This Matters
I built a fine-tuned model that ran entirely offline on my laptop. No API keys. No cloud costs. No internet required. The PersonalForge project described it perfectly: “Run it offline forever.”
The comparison is striking:
| Factor | Training Format | GGUF Format |
|---|---|---|
| File Size (7B) | ~14GB | ~4GB (Q4_K_M) |
| RAM Required | 16GB+ GPU VRAM | 8GB CPU RAM |
| Inference Speed | Fast (GPU) | Acceptable (CPU) |
| Portability | Low (multiple files) | High (single file) |
| Local Tools | None | Ollama, LM Studio, llama.cpp |
| Offline Use | No | Yes |
Best Practices I Learned
Through trial and error, I discovered several key practices:
1. Always merge before quantizing
LoRA adapters must be merged with the base model. Quantizing unmerged adapters produces garbage.
2. Test at each step
Don’t skip testing:
- Test merged model (before GGUF conversion)
- Test unquantized GGUF (before quantization)
- Test quantized GGUF (before deployment)
This isolates where problems occur.
3. Q4_K_M is the sweet spot
I tested multiple quantization levels. Q4_K_M consistently provided the best balance:
- Small enough for 8GB RAM
- Fast enough for real-time chat
- Quality nearly indistinguishable from original
4. Keep the FP16 GGUF
After quantizing, I keep the unquantized GGUF file. It’s useful for:
- Re-quantizing to different formats later
- Debugging quantization issues
- Serving from more powerful hardware
5. Document your training config
Include model architecture, training parameters, and base model in the GGUF filename or metadata. Six months from now, you won’t remember which base model you used.
What I Built
After following this process, I had:
- A 4GB GGUF file that runs on my laptop
- Integration with Ollama for command-line use
- Integration with LM Studio for GUI use
- Complete offline capability - no internet needed
- The ability to share the model file with others
The fine-tuning was the hard part. Exporting to GGUF was the practical step that made the model usable.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments