Skip to content

Export Fine-Tuned LLM to GGUF: Run Custom Models on Ollama or LM Studio

I spent hours fine-tuning a language model. When I tried to run it locally, I hit a wall. The model files were 14GB. My laptop has 16GB RAM total. I couldn’t even load the model, let alone run inference.

The problem wasn’t the fine-tuning. The problem was the format.

The Problem

After fine-tuning an LLM, you end up with model weights in formats like PyTorch (.bin) or SafeTensors. These formats work great for training on GPUs. They’re terrible for local inference on consumer hardware.

I faced several issues:

  • File sizes are huge - A 7B parameter model takes ~14GB in FP16 format
  • RAM requirements exceed consumer hardware - Loading a 14GB model needs more than 16GB system RAM
  • No integration with local inference tools - Ollama and LM Studio can’t load .bin files directly
  • LoRA adapters need merging - If you used LoRA, the adapters are separate from the base model

I wanted to run my model offline, on my laptop, without requiring a GPU. I needed a different format.

The Solution: GGUF

GGUF (GGML Universal Format) is designed specifically for efficient CPU inference. The PersonalForge project on Reddit showed me the way: “Exports GGUF with Q4_K_M quantization” and “Run it offline forever.”

The key benefits:

  • Quantized file sizes - 4-bit quantization reduces a 7B model from ~14GB to ~4GB
  • Single file distribution - Everything (weights, tokenizer, metadata) in one file
  • Works with local tools - Ollama, LM Studio, llama.cpp all support GGUF natively
  • CPU inference - No GPU required, runs on standard RAM

The export pipeline has three main steps:

  1. Merge LoRA adapters with base model (if applicable)
  2. Convert to GGUF format
  3. Quantize to reduce file size (Q4_K_M recommended)

Understanding Quantization

Before diving into code, I needed to understand quantization options. Quantization reduces model precision to save memory:

QuantizationSize (7B)QualitySpeedWhen to Use
Q4_K_M~4GBExcellentFastDefault choice
Q4_K_S~3.8GBGoodFasterLow memory situations
Q5_K_M~4.8GBExcellentMediumQuality-critical apps
Q6_K~5.5GBNear-originalSlowMaximum quality
Q8_0~7GBNear-originalSlowestResearch/benchmarking
Q2_K~2GBPoorFastestAvoid - too much quality loss

I chose Q4_K_M as my default. It provides the best balance: maintaining 95%+ of original quality while fitting comfortably in 8GB RAM.

Method 1: Export with Unsloth (Easiest)

If you used Unsloth for fine-tuning, export is built-in. This is the simplest approach.

export_unsloth.py
from unsloth import FastLanguageModel
# Load your fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "outputs/checkpoint-500", # Your checkpoint path
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
)
# Export directly to GGUF with Q4_K_M quantization
model.save_pretrained_gguf(
"my_finetuned_model",
tokenizer,
quantization_method = "q4_k_m"
)
print("Export complete!")

Unsloth handles everything automatically:

  • Merges LoRA adapters with the base model
  • Converts to GGUF format
  • Applies quantization in one step
  • Saves tokenizer and metadata

I tried this first. It worked perfectly for my Unsloth-trained models.

Method 2: Manual Conversion with llama.cpp

When I had models trained outside Unsloth, I needed the manual approach. This gave me more control but required more steps.

Step 1: Install llama.cpp

install_llamacpp.sh
# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build the project
make
# Verify installation
./llama-cli --version

Step 2: Merge LoRA Adapters (If Needed)

If you used LoRA fine-tuning, the adapters must be merged with the base model before conversion:

merge_lora.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-3B", # Your base model
torch_dtype="auto",
device_map="auto"
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "outputs/checkpoint-500")
# Merge and unload
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("merged_model")
tokenizer.save_pretrained("merged_model")
print("LoRA merged successfully!")

Important: Always test the merged model before proceeding. If the merge went wrong, quantization won’t fix it.

Step 3: Convert to GGUF Format

Now convert the merged model to unquantized GGUF:

convert_to_gguf.sh
# Navigate to llama.cpp directory
cd llama.cpp
# Convert HuggingFace format to GGUF (FP16, unquantized)
python convert-hf-to-gguf.py /path/to/merged_model \
--outfile my_model_f16.gguf \
--outtype f16
# Verify the file was created
ls -lh my_model_f16.gguf

At this point, I had a GGUF file, but it was still ~14GB. I needed quantization.

Step 4: Quantize the Model

The quantization step reduces file size dramatically:

quantize_gguf.sh
# Quantize to Q4_K_M (recommended)
./llama-quantize my_model_f16.gguf my_model_q4_k_m.gguf Q4_K_M
# Check the result
ls -lh my_model_q4_k_m.gguf
# Should be around 4GB for a 7B model

I watched the file shrink from 14GB to 4GB. The quality remained surprisingly good.

Note: llama.cpp command names sometimes change between versions. The llama-quantize command was previously just quantize. If you get “command not found,” check with ./quantize or ls to see what binaries exist.

Import into Ollama

Ollama makes running GGUF models simple. I created a Modelfile:

create_modelfile.sh
cat > Modelfile << 'EOF'
FROM ./my_model_q4_k_m.gguf
TEMPLATE """{{ .System }}{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF
# Create the Ollama model
ollama create my-custom-model -f Modelfile
# Run the model
ollama run my-custom-model
# Test it
>>> Hello, how are you?

Ollama loaded my 4GB model and started chatting. No GPU required.

To share the model:

push_to_ollama.sh
# Push to Ollama registry (requires ollama.ai account)
ollama push username/my-custom-model
# Others can then pull it
ollama pull username/my-custom-model

Use with LM Studio

LM Studio provides a GUI for running local models. It loads GGUF files directly.

Option 1: Direct file loading

  1. Open LM Studio
  2. Go to “Local Server” tab
  3. Click “Select a model to load”
  4. Navigate to your GGUF file
  5. Load and start chatting

Option 2: Model directory

lmstudio_directory.sh
# Copy GGUF to LM Studio's model directory
mkdir -p ~/.lmstudio/models/
cp my_model_q4_k_m.gguf ~/.lmstudio/models/
# Restart LM Studio to detect the new model

LM Studio’s interface makes it easy to adjust parameters like temperature, context length, and sampling methods.

Test Your Exported Model

Always test the quantized model to ensure quality isn’t degraded:

test_gguf.sh
# Test with llama.cpp directly
./llama-cli -m my_model_q4_k_m.gguf \
-p "Explain machine learning in simple terms" \
-n 512 \
--temp 0.7
# Test with Ollama API
curl http://localhost:11434/api/generate -d '{
"model": "my-custom-model",
"prompt": "Explain machine learning in simple terms",
"stream": false
}'

Compare the output quality between the original and quantized versions. Q4_K_M should maintain 95%+ of original quality.

Troubleshooting Common Errors

I hit several issues during export. Here’s how I solved them:

Error: “Key ‘tokenizer.model’ not found”

ConversionError: Key 'tokenizer.model' not found in model directory

Solution: Copy tokenizer files from the base model:

copy_tokenizer.sh
# Copy tokenizer files from base model to your fine-tuned model
cp /path/to/base_model/tokenizer.model /path/to/your_model/
cp /path/to/base_model/tokenizer.json /path/to/your_model/
cp /path/to/base_model/tokenizer_config.json /path/to/your_model/
# Retry conversion
python convert-hf-to-gguf.py /path/to/your_model --outfile output.gguf --outtype f16

Error: “GGUF model not recognized by Ollama”

Error: model architecture not supported

Solution: Check architecture compatibility. Ollama supports:

  • Llama family (Llama, Llama 2, Llama 3)
  • Mistral family
  • Qwen family
  • Phi family
  • Gemma family

Verify with:

inspect_gguf.sh
# Inspect GGUF metadata
./llama-inspect my_model.gguf
# Check the architecture field

If your model architecture isn’t supported, you may need to wait for Ollama updates or use llama.cpp directly.

Error: “Out of memory during quantization”

RuntimeError: CUDA out of memory

Solution: Use CPU-only quantization or low-memory mode:

low_memory_quantize.sh
# Use low-memory flag
./llama-quantize --low-memory input.gguf output.gguf Q4_K_M
# Or force CPU-only
CUDA_VISIBLE_DEVICES="" ./llama-quantize input.gguf output.gguf Q4_K_M

Quantization doesn’t require GPU. It runs fine on CPU, just slower.

Error: “Output is nonsense after quantization”

If the quantized model produces gibberish:

  1. Test the merged model first - Before quantizing, test the FP16 GGUF:

    test_before_quantize.sh
    ./llama-cli -m my_model_f16.gguf -p "Hello" -n 100
  2. Check LoRA merge - The merge might have failed. Reload and test with transformers:

    test_merge.py
    from transformers import AutoModelForCausalLM, AutoTokenizer
    model = AutoModelForCausalLM.from_pretrained("merged_model")
    tokenizer = AutoTokenizer.from_pretrained("merged_model")
    inputs = tokenizer("Hello", return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=50)
    print(tokenizer.decode(outputs[0]))
  3. Try higher quantization - Q4_K_M too lossy? Try Q5_K_M or Q6_K.

Error: “Wrong outtype - got f32 instead of quantized”

This happens when --outtype isn’t specified correctly during conversion.

Solution: Always specify outtype explicitly:

correct_outtype.sh
# Correct: specify outtype
python convert-hf-to-gguf.py model_path --outfile model.gguf --outtype f16
# Then quantize separately
./llama-quantize model.gguf model_q4.gguf Q4_K_M

Why This Matters

I built a fine-tuned model that ran entirely offline on my laptop. No API keys. No cloud costs. No internet required. The PersonalForge project described it perfectly: “Run it offline forever.”

The comparison is striking:

FactorTraining FormatGGUF Format
File Size (7B)~14GB~4GB (Q4_K_M)
RAM Required16GB+ GPU VRAM8GB CPU RAM
Inference SpeedFast (GPU)Acceptable (CPU)
PortabilityLow (multiple files)High (single file)
Local ToolsNoneOllama, LM Studio, llama.cpp
Offline UseNoYes

Best Practices I Learned

Through trial and error, I discovered several key practices:

1. Always merge before quantizing

LoRA adapters must be merged with the base model. Quantizing unmerged adapters produces garbage.

2. Test at each step

Don’t skip testing:

  • Test merged model (before GGUF conversion)
  • Test unquantized GGUF (before quantization)
  • Test quantized GGUF (before deployment)

This isolates where problems occur.

3. Q4_K_M is the sweet spot

I tested multiple quantization levels. Q4_K_M consistently provided the best balance:

  • Small enough for 8GB RAM
  • Fast enough for real-time chat
  • Quality nearly indistinguishable from original

4. Keep the FP16 GGUF

After quantizing, I keep the unquantized GGUF file. It’s useful for:

  • Re-quantizing to different formats later
  • Debugging quantization issues
  • Serving from more powerful hardware

5. Document your training config

Include model architecture, training parameters, and base model in the GGUF filename or metadata. Six months from now, you won’t remember which base model you used.

What I Built

After following this process, I had:

  • A 4GB GGUF file that runs on my laptop
  • Integration with Ollama for command-line use
  • Integration with LM Studio for GUI use
  • Complete offline capability - no internet needed
  • The ability to share the model file with others

The fine-tuning was the hard part. Exporting to GGUF was the practical step that made the model usable.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments