Skip to content

Why Is llama.cpp Faster Than LMStudio and Ollama for Local LLM Inference?

Purpose

This post explains why llama.cpp CLI delivers faster inference than LMStudio and Ollama.

Problem

I was testing local LLM inference on my machine and noticed something strange. Running the same Qwen 3.5 9B model, I got 4.6 tokens/second on llama.cpp CLI but only 2.4 tokens/second on LMStudio.

That’s nearly a 2x difference for what should be the same underlying engine.

llama-cli-output.txt
# llama.cpp CLI
llama-cli -m qwen3.5-9b-q4_k_m.gguf -p "Explain quantum computing"
# Output: 4.6 tokens/second
lmstudio-output.txt
# LMStudio
# Same model, same hardware
# Output: 2.4 tokens/second

A Reddit user reported the same observation - nearly double the speed with raw llama.cpp compared to LMStudio and Ollama. So I dug into why.

Environment

  • macOS with Apple Silicon
  • Qwen 3.5 9B (Q4_K_M quantization)
  • 16GB unified memory
  • llama.cpp b4500
  • LMStudio 0.3.x
  • Ollama 0.5.x

The Direct Answer

Raw llama.cpp CLI is faster because it has minimal overhead and allows direct control over all inference parameters. But here’s the key insight: when comparing apples-to-apples (same quantization, context length, GPU offload, batch size, and prompt), the performance gap narrows significantly.

The 2x difference I saw wasn’t because llama.cpp is fundamentally faster. It was because LMStudio and Ollama apply different default settings that prioritize features over raw speed.

Architecture Differences

Let me break down what each tool adds on top of the core inference engine.

architecture-diagram.txt
+------------------+
| llama.cpp CLI | <- Minimal overhead, direct parameter control
+------------------+
|
v
+------------------+
| GGUF Engine | <- Core inference (all tools use this)
+------------------+
+------------------+
| LMStudio | <- Electron UI + model management + chat history
+------------------+
|
v
+------------------+
| llama.cpp | <- Bundled engine
+------------------+
+------------------+
| Ollama | <- API server + model registry + container-like storage
+------------------+
|
v
+------------------+
| llama.cpp | <- Bundled engine
+------------------+

LMStudio overhead:

  • Electron-based UI (Chromium + Node.js runtime)
  • Model management and discovery features
  • Chat history persistence
  • Real-time UI updates during generation

Ollama overhead:

  • REST API server layer
  • Model registry and versioning
  • Container-like model storage
  • Request queuing and management

These features aren’t free. They consume CPU cycles and memory that could otherwise go to inference.

Default Settings Matter

The bigger factor is that LMStudio and Ollama don’t use the same defaults as raw llama.cpp. Here’s what I found:

Settingllama.cpp CLILMStudioOllama
Context length512 (default)8192 (default)4096 (default)
Batch size512Varies by UIAuto-tuned
GPU offloadManual (-ngl)Auto-detectAuto-detect
Flash attentionOff by defaultOn by defaultOn by default
KV cacheDefaultOptimized for chatOptimized for API

A larger context length means more KV cache allocation, which slows down inference. LMStudio defaults to 8192 tokens context while llama.cpp CLI defaults to 512 - that’s a 16x difference in memory allocation.

Apples-to-Apples Comparison

To fairly compare, I ran tests with identical settings:

fair-comparison.sh
# llama.cpp with explicit settings
llama-cli -m qwen3.5-9b-q4_k_m.gguf \
-p "Explain quantum computing" \
-c 4096 \
-ngl 99 \
-b 512 \
--flash-attn
# LMStudio: Manually set context to 4096, same prompt
# Ollama: Set num_ctx=4096 in modelfile
comparison-results.txt
With identical settings (ctx=4096, flash-attn, full GPU offload):
llama.cpp CLI: 4.2 tokens/second
LMStudio: 3.8 tokens/second
Ollama: 3.6 tokens/second

The gap shrank from 2x to about 10-15%. That remaining difference is the actual overhead from UI/API layers.

Where Overhead Comes From

Even with matched settings, LMStudio and Ollama have inherent overhead:

LMStudio (Electron overhead):

process-list.txt
# While LMStudio runs, you'll see these processes:
LM Studio Helper (Renderer) ~150-300MB RAM
LM Studio Helper (GPU) ~100-200MB RAM
Electron Framework ~80-150MB RAM

The Electron framework runs a full Chromium browser and Node.js runtime. That’s memory and CPU not available for inference.

Ollama (API server overhead):

ollama-processes.txt
# Ollama runs as a service:
ollama serve # API server process
ollama run # Inference process
# Plus HTTP overhead for each request:
# JSON parsing, request routing, response serialization

Every API call goes through HTTP parsing, JSON serialization, and request routing. Small overhead per token, but it adds up.

Hardware Context Matters

The performance gap also depends on your hardware:

hardware-impact.txt
High-end GPU (RTX 4090):
- llama.cpp: 80 tokens/s
- LMStudio: 75 tokens/s
- Gap: ~6% (GPU is the bottleneck, not software)
Mid-range GPU (RTX 3060):
- llama.cpp: 25 tokens/s
- LMStudio: 20 tokens/s
- Gap: ~20% (more noticeable overhead)
Apple Silicon (M1/M2):
- llama.cpp: 4.6 tokens/s
- LMStudio: 2.4 tokens/s
- Gap: ~50% (unified memory, overhead more impactful)

On Apple Silicon, the unified memory architecture means UI overhead competes directly with inference for memory bandwidth. That’s why I saw a bigger gap than users with discrete GPUs.

When to Use Each Tool

Use llama.cpp CLI when:

  • You need maximum performance
  • You’re benchmarking or testing
  • You want full control over every parameter
  • You’re running batch processing

Use LMStudio when:

  • You want a GUI for model discovery
  • You prefer chat-style interface
  • You don’t want to memorize CLI flags
  • The 10-20% overhead is acceptable

Use Ollama when:

  • You need an API server for applications
  • You want easy model management (pull/run)
  • You’re building tools that need programmatic access
  • Docker-style workflow appeals to you

Optimizing LMStudio Performance

If you want LMStudio closer to llama.cpp speeds:

  1. Reduce context length - Set it to what you actually need, not the maximum.

  2. Disable unnecessary features - Turn off chat history if you don’t need it.

  3. Check GPU offload - Ensure your layers are fully offloaded.

  4. Use the right quantization - Q4_K_M is usually the sweet spot for speed/quality.

lmstudio-settings.txt
Recommended LMStudio settings for performance:
- Context Length: 2048 (unless you need more)
- GPU Offload: Maximum available
- Flash Attention: On
- Batch Size: 512 (match llama.cpp default)

Optimizing Ollama Performance

For Ollama, create a custom modelfile with explicit settings:

custom-modelfile.txt
FROM qwen3.5:9b
PARAMETER num_ctx 4096
PARAMETER num_batch 512
PARAMETER num_gpu 99

Then create your optimized model:

create-optimized-model.sh
ollama create qwen-optimized -f Modelfile
ollama run qwen-optimized "Explain quantum computing"

Summary

In this post, I showed why llama.cpp CLI appears faster than LMStudio and Ollama. The key point is that raw CLI has minimal overhead and allows direct control over all parameters.

The 2x speed difference I observed came from two sources:

  1. Default settings differ - LMStudio uses larger context by default, which allocates more memory and slows inference.

  2. Actual overhead - LMStudio’s Electron UI and Ollama’s API layer consume resources that could go to inference.

When I matched settings apples-to-apples, the gap narrowed to 10-15%. That’s the true overhead cost of convenience features.

Choose the tool that matches your needs. If you’re benchmarking or need every token per second, use llama.cpp CLI. If you want a GUI and don’t mind a small performance trade-off, LMStudio is fine. If you need an API for applications, Ollama’s overhead is the price of that convenience.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments