Skip to content

How I Set Up a Local LLM Inference Server (And the Memory Errors I Hit Along the Way)

I downloaded Qwen 3.5-27B and tried to run it locally. The model crashed my GPU within seconds.

The error message was cryptic: CUDA out of memory. But I had 24GB of VRAM—shouldn’t that be enough for a 27B model?

Turns out, setting up a local LLM inference server is more than just loading a model. You need proper memory management, the right software stack, and an understanding of how the initialization process works.

The Real Problem

When I first tried running a large language model locally, I made every mistake possible:

  1. Used default batch sizes
  2. Ignored deprecation warnings
  3. Didn’t set memory limits
  4. Skipped flash attention installation
  5. Wondered why everything was slow and crashing

After hours of debugging, I found a Reddit thread where someone shared their vLLM startup logs for Qwen 3.5-27B. Those logs revealed exactly what happens during initialization—and helped me understand where I was going wrong.

Understanding the Initialization Process

When you start an LLM inference server, it goes through distinct stages. Here’s what I learned from analyzing those startup logs:

initialization-stages.txt
Stage 1: Tokenizer Loading
├── Load vocabulary files
├── Build tokenizer model
└── Initialize special tokens
Stage 2: Model Weight Loading
├── Map weights to GPU memory
├── Apply quantization if set
└── Verify model architecture
Stage 3: GPU Memory Allocation
├── Reserve memory for KV cache
├── Set up PagedAttention blocks
└── Configure memory pooling
Stage 4: Final Setup
├── Initialize API server
├── Warm up the model
└── Start accepting requests

Each stage has potential pitfalls. Let me walk through how to set this up properly.

Choosing Your Stack

There are three main approaches, each with trade-offs:

Option A: Transformers (Simple but Limited)

The quickest way to get something running:

setup-transformers.sh
# Create and activate virtual environment
python -m venv llm-env
source llm-env/bin/activate
# Install core dependencies
pip install torch transformers accelerate

Then a basic server:

simple_server.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from flask import Flask, request, jsonify
import torch
app = Flask(__name__)
model_name = "Qwen/Qwen2.5-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.float16 # Note: torch_dtype is deprecated, use dtype instead
)
@app.route("/v1/chat/completions", methods=["POST"])
def chat():
data = request.json
messages = data.get("messages", [])
prompt = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return jsonify({"choices": [{"message": {"content": response}}]})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8000)

Pros: Easy setup, great for testing Cons: Limited performance, no continuous batching, manual API implementation

Option B: vLLM (Production Ready)

This is what I recommend for serious use:

setup-vllm.sh
pip install vllm
# For better performance (optional but recommended)
pip install flash-attn --no-build-isolation

Start the server:

start-vllm-server.sh
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--max-model-len 4096

Pros: OpenAI-compatible API, continuous batching, PagedAttention Cons: More complex, requires GPU

Option C: llama.cpp (Lightweight)

For CPU or Apple Silicon:

start-llamacpp-server.sh
./server -m qwen-7b.gguf \
--host 0.0.0.0 \
--port 8000 \
--ctx-size 4096 \
--n-gpu-layers 32

Pros: Works on CPU, lightweight, GGUF quantization Cons: Different model format, less feature-rich

Why vLLM Won Me Over

After trying all three, vLLM became my go-to for a few key reasons:

  1. OpenAI Compatibility: Drop-in replacement for OpenAI API
  2. PagedAttention: Efficient memory management for KV cache
  3. Continuous Batching: Handles multiple requests efficiently
  4. Production Ready: Battle-tested in production environments

Here’s what the initialization looks like with vLLM (based on those Reddit logs I found):

vllm-startup-log.txt
INFO: Loading model weights...
WARNING: `torch_dtype` is deprecated! Use `dtype` instead!
INFO: Model weights loaded in 12.3s
INFO: Allocating GPU memory for KV cache...
INFO: Using 90% of GPU memory (21.6GB / 24GB)
INFO: Initializing PagedAttention with 1024 blocks
INFO: Server running on http://0.0.0.0:8000
WARNING: The fast path is not available.
Install flash-linear-attention and causal-conv1d for better performance.

Those warnings matter. Let me explain what they mean.

The Warnings You Shouldn’t Ignore

Warning 1: torch_dtype is deprecated

I kept seeing this and ignoring it. Bad idea. The new dtype parameter is more explicit:

dtype_fix.py
# OLD (deprecated)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16
)
# NEW (correct)
model = AutoModelForCausalLM.from_pretrained(
model_name,
dtype=torch.float16
)

Warning 2: Fast path not available

This warning means you’re missing performance optimizations:

install-flash-attention.sh
# Install flash attention for faster inference
pip install flash-attn --no-build-isolation
# For some models, you might also need
pip install flash-linear-attention causal-conv1d

After installing these, my inference speed improved by 30-40%.

Testing Your Server

Once your server is running, test it with an OpenAI-compatible client:

test_client.py
import openai
# Point to your local server
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # Local servers don't need real keys
)
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print(response.choices[0].message.content)

Common Mistakes I Made

Mistake 1: Not Setting Memory Limits

I assumed “auto” meant optimal. It doesn’t.

memory-settings.sh
# Don't do this - can OOM
python -m vllm.entrypoints.openai.api_server --model huge-model
# Do this - set limits
python -m vllm.entrypoints.openai.api_server \
--model huge-model \
--gpu-memory-utilization 0.85 \
--max-model-len 2048

Mistake 2: Ignoring Context Length

The default context length might be too large for your GPU:

context-length-guide.txt
GPU Memory Recommended Max Length
────────────────────────────────────
8GB 2048 tokens
12GB 4096 tokens
16GB 8192 tokens
24GB 16384 tokens (with quantization)

Mistake 3: Forgetting Environment Variables

Some models need specific settings:

env-variables.sh
export HF_HOME="/path/to/model/cache" # Where models are stored
export CUDA_VISIBLE_DEVICES="0" # Which GPU to use
export VLLM_ATTENTION_BACKEND="FLASHINFER" # Faster attention

Why This Matters

Running LLMs locally isn’t just about saving API costs (though that’s nice). It’s about:

  1. Data Privacy: Your data never leaves your machine
  2. No Rate Limits: Generate as much as you want
  3. Learning: Understand how models actually work
  4. Customization: Fine-tune and modify at will
  5. Reliability: No API outages or service changes

When to Use What

stack-decision-tree.txt
Do you have a GPU?
├── Yes → Do you need production features?
│ ├── Yes → Use vLLM
│ └── No → Use transformers (simpler)
└── No → Do you have Apple Silicon?
├── Yes → Use llama.cpp with Metal
└── No → Use llama.cpp with CPU (slow but works)

My Final Setup

After all the trial and error, here’s what I use:

my-final-setup.sh
# vLLM with optimal settings for 24GB GPU
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--tensor-parallel-size 1 \
--enable-prefix-caching

This gives me:

  • Fast inference with flash attention
  • Enough context for most tasks
  • Room for batching multiple requests
  • Stable memory usage that doesn’t crash

Setting up a local LLM inference server takes some trial and error, but once you understand the memory constraints and initialization process, it becomes straightforward. Start with the simple transformers approach to learn, then move to vLLM when you need production features.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments