How I Set Up a Local LLM Inference Server (And the Memory Errors I Hit Along the Way)

Mar 27, 2026

I downloaded Qwen 3.5-27B and tried to run it locally. The model crashed my GPU within seconds.

The error message was cryptic: CUDA out of memory. But I had 24GB of VRAM—shouldn’t that be enough for a 27B model?

Turns out, setting up a local LLM inference server is more than just loading a model. You need proper memory management, the right software stack, and an understanding of how the initialization process works.

The Real Problem

When I first tried running a large language model locally, I made every mistake possible:

Used default batch sizes
Ignored deprecation warnings
Didn’t set memory limits
Skipped flash attention installation
Wondered why everything was slow and crashing

After hours of debugging, I found a Reddit thread where someone shared their vLLM startup logs for Qwen 3.5-27B. Those logs revealed exactly what happens during initialization—and helped me understand where I was going wrong.

Understanding the Initialization Process

When you start an LLM inference server, it goes through distinct stages. Here’s what I learned from analyzing those startup logs:

Stage 1: Tokenizer Loading
├── Load vocabulary files
├── Build tokenizer model
└── Initialize special tokens

Stage 2: Model Weight Loading
├── Map weights to GPU memory
├── Apply quantization if set
└── Verify model architecture

Stage 3: GPU Memory Allocation
├── Reserve memory for KV cache
├── Set up PagedAttention blocks
└── Configure memory pooling

Stage 4: Final Setup
├── Initialize API server
├── Warm up the model
└── Start accepting requests

Each stage has potential pitfalls. Let me walk through how to set this up properly.

Choosing Your Stack

There are three main approaches, each with trade-offs:

Option A: Transformers (Simple but Limited)

The quickest way to get something running:

# Create and activate virtual environment
python -m venv llm-env
source llm-env/bin/activate

# Install core dependencies
pip install torch transformers accelerate

Then a basic server:

from transformers import AutoModelForCausalLM, AutoTokenizer
from flask import Flask, request, jsonify
import torch

app = Flask(__name__)

model_name = "Qwen/Qwen2.5-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16  # Note: torch_dtype is deprecated, use dtype instead
)

@app.route("/v1/chat/completions", methods=["POST"])
def chat():
    data = request.json
    messages = data.get("messages", [])

    prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(**inputs, max_new_tokens=512)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return jsonify({"choices": [{"message": {"content": response}}]})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8000)

Pros: Easy setup, great for testing Cons: Limited performance, no continuous batching, manual API implementation

Option B: vLLM (Production Ready)

This is what I recommend for serious use:

pip install vllm

# For better performance (optional but recommended)
pip install flash-attn --no-build-isolation

Start the server:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 4096

Pros: OpenAI-compatible API, continuous batching, PagedAttention Cons: More complex, requires GPU

Option C: llama.cpp (Lightweight)

For CPU or Apple Silicon:

./server -m qwen-7b.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --ctx-size 4096 \
    --n-gpu-layers 32

Pros: Works on CPU, lightweight, GGUF quantization Cons: Different model format, less feature-rich

Why vLLM Won Me Over

After trying all three, vLLM became my go-to for a few key reasons:

OpenAI Compatibility: Drop-in replacement for OpenAI API
PagedAttention: Efficient memory management for KV cache
Continuous Batching: Handles multiple requests efficiently
Production Ready: Battle-tested in production environments

Here’s what the initialization looks like with vLLM (based on those Reddit logs I found):

INFO: Loading model weights...
WARNING: `torch_dtype` is deprecated! Use `dtype` instead!
INFO: Model weights loaded in 12.3s
INFO: Allocating GPU memory for KV cache...
INFO: Using 90% of GPU memory (21.6GB / 24GB)
INFO: Initializing PagedAttention with 1024 blocks
INFO: Server running on http://0.0.0.0:8000
WARNING: The fast path is not available.
         Install flash-linear-attention and causal-conv1d for better performance.

Those warnings matter. Let me explain what they mean.

The Warnings You Shouldn’t Ignore

Warning 1: `torch_dtype` is deprecated

I kept seeing this and ignoring it. Bad idea. The new dtype parameter is more explicit:

# OLD (deprecated)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16
)

# NEW (correct)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.float16
)

Warning 2: Fast path not available

This warning means you’re missing performance optimizations:

# Install flash attention for faster inference
pip install flash-attn --no-build-isolation

# For some models, you might also need
pip install flash-linear-attention causal-conv1d

After installing these, my inference speed improved by 30-40%.

Testing Your Server

Once your server is running, test it with an OpenAI-compatible client:

import openai

# Point to your local server
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # Local servers don't need real keys
)

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print(response.choices[0].message.content)

Common Mistakes I Made

Mistake 1: Not Setting Memory Limits

I assumed “auto” meant optimal. It doesn’t.

# Don't do this - can OOM
python -m vllm.entrypoints.openai.api_server --model huge-model

# Do this - set limits
python -m vllm.entrypoints.openai.api_server \
    --model huge-model \
    --gpu-memory-utilization 0.85 \
    --max-model-len 2048

Mistake 2: Ignoring Context Length

The default context length might be too large for your GPU:

GPU Memory    Recommended Max Length
────────────────────────────────────
8GB           2048 tokens
12GB          4096 tokens
16GB          8192 tokens
24GB          16384 tokens (with quantization)

Mistake 3: Forgetting Environment Variables

Some models need specific settings:

export HF_HOME="/path/to/model/cache"  # Where models are stored
export CUDA_VISIBLE_DEVICES="0"        # Which GPU to use
export VLLM_ATTENTION_BACKEND="FLASHINFER"  # Faster attention

Why This Matters

Running LLMs locally isn’t just about saving API costs (though that’s nice). It’s about:

Data Privacy: Your data never leaves your machine
No Rate Limits: Generate as much as you want
Learning: Understand how models actually work
Customization: Fine-tune and modify at will
Reliability: No API outages or service changes

When to Use What

Do you have a GPU?
├── Yes → Do you need production features?
│         ├── Yes → Use vLLM
│         └── No → Use transformers (simpler)
│
└── No → Do you have Apple Silicon?
          ├── Yes → Use llama.cpp with Metal
          └── No → Use llama.cpp with CPU (slow but works)

My Final Setup

After all the trial and error, here’s what I use:

# vLLM with optimal settings for 24GB GPU
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 8192 \
    --tensor-parallel-size 1 \
    --enable-prefix-caching

This gives me:

Fast inference with flash attention
Enough context for most tasks
Room for batching multiple requests
Stable memory usage that doesn’t crash

Setting up a local LLM inference server takes some trial and error, but once you understand the memory constraints and initialization process, it becomes straightforward. Start with the simple transformers approach to learn, then move to vLLM when you need production features.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!