What's the Best Local LLM to Run on Mac Mini M4 with 24-32GB RAM?
I just got my Mac Mini M4 with 24GB of RAM, excited to run local LLMs. I fired up Ollama and tried pulling Llama 3.1 70B. It downloaded for 30 minutes, then crashed with an out-of-memory error. My enthusiasm quickly turned to frustration.
Turns out, choosing the right model for your memory constraints isn’t straightforward. After weeks of testing, benchmarking, and learning from mistakes, here’s what actually works.
The Memory Math Problem
The first thing I had to understand: your “24GB” isn’t really 24GB available for models.
Total RAM: 24 GBmacOS overhead: ~4-6 GBRunning apps: ~2-4 GBContext window: ~2-4 GB (depends on usage)━━━━━━━━━━━━━━━━━━━━━━━━━━━Available for model: ~8-14 GBThis is why the 70B model failed spectacularly. Even the 13B models need careful quantization to fit properly.
What Actually Fits: 24GB RAM Edition
After testing dozens of models, here’s what works reliably on a 24GB Mac Mini M4:
Llama 3.1 8B (Q4_K_M) - The Safe Default
ollama pull llama3.1:8bThis model uses about 5-6GB of memory and leaves plenty of room for context. Performance is impressive:
Model: Llama 3.1 8B Q4_K_MMemory: ~5.5 GBSpeed: 50-80 tokens/secQuality: Good for general tasksBest for: Chat, writing, simple codeI use this as my daily driver for quick questions and text generation. It’s fast enough to feel snappy and capable enough for most tasks.
Mistral 7B (Q4_K_M) - Speed Demon
ollama pull mistral:7bSlightly smaller and even faster:
Model: Mistral 7B Q4_K_MMemory: ~4.5 GBSpeed: 60-90 tokens/secQuality: Excellent for sizeBest for: Fast responses, summarizationMistral surprised me with how good it is for its size. If you prioritize speed over maximum capability, this is the one.
DeepSeek Coder 6.7B - Code Specialist
ollama pull deepseek-coder:6.7bFor coding tasks, this outperforms larger general models:
Model: DeepSeek Coder 6.7B Q4_K_MMemory: ~4 GBSpeed: 55-75 tokens/secQuality: Best for code in this sizeBest for: Code completion, debugging, explanationI tested it on a complex TypeScript refactoring task, and it nailed the logic while faster models got confused.
Phi-3 Mini 3.8B - Efficiency King
ollama pull phi3:miniTiny but surprisingly capable:
Model: Phi-3 Mini 3.8B Q4_K_MMemory: ~2.5 GBSpeed: 80-100+ tokens/secQuality: Good for reasoning tasksBest for: Quick queries, constrained environmentsWhen I need something that runs alongside other memory-heavy apps, Phi-3 is my go-to.
What Fits: 32GB RAM Edition
If you splurged for the 32GB model (I wish I had), you get access to significantly better models:
DeepSeek R1 14B - The Reasoning Beast
ollama pull deepseek-r1:14bThis is the standout model for 32GB systems:
Model: DeepSeek R1 14B Q4_K_MMemory: ~9-11 GBSpeed: 35-50 tokens/secQuality: Excellent reasoning capabilityBest for: Complex reasoning, analysis, mathI tested it on logic puzzles that stumped the 8B models, and it consistently found solutions. The reasoning chain-of-thought is impressive.
Llama 3.1 13B - Better Quality
ollama pull llama3.1:13bStep up from the 8B with noticeably better output:
Model: Llama 3.1 13B Q4_K_MMemory: ~7-8 GBSpeed: 25-40 tokens/secQuality: Very good general purposeBest for: Writing, analysis, detailed responsesQwen 2.5 Coder 14B - Code + Reasoning
ollama pull qwen2.5-coder:14bCombines strong coding with reasoning capabilities:
Model: Qwen 2.5 Coder 14B Q4_K_MMemory: ~9 GBSpeed: 30-45 tokens/secQuality: Excellent for code + reasoningBest for: Complex coding tasks, architectureMistral Small 24B - Pushing the Limit
ollama pull mistral-small:24bThis one requires careful memory management:
Model: Mistral Small 24B Q4_K_MMemory: ~16 GBSpeed: 15-25 tokens/secQuality: Very high for localBest for: When quality > speedYou can run it on 32GB, but close other apps first. I learned this the hard way when my browser tabs caused an OOM crash.
Performance Benchmarks
Here are the actual speeds I measured on my M4:
| Model | Size | Memory | M4 24GB Speed ||------------------|---------|---------|---------------|| Llama 3.1 8B | 8B | 5.5 GB | 50-80 t/s || Mistral 7B | 7B | 4.5 GB | 60-90 t/s || DeepSeek Coder | 6.7B | 4 GB | 55-75 t/s || Phi-3 Mini | 3.8B | 2.5 GB | 80-100+ t/s || Llama 3.1 13B | 13B | 8 GB | 25-40 t/s || DeepSeek R1 14B | 14B | 10 GB | 35-50 t/s |These are with Metal GPU acceleration enabled. Without it, speeds drop by 60-70%.
Installation Options
Option 1: Ollama (Recommended)
The simplest way to get started:
# Install Ollamacurl -fsSL https://ollama.ai/install.sh | sh
# Pull a modelollama pull llama3.1:8b
# Run itollama run llama3.1:8b
# Test performanceollama run llama3.1:8b "Write a short poem about Mac computers"Ollama automatically uses Metal GPU acceleration on Apple Silicon. No configuration needed.
Option 2: LM Studio
If you prefer a GUI:
1. Download from lmstudio.ai2. Open the app3. Search for "llama 3.1 8b"4. Download the Q4_K_M quantization5. Click "Chat" to startLM Studio gives you more control over parameters and shows memory usage in real-time.
Option 3: OpenClaw + Ollama Integration
For a more integrated experience with your existing tools:
# OpenClaw can connect to local Ollama instance# Set OLLAMA_HOST environment variableexport OLLAMA_HOST="http://localhost:11434"
# OpenClaw will automatically detect running models# and use them for inferenceCommon Mistakes I Made
Mistake 1: Choosing the Largest Model That “Fits”
I initially thought “24GB RAM, so I can use a 22GB model!” Wrong.
WRONG:Model size: 22 GBSystem + apps: -6 GBAvailable: -4 GB (negative!)Result: CRASH
RIGHT:Model size: 6 GBSystem + apps: -6 GBContext: -4 GBAvailable: 8 GB headroomResult: Smooth operationLeave at least 30-40% of your RAM free for context and system overhead.
Mistake 2: Ignoring Quantization
I downloaded the full FP16 version of a model, and it used 2x the memory for minimal quality gain.
# FP16 (full precision) - DON'T DO THISollama pull llama3.1:8b-fp16 # ~16 GB memory
# Q4_K_M (4-bit quantization) - DO THISollama pull llama3.1:8b # ~5.5 GB memoryQ4_K_M gives you 95% of the quality for 35% of the memory. Q5_K_M is slightly better but not worth the extra memory for most use cases.
Mistake 3: Not Verifying Metal Acceleration
I ran models for a week before realizing Metal wasn’t enabled:
# Check if GPU is being usedollama run llama3.1:8b "Hello"
# In another terminal, watch GPU usagesudo powermetrics --samplers gpu_power -i 1000If GPU usage is near zero, Metal isn’t working. For me, reinstalling Ollama fixed it.
Mistake 4: Wrong Model for the Task
Using a general model for coding tasks:
WRONG: Use Llama 3.1 8B for complex code refactoringRIGHT: Use DeepSeek Coder or Qwen Coder for code tasks
WRONG: Use code model for creative writingRIGHT: Use Llama or Mistral for general tasksSpecialized models outperform general ones in their domain, even at smaller sizes.
Local vs Cloud: When Does Local Make Sense?
After months of usage, here’s my honest take:
Local Wins When:
- Privacy matters - Your data never leaves your machine
- High volume usage - No per-token costs, use as much as you want
- Offline needed - Works on airplanes, remote locations
- Low latency - No network round-trip delays
- Learning/experimenting - Great for understanding how LLMs work
Cloud Wins When:
- Occasional use - less than $20/month is cheaper than hardware
- Best quality needed - GPT-4/Claude Opus still beat local models
- No technical setup - Just use the web interface
- Complex reasoning - Frontier models handle nuance better
Local Setup (Mac Mini M4 24GB):- Hardware: $599 (one-time)- Electricity: ~$5/month (heavy use)- Software: Free- Total Year 1: ~$659
Cloud Equivalent (GPT-4 level):- API costs: ~$100-300/month (heavy use)- Total Year 1: ~$1200-3600
Break-even: ~4-18 months depending on usageFor my usage (10-20 hours/week of local LLM use), the Mac Mini paid for itself in 8 months.
My Recommended Setup
After all this experimentation, here’s my current setup on 24GB:
primary: llama3.1:8b # General tasks, good balancecoding: deepseek-coder:6.7b # Code completion and reviewquick: phi3:mini # Fast queries, runs alongside appsspeed: mistral:7b # When I need quick responsesI switch between them based on task type. Having multiple models installed only costs disk space, not memory (until you run them).
Final Setup Checklist
Before you start, verify:
# 1. Check available memoryvm_stat | head -5
# 2. Verify Metal GPU supportsystem_profiler SPDisplaysDataType | grep "Metal"
# 3. Install Ollamacurl -fsSL https://ollama.ai/install.sh | sh
# 4. Pull your first model (start small!)ollama pull llama3.1:8b
# 5. Test it worksollama run llama3.1:8b "Hello, are you running on Metal?"
# 6. Monitor resources while running# In another terminal:top -l 1 | grep -E "PhysMem|GPU"Key Takeaways
- 24GB RAM: Stick to 7-8B models (Q4_K_M quantization)
- 32GB RAM: You can run 13-14B models comfortably
- Always use Q4_K_M quantization for best size/quality tradeoff
- Leave 30-40% RAM free for context and system
- Match model to task: code models for code, general for general
- Metal acceleration is critical: verify it’s working
Local LLMs on Mac Mini M4 are genuinely useful. They’re not replacing GPT-4 for complex reasoning, but for day-to-day tasks, privacy-sensitive work, and high-volume usage, they’re fantastic. Start with Llama 3.1 8B, see if it meets your needs, and explore from there.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments