What's the Best Local LLM for Coding on a Mac with 64GB RAM?
Purpose
I thought local LLMs just weren’t for me. After testing llama3.1:8b, qwen2.5-coder:1.5b, and several other models on my Mac with 64GB RAM, the results were disappointing. The models couldn’t complete basic coding tasks without hallucinations or errors.
Turns out, I was doing it wrong. The problem wasn’t local LLMs - it was my model selection. Here’s what I learned about choosing the right model for Mac hardware.
Environment
My setup:
Machine: Mac with M-series chipRAM: 64GB unified memoryUse Case: Coding assistant, code generation, debuggingTools: LM Studio, Ollama, llama.cppThe 64GB RAM constraint is key. It’s enough for serious models, but not the massive 70B+ ones. The sweet spot is in the 27B-35B range with proper quantization.
What Happened
The Failed Experiments
I started with what seemed reasonable - popular models with smaller footprints:
Model | Size | Result-------------------|-------|--------llama3.1:8b | 8B | Poor reasoning, frequent errorsqwen2.5-coder:1.5b | 1.5B | Trivial tasks only, hallucinationsqwen2.5-coder:32b | 32B | Better, but still disappointingdevstral | - | Underwhelming for codingmistral-nemo | 12B | Good chat, bad at codeqwen3.5 (default) | - | Wrong quantization settingsThe problems were consistent:
- Outdated models - llama3.1:8b lacks modern reasoning capabilities
- Undersized models - 1.5B and 7B variants can’t handle complex code
- Wrong quantization - Using defaults without understanding memory allocation
- Insufficient context - Not allocating enough for large context windows
The Reddit Thread That Changed Everything
I found a thread on r/LocalLLM with someone having the exact same problem. The highest-voted response was blunt:
“You’re using either ancient models or very small models… Qwen3.5 35b at Q6 quant should fit in your RAM with 128k context size. It’s smarter and modern.”
Another user added:
“For coding agents, you need context window of at least 80k, preferably 128k or more.”
The original poster updated later:
“Thanks to the comment calling out ancient models… I got Qwen 3.5… was able to do simple tasks like create directory, install vite, run locally - it did really well.”
How to Solve It
The solution: Qwen 3.5 35B at Q6 quantization.
Why This Model?
Parameters: 35BQuantization: Q6_K (recommended)Model Size: ~30GBContext Window: 128K tokensMemory Required: ~35-40GB with full contextYour RAM: 64GBHeadroom: ~24GB for system and contextQ6 quantization provides the best balance of quality and size. Lower quantizations (Q4) lose too much accuracy. Higher (Q8) don’t fit with large context.
LM Studio Configuration
LM Studio is the easiest way to get started:
Model: Qwen/Qwen2.5-35B-Instruct-GGUFQuantization: Q6_KFile Size: ~30GB
Settings: temperature: 0.1 top_p: 0.9 repeat_penalty: 1.1 context_length: 131072 gpu_layers: -1The gpu_layers: -1 setting offloads all layers to GPU (unified memory on Mac), which is critical for performance.
Ollama Configuration
If you prefer Ollama:
# Pull the modelollama pull qwen2.5:35bCreate a Modelfile for proper settings:
FROM qwen2.5:35b
PARAMETER temperature 0.1PARAMETER num_ctx 131072PARAMETER top_p 0.9PARAMETER repeat_penalty 1.1Then create your custom model:
ollama create qwen-coder -f Modelfileollama run qwen-coderllama.cpp Direct
For maximum control:
./llama-cli \ -m qwen2.5-35b-instruct-q6_k.gguf \ -c 131072 \ -ngl 99 \ -temp 0.1 \ --top-p 0.9 \ --repeat-penalty 1.1 \ -p "Create a Python function that validates email addresses"The -ngl 99 flag offloads all 99 layers to GPU. The -c 131072 sets the context window to 128K tokens.
The Reason
Why Model Size Matters
Size | Reasoning | Code Quality | Context Handling--------|-----------|--------------|------------------1.5B | Poor | Basic | Limited7-8B | Fair | Good | Moderate27-35B | Excellent | Excellent | Large (128K)70B+ | Superior | Superior | LargeThe jump from 8B to 35B is dramatic. It’s not incremental - it’s a qualitative difference in capability.
Why Q6 Quantization?
Quantization trades a small quality loss for massive memory savings:
Quantization | Quality Loss | Size (35B model) | Fits 64GB + 128K context?-------------|--------------|-------------------|---------------------------Q4 | ~5-8% | ~20GB | Yes, easilyQ5 | ~3-5% | ~25GB | YesQ6 | ~1-3% | ~30GB | Yes (sweet spot)Q8 | <1% | ~38GB | Tight fitFP16 | 0% | ~70GB | NoQ6 is the sweet spot for 64GB RAM. You get near-full quality with room for a large context window.
Why Context Window Matters for Coding
Coding requires large context windows. Consider what fits in 128K tokens:
Context Size | What Fits-------------|----------------------------------32K | ~2-3 medium files64K | ~5-6 medium files or 1 large file80K | Minimum for most coding tasks128K | Full project structure + multiple filesWith 128K context, the model can understand your entire project structure, not just the current file.
Why Temperature 0.1 for Coding?
Low temperature (0.1-0.3) makes the model more deterministic, which is what you want for code:
Temperature | Behavior------------|--------------------------------0.1-0.3 | Deterministic, consistent output0.5-0.7 | Balanced creativity0.8-1.0 | Creative, inconsistent codeFor coding tasks, consistency beats creativity. You want the same prompt to produce similar, correct code every time.
Common Mistakes I Made
Mistake 1: Using “Popular” Models
llama3.1:8b is popular, but it’s outdated for coding tasks. Popularity doesn’t equal capability.
Mistake 2: Going Too Small
1.5B and 7B models seem attractive for speed, but they can’t handle real coding work. The time saved on inference is lost on debugging hallucinated code.
Mistake 3: Ignoring Quantization
I downloaded models without checking quantization. A Q4 model runs faster but produces worse code than Q6.
Mistake 4: Not Allocating Context
Running with default 4K context. Coding requires understanding relationships between files. 4K is useless for anything beyond single functions.
Mistake 5: Wrong Temperature
Using default 0.7 temperature produced inconsistent code. Lower is better for deterministic output.
Performance Comparison
After fixing my setup, here’s what I observed:
Task | Old Setup (8B Q4) | New Setup (35B Q6)--------------------------|-------------------|--------------------Create directory | Failed | SuccessInstall dependencies | Partial | SuccessWrite functions | Basic | Complex + documentedDebug code | Poor | GoodRefactor across files | Failed | SuccessExplain architecture | Generic | Specific + accurateThe difference is night and day.
Summary
For Mac users with 64GB RAM, the optimal local LLM for coding is:
Qwen 3.5 35B at Q6 quantization with 128K context window.
Key settings:
- Temperature: 0.1 (deterministic output)
- Context: 131072 tokens (128K)
- GPU layers: -1 or 99 (full offload)
- Quantization: Q6_K (best quality/size ratio)
The mistake isn’t that local LLMs aren’t ready for coding. The mistake is using outdated, undersized, or misconfigured models. With the right setup, local LLMs on Mac hardware are genuinely useful for development work.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments