Skip to content

What's the Best Local LLM for Coding on a Mac with 64GB RAM?

Purpose

I thought local LLMs just weren’t for me. After testing llama3.1:8b, qwen2.5-coder:1.5b, and several other models on my Mac with 64GB RAM, the results were disappointing. The models couldn’t complete basic coding tasks without hallucinations or errors.

Turns out, I was doing it wrong. The problem wasn’t local LLMs - it was my model selection. Here’s what I learned about choosing the right model for Mac hardware.

Environment

My setup:

Hardware Specs
Machine: Mac with M-series chip
RAM: 64GB unified memory
Use Case: Coding assistant, code generation, debugging
Tools: LM Studio, Ollama, llama.cpp

The 64GB RAM constraint is key. It’s enough for serious models, but not the massive 70B+ ones. The sweet spot is in the 27B-35B range with proper quantization.

What Happened

The Failed Experiments

I started with what seemed reasonable - popular models with smaller footprints:

Initial Model Tests
Model | Size | Result
-------------------|-------|--------
llama3.1:8b | 8B | Poor reasoning, frequent errors
qwen2.5-coder:1.5b | 1.5B | Trivial tasks only, hallucinations
qwen2.5-coder:32b | 32B | Better, but still disappointing
devstral | - | Underwhelming for coding
mistral-nemo | 12B | Good chat, bad at code
qwen3.5 (default) | - | Wrong quantization settings

The problems were consistent:

  1. Outdated models - llama3.1:8b lacks modern reasoning capabilities
  2. Undersized models - 1.5B and 7B variants can’t handle complex code
  3. Wrong quantization - Using defaults without understanding memory allocation
  4. Insufficient context - Not allocating enough for large context windows

The Reddit Thread That Changed Everything

I found a thread on r/LocalLLM with someone having the exact same problem. The highest-voted response was blunt:

“You’re using either ancient models or very small models… Qwen3.5 35b at Q6 quant should fit in your RAM with 128k context size. It’s smarter and modern.”

Another user added:

“For coding agents, you need context window of at least 80k, preferably 128k or more.”

The original poster updated later:

“Thanks to the comment calling out ancient models… I got Qwen 3.5… was able to do simple tasks like create directory, install vite, run locally - it did really well.”

How to Solve It

The solution: Qwen 3.5 35B at Q6 quantization.

Why This Model?

Qwen 3.5 35B Specifications
Parameters: 35B
Quantization: Q6_K (recommended)
Model Size: ~30GB
Context Window: 128K tokens
Memory Required: ~35-40GB with full context
Your RAM: 64GB
Headroom: ~24GB for system and context

Q6 quantization provides the best balance of quality and size. Lower quantizations (Q4) lose too much accuracy. Higher (Q8) don’t fit with large context.

LM Studio Configuration

LM Studio is the easiest way to get started:

LM Studio Settings
Model: Qwen/Qwen2.5-35B-Instruct-GGUF
Quantization: Q6_K
File Size: ~30GB
Settings:
temperature: 0.1
top_p: 0.9
repeat_penalty: 1.1
context_length: 131072
gpu_layers: -1

The gpu_layers: -1 setting offloads all layers to GPU (unified memory on Mac), which is critical for performance.

Ollama Configuration

If you prefer Ollama:

ollama-setup.sh
# Pull the model
ollama pull qwen2.5:35b

Create a Modelfile for proper settings:

Modelfile
FROM qwen2.5:35b
PARAMETER temperature 0.1
PARAMETER num_ctx 131072
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

Then create your custom model:

create-ollama-model.sh
ollama create qwen-coder -f Modelfile
ollama run qwen-coder

llama.cpp Direct

For maximum control:

llamacpp-run.sh
./llama-cli \
-m qwen2.5-35b-instruct-q6_k.gguf \
-c 131072 \
-ngl 99 \
-temp 0.1 \
--top-p 0.9 \
--repeat-penalty 1.1 \
-p "Create a Python function that validates email addresses"

The -ngl 99 flag offloads all 99 layers to GPU. The -c 131072 sets the context window to 128K tokens.

The Reason

Why Model Size Matters

Model Size vs Capability
Size | Reasoning | Code Quality | Context Handling
--------|-----------|--------------|------------------
1.5B | Poor | Basic | Limited
7-8B | Fair | Good | Moderate
27-35B | Excellent | Excellent | Large (128K)
70B+ | Superior | Superior | Large

The jump from 8B to 35B is dramatic. It’s not incremental - it’s a qualitative difference in capability.

Why Q6 Quantization?

Quantization trades a small quality loss for massive memory savings:

Quantization Comparison
Quantization | Quality Loss | Size (35B model) | Fits 64GB + 128K context?
-------------|--------------|-------------------|---------------------------
Q4 | ~5-8% | ~20GB | Yes, easily
Q5 | ~3-5% | ~25GB | Yes
Q6 | ~1-3% | ~30GB | Yes (sweet spot)
Q8 | <1% | ~38GB | Tight fit
FP16 | 0% | ~70GB | No

Q6 is the sweet spot for 64GB RAM. You get near-full quality with room for a large context window.

Why Context Window Matters for Coding

Coding requires large context windows. Consider what fits in 128K tokens:

Context Window Capacity
Context Size | What Fits
-------------|----------------------------------
32K | ~2-3 medium files
64K | ~5-6 medium files or 1 large file
80K | Minimum for most coding tasks
128K | Full project structure + multiple files

With 128K context, the model can understand your entire project structure, not just the current file.

Why Temperature 0.1 for Coding?

Low temperature (0.1-0.3) makes the model more deterministic, which is what you want for code:

Temperature Impact on Code Generation
Temperature | Behavior
------------|--------------------------------
0.1-0.3 | Deterministic, consistent output
0.5-0.7 | Balanced creativity
0.8-1.0 | Creative, inconsistent code

For coding tasks, consistency beats creativity. You want the same prompt to produce similar, correct code every time.

Common Mistakes I Made

llama3.1:8b is popular, but it’s outdated for coding tasks. Popularity doesn’t equal capability.

Mistake 2: Going Too Small

1.5B and 7B models seem attractive for speed, but they can’t handle real coding work. The time saved on inference is lost on debugging hallucinated code.

Mistake 3: Ignoring Quantization

I downloaded models without checking quantization. A Q4 model runs faster but produces worse code than Q6.

Mistake 4: Not Allocating Context

Running with default 4K context. Coding requires understanding relationships between files. 4K is useless for anything beyond single functions.

Mistake 5: Wrong Temperature

Using default 0.7 temperature produced inconsistent code. Lower is better for deterministic output.

Performance Comparison

After fixing my setup, here’s what I observed:

Task Performance Comparison
Task | Old Setup (8B Q4) | New Setup (35B Q6)
--------------------------|-------------------|--------------------
Create directory | Failed | Success
Install dependencies | Partial | Success
Write functions | Basic | Complex + documented
Debug code | Poor | Good
Refactor across files | Failed | Success
Explain architecture | Generic | Specific + accurate

The difference is night and day.

Summary

For Mac users with 64GB RAM, the optimal local LLM for coding is:

Qwen 3.5 35B at Q6 quantization with 128K context window.

Key settings:

  • Temperature: 0.1 (deterministic output)
  • Context: 131072 tokens (128K)
  • GPU layers: -1 or 99 (full offload)
  • Quantization: Q6_K (best quality/size ratio)

The mistake isn’t that local LLMs aren’t ready for coding. The mistake is using outdated, undersized, or misconfigured models. With the right setup, local LLMs on Mac hardware are genuinely useful for development work.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments