What Are the Best LLM Models to Run on a 128GB MacBook for Agentic AI?
Purpose
I wanted to run agentic AI workflows locally on my 128GB MacBook without relying on cloud APIs. The problem: most frontier-quality models require 200GB+ memory. I needed to find which 100B+ parameter models work with quantization and still support tool calling for autonomous agent operations.
This post shows the best LLM models for agentic AI on a 128GB MacBook with M-series unified memory. The key point is choosing models at Q4/Q5 quantization that retain tool calling capability.
The Problem
Agentic AI workflows have specific requirements:
- Large context windows for complex reasoning chains
- Tool calling capability for autonomous actions
- Sufficient model size for complex task planning
- Fast inference for interactive agent workflows
I tested several approaches and found that frontier-quality models (GPT-4 class) typically require 200GB+ memory or cloud APIs. Running capable agents locally on 128GB seemed impossible at first.
What Works on 128GB
After testing on my M-series MacBook with 128GB unified memory, I found these models run well:
Tier 1: Best for Agentic AI (Full Memory Usage)
| Model | Size | Quantization | Key Strength |
|---|---|---|---|
| GPT-OSS-120B | ~70GB | Q4/Q5 | General reasoning, tool calling |
| Nemotron-3-Super-120B-A12B | ~70GB | Q4/Q5 | NVIDIA-optimized, strong reasoning |
| Qwen3.5-122B-A10B | ~75GB | Q4/Q5 | Multilingual, coding, tool use |
| Qwen3-Coder-Next | ~70GB | Q4/Q5 | Code generation, debugging |
Tier 2: Efficient Models (Room for Large Context)
| Model | Size | Benchmark | Key Strength |
|---|---|---|---|
| JANG_2S MiniMax M2.5 | 60GB | 76% MMLU | Quality compression |
| Standard MiniMax M2.5 4bit | 120GB | 25% MMLU | Full size, lower quality |
I found that all the Tier 1 models “feel like the best frontier models from a year ago” - sufficient for most agentic workflows.
Loading Models with MLX
I use Apple’s MLX framework for inference on unified memory. Here’s how I load a model:
# Install MLXpip install mlx-lm
# Download a quantized modelhuggingface-cli download Qwen/Qwen3.5-122B-A10B-MLX --local-dir ./models/qwen-122bfrom mlx_lm import load, generate
# Load the quantized modelmodel, tokenizer = load( "./models/qwen-122b", tokenizer_config={"trust_remote_code": True})
# Generate responseresponse = generate( model, tokenizer, prompt="You are a task automation agent. Break down: Send a summary of my unread emails to Slack.", max_tokens=1024, temp=0.7)
print(response)Memory Estimation for 128GB
When I run these models, I budget memory like this:
Total Memory: 128GB
Model (Q4/Q5 120B): ~70GBKV Cache (32K context): ~30GBSystem Overhead: ~10GBAvailable for context: ~18GB
Safe context window: 32K-64K tokensHere’s how I estimate memory for different context sizes:
def estimate_memory(model_params_b: float, quantization: str, context_tokens: int) -> dict: """Estimate memory usage for a model configuration"""
# Model weight memory based on quantization bits_per_param = { "Q4": 4.5, # 4-bit + overhead "Q5": 5.5, "Q8": 8.5, "FP16": 16 }
model_memory_gb = (model_params_b * bits_per_param[quantization]) / 8
# KV cache estimation (roughly 0.5-1MB per 1K tokens at Q4) kv_cache_gb = context_tokens * 0.0008 # Conservative estimate
# System overhead system_overhead_gb = 10
total_gb = model_memory_gb + kv_cache_gb + system_overhead_gb
return { "model_gb": model_memory_gb, "kv_cache_gb": kv_cache_gb, "total_gb": total_gb, "fits_in_128gb": total_gb < 128 }
# Example: 122B model at Q4 with 32K contextresult = estimate_memory(122, "Q4", 32000)print(f"Total memory: {result['total_gb']:.1f}GB")print(f"Fits in 128GB: {result['fits_in_128gb']}")
# Output:# Total memory: 88.6GB# Fits in 128GB: TrueRunning Inference with Tool Calling
For agentic AI, tool calling is essential. Here’s how I set up a simple agent:
from mlx_lm import load, generateimport json
# Define tools available to the agentTOOLS = [ { "type": "function", "function": { "name": "read_file", "description": "Read a file from the filesystem", "parameters": { "type": "object", "properties": { "path": {"type": "string", "description": "File path to read"} }, "required": ["path"] } } }, { "type": "function", "function": { "name": "write_file", "description": "Write content to a file", "parameters": { "type": "object", "properties": { "path": {"type": "string", "description": "File path to write"}, "content": {"type": "string", "description": "Content to write"} }, "required": ["path", "content"] } } }]
def run_agent_with_tools(model, tokenizer, user_request: str): """Run agent with tool calling capability"""
system_prompt = f"""You are an AI agent with access to tools.
Available tools:{json.dumps(TOOLS, indent=2)}
When you need to use a tool, respond with a JSON object containing:- "tool": the tool name- "arguments": the tool arguments
Otherwise, respond normally to the user."""
prompt = f"{system_prompt}\n\nUser: {user_request}\n\nAssistant:"
response = generate( model, tokenizer, prompt=prompt, max_tokens=2048, temp=0.3 # Lower temperature for more deterministic tool use )
return response
# Load modelmodel, tokenizer = load("./models/qwen-122b")
# Run agentresult = run_agent_with_tools( model, tokenizer, "Read the file /Users/me/notes.txt and summarize it")
print(result)Why JangQ Quantization Matters
I discovered something important about quantization quality. The standard MiniMax M2.5 at 4-bit on MLX is 120GB but only gets 25% on MMLU benchmark. The JANG_2S quantized version is half the size (60GB) but gets 76% on MMLU.
Standard MLX 4-bit MiniMax M2.5:- Size: 120GB- MMLU: 25%- Barely fits, poor quality
JANG_2S MiniMax M2.5:- Size: 60GB- MMLU: 76%- Fits easily, 3x better qualityThe lesson: quality matters more than raw parameter count. A better quantized smaller model can outperform a poorly quantized larger one.
Common Mistakes I Made
Mistake 1: Choosing Maximum Size Over Efficiency
I initially loaded the largest model possible. The 120GB MiniMax at 4-bit gave me 25% MMLU. Switching to the 60GB JANG_2S version gave me 76% MMLU with room for larger context windows.
Mistake 2: Ignoring Tool Calling Support
Not all quantized models retain tool calling capability. I wasted time on models that couldn’t reliably call tools. Qwen and GPT-OSS variants generally maintain tool calling well.
Mistake 3: Forgetting Context Memory
128GB minus 70GB for the model leaves 58GB for context and overhead. I initially didn’t budget for KV cache and ran into memory errors with long conversations. Now I budget 30-40GB for large context windows (32K+).
Mistake 4: Using Wrong Quantization Format
MLX has its own format. GGUF files don’t work directly. I had to find MLX-specific quantized models or convert them myself.
Local vs Cloud for Agents
Why run agents locally instead of using cloud APIs?
Privacy: My code and data never leave my machine. I can work on proprietary codebases without concern.
No Rate Limits: My agents can run continuously without API throttling. Long-running automation tasks don’t get interrupted.
Cost: No per-token charges for agents that process large amounts of text. I paid once for the hardware.
Offline Capability: My agents work without internet. I can code on airplanes or in remote locations.
Summary
In this post, I showed the best LLM models for running agentic AI workflows on a 128GB MacBook. The key point is choosing 100B+ parameter models with Q4/Q5 quantization that retain tool calling capability.
For the best balance of quality and context space, I recommend GPT-OSS-120B, Nemotron-3-Super-120B-A12B, Qwen3.5-122B-A10B, or Qwen3-Coder-Next. These models feel like the best frontier models from a year ago - sufficient for most agentic workflows.
For better efficiency, the JangQ-quantized MiniMax M2.5 at 60GB delivers 76% MMLU with room for large contexts. Quality of quantization matters more than raw model size.
Check omlx.ai for the latest M5 Max benchmarks before selecting your model.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: 128GB M5 Max for Local Agentic AI Discussion
- 👨💻 omlx.ai - M5 Max Benchmarks
- 👨💻 MLX Framework Documentation
- 👨💻 Qwen Model Releases
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments