Skip to content

What Are the Best LLM Models to Run on a 128GB MacBook for Agentic AI?

Purpose

I wanted to run agentic AI workflows locally on my 128GB MacBook without relying on cloud APIs. The problem: most frontier-quality models require 200GB+ memory. I needed to find which 100B+ parameter models work with quantization and still support tool calling for autonomous agent operations.

This post shows the best LLM models for agentic AI on a 128GB MacBook with M-series unified memory. The key point is choosing models at Q4/Q5 quantization that retain tool calling capability.

The Problem

Agentic AI workflows have specific requirements:

  1. Large context windows for complex reasoning chains
  2. Tool calling capability for autonomous actions
  3. Sufficient model size for complex task planning
  4. Fast inference for interactive agent workflows

I tested several approaches and found that frontier-quality models (GPT-4 class) typically require 200GB+ memory or cloud APIs. Running capable agents locally on 128GB seemed impossible at first.

What Works on 128GB

After testing on my M-series MacBook with 128GB unified memory, I found these models run well:

Tier 1: Best for Agentic AI (Full Memory Usage)

ModelSizeQuantizationKey Strength
GPT-OSS-120B~70GBQ4/Q5General reasoning, tool calling
Nemotron-3-Super-120B-A12B~70GBQ4/Q5NVIDIA-optimized, strong reasoning
Qwen3.5-122B-A10B~75GBQ4/Q5Multilingual, coding, tool use
Qwen3-Coder-Next~70GBQ4/Q5Code generation, debugging

Tier 2: Efficient Models (Room for Large Context)

ModelSizeBenchmarkKey Strength
JANG_2S MiniMax M2.560GB76% MMLUQuality compression
Standard MiniMax M2.5 4bit120GB25% MMLUFull size, lower quality

I found that all the Tier 1 models “feel like the best frontier models from a year ago” - sufficient for most agentic workflows.

Loading Models with MLX

I use Apple’s MLX framework for inference on unified memory. Here’s how I load a model:

Terminal
# Install MLX
pip install mlx-lm
# Download a quantized model
huggingface-cli download Qwen/Qwen3.5-122B-A10B-MLX --local-dir ./models/qwen-122b
load_model.py
from mlx_lm import load, generate
# Load the quantized model
model, tokenizer = load(
"./models/qwen-122b",
tokenizer_config={"trust_remote_code": True}
)
# Generate response
response = generate(
model,
tokenizer,
prompt="You are a task automation agent. Break down: Send a summary of my unread emails to Slack.",
max_tokens=1024,
temp=0.7
)
print(response)

Memory Estimation for 128GB

When I run these models, I budget memory like this:

Memory Budget
Total Memory: 128GB
Model (Q4/Q5 120B): ~70GB
KV Cache (32K context): ~30GB
System Overhead: ~10GB
Available for context: ~18GB
Safe context window: 32K-64K tokens

Here’s how I estimate memory for different context sizes:

memory_calc.py
def estimate_memory(model_params_b: float, quantization: str, context_tokens: int) -> dict:
"""Estimate memory usage for a model configuration"""
# Model weight memory based on quantization
bits_per_param = {
"Q4": 4.5, # 4-bit + overhead
"Q5": 5.5,
"Q8": 8.5,
"FP16": 16
}
model_memory_gb = (model_params_b * bits_per_param[quantization]) / 8
# KV cache estimation (roughly 0.5-1MB per 1K tokens at Q4)
kv_cache_gb = context_tokens * 0.0008 # Conservative estimate
# System overhead
system_overhead_gb = 10
total_gb = model_memory_gb + kv_cache_gb + system_overhead_gb
return {
"model_gb": model_memory_gb,
"kv_cache_gb": kv_cache_gb,
"total_gb": total_gb,
"fits_in_128gb": total_gb < 128
}
# Example: 122B model at Q4 with 32K context
result = estimate_memory(122, "Q4", 32000)
print(f"Total memory: {result['total_gb']:.1f}GB")
print(f"Fits in 128GB: {result['fits_in_128gb']}")
# Output:
# Total memory: 88.6GB
# Fits in 128GB: True

Running Inference with Tool Calling

For agentic AI, tool calling is essential. Here’s how I set up a simple agent:

tool_agent.py
from mlx_lm import load, generate
import json
# Define tools available to the agent
TOOLS = [
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read a file from the filesystem",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "File path to read"}
},
"required": ["path"]
}
}
},
{
"type": "function",
"function": {
"name": "write_file",
"description": "Write content to a file",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "File path to write"},
"content": {"type": "string", "description": "Content to write"}
},
"required": ["path", "content"]
}
}
}
]
def run_agent_with_tools(model, tokenizer, user_request: str):
"""Run agent with tool calling capability"""
system_prompt = f"""You are an AI agent with access to tools.
Available tools:
{json.dumps(TOOLS, indent=2)}
When you need to use a tool, respond with a JSON object containing:
- "tool": the tool name
- "arguments": the tool arguments
Otherwise, respond normally to the user."""
prompt = f"{system_prompt}\n\nUser: {user_request}\n\nAssistant:"
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=2048,
temp=0.3 # Lower temperature for more deterministic tool use
)
return response
# Load model
model, tokenizer = load("./models/qwen-122b")
# Run agent
result = run_agent_with_tools(
model,
tokenizer,
"Read the file /Users/me/notes.txt and summarize it"
)
print(result)

Why JangQ Quantization Matters

I discovered something important about quantization quality. The standard MiniMax M2.5 at 4-bit on MLX is 120GB but only gets 25% on MMLU benchmark. The JANG_2S quantized version is half the size (60GB) but gets 76% on MMLU.

Quantization Comparison
Standard MLX 4-bit MiniMax M2.5:
- Size: 120GB
- MMLU: 25%
- Barely fits, poor quality
JANG_2S MiniMax M2.5:
- Size: 60GB
- MMLU: 76%
- Fits easily, 3x better quality

The lesson: quality matters more than raw parameter count. A better quantized smaller model can outperform a poorly quantized larger one.

Common Mistakes I Made

Mistake 1: Choosing Maximum Size Over Efficiency

I initially loaded the largest model possible. The 120GB MiniMax at 4-bit gave me 25% MMLU. Switching to the 60GB JANG_2S version gave me 76% MMLU with room for larger context windows.

Mistake 2: Ignoring Tool Calling Support

Not all quantized models retain tool calling capability. I wasted time on models that couldn’t reliably call tools. Qwen and GPT-OSS variants generally maintain tool calling well.

Mistake 3: Forgetting Context Memory

128GB minus 70GB for the model leaves 58GB for context and overhead. I initially didn’t budget for KV cache and ran into memory errors with long conversations. Now I budget 30-40GB for large context windows (32K+).

Mistake 4: Using Wrong Quantization Format

MLX has its own format. GGUF files don’t work directly. I had to find MLX-specific quantized models or convert them myself.

Local vs Cloud for Agents

Why run agents locally instead of using cloud APIs?

Privacy: My code and data never leave my machine. I can work on proprietary codebases without concern.

No Rate Limits: My agents can run continuously without API throttling. Long-running automation tasks don’t get interrupted.

Cost: No per-token charges for agents that process large amounts of text. I paid once for the hardware.

Offline Capability: My agents work without internet. I can code on airplanes or in remote locations.

Summary

In this post, I showed the best LLM models for running agentic AI workflows on a 128GB MacBook. The key point is choosing 100B+ parameter models with Q4/Q5 quantization that retain tool calling capability.

For the best balance of quality and context space, I recommend GPT-OSS-120B, Nemotron-3-Super-120B-A12B, Qwen3.5-122B-A10B, or Qwen3-Coder-Next. These models feel like the best frontier models from a year ago - sufficient for most agentic workflows.

For better efficiency, the JangQ-quantized MiniMax M2.5 at 60GB delivers 76% MMLU with room for large contexts. Quality of quantization matters more than raw model size.

Check omlx.ai for the latest M5 Max benchmarks before selecting your model.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments