Skip to content

How to Implement Planner/Executor Split with GPT-5.4 and Mini

The Problem

I hit the 5-hour message quota in the middle of a complex refactoring task. My workflow was dead in the water. All context lost. The rollback cost me another hour of work.

This pattern repeated every time I ran multi-step operations through GPT-5.4:

Quota Depletion Pattern
Start complex task → 50 messages later → Quota warning →
Panic → Rush to finish → Context lost → Start over

The core problem: every message, whether complex architectural reasoning or simple code formatting, consumed the same quota. I was burning premium model capacity on tasks that didn’t require it.

A Reddit thread on GPT-5.4-mini optimization revealed the highest-leverage solution: the planner/executor split.

The Solution: Planner/Executor Architecture

The key insight is role separation. Full GPT-5.4 handles strategic work while mini handles tactical work:

Role Separation
Planner (GPT-5.4):
- Analyzes problem scope
- Breaks down into subtasks
- Coordinates agent handoffs
- Reviews and judges outputs
- Makes architectural decisions
Executor (GPT-5.4-mini):
- Implements specific subtasks
- Writes boilerplate code
- Generates tests
- Formats and documents
- Executes narrow, defined operations

The Communication Flow

communication-flow.txt
1. Planner receives initial request
2. Planner decomposes into ordered subtasks
3. Each subtask delegated to executor
4. Executor returns results
5. Planner reviews, adjusts, and coordinates next step
6. Final judgment by planner before completion

This pattern transforms a single expensive conversation into a coordinated multi-agent workflow. Instead of 300 full-model messages, you achieve 900+ effective interactions within the same quota.

Why This Matters

The quota efficiency comparison tells the story:

ModelQuota ConsumptionEffective Messages/5h
GPT-5.4100%223-1120 (Pro)
GPT-5.4-mini~30%743-3733 (Pro)

The planner/executor split lets the planner maintain strategic context while executors handle tactical work at 30% cost.

Implementation: Python Orchestrator

Here’s the core orchestrator that manages the planner/executor split:

orchestrator.py
from openai import OpenAI
client = OpenAI()
class PlannerExecutorOrchestrator:
def __init__(self):
self.planner_model = "gpt-5.4"
self.executor_model = "gpt-5.4-mini"
self.previous_response_id = None
def plan(self, user_request: str) -> dict:
"""Full GPT-5.4 handles planning and decomposition"""
response = client.responses.create(
model=self.planner_model,
input=f"""
Analyze this request and break it into subtasks.
Return JSON with: subtasks[], dependencies[], coordination_notes
Request: {user_request}
""",
previous_response_id=self.previous_response_id
)
self.previous_response_id = response.id
return response.output
def execute_subtask(self, subtask: dict, context: str) -> str:
"""Mini handles narrow subtasks with limited context"""
response = client.responses.create(
model=self.executor_model,
input=f"""
Execute this specific subtask. Focus only on what's requested.
Context: {context}
Subtask: {subtask['description']}
Expected output: {subtask['output_format']}
"""
)
return response.output
def review(self, results: list) -> dict:
"""Full GPT-5.4 reviews and judges executor outputs"""
response = client.responses.create(
model=self.planner_model,
input=f"""
Review these executor results. Judge quality and completeness.
Approve, reject with revisions, or escalate to human.
Results: {results}
""",
previous_response_id=self.previous_response_id
)
self.previous_response_id = response.id
return response.output

The previous_response_id parameter is critical. It maintains context across planner turns without resending the entire conversation, leveraging the Responses API’s prompt caching.

Running the Workflow

workflow_runner.py
def run_workflow(user_request: str):
orchestrator = PlannerExecutorOrchestrator()
# Step 1: Planner analyzes and decomposes
plan = orchestrator.plan(user_request)
results = []
# Step 2: Executors run subtasks in parallel (when independent)
for subtask in plan['subtasks']:
if subtask.get('parallel', False):
# Dispatch to mini with minimal context
context = build_minimal_context(subtask)
result = orchestrator.execute_subtask(subtask, context)
results.append(result)
else:
# Sequential execution with dependency context
context = build_dependency_context(subtask, results)
result = orchestrator.execute_subtask(subtask, context)
results.append(result)
# Step 3: Planner reviews and judges
final_review = orchestrator.review(results)
return final_review

Workflow State Management

For longer workflows, you need to track state across turns:

workflow_state.py
class WorkflowState:
"""Track planner/executor state across turns"""
def __init__(self):
self.planner_response_id = None
self.executor_results = []
self.current_phase = "planning"
self.token_budget = {
"planner": {"used": 0, "limit": 100000},
"executor": {"used": 0, "limit": 50000}
}
def save_planner_context(self, response_id: str):
"""Store planner response for next turn"""
self.planner_response_id = response_id
def build_executor_context(self, subtask_id: str) -> str:
"""Build minimal context for executor - not full history"""
# Only include: subtask definition, relevant files, dependency outputs
return f"""
Subtask: {subtask_id}
Relevant files: {self.get_relevant_files(subtask_id)}
Dependencies: {self.get_dependency_outputs(subtask_id)}
"""
def should_compact(self) -> bool:
"""Check if context compaction needed"""
return (
self.token_budget["planner"]["used"] > 80000 or
len(self.executor_results) > 10
)

The build_executor_context method is where you save tokens. Never send the full conversation history to executors. Only send what’s relevant to that specific subtask.

AGENTS.md Configuration

Keep your agent configuration lean. Every token in AGENTS.md gets processed on every turn:

AGENTS.md
# Optimized Agent Configuration for Planner/Executor Split
## Model Routing
planner:
model: GPT-5.4
role: planning, coordination, review, judgment
cache: 24h-extended
executor-mini:
model: GPT-5.4-mini
role: implementation, tests, formatting
cache: standard
## Context Management
- Send only relevant code to executors (not full repo)
- Use compaction for long conversations
- Disable unused MCP servers (saves tokens)
## Cost Optimization
- Default to mini for routine tasks
- Escalate to full 5.4 only when:
- Architecture decisions needed
- Complex debugging required
- Multi-file reasoning required
- Final review/judgment

Common Mistakes to Avoid

I made every mistake on this list before finding the right pattern:

Mistake 1: Sending full context to executor

Wrong vs Right
WRONG:
executor_prompt = f"Here's the full conversation: {full_history}\n\nTask: {subtask}"
RIGHT:
executor_prompt = f"Subtask: {subtask}\nRelevant files: {subset}\nDependencies: {outputs}"

Mistake 2: Not caching prompts for repeated subtask types

Wrong vs Right
WRONG:
# Repeating the same system prompt every time
for subtask in similar_subtasks:
response = client.responses.create(
model="gpt-5.4-mini",
input=f"{SYSTEM_PROMPT}\n\n{subtask}" # Duplicated every call
)
RIGHT:
# Use cached system prompt
CACHED_PROMPT = {
"role": "system",
"content": "You are a test generator...",
"cache_control": {"type": "ephemeral"}
}
# 24-hour extended cache available for GPT-5.4

Mistake 3: Keeping bloated AGENTS.md files

Token Cost
Bloated AGENTS.md (2000 lines) = ~8000 tokens per turn
Lean AGENTS.md (500 lines) = ~2000 tokens per turn
Savings: 75% reduction in context overhead

Mistake 4: Using fast mode unnecessarily

Fast mode consumes 2x credits. Reserve it for genuine urgency, not as default.

Mistake 5: Enabling all MCP servers when only 2-3 are needed

Every enabled MCP server adds tokens to each turn. Disable what you don’t use.

Cost Comparison: Real Example

Here’s the math for a user authentication feature:

cost-comparison.txt
Scenario: Implement user authentication feature
Without planner/executor split:
- 50 messages @ GPT-5.4 = 50 quota units
With planner/executor split:
- 5 planner messages @ GPT-5.4 = 5 quota units
- 45 executor messages @ GPT-5.4-mini = 13.5 quota units (45 * 0.3)
- Total: 18.5 quota units
Savings: 63% reduction in quota consumption
Effective extension: 2.7x more work within same limits

Token-Saving Checklist

  1. Prompt Caching: Use previous_response_id to avoid resending context
  2. Lean AGENTS.md: Keep under 500 lines, remove unused configurations
  3. Disable Unused MCPs: Only enable servers you actively need
  4. Minimal Executor Context: Send only relevant code, not entire codebase
  5. Compaction: Compress conversation history when approaching limits
  6. Lower Verbosity: Reduce output detail when appropriate
  7. Avoid Fast Mode: Use standard speed unless urgency requires 2x cost
  8. Strategic Escalation: Route to full 5.4 only when complexity demands

Summary

The planner/executor split is the highest-leverage pattern for extending GPT-5.4 sessions. Full 5.4 handles strategic work (planning, coordination, judgment) while mini handles tactical work (implementation, formatting, testing). Combined with Responses API’s previous_response_id, prompt caching, and lean configurations, you can achieve 2.5-3.3x effective message extension without upgrading your subscription.

Key implementation points:

  1. Orchestrator pattern: Central coordinator manages planner/executor handoffs
  2. Context minimization: Executors receive only relevant context, not full history
  3. State management: Track response IDs for efficient context continuation
  4. Lean configuration: Every token in AGENTS.md costs on every turn

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments