How to Implement Planner/Executor Split with GPT-5.4 and Mini
The Problem
I hit the 5-hour message quota in the middle of a complex refactoring task. My workflow was dead in the water. All context lost. The rollback cost me another hour of work.
This pattern repeated every time I ran multi-step operations through GPT-5.4:
Start complex task → 50 messages later → Quota warning →Panic → Rush to finish → Context lost → Start overThe core problem: every message, whether complex architectural reasoning or simple code formatting, consumed the same quota. I was burning premium model capacity on tasks that didn’t require it.
A Reddit thread on GPT-5.4-mini optimization revealed the highest-leverage solution: the planner/executor split.
The Solution: Planner/Executor Architecture
The key insight is role separation. Full GPT-5.4 handles strategic work while mini handles tactical work:
Planner (GPT-5.4): - Analyzes problem scope - Breaks down into subtasks - Coordinates agent handoffs - Reviews and judges outputs - Makes architectural decisions
Executor (GPT-5.4-mini): - Implements specific subtasks - Writes boilerplate code - Generates tests - Formats and documents - Executes narrow, defined operationsThe Communication Flow
1. Planner receives initial request2. Planner decomposes into ordered subtasks3. Each subtask delegated to executor4. Executor returns results5. Planner reviews, adjusts, and coordinates next step6. Final judgment by planner before completionThis pattern transforms a single expensive conversation into a coordinated multi-agent workflow. Instead of 300 full-model messages, you achieve 900+ effective interactions within the same quota.
Why This Matters
The quota efficiency comparison tells the story:
| Model | Quota Consumption | Effective Messages/5h |
|---|---|---|
| GPT-5.4 | 100% | 223-1120 (Pro) |
| GPT-5.4-mini | ~30% | 743-3733 (Pro) |
The planner/executor split lets the planner maintain strategic context while executors handle tactical work at 30% cost.
Implementation: Python Orchestrator
Here’s the core orchestrator that manages the planner/executor split:
from openai import OpenAI
client = OpenAI()
class PlannerExecutorOrchestrator: def __init__(self): self.planner_model = "gpt-5.4" self.executor_model = "gpt-5.4-mini" self.previous_response_id = None
def plan(self, user_request: str) -> dict: """Full GPT-5.4 handles planning and decomposition""" response = client.responses.create( model=self.planner_model, input=f""" Analyze this request and break it into subtasks. Return JSON with: subtasks[], dependencies[], coordination_notes
Request: {user_request} """, previous_response_id=self.previous_response_id ) self.previous_response_id = response.id return response.output
def execute_subtask(self, subtask: dict, context: str) -> str: """Mini handles narrow subtasks with limited context""" response = client.responses.create( model=self.executor_model, input=f""" Execute this specific subtask. Focus only on what's requested. Context: {context}
Subtask: {subtask['description']} Expected output: {subtask['output_format']} """ ) return response.output
def review(self, results: list) -> dict: """Full GPT-5.4 reviews and judges executor outputs""" response = client.responses.create( model=self.planner_model, input=f""" Review these executor results. Judge quality and completeness. Approve, reject with revisions, or escalate to human.
Results: {results} """, previous_response_id=self.previous_response_id ) self.previous_response_id = response.id return response.outputThe previous_response_id parameter is critical. It maintains context across planner turns without resending the entire conversation, leveraging the Responses API’s prompt caching.
Running the Workflow
def run_workflow(user_request: str): orchestrator = PlannerExecutorOrchestrator()
# Step 1: Planner analyzes and decomposes plan = orchestrator.plan(user_request) results = []
# Step 2: Executors run subtasks in parallel (when independent) for subtask in plan['subtasks']: if subtask.get('parallel', False): # Dispatch to mini with minimal context context = build_minimal_context(subtask) result = orchestrator.execute_subtask(subtask, context) results.append(result) else: # Sequential execution with dependency context context = build_dependency_context(subtask, results) result = orchestrator.execute_subtask(subtask, context) results.append(result)
# Step 3: Planner reviews and judges final_review = orchestrator.review(results)
return final_reviewWorkflow State Management
For longer workflows, you need to track state across turns:
class WorkflowState: """Track planner/executor state across turns"""
def __init__(self): self.planner_response_id = None self.executor_results = [] self.current_phase = "planning" self.token_budget = { "planner": {"used": 0, "limit": 100000}, "executor": {"used": 0, "limit": 50000} }
def save_planner_context(self, response_id: str): """Store planner response for next turn""" self.planner_response_id = response_id
def build_executor_context(self, subtask_id: str) -> str: """Build minimal context for executor - not full history""" # Only include: subtask definition, relevant files, dependency outputs return f""" Subtask: {subtask_id} Relevant files: {self.get_relevant_files(subtask_id)} Dependencies: {self.get_dependency_outputs(subtask_id)} """
def should_compact(self) -> bool: """Check if context compaction needed""" return ( self.token_budget["planner"]["used"] > 80000 or len(self.executor_results) > 10 )The build_executor_context method is where you save tokens. Never send the full conversation history to executors. Only send what’s relevant to that specific subtask.
AGENTS.md Configuration
Keep your agent configuration lean. Every token in AGENTS.md gets processed on every turn:
# Optimized Agent Configuration for Planner/Executor Split
## Model Routingplanner: model: GPT-5.4 role: planning, coordination, review, judgment cache: 24h-extended
executor-mini: model: GPT-5.4-mini role: implementation, tests, formatting cache: standard
## Context Management- Send only relevant code to executors (not full repo)- Use compaction for long conversations- Disable unused MCP servers (saves tokens)
## Cost Optimization- Default to mini for routine tasks- Escalate to full 5.4 only when: - Architecture decisions needed - Complex debugging required - Multi-file reasoning required - Final review/judgmentCommon Mistakes to Avoid
I made every mistake on this list before finding the right pattern:
Mistake 1: Sending full context to executor
WRONG:executor_prompt = f"Here's the full conversation: {full_history}\n\nTask: {subtask}"
RIGHT:executor_prompt = f"Subtask: {subtask}\nRelevant files: {subset}\nDependencies: {outputs}"Mistake 2: Not caching prompts for repeated subtask types
WRONG:# Repeating the same system prompt every timefor subtask in similar_subtasks: response = client.responses.create( model="gpt-5.4-mini", input=f"{SYSTEM_PROMPT}\n\n{subtask}" # Duplicated every call )
RIGHT:# Use cached system promptCACHED_PROMPT = { "role": "system", "content": "You are a test generator...", "cache_control": {"type": "ephemeral"}}# 24-hour extended cache available for GPT-5.4Mistake 3: Keeping bloated AGENTS.md files
Bloated AGENTS.md (2000 lines) = ~8000 tokens per turnLean AGENTS.md (500 lines) = ~2000 tokens per turnSavings: 75% reduction in context overheadMistake 4: Using fast mode unnecessarily
Fast mode consumes 2x credits. Reserve it for genuine urgency, not as default.
Mistake 5: Enabling all MCP servers when only 2-3 are needed
Every enabled MCP server adds tokens to each turn. Disable what you don’t use.
Cost Comparison: Real Example
Here’s the math for a user authentication feature:
Scenario: Implement user authentication feature
Without planner/executor split:- 50 messages @ GPT-5.4 = 50 quota units
With planner/executor split:- 5 planner messages @ GPT-5.4 = 5 quota units- 45 executor messages @ GPT-5.4-mini = 13.5 quota units (45 * 0.3)- Total: 18.5 quota units
Savings: 63% reduction in quota consumptionEffective extension: 2.7x more work within same limitsToken-Saving Checklist
- Prompt Caching: Use
previous_response_idto avoid resending context - Lean AGENTS.md: Keep under 500 lines, remove unused configurations
- Disable Unused MCPs: Only enable servers you actively need
- Minimal Executor Context: Send only relevant code, not entire codebase
- Compaction: Compress conversation history when approaching limits
- Lower Verbosity: Reduce output detail when appropriate
- Avoid Fast Mode: Use standard speed unless urgency requires 2x cost
- Strategic Escalation: Route to full 5.4 only when complexity demands
Summary
The planner/executor split is the highest-leverage pattern for extending GPT-5.4 sessions. Full 5.4 handles strategic work (planning, coordination, judgment) while mini handles tactical work (implementation, formatting, testing). Combined with Responses API’s previous_response_id, prompt caching, and lean configurations, you can achieve 2.5-3.3x effective message extension without upgrading your subscription.
Key implementation points:
- Orchestrator pattern: Central coordinator manages planner/executor handoffs
- Context minimization: Executors receive only relevant context, not full history
- State management: Track response IDs for efficient context continuation
- Lean configuration: Every token in AGENTS.md costs on every turn
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments