How to Implement Planner/Executor Split with GPT-5.4 and Mini

Apr 5, 2026

The Problem

I hit the 5-hour message quota in the middle of a complex refactoring task. My workflow was dead in the water. All context lost. The rollback cost me another hour of work.

This pattern repeated every time I ran multi-step operations through GPT-5.4:

Start complex task → 50 messages later → Quota warning →
Panic → Rush to finish → Context lost → Start over

The core problem: every message, whether complex architectural reasoning or simple code formatting, consumed the same quota. I was burning premium model capacity on tasks that didn’t require it.

A Reddit thread on GPT-5.4-mini optimization revealed the highest-leverage solution: the planner/executor split.

The Solution: Planner/Executor Architecture

The key insight is role separation. Full GPT-5.4 handles strategic work while mini handles tactical work:

Planner (GPT-5.4):
  - Analyzes problem scope
  - Breaks down into subtasks
  - Coordinates agent handoffs
  - Reviews and judges outputs
  - Makes architectural decisions

Executor (GPT-5.4-mini):
  - Implements specific subtasks
  - Writes boilerplate code
  - Generates tests
  - Formats and documents
  - Executes narrow, defined operations

The Communication Flow

1. Planner receives initial request
2. Planner decomposes into ordered subtasks
3. Each subtask delegated to executor
4. Executor returns results
5. Planner reviews, adjusts, and coordinates next step
6. Final judgment by planner before completion

This pattern transforms a single expensive conversation into a coordinated multi-agent workflow. Instead of 300 full-model messages, you achieve 900+ effective interactions within the same quota.

Why This Matters

The quota efficiency comparison tells the story:

Model	Quota Consumption	Effective Messages/5h
GPT-5.4	100%	223-1120 (Pro)
GPT-5.4-mini	~30%	743-3733 (Pro)

The planner/executor split lets the planner maintain strategic context while executors handle tactical work at 30% cost.

Implementation: Python Orchestrator

Here’s the core orchestrator that manages the planner/executor split:

from openai import OpenAI

client = OpenAI()

class PlannerExecutorOrchestrator:
    def __init__(self):
        self.planner_model = "gpt-5.4"
        self.executor_model = "gpt-5.4-mini"
        self.previous_response_id = None

    def plan(self, user_request: str) -> dict:
        """Full GPT-5.4 handles planning and decomposition"""
        response = client.responses.create(
            model=self.planner_model,
            input=f"""
            Analyze this request and break it into subtasks.
            Return JSON with: subtasks[], dependencies[], coordination_notes

            Request: {user_request}
            """,
            previous_response_id=self.previous_response_id
        )
        self.previous_response_id = response.id
        return response.output

    def execute_subtask(self, subtask: dict, context: str) -> str:
        """Mini handles narrow subtasks with limited context"""
        response = client.responses.create(
            model=self.executor_model,
            input=f"""
            Execute this specific subtask. Focus only on what's requested.
            Context: {context}

            Subtask: {subtask['description']}
            Expected output: {subtask['output_format']}
            """
        )
        return response.output

    def review(self, results: list) -> dict:
        """Full GPT-5.4 reviews and judges executor outputs"""
        response = client.responses.create(
            model=self.planner_model,
            input=f"""
            Review these executor results. Judge quality and completeness.
            Approve, reject with revisions, or escalate to human.

            Results: {results}
            """,
            previous_response_id=self.previous_response_id
        )
        self.previous_response_id = response.id
        return response.output

The previous_response_id parameter is critical. It maintains context across planner turns without resending the entire conversation, leveraging the Responses API’s prompt caching.

Running the Workflow

def run_workflow(user_request: str):
    orchestrator = PlannerExecutorOrchestrator()

    # Step 1: Planner analyzes and decomposes
    plan = orchestrator.plan(user_request)
    results = []

    # Step 2: Executors run subtasks in parallel (when independent)
    for subtask in plan['subtasks']:
        if subtask.get('parallel', False):
            # Dispatch to mini with minimal context
            context = build_minimal_context(subtask)
            result = orchestrator.execute_subtask(subtask, context)
            results.append(result)
        else:
            # Sequential execution with dependency context
            context = build_dependency_context(subtask, results)
            result = orchestrator.execute_subtask(subtask, context)
            results.append(result)

    # Step 3: Planner reviews and judges
    final_review = orchestrator.review(results)

    return final_review

Workflow State Management

For longer workflows, you need to track state across turns:

class WorkflowState:
    """Track planner/executor state across turns"""

    def __init__(self):
        self.planner_response_id = None
        self.executor_results = []
        self.current_phase = "planning"
        self.token_budget = {
            "planner": {"used": 0, "limit": 100000},
            "executor": {"used": 0, "limit": 50000}
        }

    def save_planner_context(self, response_id: str):
        """Store planner response for next turn"""
        self.planner_response_id = response_id

    def build_executor_context(self, subtask_id: str) -> str:
        """Build minimal context for executor - not full history"""
        # Only include: subtask definition, relevant files, dependency outputs
        return f"""
        Subtask: {subtask_id}
        Relevant files: {self.get_relevant_files(subtask_id)}
        Dependencies: {self.get_dependency_outputs(subtask_id)}
        """

    def should_compact(self) -> bool:
        """Check if context compaction needed"""
        return (
            self.token_budget["planner"]["used"] > 80000 or
            len(self.executor_results) > 10
        )

The build_executor_context method is where you save tokens. Never send the full conversation history to executors. Only send what’s relevant to that specific subtask.

AGENTS.md Configuration

Keep your agent configuration lean. Every token in AGENTS.md gets processed on every turn:

# Optimized Agent Configuration for Planner/Executor Split

## Model Routing
planner:
  model: GPT-5.4
  role: planning, coordination, review, judgment
  cache: 24h-extended

executor-mini:
  model: GPT-5.4-mini
  role: implementation, tests, formatting
  cache: standard

## Context Management
- Send only relevant code to executors (not full repo)
- Use compaction for long conversations
- Disable unused MCP servers (saves tokens)

## Cost Optimization
- Default to mini for routine tasks
- Escalate to full 5.4 only when:
  - Architecture decisions needed
  - Complex debugging required
  - Multi-file reasoning required
  - Final review/judgment

Common Mistakes to Avoid

I made every mistake on this list before finding the right pattern:

Mistake 1: Sending full context to executor

WRONG:
executor_prompt = f"Here's the full conversation: {full_history}\n\nTask: {subtask}"

RIGHT:
executor_prompt = f"Subtask: {subtask}\nRelevant files: {subset}\nDependencies: {outputs}"

Mistake 2: Not caching prompts for repeated subtask types

WRONG:
# Repeating the same system prompt every time
for subtask in similar_subtasks:
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=f"{SYSTEM_PROMPT}\n\n{subtask}"  # Duplicated every call
    )

RIGHT:
# Use cached system prompt
CACHED_PROMPT = {
    "role": "system",
    "content": "You are a test generator...",
    "cache_control": {"type": "ephemeral"}
}
# 24-hour extended cache available for GPT-5.4

Mistake 3: Keeping bloated AGENTS.md files

Bloated AGENTS.md (2000 lines) = ~8000 tokens per turn
Lean AGENTS.md (500 lines) = ~2000 tokens per turn
Savings: 75% reduction in context overhead

Mistake 4: Using fast mode unnecessarily

Fast mode consumes 2x credits. Reserve it for genuine urgency, not as default.

Mistake 5: Enabling all MCP servers when only 2-3 are needed

Every enabled MCP server adds tokens to each turn. Disable what you don’t use.

Cost Comparison: Real Example

Here’s the math for a user authentication feature:

Scenario: Implement user authentication feature

Without planner/executor split:
- 50 messages @ GPT-5.4 = 50 quota units

With planner/executor split:
- 5 planner messages @ GPT-5.4 = 5 quota units
- 45 executor messages @ GPT-5.4-mini = 13.5 quota units (45 * 0.3)
- Total: 18.5 quota units

Savings: 63% reduction in quota consumption
Effective extension: 2.7x more work within same limits

Token-Saving Checklist

Prompt Caching: Use previous_response_id to avoid resending context
Lean AGENTS.md: Keep under 500 lines, remove unused configurations
Disable Unused MCPs: Only enable servers you actively need
Minimal Executor Context: Send only relevant code, not entire codebase
Compaction: Compress conversation history when approaching limits
Lower Verbosity: Reduce output detail when appropriate
Avoid Fast Mode: Use standard speed unless urgency requires 2x cost
Strategic Escalation: Route to full 5.4 only when complexity demands

Summary

The planner/executor split is the highest-leverage pattern for extending GPT-5.4 sessions. Full 5.4 handles strategic work (planning, coordination, judgment) while mini handles tactical work (implementation, formatting, testing). Combined with Responses API’s previous_response_id, prompt caching, and lean configurations, you can achieve 2.5-3.3x effective message extension without upgrading your subscription.

Key implementation points:

Orchestrator pattern: Central coordinator manages planner/executor handoffs
Context minimization: Executors receive only relevant context, not full history
State management: Track response IDs for efficient context continuation
Lean configuration: Every token in AGENTS.md costs on every turn

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit: GPT-5.4-mini vs GPT-5.4 Performance Discussion
👨‍💻 OpenAI Responses API Documentation

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!