Skip to content

How to Control What Goes Into the Context Window in a Custom AI Coding Harness

Problem

When I ran Claude Code on a long refactor, the model started fixing code that was already fixed. It “remembered” a file from ten turns ago, ignored the diff I had just shown it, and confidently rewrote working logic. I tried to be more explicit in my prompts. Same thing. I trimmed the conversation. Better for a turn, then worse again.

The real reason, based on a long thread of people building their own harnesses, was not the model and not the prompt. It was context. Off-the-shelf harnesses dump the entire chat history plus every file the model ever touched into every turn, and after ten rounds you are at 50k tokens of mostly noise.

Environment

  • Claude Code / Pi / OpenClaw (any wrapper with a per-turn context input)
  • A coding task long enough to span 10+ tool-call rounds
  • Python 3.11+ (for the snippets below)

What happened?

I asked one commenter the same question I was asking myself — why is the model getting worse the longer I use it? His answer was the cleanest one in the thread:

The real unlock of rolling your own isn’t the ui, it’s that you control exactly what goes into context each turn. The prebuilt harnesses all over-stuff. — u/agiblox

Three things go wrong as a session grows:

  • Attention dilution. The signal (current task, current plan, current diff) is a smaller and smaller fraction of the input.
  • Stale state. The model “remembers” an old version of a file that has since been edited by another tool call.
  • Instruction drift. Earlier messages effectively become part of the implicit system prompt and start contradicting later ones.

A stacked bar across 10 rounds makes the growth visible — system prompt stays flat, tool descriptions stay flat, but tool calls and conversation history grow until they dwarf the actual work.

Stacked bar chart of cumulative token usage across 10 rounds of agent tool calls

How to solve it?

The pattern that recurs in the thread is to split the per-turn input into three slots.

AI agent loop diagram with plan, execute, observe, reflect steps and token cost annotations

Slot A — System / role. Small, fixed, version-controlled. Defines the agent’s identity, tool policy, and any project-level invariants. The whole point is that this is the only thing that is identical on every turn, so you can reason about its effect across runs.

Slot B — Working memory. Medium, per-task. Holds the active task, the current plan, the last N tool results, and the in-progress diff. This is the “context window” the model is actually reasoning over.

Slot C — Retrieval. Large, on-demand. Files, prior sessions, PRD, docs, code search results. Only the slice relevant to the current step is pulled in.

The harness’s job on each turn is: rebuild Slot A from the project config, rebuild Slot B from the live task state, and only then query Slot C for whatever the current step needs.

Three-slot context window architecture diagram showing Slot A (small fixed system prompt) on top, Slot B (medium per-task working memory with plan, recent tool results, and current diff) in the middle, and Slot C (large on-demand retrieval from files, prior sessions, and external docs) on the bottom, with arrows showing the harness rebuilding A and B on every turn before querying C

Here is a sketch of what the per-turn builder looks like:

build_turn_input.py
def build_turn_input(task, project):
return {
"system": {
"role": project.agent_role,
"tool_policy": project.tool_policy,
"invariants": project.invariants,
},
"working": {
"current_goal": task.goal,
"current_plan": task.plan,
"recent_tool_results": task.last_n_results(n=5),
"in_progress_diff": task.current_diff(),
},
"retrieval": {
"relevant_files": project.file_index.search(task.goal, k=5),
"prior_sessions": project.memory.recall(task.goal, k=2),
"external_docs": project.context7.lookup(task.technologies),
},
}

The version to avoid is the one most prebuilt harnesses fall back to:

build_turn_input_bad.py
def build_turn_input_bad(history, files):
return "\n".join(history) + "\n" + "\n".join(open(f).read() for f in files)

That is a single growing string. By round ten it is 50k tokens of mostly old state.

The reason

I think the key reason prebuilt harnesses fail is that they treat context as a string instead of as a build artifact. Every turn they concatenate whatever is at hand, and the model has to re-read all of it. Once you split the input into slots and decide on a per-turn basis what belongs in each one, the input stays small, current, and focused on the work in front of the model.

A few related things that came out of the thread:

  • The “memory” layer people keep naming (u/Foxiestofthehounds, u/Downtown-Pear-6509) is just Slot C with a friendly name.
  • u/trmnl_cmdr’s “research in one context window, implementation in another” is Slot B scoped to the current sub-task.
  • “Contextual memory” in gamepad-cli-hub is the same idea: structured slots, not one long string.

Summary

In this post, I showed how to control what goes into the context window of a custom AI coding harness. The key point is to treat the per-turn input as a build artifact with three slots — system (small, fixed), working memory (medium, per-task), and retrieval (large, on-demand) — and to rebuild slot A and B on every turn before pulling anything from slot C.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments