How to Build AI Coding Workflows That Survive Model Changes
Five hours of work. Gone. My AI coding assistant suddenly couldn’t figure out how to create a zip file. On retry, it provided nothing but an archive of the original code plus a description of changes it “would have made.”
But here’s what caught my attention: another developer reported having a governance discipline system that let them use Codex 5.1 with performance “quite close to 5.4”—they didn’t suffer from the model degradation that broke my workflow.
The difference? I had ad-hoc prompts. They had structure.
The Problem
When AI models change or degrade, workflows break. This happens silently—no API version change, no deprecation warning. The endpoint stays the same, but the model behind it behaves differently.
I saw this firsthand:
- Prompts that worked for months suddenly failed
- Instructions were understood but not executed
- Quality dropped while token usage increased
The developer with the governance system had a different experience. As they put it: “Development is slower and burns more tokens indeed, but at least I don’t suffer backlashes from model degradation.”
That trade-off—slower but stable—became my goal.
What is AI Workflow Governance?
Governance is a structured approach to using AI coding assistants. Instead of throwing prompts at a model and hoping for the best, you create systems that:
- Validate outputs before accepting them
- Provide explicit constraints and verification
- Maintain consistency across model versions
- Reduce impact of model degradation
Think of it like building with guardrails. You still get the productivity benefits of AI, but you catch errors before they cascade.
The Four Components
1. Structured Prompt Templates
The most impactful change I made was standardizing prompts. Here’s the template I now use:
## Task{specific_objective}
## Constraints (MUST follow)- DO NOT modify: {protected_files}- DO NOT change: {protected_logic}- MUST preserve: {required_elements}
## Verification RequiredBefore making any changes:1. List ALL files you intend to modify2. Show EXACT changes you plan to make3. Explain how each constraint is satisfied4. WAIT for my approval before applying changes
## Output FormatProvide your response in JSON:{ "plan": "description of what you will do", "files_to_modify": ["file1", "file2"], "changes": [...], "constraint_compliance": [...]}This structure forces the AI to plan before acting and makes failures visible.
2. Multi-Stage Validation
I split every interaction into four stages:
Plan → Review → Execute → VerifyAt each stage, there’s a gate:
- Plan: AI describes what it will do in structured format
- Review: Automated checks + optional human approval
- Execute: AI makes the actual changes
- Verify: Changes are validated against requirements
This catches issues at the planning stage instead of after broken code is committed.
3. Model Fallback Chains
When your primary model fails quality checks, you need alternatives:
MODEL_CHAIN = { "primary": { "model": "codex-5.4", "quality_threshold": 0.85 }, "secondary": { "model": "codex-5.2", "quality_threshold": 0.80 }, "tertiary": { "model": "claude-sonnet", "quality_threshold": 0.75 }}When codex-5.4 started failing, my system automatically fell back to 5.2. I didn’t lose productivity—I just paid slightly higher token costs while debugging the root cause.
4. Change Logging and Rollback
Every AI change is logged. When something breaks, I can see exactly what changed and revert:
import jsonfrom datetime import datetimefrom pathlib import Path
class ChangeLogger: def __init__(self, log_dir="ai_changes"): self.log_dir = Path(log_dir) self.log_dir.mkdir(exist_ok=True)
def log_change(self, action, files_modified, diff, model_version): entry = { "timestamp": datetime.now().isoformat(), "action": action, "files": files_modified, "diff": diff, "model": model_version, "session_id": self.get_session_id() }
log_file = self.log_dir / f"{datetime.now().strftime('%Y-%m-%d')}.jsonl" with open(log_file, "a") as f: f.write(json.dumps(entry) + "\n")
return entry
def rollback(self, timestamp): """Find and revert changes from a specific time.""" # Implementation depends on your version control passThis gives me visibility into what the AI actually did versus what I asked for.
Implementation
Here’s the governance system I built:
from dataclasses import dataclassfrom typing import List, Optional, Callablefrom enum import Enumimport subprocessimport json
class Stage(Enum): PLAN = "plan" REVIEW = "review" EXECUTE = "execute" VERIFY = "verify"
@dataclassclass AIAction: description: str files_to_modify: List[str] changes_preview: str constraints: List[str] approved: bool = False
class GovernanceSystem: def __init__(self, model_client, config: dict): self.model = model_client self.config = config self.fallback_models = config.get("fallback_models", []) self.change_log = []
def structured_prompt(self, task: str, constraints: dict) -> str: """Generate a structured prompt with governance constraints.""" template = """## Task{task}
## Constraints (MUST follow){constraints_text}
## Verification RequiredBefore making any changes:1. List ALL files you intend to modify2. Show EXACT changes you plan to make3. Explain how each constraint is satisfied4. WAIT for my approval before applying changes
## Output FormatProvide your response in this JSON structure:{{ "plan": "description of what you will do", "files_to_modify": ["file1", "file2"], "changes": [ {{"file": "path/to/file", "change": "description"}} ], "constraint_compliance": [ {{"constraint": "X", "how_satisfied": "Y"}} ]}}""" constraints_text = "\n".join(f"- {c}" for c in constraints.get("must_not", [])) constraints_text += "\n".join(f"- MUST: {c}" for c in constraints.get("must", []))
return template.format(task=task, constraints_text=constraints_text)
def execute_with_governance(self, task: str, constraints: dict) -> dict: """Full governance workflow: Plan -> Review -> Execute -> Verify."""
# Stage 1: PLAN prompt = self.structured_prompt(task, constraints) plan_response = self.model.generate(prompt, temperature=0.0)
try: action = self.parse_plan(plan_response) except json.JSONDecodeError: return {"success": False, "error": "Failed to parse plan", "stage": Stage.PLAN}
# Stage 2: REVIEW if not self.review_action(action): return {"success": False, "error": "Action not approved", "stage": Stage.REVIEW}
# Stage 3: EXECUTE results = self.execute_action(action) self.change_log.append({ "action": action, "results": results, "timestamp": datetime.now().isoformat() })
# Stage 4: VERIFY verification = self.verify_changes(action, results)
return { "success": verification["passed"], "stage": Stage.VERIFY, "action": action, "results": results, "verification": verification }
def review_action(self, action: AIAction) -> bool: """Review the proposed action against constraints.""" # Automated checks for constraint in action.constraints: if not self.check_constraint(action, constraint): print(f"Constraint violation: {constraint}") return False
# Optional human-in-the-loop if self.config.get("require_human_approval", False): print(f"\nProposed action: {action.description}") print(f"Files to modify: {action.files_to_modify}") print(f"Preview:\n{action.changes_preview}")
response = input("Approve? (y/n): ") return response.lower() == 'y'
return True
def verify_changes(self, action: AIAction, results: dict) -> dict: """Verify the changes were applied correctly.""" checks = { "files_modified": True, "constraints_satisfied": True, "tests_pass": True, "errors": [] }
# Run tests if configured if self.config.get("run_tests_on_change", True): test_result = subprocess.run( ["pytest", "--tb=short"], capture_output=True ) checks["tests_pass"] = test_result.returncode == 0 if not checks["tests_pass"]: checks["errors"].append(test_result.stdout.decode())
return { "passed": all([checks["files_modified"], checks["constraints_satisfied"], checks["tests_pass"]]), "checks": checks }
def fallback_execute(self, task: str, constraints: dict) -> dict: """Try primary model, fallback to others if needed.""" for model_version in [self.model] + self.fallback_models: try: result = self.execute_with_governance(task, constraints) if result["success"]: return result except Exception as e: print(f"Model failed: {e}") continue
return {"success": False, "error": "All models failed"}
def parse_plan(self, response) -> AIAction: """Parse model response into structured action.""" # Extract JSON from response content = response.content if hasattr(response, 'content') else str(response) data = json.loads(content)
return AIAction( description=data.get("plan", ""), files_to_modify=data.get("files_to_modify", []), changes_preview=json.dumps(data.get("changes", []), indent=2), constraints=[c.get("constraint") for c in data.get("constraint_compliance", [])] )
def check_constraint(self, action: AIAction, constraint: str) -> bool: """Check if action satisfies a specific constraint.""" # Implement constraint checking logic # This is domain-specific return TrueConfiguration
Here’s how I configure the system:
{ "require_human_approval": false, "run_tests_on_change": true, "fallback_models": [ {"provider": "openai", "model": "codex-5.2"}, {"provider": "anthropic", "model": "claude-sonnet"} ], "constraints": { "must_not": [ "Modify files outside of src/", "Delete existing functionality", "Change API signatures without explicit approval" ], "must": [ "Add tests for new functionality", "Update documentation", "Follow existing code style" ] }, "verification": { "run_linter": true, "run_tests": true, "check_types": true }}Why This Works
I tested this system during the Codex 5.4 degradation incident. While others reported:
- “5 hours of work lost”
- “It replaced content instead of creating new page”
- “Suddenly couldn’t figure out how to create a zip file”
My governance system caught failures at the plan stage. When the model proposed wrong targets or shortcuts, the constraint checks flagged them. When quality dropped below thresholds, fallback models took over.
The trade-off is real—development is slower. Each change goes through four stages instead of one. But I haven’t lost work to model degradation since implementing this.
The Cost-Benefit Analysis
Before governance:
- Fast development when model works well
- Catastrophic failures when model degrades
- Lost hours debugging “fixes” that introduced bugs
- No visibility into what changed
After governance:
- Slower development (maybe 20-30% more tokens)
- Consistent quality regardless of model changes
- Clear audit trail of all changes
- Automatic fallback to stable models
The math works out: spending 20% more tokens is better than losing 5 hours of work.
Common Patterns
I’ve found these patterns consistently useful:
| Pattern | Description | Benefit |
|---|---|---|
| Plan-Review-Execute | AI proposes, human/system approves, then executes | Catches errors early |
| Sandbox Execution | Test changes in isolated environment | Safe experimentation |
| Incremental Changes | Small, atomic modifications | Easy to debug and rollback |
| Constraint Templates | Reusable prompt structures | Consistency across tasks |
What I Do Now
My production workflow looks like this:
- Every task goes through the governance system
- Automated tests run after each change
- If tests fail or constraints are violated, changes are rejected
- Primary model fails quality check? Fallback to previous version
- All changes logged for audit and rollback
This system caught the Codex 5.4 degradation within the first day. I didn’t lose work—I just saw my fallback chain activate and started investigating.
Related Knowledge
This governance approach connects to several related problems:
- Model degradation detection: I wrote about how to detect AI model degradation symptoms before they break production
- Instruction following: The core issue of models ignoring constraints is covered in my post on why AI coding assistants ignore instructions
- Benchmarking: Building benchmarking systems for AI models gives you the data to detect degradation early
The governance system I built addresses all of these: detection through monitoring, instruction following through constraints, and benchmarking through quality thresholds.
Governance systems protect your AI coding workflows from model degradation and changes. Structure your prompts, validate outputs, maintain fallbacks, and log all changes. Slower development is better than lost work.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: Codex 5.4 degradation reports
- 👨💻 OpenAI Model Versioning Documentation
- 👨💻 Anthropic: Claude Best Practices
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments