Why Does GLM5 Lose Context During Long Coding Sessions? Causes and Solutions
Problem
I was working through a complex coding project with GLM5. The first three big steps went smoothly: I created a specs document, developed a detailed plan, and started execution. GLM5 followed instructions, maintained context, and produced quality code.
Then I asked for some adjustments. That’s when things fell apart.
[Earlier in session - worked perfectly]User: Create a specs document for the authentication moduleGLM5: [produces comprehensive specs]
User: Now build a plan for implementationGLM5: [creates detailed implementation plan]
User: Execute phase 1GLM5: [implements correctly, follows all constraints]
[Later - context lost]User: Adjust the rate limiting to use sliding windowGLM5: [completely forgets existing architecture, suggests conflicting changes]
User: Remember we're using PostgreSQL, not RedisGLM5: "I'll update that" [but then ignores half the existing code]I found I wasn’t alone. A Reddit user reported the exact same experience: completing “three big steps” successfully, then watching GLM5 “lose its mind” when requesting adjustments.
What I Discovered
The community has a theory: z.ai routes to heavily quantized models after certain usage or context thresholds are reached.
What quantization means:- Full model: High precision, better reasoning, maintains context- Quantized model: Compressed, faster/cheaper, degraded quality
z.ai's suspected behavior:1. Start with full-quality GLM52. Monitor usage/context metrics3. Switch to quantized version when thresholds hit4. User experiences sudden quality dropOne comment stood out: “glm5 is really good, but now the quant version is on prod and it fuck up all the work.”
Another user noted the service quality has degraded to “1/10 of what it was say 6 months ago.” This suggests a systematic change in how z.ai serves the model, not a one-off issue.
Comparing Providers
I tested GLM5 through different providers and found significant differences.
Provider Comparison:
z.ai (direct)- Starts well, degrades after extended use- Inconsistent quality within single session- No transparency about model switching- Free tier available
OpenRouter- Consistent quality throughout session- Multiple GLM5 variants to choose from- Clear model specifications- Pay per use
Ollama Cloud- Local-like experience via cloud- Consistent model behavior- No quantization surprises- API compatibleThe key insight from the community: “Try glm-5 via any other provider (ollama-cloud, openrouter, …) and you’ll have a much better experience.”
How I Now Handle Long Coding Sessions
After this experience, I changed my approach to GLM5 and similar models.
Strategy 1: Use Alternative Providers
I stopped using z.ai for extended coding work. Instead:
# OpenRouter endpointcurl https://openrouter.ai/api/v1/chat/completions \ -H "Authorization: Bearer $OPENROUTER_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "zhipu/glm-4-plus", "messages": [{"role": "user", "content": "..."}] }'
# Ollama Cloud endpointcurl https://api.ollama.cloud/v1/chat/completions \ -H "Authorization: Bearer $OLLAMA_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "glm5", "messages": [{"role": "user", "content": "..."}] }'The same model through different providers maintains context throughout long sessions.
Strategy 2: Session Checkpointing
I now implement context checkpointing for long coding workflows.
from datetime import datetimeimport json
class SessionManager: def __init__(self, checkpoint_threshold: int = 50): self.messages = [] self.checkpoint_threshold = checkpoint_threshold self.key_decisions = [] self.constraints = []
def add_message(self, role: str, content: str): self.messages.append({"role": role, "content": content})
# Checkpoint before degradation threshold if len(self.messages) >= self.checkpoint_threshold: return self.checkpoint()
return None
def checkpoint(self) -> dict: """Save essential context before starting fresh session.""" return { "timestamp": datetime.now().isoformat(), "message_count": len(self.messages), "key_decisions": self.key_decisions, "constraints": self.constraints, "recent_summary": self._summarize_recent() }
def _summarize_recent(self) -> str: """Extract essential context from recent messages.""" # In practice, use another LLM call or manual extraction return "Summary of recent work..."
def add_decision(self, decision: str): """Track important decisions made during session.""" self.key_decisions.append(decision)
def add_constraint(self, constraint: str): """Track constraints that must be maintained.""" self.constraints.append(constraint)Strategy 3: Break Sessions at Natural Boundaries
Instead of pushing through degradation, I start fresh at logical points.
Wrong approach (single session):[Session 1: messages 1-150]- Specs document (messages 1-30) ✓- Implementation plan (messages 31-60) ✓- Phase 1 execution (messages 61-100) ✓- Phase 2 adjustments (messages 101-150) ✗ Quality degraded
Correct approach (chunked sessions):[Session 1: messages 1-30]- Specs document ✓- Save checkpoint, start fresh
[Session 2: messages 1-35]- Load specs summary- Implementation plan ✓- Save checkpoint, start fresh
[Session 3: messages 1-50]- Load specs + plan summary- Phase 1 execution ✓- Save checkpoint, start freshStrategy 4: Compare Model Outputs
For critical decisions, I run the same prompt through multiple providers.
import asyncioimport aiohttp
async def query_provider(url: str, api_key: str, prompt: str) -> str: async with aiohttp.ClientSession() as session: async with session.post( url, headers={"Authorization": f"Bearer {api_key}"}, json={"model": "glm-5", "messages": [{"role": "user", "content": prompt}]} ) as response: result = await response.json() return result["choices"][0]["message"]["content"]
async def compare_outputs(prompt: str): """Query multiple providers for the same prompt.""" providers = [ ("https://openrouter.ai/api/v1/chat/completions", OPENROUTER_KEY), ("https://api.ollama.cloud/v1/chat/completions", OLLAMA_KEY), ]
tasks = [ query_provider(url, key, prompt) for url, key in providers ]
results = await asyncio.gather(*tasks)
for i, (url, _) in enumerate(providers): provider_name = url.split("//")[1].split(".")[0] print(f"\n=== {provider_name} ===") print(results[i])
return resultsThis helps me identify when a provider is giving degraded output.
Common Mistakes I Made
I learned these lessons the hard way.
Mistake 1: Assuming all providers are equivalent
I thought “GLM5 is GLM5” regardless of where I accessed it. The provider’s infrastructure and model serving choices significantly impact quality.
Mistake 2: Pushing through degraded sessions
I kept going when quality dropped, hoping it would recover. It never did. Starting fresh is faster than fighting degradation.
Mistake 3: No context preservation
I didn’t save key decisions and constraints. When I had to start a new session, I lost track of important context. Now I checkpoint religiously.
Mistake 4: Ignoring early warning signs
When GLM5 first forgot a constraint, I dismissed it. By the third time, I’d wasted significant effort on code that needed rework.
Mistake 5: Single-provider dependency
I relied exclusively on z.ai. When their quality degraded, I had no backup. Now I have accounts with multiple providers.
When to Switch Providers vs. Starting Fresh
I’ve developed rules for when to do what.
Switch providers when:- Quality degrades mid-session- You notice inconsistent responses- The model "forgets" recently established context- Response quality is noticeably worse than previous sessions
Start a fresh session when:- Conversation exceeds 50-80 messages- Task transitions to a new phase- You need to switch coding contexts- Current session has accumulated too much noise
Keep current session when:- Under 40 messages and quality is good- Task is nearly complete- Context is simple and focused- You're in the middle of a tight feedback loopThe Root Cause
From what I’ve gathered, the issue isn’t GLM5 itself. The model performs consistently well through providers like OpenRouter and Ollama Cloud.
The problem is how z.ai serves the model. Community reports suggest:
- z.ai uses dynamic routing based on usage patterns
- After hitting thresholds (usage or context), traffic routes to quantized models
- Quantized models have reduced context capacity and reasoning ability
- Users experience sudden quality drops without warning
This is a provider-level optimization that trades user experience for cost efficiency. The model itself is capable of maintaining context through long sessions when served properly.
Summary
In this post, I explained why GLM5 loses context during long coding sessions through z.ai. The root cause appears to be provider-level routing to quantized models after usage thresholds are reached.
The key point is using alternative providers like OpenRouter or Ollama Cloud gives you consistent GLM5 quality without the degradation surprises. For extended coding workflows, implement session checkpointing and start fresh at natural boundaries rather than pushing through degraded sessions.
The model isn’t the problem. The provider is. Choose your provider carefully for any work that requires consistent quality over long sessions.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit Discussion: GLM5 Context Loss
- 👨💻 OpenRouter API Documentation
- 👨💻 Ollama Cloud Documentation
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments