Skip to content

Why Does GLM5 Lose Context During Long Coding Sessions? Causes and Solutions

Problem

I was working through a complex coding project with GLM5. The first three big steps went smoothly: I created a specs document, developed a detailed plan, and started execution. GLM5 followed instructions, maintained context, and produced quality code.

Then I asked for some adjustments. That’s when things fell apart.

[Earlier in session - worked perfectly]
User: Create a specs document for the authentication module
GLM5: [produces comprehensive specs]
User: Now build a plan for implementation
GLM5: [creates detailed implementation plan]
User: Execute phase 1
GLM5: [implements correctly, follows all constraints]
[Later - context lost]
User: Adjust the rate limiting to use sliding window
GLM5: [completely forgets existing architecture, suggests conflicting changes]
User: Remember we're using PostgreSQL, not Redis
GLM5: "I'll update that" [but then ignores half the existing code]

I found I wasn’t alone. A Reddit user reported the exact same experience: completing “three big steps” successfully, then watching GLM5 “lose its mind” when requesting adjustments.

What I Discovered

The community has a theory: z.ai routes to heavily quantized models after certain usage or context thresholds are reached.

What quantization means:
- Full model: High precision, better reasoning, maintains context
- Quantized model: Compressed, faster/cheaper, degraded quality
z.ai's suspected behavior:
1. Start with full-quality GLM5
2. Monitor usage/context metrics
3. Switch to quantized version when thresholds hit
4. User experiences sudden quality drop

One comment stood out: “glm5 is really good, but now the quant version is on prod and it fuck up all the work.”

Another user noted the service quality has degraded to “1/10 of what it was say 6 months ago.” This suggests a systematic change in how z.ai serves the model, not a one-off issue.

Comparing Providers

I tested GLM5 through different providers and found significant differences.

Provider Comparison:
z.ai (direct)
- Starts well, degrades after extended use
- Inconsistent quality within single session
- No transparency about model switching
- Free tier available
OpenRouter
- Consistent quality throughout session
- Multiple GLM5 variants to choose from
- Clear model specifications
- Pay per use
Ollama Cloud
- Local-like experience via cloud
- Consistent model behavior
- No quantization surprises
- API compatible

The key insight from the community: “Try glm-5 via any other provider (ollama-cloud, openrouter, …) and you’ll have a much better experience.”

How I Now Handle Long Coding Sessions

After this experience, I changed my approach to GLM5 and similar models.

Strategy 1: Use Alternative Providers

I stopped using z.ai for extended coding work. Instead:

Terminal window
# OpenRouter endpoint
curl https://openrouter.ai/api/v1/chat/completions \
-H "Authorization: Bearer $OPENROUTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "zhipu/glm-4-plus",
"messages": [{"role": "user", "content": "..."}]
}'
# Ollama Cloud endpoint
curl https://api.ollama.cloud/v1/chat/completions \
-H "Authorization: Bearer $OLLAMA_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "glm5",
"messages": [{"role": "user", "content": "..."}]
}'

The same model through different providers maintains context throughout long sessions.

Strategy 2: Session Checkpointing

I now implement context checkpointing for long coding workflows.

session_manager.py
from datetime import datetime
import json
class SessionManager:
def __init__(self, checkpoint_threshold: int = 50):
self.messages = []
self.checkpoint_threshold = checkpoint_threshold
self.key_decisions = []
self.constraints = []
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
# Checkpoint before degradation threshold
if len(self.messages) >= self.checkpoint_threshold:
return self.checkpoint()
return None
def checkpoint(self) -> dict:
"""Save essential context before starting fresh session."""
return {
"timestamp": datetime.now().isoformat(),
"message_count": len(self.messages),
"key_decisions": self.key_decisions,
"constraints": self.constraints,
"recent_summary": self._summarize_recent()
}
def _summarize_recent(self) -> str:
"""Extract essential context from recent messages."""
# In practice, use another LLM call or manual extraction
return "Summary of recent work..."
def add_decision(self, decision: str):
"""Track important decisions made during session."""
self.key_decisions.append(decision)
def add_constraint(self, constraint: str):
"""Track constraints that must be maintained."""
self.constraints.append(constraint)

Strategy 3: Break Sessions at Natural Boundaries

Instead of pushing through degradation, I start fresh at logical points.

Wrong approach (single session):
[Session 1: messages 1-150]
- Specs document (messages 1-30) ✓
- Implementation plan (messages 31-60) ✓
- Phase 1 execution (messages 61-100) ✓
- Phase 2 adjustments (messages 101-150) ✗ Quality degraded
Correct approach (chunked sessions):
[Session 1: messages 1-30]
- Specs document ✓
- Save checkpoint, start fresh
[Session 2: messages 1-35]
- Load specs summary
- Implementation plan ✓
- Save checkpoint, start fresh
[Session 3: messages 1-50]
- Load specs + plan summary
- Phase 1 execution ✓
- Save checkpoint, start fresh

Strategy 4: Compare Model Outputs

For critical decisions, I run the same prompt through multiple providers.

compare_providers.py
import asyncio
import aiohttp
async def query_provider(url: str, api_key: str, prompt: str) -> str:
async with aiohttp.ClientSession() as session:
async with session.post(
url,
headers={"Authorization": f"Bearer {api_key}"},
json={"model": "glm-5", "messages": [{"role": "user", "content": prompt}]}
) as response:
result = await response.json()
return result["choices"][0]["message"]["content"]
async def compare_outputs(prompt: str):
"""Query multiple providers for the same prompt."""
providers = [
("https://openrouter.ai/api/v1/chat/completions", OPENROUTER_KEY),
("https://api.ollama.cloud/v1/chat/completions", OLLAMA_KEY),
]
tasks = [
query_provider(url, key, prompt)
for url, key in providers
]
results = await asyncio.gather(*tasks)
for i, (url, _) in enumerate(providers):
provider_name = url.split("//")[1].split(".")[0]
print(f"\n=== {provider_name} ===")
print(results[i])
return results

This helps me identify when a provider is giving degraded output.

Common Mistakes I Made

I learned these lessons the hard way.

Mistake 1: Assuming all providers are equivalent

I thought “GLM5 is GLM5” regardless of where I accessed it. The provider’s infrastructure and model serving choices significantly impact quality.

Mistake 2: Pushing through degraded sessions

I kept going when quality dropped, hoping it would recover. It never did. Starting fresh is faster than fighting degradation.

Mistake 3: No context preservation

I didn’t save key decisions and constraints. When I had to start a new session, I lost track of important context. Now I checkpoint religiously.

Mistake 4: Ignoring early warning signs

When GLM5 first forgot a constraint, I dismissed it. By the third time, I’d wasted significant effort on code that needed rework.

Mistake 5: Single-provider dependency

I relied exclusively on z.ai. When their quality degraded, I had no backup. Now I have accounts with multiple providers.

When to Switch Providers vs. Starting Fresh

I’ve developed rules for when to do what.

Switch providers when:
- Quality degrades mid-session
- You notice inconsistent responses
- The model "forgets" recently established context
- Response quality is noticeably worse than previous sessions
Start a fresh session when:
- Conversation exceeds 50-80 messages
- Task transitions to a new phase
- You need to switch coding contexts
- Current session has accumulated too much noise
Keep current session when:
- Under 40 messages and quality is good
- Task is nearly complete
- Context is simple and focused
- You're in the middle of a tight feedback loop

The Root Cause

From what I’ve gathered, the issue isn’t GLM5 itself. The model performs consistently well through providers like OpenRouter and Ollama Cloud.

The problem is how z.ai serves the model. Community reports suggest:

  1. z.ai uses dynamic routing based on usage patterns
  2. After hitting thresholds (usage or context), traffic routes to quantized models
  3. Quantized models have reduced context capacity and reasoning ability
  4. Users experience sudden quality drops without warning

This is a provider-level optimization that trades user experience for cost efficiency. The model itself is capable of maintaining context through long sessions when served properly.

Summary

In this post, I explained why GLM5 loses context during long coding sessions through z.ai. The root cause appears to be provider-level routing to quantized models after usage thresholds are reached.

The key point is using alternative providers like OpenRouter or Ollama Cloud gives you consistent GLM5 quality without the degradation surprises. For extended coding workflows, implement session checkpointing and start fresh at natural boundaries rather than pushing through degraded sessions.

The model isn’t the problem. The provider is. Choose your provider carefully for any work that requires consistent quality over long sessions.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments