Why Do AI Models Feel Dumber Over Time? The Psychology of AI Perception

Feb 4, 2026

Problem

When I use ChatGPT or Claude, I feel like it’s getting worse over time.

I see Reddit posts with titles like “has sonnet 5 been nerfed?” where users complain:

"when it first dropped i was blown away... now its back to writing
me a 2000 word response about why it can't fold laundry"

"it used to walk my dog and raise my kids, now it can't even
basic coding tasks"

The thread is tagged as humor/satire, has 787 upvotes, and the TL;DR notes:

"a couple of users missed the memo that Sonnet 5 isn't real"

But here’s the thing - even though this is satire, it perfectly captures how I actually feel. I remember when the AI first came out, it was amazing. Now it feels… worse?

So what’s happening? Did the companies secretly nerf their models?

What happened?

I’ve been using AI models for over a year now. When I first started, every output felt like magic. I would share impressive results with friends, save screenshots of amazing responses, and attribute any failures to “how I prompted it.”

Here’s what I noticed:

Stage 1: The Launch Honeymoon (Weeks 1-4)

Every output feels magical. I share examples like:

Me: "Write a Python script to scrape this website"
AI: [Produces perfect, production-ready code with error handling]
Me: "WOW! This is incredible!"

I save these wins. I ignore the times when it fails or gives generic responses. When it doesn’t work, I think “I must have prompted it wrong.”

Stage 2: Familiarity Sets In (Months 1-3)

The AI becomes a normal tool in my workflow. I notice limitations more than capabilities. I start seeing posts like “Has anyone else noticed ChatGPT is getting worse?” on Reddit.

I compare today’s average outputs to my best memories from the honeymoon phase.

Stage 3: Disappointment (Month 3+)

“It used to be better” becomes my default explanation for any failure. Normal variance feels like degradation. I start attributing failures to “silent cost cutting” or “they nerfed it.”

But here’s the problem - I never actually tested whether it got worse. I just feel like it did.

How to solve it?

I tried to figure out if this was real or just my perception. Here’s what I did:

Step 1: Track Actual Performance

I created a simple logging system:

import json
from datetime import datetime
from pathlib import Path

class AIPerformanceTracker:
    def __init__(self, log_file="ai_performance.jsonl"):
        self.log_file = Path(log_file)

    def log_interaction(self, prompt, response, quality_rating, notes=""):
        """Log each AI interaction with objective metrics"""
        entry = {
            "timestamp": datetime.now().isoformat(),
            "prompt": prompt,
            "response_preview": response[:200],  # First 200 chars
            "quality": quality_rating,  # 1-5 scale
            "response_length": len(response),
            "notes": notes
        }

        with open(self.log_file, "a") as f:
            f.write(json.dumps(entry) + "\n")

    def analyze_trends(self, days=30):
        """Analyze if quality has changed over time"""
        entries = []
        with open(self.log_file, "r") as f:
            for line in f:
                entries.append(json.loads(line))

        # Group by week
        weekly_avg = {}
        for entry in entries:
            week = entry["timestamp"][:10]  # Just the date
            if week not in weekly_avg:
                weekly_avg[week] = []
            weekly_avg[week].append(entry["quality"])

        # Calculate averages
        trends = {week: sum(ratings)/len(ratings)
                  for week, ratings in weekly_avg.items()}

        return trends

I used this for two weeks. Every time I used the AI, I logged:

tracker = AIPerformanceTracker()

# After each AI interaction
tracker.log_interaction(
    prompt="Explain recursion in Python",
    response=ai_response,
    quality_rating=4,  # My subjective 1-5 rating
    notes="Clear explanation, good examples"
)

Step 2: A/B Test with Fixed Prompts

I picked 5 prompts I use regularly and ran them once a week with temperature=0:

import anthropic
import time
from datetime import datetime

client = anthropic.Anthropic()

test_prompts = [
    "Explain the difference between list and tuple in Python",
    "Write a function to reverse a linked list",
    "Debug this code: [paste broken code]",
    "Explain REST APIs to a beginner",
    "Convert this SQL query to SQLAlchemy"
]

def run_weekly_test():
    results = []

    for prompt in test_prompts:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            temperature=0,  # Fixed for consistency
            messages=[{"role": "user", "content": prompt}]
        )

        results.append({
            "timestamp": datetime.now().isoformat(),
            "prompt": prompt,
            "response": response.content[0].text,
            "model": response.model
        })

        time.sleep(1)  # Rate limiting

    # Save results
    with open(f"weekly_test_{datetime.now().strftime('%Y%m%d')}.json", "w") as f:
        json.dump(results, f, indent=2)

if __name__ == "__main__":
    run_weekly_test()

Step 3: Compare Results Objectively

After four weeks, I compared the outputs:

import json
from pathlib import Path

def compare_weekly_outputs():
    results = {}

    for file in Path(".").glob("weekly_test_*.json"):
        date = file.stem.split("_")[-1]
        with open(file) as f:
            results[date] = json.load(f)

    # Check if quality changed
    for date, data in results.items():
        avg_length = sum(len(d["response"]) for d in data) / len(data)
        print(f"{date}: Avg response length = {avg_length:.0f} chars")

    # Manual comparison
    print("\nManual comparison needed:")
    print("1. Read responses side-by-side")
    print("2. Rate each 1-5 without knowing the date")
    print("3. Compare ratings across weeks")

compare_weekly_outputs()

Now test the hypothesis:

$ python analyze_results.py

20250114: Avg response length = 847 chars
20250121: Avg response length = 823 chars
20250128: Avg response length = 861 chars
20250204: Avg response length = 839 chars

The lengths are consistent. When I did blind ratings, I couldn’t tell which week was which.

Step 4: Review My Own Bias

I looked at my early journal entries from the “honeymoon phase”:

Day 3: AI wrote amazing Python code! Saved the example.
Day 7: Another great response. This tool is incredible.
Day 12: Had to retry 3 times to get good output. Prompting is hard.
Day 15: Perfect SQL query! This model is so smart.

I only saved the wins. I didn’t log:

Times I had to retry 5+ times
Generic or unhelpful responses
Simple tasks it failed at

I was comparing today’s average to my best memories from the past.

The reason

I think the key reason AI models feel dumber is psychological, not technical.

1. Expectation Inflation

My first experiences set an unrealistic baseline. What felt “amazing” now feels “expected.” The AI didn’t change - my threshold for amazement did.

2. The Peak-End Rule

I remember the most exceptional early outputs (the peak). I forget the mediocre ones. I compare today’s average to yesterday’s best.

3. Community Amplification

Reddit threads about “nerfs” create confirmation bias. When I read others’ complaints, I become hyper-aware of failures. Even satirical posts (like the Sonnet 5 thread) spread the perception that “AI is getting worse.”

4. Confirmation Bias Loop

Notice a failure
    → Search Reddit for "AI nerfed"
    → Find threads confirming my suspicion
    → Pay more attention to future failures
    → Conclude: "See? It's definitely worse now"

This loop reinforces my belief every time the AI makes a mistake.

5. Normal Variance ≠ Degradation

AI models have variance. Sometimes they give great answers, sometimes average ones. I attribute bad days to “they nerfed it” and good days to “I finally got the prompt right.”

The reality: LLMs are probabilistic. Temperature=0 reduces variance, but doesn’t eliminate it. Context length, server load, and random seeds all affect output.

Summary

In this post, I explained why AI models feel dumber over time even though they don’t actually degrade. The key point is that your perception changes due to expectation inflation, the peak-end rule, and confirmation bias.

To maintain accurate perception:

Track actual performance over time
A/B test with fixed prompts and temperature=0
Recognize that variance is normal, not a conspiracy
Be aware of community amplification of “nerf” narratives

The AI didn’t get worse. You just got used to it.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Peak-End Rule
👨‍💻 Confirmation Bias
👨‍💻 Regression to the Mean
👨‍💻 Expectation Inflation in User Experience

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!