What Is Testing in Production? The Strategy Behind "Ship First, Fix Later"

Mar 27, 2026

I deployed a fix to production last week. All tests passed in staging. The staging environment mirrored production—or so I thought. Within minutes, users started reporting 500 errors on a specific checkout flow.

The staging database had 100 test records. Production had 10 million. My query, perfectly fast in staging, caused a full table scan in production. A query plan difference I never saw coming.

This is why companies like Netflix, Amazon, and even Anthropic test in production. Not because they’re reckless—because staging environments lie.

The Problem with Staging

Staging environments promise safety. They deliver false confidence.

Staging Environment:
- 100 test users
- Predictable traffic patterns
- Known edge cases
- Single region
- Artificial load

Production Environment:
- Millions of real users
- Unpredictable behavior
- Unknown unknowns
- Multi-region latency
- Real load with real consequences

I’ve seen staging pass every test while production burned. Connection pool exhaustion only appears under real concurrency. Race conditions only trigger with actual user timing. Memory leaks only surface after days of uptime.

What Testing in Production Actually Means

Testing in production isn’t skipping QA. It’s acknowledging that staging can’t catch everything.

The strategy involves:

Gradual rollouts - Deploy to 1% of users first
Feature flags - Kill switches for instant disable
Blue-green deployment - Two environments, instant rollback
Chaos engineering - Intentionally break things to find weaknesses

Netflix runs Chaos Monkey in production. It randomly kills production instances to test resilience. They learned that staging tests predict staging behavior—nothing more.

Implementing Safe Production Testing

Strategy 1: Feature Flags with Gradual Rollout

I started using feature flags after a bad deployment took down our checkout for 30 minutes. Now I can disable any feature instantly.

interface FeatureConfig {
  enabled: boolean;
  rolloutPercentage: number;
  whitelistUsers: string[];
}

async function isFeatureEnabled(
  feature: string,
  userId: string
): Promise<boolean> {
  const config = await getFeatureConfig(feature);

  if (!config.enabled) return false;

  // Whitelisted users always get the feature
  if (config.whitelistUsers.includes(userId)) return true;

  // Hash-based consistent rollout
  const hash = hashUserId(userId);
  return (hash % 100) < config.rolloutPercentage;
}

The rollout progression looks like:

Week 1: rolloutPercentage: 5   // Internal testers + lucky users
Week 2: rolloutPercentage: 25  // Quarter of users
Week 3: rolloutPercentage: 50  // Half of users
Week 4: rolloutPercentage: 100 // Full rollout

This hash-based approach ensures the same user always gets the same experience—no flip-flopping between feature states.

Strategy 2: Canary Releases

Canary releases limit blast radius. I deploy to 1% of traffic, watch metrics, then expand.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 10%
  template:
    spec:
      containers:
      - name: api
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          periodSeconds: 30

The health checks act as automatic rollback triggers. If the new pod doesn’t respond, Kubernetes stops routing traffic to it.

Strategy 3: Automated Monitoring and Rollback

Monitoring without automated response is theater. I added alerts that trigger rollback.

groups:
- name: deployment.rules
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
    for: 2m
    annotations:
      summary: "Error rate above 1%, triggering rollback"

This alert gives me two minutes to respond before automated systems kick in. For critical paths, I reduce that to 30 seconds.

The Blue-Green Deployment Pattern

Blue-green maintains two identical production environments. Traffic switches between them instantly.

┌─────────────┐
│   Users     │
└──────┬──────┘
       │
       ▼
┌─────────────┐     ┌─────────────┐
│   Load      │────▶│  Blue      │ (Current stable)
│   Balancer  │     │  v1.2.3    │
└─────────────┘     └─────────────┘
       │
       │ Deploy v1.2.4 to Green
       │ Run smoke tests on Green
       │ Switch traffic 1% → 10% → 50% → 100%
       ▼
┌─────────────┐
│   Green     │ (New version)
│   v1.2.4    │
└─────────────┘

The rollback is instant: switch the load balancer back to Blue. No redeployment, no waiting.

Real-World Lessons

I learned three things the hard way:

Lesson 1: Users are the best QA

Reddit users noticed Anthropic’s bugs faster than any test suite. Real users find edge cases I never imagined. The infinite variety of real usage patterns exposes problems staging never will.

Lesson 2: Small rollouts save jobs

I once deployed to 100% of users. A memory leak took 45 minutes to surface. By then, thousands of users experienced degraded service. Now I roll out to 5% and watch for a full day before expanding.

Lesson 3: Rollback is not failure

I used to resist rollback, thinking it meant my code was bad. Now I see rollback as a tool, not an admission of defeat. Every deployment should have a rollback plan, and I should feel comfortable executing it.

When to Use Each Strategy

Scenario                          → Strategy
─────────────────────────────────────────────────────
New feature, unknown impact       → Feature flags + 5% rollout
Infrastructure change             → Blue-green deployment
Performance optimization          → Canary with metrics comparison
Resilience testing                → Chaos engineering
Critical bug fix                  → Canary with fast rollback

The Honest Truth About Production Testing

Testing in production isn’t about being careless. It’s about being realistic.

No staging environment perfectly replicates production. The differences in scale, data distribution, user behavior, and timing guarantee that some bugs only surface under real conditions.

The companies doing this well aren’t skipping quality—they’re adding quality where it matters most. They combine staging tests with production validation, catching what staging misses.

Start small. Deploy to 1% with feature flags. Watch your metrics. Expand gradually. Have a rollback plan. Accept that some bugs will reach users, but limit how many users and for how long.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!