What Is Testing in Production? The Strategy Behind "Ship First, Fix Later"
I deployed a fix to production last week. All tests passed in staging. The staging environment mirrored production—or so I thought. Within minutes, users started reporting 500 errors on a specific checkout flow.
The staging database had 100 test records. Production had 10 million. My query, perfectly fast in staging, caused a full table scan in production. A query plan difference I never saw coming.
This is why companies like Netflix, Amazon, and even Anthropic test in production. Not because they’re reckless—because staging environments lie.
The Problem with Staging
Staging environments promise safety. They deliver false confidence.
Staging Environment:- 100 test users- Predictable traffic patterns- Known edge cases- Single region- Artificial load
Production Environment:- Millions of real users- Unpredictable behavior- Unknown unknowns- Multi-region latency- Real load with real consequencesI’ve seen staging pass every test while production burned. Connection pool exhaustion only appears under real concurrency. Race conditions only trigger with actual user timing. Memory leaks only surface after days of uptime.
What Testing in Production Actually Means
Testing in production isn’t skipping QA. It’s acknowledging that staging can’t catch everything.
The strategy involves:
- Gradual rollouts - Deploy to 1% of users first
- Feature flags - Kill switches for instant disable
- Blue-green deployment - Two environments, instant rollback
- Chaos engineering - Intentionally break things to find weaknesses
Netflix runs Chaos Monkey in production. It randomly kills production instances to test resilience. They learned that staging tests predict staging behavior—nothing more.
Implementing Safe Production Testing
Strategy 1: Feature Flags with Gradual Rollout
I started using feature flags after a bad deployment took down our checkout for 30 minutes. Now I can disable any feature instantly.
interface FeatureConfig { enabled: boolean; rolloutPercentage: number; whitelistUsers: string[];}
async function isFeatureEnabled( feature: string, userId: string): Promise<boolean> { const config = await getFeatureConfig(feature);
if (!config.enabled) return false;
// Whitelisted users always get the feature if (config.whitelistUsers.includes(userId)) return true;
// Hash-based consistent rollout const hash = hashUserId(userId); return (hash % 100) < config.rolloutPercentage;}The rollout progression looks like:
Week 1: rolloutPercentage: 5 // Internal testers + lucky usersWeek 2: rolloutPercentage: 25 // Quarter of usersWeek 3: rolloutPercentage: 50 // Half of usersWeek 4: rolloutPercentage: 100 // Full rolloutThis hash-based approach ensures the same user always gets the same experience—no flip-flopping between feature states.
Strategy 2: Canary Releases
Canary releases limit blast radius. I deploy to 1% of traffic, watch metrics, then expand.
apiVersion: apps/v1kind: Deploymentmetadata: name: api-serverspec: strategy: type: RollingUpdate rollingUpdate: maxSurge: 25% maxUnavailable: 10% template: spec: containers: - name: api readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: path: /health/live port: 8080 periodSeconds: 30The health checks act as automatic rollback triggers. If the new pod doesn’t respond, Kubernetes stops routing traffic to it.
Strategy 3: Automated Monitoring and Rollback
Monitoring without automated response is theater. I added alerts that trigger rollback.
groups:- name: deployment.rules rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01 for: 2m annotations: summary: "Error rate above 1%, triggering rollback"This alert gives me two minutes to respond before automated systems kick in. For critical paths, I reduce that to 30 seconds.
The Blue-Green Deployment Pattern
Blue-green maintains two identical production environments. Traffic switches between them instantly.
┌─────────────┐│ Users │└──────┬──────┘ │ ▼┌─────────────┐ ┌─────────────┐│ Load │────▶│ Blue │ (Current stable)│ Balancer │ │ v1.2.3 │└─────────────┘ └─────────────┘ │ │ Deploy v1.2.4 to Green │ Run smoke tests on Green │ Switch traffic 1% → 10% → 50% → 100% ▼┌─────────────┐│ Green │ (New version)│ v1.2.4 │└─────────────┘The rollback is instant: switch the load balancer back to Blue. No redeployment, no waiting.
Real-World Lessons
I learned three things the hard way:
Lesson 1: Users are the best QA
Reddit users noticed Anthropic’s bugs faster than any test suite. Real users find edge cases I never imagined. The infinite variety of real usage patterns exposes problems staging never will.
Lesson 2: Small rollouts save jobs
I once deployed to 100% of users. A memory leak took 45 minutes to surface. By then, thousands of users experienced degraded service. Now I roll out to 5% and watch for a full day before expanding.
Lesson 3: Rollback is not failure
I used to resist rollback, thinking it meant my code was bad. Now I see rollback as a tool, not an admission of defeat. Every deployment should have a rollback plan, and I should feel comfortable executing it.
When to Use Each Strategy
Scenario → Strategy─────────────────────────────────────────────────────New feature, unknown impact → Feature flags + 5% rolloutInfrastructure change → Blue-green deploymentPerformance optimization → Canary with metrics comparisonResilience testing → Chaos engineeringCritical bug fix → Canary with fast rollbackThe Honest Truth About Production Testing
Testing in production isn’t about being careless. It’s about being realistic.
No staging environment perfectly replicates production. The differences in scale, data distribution, user behavior, and timing guarantee that some bugs only surface under real conditions.
The companies doing this well aren’t skipping quality—they’re adding quality where it matters most. They combine staging tests with production validation, catching what staging misses.
Start small. Deploy to 1% with feature flags. Watch your metrics. Expand gradually. Have a rollback plan. Accept that some bugs will reach users, but limit how many users and for how long.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments