Why PostgreSQL Queries Use Slow Plans: Understanding Planner Statistics
I added an index on the status column. It should have made my query faster, but PostgreSQL kept doing a sequential scan. I was confused—why would the planner ignore my index?
The answer was in the statistics. PostgreSQL’s query planner doesn’t look at your data directly; it looks at statistics about your data. When those statistics are wrong or missing, the planner makes bad decisions.
The Problem: Index Exists, But Not Used
Here’s the scenario that drove me crazy:
-- I created this indexCREATE INDEX idx_orders_status ON orders(status);
-- But my query still used a sequential scanEXPLAIN SELECT * FROM orders WHERE status = 'completed';-- Result: Seq Scan on orders (cost=0.00..150000.00 rows=5000000)Why? The planner estimated 5 million rows would match. With that many rows, a sequential scan IS faster than an index scan—it would have to jump back and forth between index and table 5 million times.
But the actual count was only 50,000 rows. An index scan would have been perfect.
How Planner Statistics Work
PostgreSQL maintains statistics about each column to estimate query costs. These statistics are the planner’s only view of your data.
┌─────────────────────────────────────────────────────────────┐│ Column Statistics │├─────────────────────────────────────────────────────────────┤│ ││ Most Common Values (MCV) ││ ├── List of frequently occurring values ││ ├── Their frequencies (how often each appears) ││ └── Used for: equality conditions (WHERE col = 'value') ││ ││ Histogram ││ ├── Distribution of values across buckets ││ ├── Used for: range conditions (WHERE col > 100) ││ └── Helps estimate selectivity of comparisons ││ ││ Correlation ││ ├── How well physical order matches value order ││ ├── Values from -1.0 to 1.0 ││ └── Used for: deciding between index scan vs seq scan ││ ││ n_distinct ││ ├── Estimated number of distinct values ││ └── Used for: join planning, GROUP BY estimation ││ │└─────────────────────────────────────────────────────────────┘These statistics are gathered by the ANALYZE command, which samples a portion of your table—not all rows.
Checking Your Statistics
The pg_stats view shows what the planner knows about your columns:
SELECT attname as column_name, n_distinct, most_common_vals as mcv, most_common_freqs as mcv_freqs, histogram_bounds, correlationFROM pg_statsWHERE tablename = 'orders'AND attname = 'status';Key columns to understand:
column_name | statusn_distinct | 5 -- About 5 different status valuesmcv | {pending,completed} -- Most common valuesmcv_freqs | {0.35,0.30} -- 35% pending, 30% completedhistogram | {cancelled,processing}-- Distribution boundariescorrelation | 0.12 -- Low correlation with physical orderWhen I ran this on my orders table, I saw the problem: mcv_freqs showed completed at 0.50 (50%), but my actual data had only 5% completed orders. The statistics were stale.
The Statistics Problem
Statistics can be wrong for several reasons:
1. Outdated Statistics
After bulk operations, statistics don’t automatically update:
-- This doesn't update statistics automaticallyCOPY orders FROM '/bulk_data.csv';
-- The planner still thinks the table has old row counts-- You must run ANALYZEANALYZE orders;2. Sampling Limitations
ANALYZE samples rows, it doesn’t count all of them:
Table with 1,000,000 rows │ ▼default_statistics_target = 100 (default) │ ▼ANALYZE samples ~30,000 rows (300 * target) │ ▼Estimates based on this sampleFor skewed distributions, this can miss important patterns.
3. Single-Column Statistics Miss Correlations
The planner’s default statistics are per-column. They don’t capture relationships between columns:
-- Problem: Two columns that are correlatedSELECT * FROM ordersWHERE status = 'completed' -- 30% of rowsAND region = 'west'; -- 40% of rows
-- Planner estimates: 30% * 40% = 12% of rows (assumes independence)-- Reality: Most completed orders ARE in west region = 28% of rowsThe planner assumed independence and underestimated. It chose a nested loop join when a hash join would be better.
The Solution: Extended Statistics
PostgreSQL 10+ lets you create statistics that capture column correlations:
-- Create extended statistics for correlated columnsCREATE STATISTICS orders_stats (dependencies, mcv)ON status, regionFROM orders;
-- Now ANALYZE will gather multi-column statisticsANALYZE orders;
-- Check what was collectedSELECT stxname, stxkind, stxndistinct, stxdependenciesFROM pg_statistics_extWHERE stxrelid = 'orders'::regclass;The stxdependencies column shows the correlation strength between columns:
status => region: 0.85 -- Strong dependency! 85% of the time, -- knowing status tells us about regionNow the planner knows that status = 'completed' usually means region = 'west'.
A Real Debugging Session
I had a query that suddenly became slow after a data migration:
SELECT * FROM ordersWHERE status = 'pending'AND priority = 'high'ORDER BY created_atLIMIT 50;Step 1: Check the plan
EXPLAIN ANALYZE SELECT * FROM ordersWHERE status = 'pending' AND priority = 'high'ORDER BY created_at LIMIT 50;Step 2: Look at row estimates vs actual
Index Scan using idx_orders_created on orders (cost=0.43..8500.00 rows=50) (actual time=0.05..4500.00 rows=50000) Filter: ((status = 'pending') AND (priority = 'high')) Rows Removed by Filter: 500000The planner thought 50 rows would match. It got 50,000. Wrong plan chosen.
Step 3: Check statistics
SELECT attname, n_distinct, most_common_vals, most_common_freqsFROM pg_stats WHERE tablename = 'orders';
-- Result: Statistics showed old data distribution-- My migration changed the data, but not the statsStep 4: Fix with ANALYZE and extended statistics
-- Update basic statisticsANALYZE orders;
-- Create extended statistics for correlated columnsCREATE STATISTICS orders_status_priority (mcv)ON status, priority FROM orders;
ANALYZE orders;Step 5: Verify the fix
Index Scan using idx_orders_status_priority_created on orders (cost=0.43..150.00 rows=50) (actual time=0.05..2.00 rows=50)From 4.5 seconds to 2 milliseconds.
When to Update Statistics
┌─────────────────────────────────────────┐│ Significant data change? │└────────────────┬────────────────────────┘ │ ┌────────▼────────┐ │ │ │ YES NO │ │ │ │ ▼ ▼ │ Run ANALYZE Check │ immediately plan │ │ │ │ │ ▼ │ │ Row estimates │ │ way off? │ │ │ │ │ ┌────▼────┐ │ │ YES NO │ │ │ │ │ │ ▼ ▼ │ │ ANALYZE OK │ └──┴────────────────┘Run ANALYZE after:
- Bulk
INSERT,UPDATE, orDELETE COPYoperations- Data migrations
- Significant changes in data distribution
Statistics Configuration
You can tune how detailed statistics get:
-- Increase statistics detail for a specific columnALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
-- Default is 100, higher = more accurate but slower ANALYZE-- Range is 0 to 10000
-- For the whole databaseSET default_statistics_target = 200;
-- Larger sample = better stats but longer ANALYZE timeCommon Mistakes to Avoid
1. Assuming ANALYZE happens automatically
Autovacuum does run ANALYZE, but only after certain thresholds. After bulk operations, run it manually.
2. Ignoring correlation between columns
If you query on multiple columns together, create extended statistics.
3. Not checking pg_stats
Before assuming an index should be used, check what the planner knows:
-- If planner thinks your filter matches 90% of rows,-- it will choose a seq scan even with an indexSELECT attname, n_distinct, most_common_vals, most_common_freqsFROM pg_stats WHERE tablename = 'your_table';4. Assuming statistics are exact
ANALYZE samples. For small tables with skewed data, estimates can be off. Consider increasing STATISTICS target.
Quick Diagnostic Checklist
When a query uses the wrong plan:
- Run
EXPLAIN ANALYZEto see estimated vs actual rows - Check
pg_statsfor the columns in your WHERE clause - Look for large discrepancies between estimates and reality
- Run
ANALYZE table_nameif statistics might be stale - Consider extended statistics if filtering on multiple correlated columns
- Verify your index is actually selective (matches small % of rows)
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments