Skip to content

What Are the Pitfalls of Storing JSON in SQL Server and PostgreSQL?

Database server warning

Problem

I designed a table with a JSONB column for storing user metadata. Six months later, the table had 2 million rows and queries that used to take 50ms were now taking 3 seconds. I had no statistics to guide the query planner, storage was bloated with repeated key names, and every query was a runtime experiment.

I thought JSON columns were the perfect solution for flexible data. I could add fields without migrations, store whatever I wanted, and avoid rigid schemas. Then reality hit at scale.

The Reddit discussion I found confirmed my experience wasn’t unique: “Small scale (<100K docs): JSONB is totally fine. Medium scale: tricky. Large scale: hit performance cliffs.”

Pitfall #1: Statistics Blindness

The query optimizer relies on statistics to choose execution plans. JSON columns lack granular statistics on individual fields within the JSON structure. When you filter on a JSON field, the optimizer is flying blind.

SQL Server Statistics Problem
-- This query cannot use statistics effectively
SELECT * FROM Orders
WHERE JSON_VALUE(OrderData, '$.Status') = 'Shipped'
AND JSON_VALUE(OrderData, '$.Region') = 'US-West';
-- The optimizer guesses selectivity, often poorly
-- It might choose a sequential scan when an index scan would work

The same problem exists in PostgreSQL. While JSONB supports GIN indexes for containment queries, the planner cannot understand the semantic structure of your JSON.

Query Planner Blindness
Normal Column Query:
Planner knows: "Status column has 100K 'Shipped' values (10% of rows)"
Planner chooses: Index scan (efficient)
JSON Field Query:
Planner knows: "There's a JSON blob"
Planner guesses: "Maybe 50% match? Maybe 1%? Who knows?"
Planner chooses: Sequential scan (might be wrong)

I assumed CREATE INDEX ON table USING GIN (jsonb_column) would solve everything. It doesn’t. GIN indexes help with containment (@>) but not with extracting and filtering on specific values.

Pitfall #2: Storage Bloat

PostgreSQL JSONB stores the full key name for every attribute in every row. If you have a field named user_preferred_notification_method appearing in 1 million rows, that string is stored 1 million times.

Storage Size Comparison
Regular columns:
user_id (4 bytes) + email (varchar 50) + created_at (8 bytes)
= ~62 bytes per row
JSONB equivalent:
{"user_id": 123, "email": "[email protected]", "created_at": "2024-01-01"}
= ~85 bytes per row (30%+ larger due to key name strings)
At 10 million rows: 620MB vs 850MB (37% larger)

SQL Server stores JSON as text (NVARCHAR typically) with no compression of repeated key names across rows.

Impact of Larger Rows
Row Size → Rows Per Page → I/O Operations → Performance
Small rows (100 bytes): 80 rows/page → Fewer I/O operations
Large rows (500 bytes): 16 rows/page → More I/O operations
Reading 1M rows:
Small: 12,500 page reads
Large: 62,500 page reads (5x more I/O)

The bloat cascades: larger rows mean fewer rows per page, more I/O for scans, memory pressure in buffer cache, longer backup and replication times.

Pitfall #3: Type Safety Roulette

Schema changes in typed columns cause immediate errors. Schema “changes” in JSON are just different data—no errors, just wrong results.

SQL Server Silent Failure
-- This compiles and runs, but returns NULL
SELECT JSON_VALUE(Data, '$.usr_id') -- typo: should be 'user_id'
-- No compile-time error. Silent NULL return.
-- Your app might crash, or worse: silently produce wrong results
PostgreSQL Silent Failure
-- This query returns NULL silently
SELECT data->>'usr_id' FROM users; -- typo: should be 'user_id'
-- No error message. Just NULL values.
-- You discover the bug only when users complain about missing data

I spent an entire day debugging why a report was showing zero results. The JSON key was "customerId" but my query used "customer_id". The database happily returned NULL for every row.

Type Safety Comparison
Typed Column Error:
Query: SELECT usr_id FROM users
Database: "ERROR: column 'usr_id' does not exist"
Result: Immediate feedback, fix in 1 minute
JSON Field Error:
Query: SELECT data->>'usr_id' FROM users
Database: NULL (silently)
Result: Bug discovered days later, data corruption possible

Pitfall #4: The Technical Debt Slippery Slope

The Reddit thread had a warning that hit home: “The practice of adding fields to the blob has let us get away with tightly coupling brittle systems.”

The JSON Column Lifecycle
Month 1: "Just a few optional fields for flexibility"
→ 3 fields in JSON, all truly optional
Month 6: "We need more fields, but migrations are slow"
→ 12 fields in JSON, 7 are "required" by business logic
Month 12: "We need to change a field name"
→ 47 fields in JSON, 15 are required
→ No schema documentation
→ Changing any field requires full data audit
→ No one knows which fields are actually used

JSON columns enable developers to bypass schema review. What starts as flexibility becomes unmanaged complexity. I’ve seen teams spend weeks trying to figure out which JSON fields could be safely removed.

Scale Thresholds: When to Stop Using JSON

JSON Usage by Scale
┌─────────────────┬──────────────────────────────────────────────┐
│ Scale │ Recommendation │
├─────────────────┼──────────────────────────────────────────────┤
│ < 100K rows │ JSON is fine │
│ │ Performance differences negligible │
│ │ Flexibility outweighs costs │
├─────────────────┼──────────────────────────────────────────────┤
│ 100K - 1M rows │ Proceed with caution │
│ │ Index carefully, monitor query plans │
│ │ Start planning extraction strategy │
├─────────────────┼──────────────────────────────────────────────┤
│ 1M - 10M rows │ Extract hot fields to columns │
│ │ Hybrid approach: columns for queries │
│ │ JSON only for truly flexible data │
├─────────────────┼──────────────────────────────────────────────┤
│ > 10M rows │ Avoid JSON for query patterns │
│ │ Performance cliffs, maintenance burden │
│ │ Consider separate document store │
└─────────────────┴──────────────────────────────────────────────┘

The Hybrid Approach: Best of Both Worlds

I learned to use a hybrid schema: typed columns for everything I query, JSONB for truly flexible data.

Hybrid Schema Example
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
price DECIMAL(10,2) NOT NULL,
-- Extract frequently queried fields to typed columns
category TEXT NOT NULL,
in_stock BOOLEAN NOT NULL,
-- Keep JSONB only for truly flexible attributes
attributes JSONB NOT NULL DEFAULT '{}'
);
-- Create appropriate indexes
CREATE INDEX IX_Products_Category ON products(category);
CREATE INDEX IX_Products_Attributes ON products USING GIN (attributes);
Hybrid Query Pattern
-- Typed columns for filters, JSONB for display
SELECT
name,
price,
attributes->>'color' as color,
attributes->>'warranty' as warranty
FROM products
WHERE category = 'Electronics' -- Uses btree index
AND in_stock = true -- Uses btree index
AND attributes @> '{"warranty": "2 years"}'; -- Uses GIN index

The key insight: if a field appears in WHERE, JOIN, ORDER BY, or GROUP BY clauses, it belongs in a typed column. JSON should only hold data you display, not data you query.

Common Mistakes I Made

My JSON Storage Mistakes
1. Using JSON for data with relationships
→ Needed to JOIN on JSON fields
→ Should have used real columns
2. Indexing everything in JSONB
→ GIN indexes are expensive to maintain
→ Should index only what I query
3. Assuming "it's just text, parsing is fast"
→ Extracting data->>'field' a million times adds CPU cost
→ Should have used typed columns for hot fields
4. No migration path planned
→ Started with JSON for flexibility
→ Had no strategy for extracting fields when scale demanded it
→ Should have documented when to extract each field
5. Treating JSON as schemaless
→ No documentation of expected structure
→ Should have used check constraints:
CHECK (jsonb_typeof(metadata->'version') = 'number')

How I Fixed It

I developed a migration checklist for extracting JSON fields:

JSON Field Extraction Checklist
Step 1: Identify hot fields
→ Query logs for WHERE/JOIN conditions
→ Find fields queried >1000 times/day
Step 2: Add typed column
→ ALTER TABLE ADD COLUMN field_name TYPE
→ Default NULL initially
Step 3: Backfill data
→ UPDATE table SET field_name = data->>'field'
→ Do in batches to avoid locking
Step 4: Verify data integrity
→ SELECT COUNT(*) WHERE field_name IS NULL
→ AND data->>'field' IS NOT NULL
Step 5: Update queries
→ Replace JSON extraction with column reference
→ Test query plans show index usage
Step 6: Add index
→ CREATE INDEX ON table(field_name)
Step 7: Remove JSON field (optional)
→ Only after verifying no dependencies
→ UPDATE table SET data = data - 'field'

If you’re considering JSON columns, also understand:

  • Document stores vs RDBMS: MongoDB, CouchDB are optimized for JSON storage. PostgreSQL JSONB is a compromise, not a replacement.
  • Columnar storage: For truly massive JSON data, consider columnar formats like Parquet for analytics workloads.
  • Schema evolution tools: If you need flexibility, tools like Prisma or Flyway handle migrations better than JSON abuse.
  • Partial indexes: PostgreSQL allows CREATE INDEX ON table(JSON_FIELD) WHERE condition to reduce index size.

Conclusion

JSON storage in relational databases is a trade-off between flexibility and performance. The pitfalls—statistics blindness, storage bloat, type unsafety, and technical debt—don’t appear at small scales but compound rapidly as data grows.

The Reddit wisdom holds: JSON in RDBMS is “totally fine” at small scale, “tricky” at medium scale, and hits “performance cliffs” at large scale. I learned to:

  1. Use JSON columns for truly optional, schema-less data
  2. Extract fields that appear in WHERE, JOIN, or ORDER BY to typed columns
  3. Set a scale threshold (500K rows) to review JSON usage
  4. Document expected JSON structures with check constraints
  5. Plan the exit strategy before implementing JSON columns

The hybrid approach works: typed columns for query patterns, JSON for flexibility. But the key is recognizing when to extract fields before hitting performance cliffs.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments