Skip to content

How to Compress JSON Data in PostgreSQL for Storage Efficiency

Data compression concept

The Problem

I built a generic API cache in PostgreSQL using JSONB columns. It worked great for months. Then I checked the table size:

Unexpected Storage Growth
Table: api_cache
Rows: 500,000
Expected size: ~500MB (based on JSON file sizes)
Actual size: 1.8GB

The JSON documents I was caching averaged 1KB each. But PostgreSQL was storing nearly 4KB per row. Where did the extra 3KB come from?

The Root Cause: JSONB Attribute Name Overhead

JSONB stores the attribute name with every row. If your JSON has 20 keys and you have 1 million rows, those 20 key names are stored 1 million times:

JSONB Storage Breakdown
Document structure (1KB):
{
"user_id": 12345,
"session_id": "abc-def-ghi",
"preferences": {...},
"metadata": {...},
...16 more keys...
}
What PostgreSQL actually stores per row (4KB):
- Key name "user_id" + value
- Key name "session_id" + value
- Key name "preferences" + value
- Key name "metadata" + value
- ...16 more key names + values...
The key names repeat for every single row.

This isn’t a bug - it’s how JSONB enables querying. You can filter on any key without a schema. But for pure object storage like a cache, this flexibility becomes expensive.

Should You Compress?

Before jumping to compression, I asked myself: is this actually worth optimizing?

When Compression IS Worth It
- Pure object storage (cache, blob store)
- High volume, simple key lookups
- Storage costs matter ($100+/month)
- Large documents (>1KB each)
- Cold/archival data rarely queried
When Compression IS NOT Worth It
- Need JSON querying (WHERE data->>'status' = 'active')
- Small scale (<100GB total)
- Documents <500 bytes
- Hot paths where latency matters
- Development simplicity is priority

A Reddit comment nailed it: “Don’t worry about data size unless you have enough for it to matter in terms of cost or query time. Keep your code simple and focus on bigger problems.”

My cache was 1.8GB growing 100MB/week. At cloud storage rates, this would cost real money within a year. So compression was worth investigating.

Attempt 1: zstd in bytea Column

I compressed the JSON at the application layer and stored it in a bytea column:

Architecture Change
Before:
JSON document -> JSONB column -> 4KB stored
After:
JSON document -> zstd compress -> bytea column -> 1KB stored

The Python implementation:

zstd Compression Helper
import zstandard as zstd
import json
class JSONCompressor:
def __init__(self):
self.compressor = zstd.ZstdCompressor(level=3)
self.decompressor = zstd.ZstdDecompressor()
def compress(self, data: dict) -> bytes:
json_bytes = json.dumps(data).encode('utf-8')
return self.compressor.compress(json_bytes)
def decompress(self, compressed: bytes) -> dict:
decompressed = self.decompressor.decompress(compressed)
return json.loads(decompressed.decode('utf-8'))

The database schema:

Compressed Cache Table
CREATE TABLE compressed_cache (
id SERIAL PRIMARY KEY,
cache_key TEXT UNIQUE NOT NULL,
compressed_data BYTEA NOT NULL,
size_original INTEGER,
size_compressed INTEGER,
created_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPT
);
CREATE INDEX idx_cache_key ON compressed_cache(cache_key);

The results after migrating 500,000 rows:

Compression Results
Original JSONB: 1.8GB
Compressed bytea: 0.6GB
Compression ratio: 67%
Space saved: 1.2GB

But I lost something important: the ability to query the JSON.

What I Lost
Before (JSONB):
SELECT * FROM api_cache
WHERE data->>'status' = 'active'
AND data->>'user_type' = 'premium';
After (bytea):
-- Cannot query compressed data directly
-- Must decompress in application layer
-- No GIN index on JSON fields

For my cache use case, this was fine. I always retrieved full documents by key. But if your workload needs JSON querying, this approach fails.

Attempt 2: lz4 for Speed-Critical Paths

I tested lz4 compression for a hot cache path where decompression latency mattered:

lz4 vs zstd Comparison
import lz4.frame
import time
# Test data: 2KB JSON document
test_json = {"key": "value", ...} # 2KB
# zstd compression
zstd_compressed = zstd.ZstdCompressor().compress(json_bytes)
zstd_time = time.perf_counter()
zstd_decompressed = zstd.ZstdDecompressor().decompress(zstd_compressed)
zstd_decompress_time = time.perf_counter() - zstd_time
# lz4 compression
lz4_compressed = lz4.frame.compress(json_bytes)
lz4_time = time.perf_counter()
lz4_decompressed = lz4.frame.decompress(lz4_compressed)
lz4_decompress_time = time.perf_counter() - lz4_time
Performance Comparison (2KB document)
Algorithm | Compress Ratio | Decompress Time
-------------|----------------|------------------
zstd level 3 | 65% | 0.8ms
lz4 | 50% | 0.1ms
JSONB | 0% (baseline) | 0.05ms (no decompress)

lz4 decompresses 8x faster than zstd, but the compression ratio is worse. For my hot cache path serving 1000 requests/second, lz4 added 0.1ms per request - acceptable. zstd would have added 0.8ms - noticeable at high load.

Attempt 3: Hybrid Approach

What if I need both compression and some queryability? I tried extracting frequently queried fields to regular columns while compressing the rest:

Hybrid Storage Schema
CREATE TABLE hybrid_storage (
id BIGSERIAL PRIMARY KEY,
-- Frequently queried fields: regular columns
user_id INTEGER NOT NULL,
status TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
-- Rarely queried details: compressed
details_compressed BYTEA,
-- Occasional queries: JSONB (small subset)
metadata JSONB
);
-- Index on extracted columns (fast)
CREATE INDEX idx_user_status ON hybrid_storage(user_id, status);
-- No index on compressed details (cannot query)

This worked for my mixed workload:

Hybrid Approach Benefits
Common queries (80% of requests):
SELECT * FROM hybrid_storage WHERE user_id = 123 AND status = 'active'
-- Uses index, no decompression needed
Full document retrieval (20% of requests):
SELECT details_compressed FROM hybrid_storage WHERE id = 456
-- Decompress in application layer
Rare queries (5% of requests):
SELECT metadata->>'source' FROM hybrid_storage WHERE user_id = 123
-- Uses JSONB GIN index

The trade-off matrix:

Trade-off Summary
Approach | Storage | Query Flexibility | Read Latency | Complexity
---------------|---------|-------------------|--------------|------------
JSONB only | High | Full | Fast | Low
bytea + zstd | Low | None | Medium | Medium
bytea + lz4 | Medium | None | Fast | Medium
Hybrid | Medium | Partial | Fast | High

Common Mistakes I Made

Mistake 1: Premature optimization

I initially tried to compress everything. But tables under 100GB didn’t matter for my storage costs. The added complexity wasn’t worth it.

Wrong vs Right
WRONG:
Compress all JSONB columns by default
RIGHT:
Identify tables >100GB or growing fast
Check if storage costs are significant
Verify query patterns support compression
Only then implement compression

Mistake 2: Compressing small documents

Documents under 500 bytes don’t compress well. The compression overhead sometimes exceeds the savings:

Small Document Compression Test
Document size: 200 bytes
zstd compressed: 180 bytes (10% savings)
But: added metadata + decompression overhead
Result: Net loss for small documents

Mistake 3: Not benchmarking decompression

I deployed zstd compression without measuring the latency impact. At 500 requests/second, the 0.8ms decompression added 400ms of total latency. I should have tested lz4 first for this hot path.

Mistake 4: No versioning

After a year, I wanted to switch from zstd to lz4 for better speed. But all existing data was zstd-compressed. I had no way to identify which algorithm compressed each row.

Correct Approach: Versioning
CREATE TABLE compressed_cache (
id SERIAL PRIMARY KEY,
cache_key TEXT UNIQUE NOT NULL,
compressed_data BYTEA NOT NULL,
compression_algo TEXT DEFAULT 'zstd', -- Track algorithm
created_at TIMESTAMPTZ DEFAULT NOW()
);

Monitoring Compression Effectiveness

I added a view to track compression savings:

Compression Statistics Query
SELECT
COUNT(*) as total_entries,
pg_size_pretty(SUM(size_original)) as original_size,
pg_size_pretty(SUM(size_compressed)) as compressed_size,
ROUND(
100.0 * (1 - SUM(size_compressed)::float / NULLIF(SUM(size_original), 0)),
2
) as compression_ratio_percent,
pg_size_pretty(pg_total_relation_size('compressed_cache')) as total_table_size
FROM compressed_cache
WHERE expires_at IS NULL OR expires_at > NOW();
Sample Output
total_entries: 500,000
original_size: 1800 MB
compressed_size: 600 MB
compression_ratio_percent: 66.67
total_table_size: 650 MB (includes index overhead)

Decision Framework

Here’s how I decide which approach to use:

Compression Decision Flow
┌─────────────────────────────────────────────────────────────────┐
│ │
│ Need JSON query capability? │
│ │ │
│ ├─ YES ─▶ Use JSONB (no compression) │
│ │ │
│ └─ NO ─▶ Is latency critical on hot paths? │
│ │ │
│ ├─ YES ─▶ bytea + lz4 │
│ │ │
│ └─ NO ─▶ bytea + zstd │
│ │
│ Mixed query patterns? ─▶ Hybrid (extract + compress) │
│ │
└─────────────────────────────────────────────────────────────────┘

Summary

PostgreSQL JSONB storage overhead surprised me with 4x space consumption for my cache use case. The solution depends on your query patterns:

  1. zstd in bytea - Best for pure object storage (cache, blob store). Saves 60-80% but loses JSON querying.

  2. lz4 in bytea - Best for speed-critical paths. Faster decompression, lower compression ratio.

  3. Hybrid - Best for mixed workloads. Extract queried fields, compress the rest.

The key insight: compression is a tool, not a default. Measure first. If storage isn’t impacting cost or performance, keep it simple. But if you’re storing terabytes of JSON that’s rarely queried, compression becomes essential.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments