How to Compress JSON Data in PostgreSQL for Storage Efficiency
The Problem
I built a generic API cache in PostgreSQL using JSONB columns. It worked great for months. Then I checked the table size:
Table: api_cacheRows: 500,000Expected size: ~500MB (based on JSON file sizes)Actual size: 1.8GBThe JSON documents I was caching averaged 1KB each. But PostgreSQL was storing nearly 4KB per row. Where did the extra 3KB come from?
The Root Cause: JSONB Attribute Name Overhead
JSONB stores the attribute name with every row. If your JSON has 20 keys and you have 1 million rows, those 20 key names are stored 1 million times:
Document structure (1KB):{ "user_id": 12345, "session_id": "abc-def-ghi", "preferences": {...}, "metadata": {...}, ...16 more keys...}
What PostgreSQL actually stores per row (4KB):- Key name "user_id" + value- Key name "session_id" + value- Key name "preferences" + value- Key name "metadata" + value- ...16 more key names + values...
The key names repeat for every single row.This isn’t a bug - it’s how JSONB enables querying. You can filter on any key without a schema. But for pure object storage like a cache, this flexibility becomes expensive.
Should You Compress?
Before jumping to compression, I asked myself: is this actually worth optimizing?
- Pure object storage (cache, blob store)- High volume, simple key lookups- Storage costs matter ($100+/month)- Large documents (>1KB each)- Cold/archival data rarely queried- Need JSON querying (WHERE data->>'status' = 'active')- Small scale (<100GB total)- Documents <500 bytes- Hot paths where latency matters- Development simplicity is priorityA Reddit comment nailed it: “Don’t worry about data size unless you have enough for it to matter in terms of cost or query time. Keep your code simple and focus on bigger problems.”
My cache was 1.8GB growing 100MB/week. At cloud storage rates, this would cost real money within a year. So compression was worth investigating.
Attempt 1: zstd in bytea Column
I compressed the JSON at the application layer and stored it in a bytea column:
Before: JSON document -> JSONB column -> 4KB stored
After: JSON document -> zstd compress -> bytea column -> 1KB storedThe Python implementation:
import zstandard as zstdimport json
class JSONCompressor: def __init__(self): self.compressor = zstd.ZstdCompressor(level=3) self.decompressor = zstd.ZstdDecompressor()
def compress(self, data: dict) -> bytes: json_bytes = json.dumps(data).encode('utf-8') return self.compressor.compress(json_bytes)
def decompress(self, compressed: bytes) -> dict: decompressed = self.decompressor.decompress(compressed) return json.loads(decompressed.decode('utf-8'))The database schema:
CREATE TABLE compressed_cache ( id SERIAL PRIMARY KEY, cache_key TEXT UNIQUE NOT NULL, compressed_data BYTEA NOT NULL, size_original INTEGER, size_compressed INTEGER, created_at TIMESTAMPTZ DEFAULT NOW(), expires_at TIMESTAMPT);
CREATE INDEX idx_cache_key ON compressed_cache(cache_key);The results after migrating 500,000 rows:
Original JSONB: 1.8GBCompressed bytea: 0.6GBCompression ratio: 67%Space saved: 1.2GBBut I lost something important: the ability to query the JSON.
Before (JSONB): SELECT * FROM api_cache WHERE data->>'status' = 'active' AND data->>'user_type' = 'premium';
After (bytea): -- Cannot query compressed data directly -- Must decompress in application layer -- No GIN index on JSON fieldsFor my cache use case, this was fine. I always retrieved full documents by key. But if your workload needs JSON querying, this approach fails.
Attempt 2: lz4 for Speed-Critical Paths
I tested lz4 compression for a hot cache path where decompression latency mattered:
import lz4.frameimport time
# Test data: 2KB JSON documenttest_json = {"key": "value", ...} # 2KB
# zstd compressionzstd_compressed = zstd.ZstdCompressor().compress(json_bytes)zstd_time = time.perf_counter()zstd_decompressed = zstd.ZstdDecompressor().decompress(zstd_compressed)zstd_decompress_time = time.perf_counter() - zstd_time
# lz4 compressionlz4_compressed = lz4.frame.compress(json_bytes)lz4_time = time.perf_counter()lz4_decompressed = lz4.frame.decompress(lz4_compressed)lz4_decompress_time = time.perf_counter() - lz4_timeAlgorithm | Compress Ratio | Decompress Time-------------|----------------|------------------zstd level 3 | 65% | 0.8mslz4 | 50% | 0.1msJSONB | 0% (baseline) | 0.05ms (no decompress)lz4 decompresses 8x faster than zstd, but the compression ratio is worse. For my hot cache path serving 1000 requests/second, lz4 added 0.1ms per request - acceptable. zstd would have added 0.8ms - noticeable at high load.
Attempt 3: Hybrid Approach
What if I need both compression and some queryability? I tried extracting frequently queried fields to regular columns while compressing the rest:
CREATE TABLE hybrid_storage ( id BIGSERIAL PRIMARY KEY,
-- Frequently queried fields: regular columns user_id INTEGER NOT NULL, status TEXT NOT NULL, created_at TIMESTAMPTZ DEFAULT NOW(),
-- Rarely queried details: compressed details_compressed BYTEA,
-- Occasional queries: JSONB (small subset) metadata JSONB);
-- Index on extracted columns (fast)CREATE INDEX idx_user_status ON hybrid_storage(user_id, status);
-- No index on compressed details (cannot query)This worked for my mixed workload:
Common queries (80% of requests): SELECT * FROM hybrid_storage WHERE user_id = 123 AND status = 'active' -- Uses index, no decompression needed
Full document retrieval (20% of requests): SELECT details_compressed FROM hybrid_storage WHERE id = 456 -- Decompress in application layer
Rare queries (5% of requests): SELECT metadata->>'source' FROM hybrid_storage WHERE user_id = 123 -- Uses JSONB GIN indexThe trade-off matrix:
Approach | Storage | Query Flexibility | Read Latency | Complexity---------------|---------|-------------------|--------------|------------JSONB only | High | Full | Fast | Lowbytea + zstd | Low | None | Medium | Mediumbytea + lz4 | Medium | None | Fast | MediumHybrid | Medium | Partial | Fast | HighCommon Mistakes I Made
Mistake 1: Premature optimization
I initially tried to compress everything. But tables under 100GB didn’t matter for my storage costs. The added complexity wasn’t worth it.
WRONG: Compress all JSONB columns by default
RIGHT: Identify tables >100GB or growing fast Check if storage costs are significant Verify query patterns support compression Only then implement compressionMistake 2: Compressing small documents
Documents under 500 bytes don’t compress well. The compression overhead sometimes exceeds the savings:
Document size: 200 byteszstd compressed: 180 bytes (10% savings)But: added metadata + decompression overheadResult: Net loss for small documentsMistake 3: Not benchmarking decompression
I deployed zstd compression without measuring the latency impact. At 500 requests/second, the 0.8ms decompression added 400ms of total latency. I should have tested lz4 first for this hot path.
Mistake 4: No versioning
After a year, I wanted to switch from zstd to lz4 for better speed. But all existing data was zstd-compressed. I had no way to identify which algorithm compressed each row.
CREATE TABLE compressed_cache ( id SERIAL PRIMARY KEY, cache_key TEXT UNIQUE NOT NULL, compressed_data BYTEA NOT NULL, compression_algo TEXT DEFAULT 'zstd', -- Track algorithm created_at TIMESTAMPTZ DEFAULT NOW());Monitoring Compression Effectiveness
I added a view to track compression savings:
SELECT COUNT(*) as total_entries, pg_size_pretty(SUM(size_original)) as original_size, pg_size_pretty(SUM(size_compressed)) as compressed_size, ROUND( 100.0 * (1 - SUM(size_compressed)::float / NULLIF(SUM(size_original), 0)), 2 ) as compression_ratio_percent, pg_size_pretty(pg_total_relation_size('compressed_cache')) as total_table_sizeFROM compressed_cacheWHERE expires_at IS NULL OR expires_at > NOW();total_entries: 500,000original_size: 1800 MBcompressed_size: 600 MBcompression_ratio_percent: 66.67total_table_size: 650 MB (includes index overhead)Decision Framework
Here’s how I decide which approach to use:
┌─────────────────────────────────────────────────────────────────┐│ ││ Need JSON query capability? ││ │ ││ ├─ YES ─▶ Use JSONB (no compression) ││ │ ││ └─ NO ─▶ Is latency critical on hot paths? ││ │ ││ ├─ YES ─▶ bytea + lz4 ││ │ ││ └─ NO ─▶ bytea + zstd ││ ││ Mixed query patterns? ─▶ Hybrid (extract + compress) ││ │└─────────────────────────────────────────────────────────────────┘Summary
PostgreSQL JSONB storage overhead surprised me with 4x space consumption for my cache use case. The solution depends on your query patterns:
-
zstd in bytea - Best for pure object storage (cache, blob store). Saves 60-80% but loses JSON querying.
-
lz4 in bytea - Best for speed-critical paths. Faster decompression, lower compression ratio.
-
Hybrid - Best for mixed workloads. Extract queried fields, compress the rest.
The key insight: compression is a tool, not a default. Measure first. If storage isn’t impacting cost or performance, keep it simple. But if you’re storing terabytes of JSON that’s rarely queried, compression becomes essential.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: Store JSON in RDBMS Discussion
- 👨💻 PostgreSQL JSONB Documentation
- 👨💻 Zstandard Compression Algorithm
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments