Skip to content

How to Optimize Python Logging for High-Throughput Applications: Processor Chain Performance

The Problem

My scraping pipeline was processing about 1 million URLs per day. CPU usage spiked during peak hours, and profiling showed logging was consuming significant resources. I thought switching to a faster logging library would solve it.

I tried loguru, then structlog, then picologging. Each claimed to be faster. The benchmarks showed structlog was about 2x faster than alternatives. But after switching, CPU usage barely dropped.

The real bottleneck wasn’t the library. It was the processor chain itself.

The Wrong Optimization Path
My approach:
1. Switch from stdlib logging to structlog → 2x faster (supposedly)
2. CPU still high → Confused
3. Profile deeper → Processor chain eating CPU
Reality:
- Library choice: ~5% improvement
- Processor chain optimization: ~60% improvement

A Reddit user working on similar scraping pipelines confirmed my discovery: “One thing that surprised us at higher volumes: the processor chain itself can become a bottleneck.”

The key insight: WHERE you filter matters more than WHAT library you use.

Understanding the Processor Chain Bottleneck

At high throughput, every processor call adds measurable overhead. Here’s what happens when you log an event:

Processor Chain Flow
Event → Processor 1 → Processor 2 → Processor 3 → ... → JSON Renderer → Output
Each processor:
- Function call overhead
- Event dict manipulation
- Potential string formatting
- JSON serialization (expensive!)

The problem compounds when you log at INFO level for normal operations. Most events get processed, serialized, and then filtered out by level checks. That’s wasted CPU.

Inefficient Flow (What I Had)
INFO event → Add level → Add timestamp → Format exception → JSON render
→ Level filter (drop because INFO < WARNING) → WASTED
All that work, then dropped at the end.

The fix: filter BEFORE expensive operations, not after.

The Zero-Cost Optimization: Level Filtering

structlog’s make_filtering_bound_logger() is the game-changer. It makes disabled levels “essentially free” by replacing filtered log methods with return None.

bound_logger_example.py
import logging
import structlog
# Create filtered logger - DEBUG calls become FREE
FilteredLogger = structlog.make_filtering_bound_logger(logging.WARNING)
structlog.configure(
wrapper_class=FilteredLogger,
processors=[
structlog.processors.add_log_level,
structlog.processors.JSONRenderer(),
],
)
log = structlog.get_logger()
# This is FREE - no processor chain executes at all
log.debug("expensive debug info", data=compute_heavy_debug())
# This runs the full processor chain
log.warning("warning message", url="example.com")

When log.debug() is called on a WARNING-filtered logger, it immediately returns None. No event dict creation, no processor calls, zero overhead.

Zero-Cost Filtering Comparison
WITHOUT make_filtering_bound_logger():
log.debug() → Create event dict → Run 5 processors → JSON render → Check level → Drop
Cost: Full chain execution
WITH make_filtering_bound_logger():
log.debug() → return None (replaced at bind time)
Cost: Literally zero

This is the single most impactful optimization. Set your production level to WARNING or ERROR, and debug/info calls become free.

Sampling Before Serialization

For high-volume operations, even WARNING level might log too many events. Sampling lets you log 1 in N events, reducing volume by 90%+ while still capturing enough for debugging.

The critical mistake I made: sampling AFTER serialization.

Wrong Sampling Location
Event → JSON render → Sample check → Drop (wasted JSON render)

Correct: sample BEFORE serialization.

Correct Sampling Location
Event → Sample check → Drop (cheap) OR Continue → JSON render

Here’s a sampling processor that goes early in the chain:

sampling_processor.py
import structlog
class RateLimiter:
def __init__(self, sample_rate=0.05): # Log 5% of events
self.sample_rate = sample_rate
self.counter = 0
def should_log(self):
self.counter += 1
return self.counter % int(1 / self.sample_rate) == 0
rate_limiter = RateLimiter(sample_rate=0.05)
def sampling_processor(logger, method_name, event_dict):
"""Sample INFO logs at 5% rate, keep all warnings/errors."""
level = event_dict.get("level", "info")
# Always keep errors and warnings
if level in ("warning", "error", "critical"):
return event_dict
# Sample info/debug events
if not rate_limiter.should_log():
return None # Drop event BEFORE expensive processing
return event_dict

Place this processor EARLY in the chain, before JSON rendering:

chain_ordering.py
structlog.configure(
processors=[
# EARLY filtering - before expensive operations
sampling_processor, # Sample first (cheap)
# Context enrichment (also relatively cheap)
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
# FINAL: expensive serialization (only for events that passed filters)
structlog.processors.JSONRenderer(),
],
)

For my scraping pipeline, sampling successes at 1% and keeping all failures worked perfectly. I still see every error, but 99% of successful scrapes don’t consume CPU.

Deduplication Before Serialization

Another bottleneck: identical events processed repeatedly. My scraper hitting the same domain would log the same “connection established” message hundreds of times.

Deduplication prevents this, but again, placement matters:

dedup_processor.py
import hashlib
import time
from collections import defaultdict
last_seen = defaultdict(float)
def dedup_processor(logger, method_name, event_dict):
"""Skip duplicate events within 5 second window."""
# Create hash key from event message and key fields
key = hashlib.md5(
f"{event_dict.get('event', '')}:{event_dict.get('worker_id', '')}".encode()
).hexdigest()
now = time.time()
if now - last_seen[key] < 5.0: # 5 second dedup window
return None # Drop duplicate
last_seen[key] = now
return event_dict

This processor should go right after sampling, before any expensive formatting:

dedup_chain_position.py
structlog.configure(
processors=[
sampling_processor, # Sample first
dedup_processor, # Dedup second (both are cheap)
# Now only unique, sampled events reach here
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
],
)

The performance gain is significant: avoiding JSON rendering for 90%+ of duplicate events.

Context Enrichment Pipeline

The Reddit insight that changed my approach: “being able to attach stuff like worker_id, target_domain, retry_count to every log line pays off quickly.”

Structured context makes debugging distributed systems possible. When I see an error, I immediately know which worker, which domain, how many retries.

context_processor.py
def add_context_processor(logger, method_name, event_dict):
"""Add worker and request context to every log."""
# These would come from context variables in real use
event_dict.setdefault("worker_id", get_current_worker_id())
event_dict.setdefault("target_domain", get_current_domain())
event_dict.setdefault("retry_count", get_retry_count())
return event_dict

For scraping workflows, add scraping-specific context:

scraping_context.py
import structlog
from opentelemetry import trace
def scraping_context_processor(logger, method_name, event_dict):
"""Add scraping-specific context with trace correlation."""
event_dict.setdefault("url_count", get_url_count())
event_dict.setdefault("batch_id", get_batch_id())
event_dict.setdefault("response_status", get_last_status())
# Add OpenTelemetry trace context for correlation
span = trace.get_current_span()
if span and span.is_recording():
ctx = span.get_span_context()
event_dict["trace_id"] = format(ctx.trace_id, "032x")
return event_dict

This processor is relatively cheap - just dict operations. Place it after sampling/dedup but before serialization.

Complete High-Performance Configuration

Here’s my production configuration for the scraping pipeline:

high_performance_config.py
import logging
import orjson
import structlog
import hashlib
import time
from collections import defaultdict
# ===== Sampling Processor =====
class RateLimiter:
def __init__(self, sample_rate=0.01): # 1% for successes
self.sample_rate = sample_rate
self.counter = 0
def should_log(self):
self.counter += 1
return self.counter % int(1 / self.sample_rate) == 0
rate_limiter = RateLimiter()
def sampling_processor(logger, method_name, event_dict):
level = event_dict.get("level", "info")
if level in ("warning", "error", "critical"):
return event_dict
if not rate_limiter.should_log():
return None
return event_dict
# ===== Deduplication Processor =====
last_seen = defaultdict(float)
def dedup_processor(logger, method_name, event_dict):
key = hashlib.md5(
f"{event_dict.get('event', '')}:{event_dict.get('worker_id', '')}".encode()
).hexdigest()
now = time.time()
if now - last_seen[key] < 5.0:
return None
last_seen[key] = now
return event_dict
# ===== Context Processor =====
def add_context_processor(logger, method_name, event_dict):
event_dict.setdefault("worker_id", "worker-01")
event_dict.setdefault("target_domain", "example.com")
event_dict.setdefault("retry_count", 0)
return event_dict
# ===== HIGH-PERFORMANCE CONFIGURATION =====
structlog.configure(
cache_logger_on_first_use=True,
# Critical: makes INFO/DEBUG calls free
wrapper_class=structlog.make_filtering_bound_logger(logging.WARNING),
processors=[
# EARLY: cheap filtering (before expensive operations)
sampling_processor,
dedup_processor,
# MIDDLE: context enrichment
structlog.contextvars.merge_contextvars,
add_context_processor,
structlog.processors.add_log_level,
structlog.processors.format_exc_info,
structlog.processors.TimeStamper(fmt="iso", utc=True),
# FINAL: expensive serialization (only for filtered events)
structlog.processors.JSONRenderer(serializer=orjson.dumps),
],
logger_factory=structlog.BytesLoggerFactory(),
)
log = structlog.get_logger()

Key ordering principle:

Processor Chain Order (Critical)
EARLY (cheap, drop events):
1. sampling_processor - Drop most info events
2. dedup_processor - Drop duplicates
MIDDLE (enrichment):
3. contextvars merge - Add async context
4. add_context - Add worker/domain info
5. add_log_level - Add level field
6. format_exc_info - Add exception info
7. TimeStamper - Add timestamp
FINAL (expensive):
8. JSONRenderer - Serialize (orjson for speed)

Only events that pass all early filters reach the expensive JSON renderer.

Benchmarking Your Setup

To measure logging overhead:

benchmark_logging.py
import time
import structlog
def benchmark_logging(iterations=100000):
start = time.perf_counter()
for i in range(iterations):
log.info("processing_url", url=f"url-{i}", status=200)
elapsed = time.perf_counter() - start
print(f"{iterations} logs in {elapsed:.2f}s")
print(f"Rate: {iterations/elapsed:.0f} logs/sec")
print(f"Per-log overhead: {elapsed/iterations*1000:.3f}ms")

Run this with different configurations:

Benchmark Results (My Measurements)
Baseline (stdlib logging + JSON):
100,000 logs in 12.5s
Rate: 8,000 logs/sec
Per-log: 0.125ms
With structlog (no optimization):
100,000 logs in 6.3s
Rate: 16,000 logs/sec
Per-log: 0.063ms
With make_filtering_bound_logger(WARNING):
100,000 INFO logs in 0.02s
Rate: 5,000,000 logs/sec (essentially free)
Per-log: 0.000ms
With sampling + dedup before serialization:
Processed 1M events, logged ~50,000
CPU reduced from 80% to 25%

The difference between optimized and unoptimized is dramatic.

When to Use These Optimizations

Signs you need processor chain optimization:

Signs You Need Optimization
- CPU spikes correlated with log volume
- Log storage costs growing faster than traffic
- Application slowing with debug logging enabled
- Processor chain appearing in CPU profiling
- Processing >100k log events per day

Common high-throughput scenarios:

High-Throughput Scenarios
1. Scraping pipelines (~1M URLs/day)
- Sample successes at 1%, keep all failures
- Dedup connection messages
2. Data processing jobs
- Sample progress updates, keep errors
- Context: job_id, stage, progress
3. API gateways
- Sample health checks, keep errors
- Context: endpoint, latency, status
4. Real-time stream processing
- Heavy sampling for throughput events
- Context: stream_id, partition, offset

For low-volume applications (<10k events/day), standard logging is fine. The optimizations matter at scale.

Summary

I spent weeks trying different logging libraries before discovering the real bottleneck. The lesson:

  1. Level filtering first: make_filtering_bound_logger() makes disabled levels free
  2. Sample before serialization: Drop events early, not after expensive JSON rendering
  3. Deduplicate early: Skip identical events before they reach expensive processors
  4. Processor order matters: Cheap filters first, expensive serialization last
  5. Context enrichment: Structured fields pay off in debugging time

The Reddit user who pointed this out was right: “doing sampling and dedup before serialization saved way more CPU than switching logging libraries ever did.”

For my 1M URL/day pipeline, CPU dropped from 80% to 25% without changing the library. Just reordering the processor chain.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments