How to Optimize Python Logging for High-Throughput Applications: Processor Chain Performance

Apr 29, 2026

The Problem

My scraping pipeline was processing about 1 million URLs per day. CPU usage spiked during peak hours, and profiling showed logging was consuming significant resources. I thought switching to a faster logging library would solve it.

I tried loguru, then structlog, then picologging. Each claimed to be faster. The benchmarks showed structlog was about 2x faster than alternatives. But after switching, CPU usage barely dropped.

The real bottleneck wasn’t the library. It was the processor chain itself.

My approach:
1. Switch from stdlib logging to structlog → 2x faster (supposedly)
2. CPU still high → Confused
3. Profile deeper → Processor chain eating CPU

Reality:
- Library choice: ~5% improvement
- Processor chain optimization: ~60% improvement

A Reddit user working on similar scraping pipelines confirmed my discovery: “One thing that surprised us at higher volumes: the processor chain itself can become a bottleneck.”

The key insight: WHERE you filter matters more than WHAT library you use.

Understanding the Processor Chain Bottleneck

At high throughput, every processor call adds measurable overhead. Here’s what happens when you log an event:

Event → Processor 1 → Processor 2 → Processor 3 → ... → JSON Renderer → Output

Each processor:
- Function call overhead
- Event dict manipulation
- Potential string formatting
- JSON serialization (expensive!)

The problem compounds when you log at INFO level for normal operations. Most events get processed, serialized, and then filtered out by level checks. That’s wasted CPU.

INFO event → Add level → Add timestamp → Format exception → JSON render
→ Level filter (drop because INFO < WARNING) → WASTED

All that work, then dropped at the end.

The fix: filter BEFORE expensive operations, not after.

The Zero-Cost Optimization: Level Filtering

structlog’s make_filtering_bound_logger() is the game-changer. It makes disabled levels “essentially free” by replacing filtered log methods with return None.

import logging
import structlog

# Create filtered logger - DEBUG calls become FREE
FilteredLogger = structlog.make_filtering_bound_logger(logging.WARNING)

structlog.configure(
    wrapper_class=FilteredLogger,
    processors=[
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer(),
    ],
)

log = structlog.get_logger()

# This is FREE - no processor chain executes at all
log.debug("expensive debug info", data=compute_heavy_debug())

# This runs the full processor chain
log.warning("warning message", url="example.com")

When log.debug() is called on a WARNING-filtered logger, it immediately returns None. No event dict creation, no processor calls, zero overhead.

WITHOUT make_filtering_bound_logger():
  log.debug() → Create event dict → Run 5 processors → JSON render → Check level → Drop
  Cost: Full chain execution

WITH make_filtering_bound_logger():
  log.debug() → return None (replaced at bind time)
  Cost: Literally zero

This is the single most impactful optimization. Set your production level to WARNING or ERROR, and debug/info calls become free.

Sampling Before Serialization

For high-volume operations, even WARNING level might log too many events. Sampling lets you log 1 in N events, reducing volume by 90%+ while still capturing enough for debugging.

The critical mistake I made: sampling AFTER serialization.

Event → JSON render → Sample check → Drop (wasted JSON render)

Correct: sample BEFORE serialization.

Event → Sample check → Drop (cheap) OR Continue → JSON render

Here’s a sampling processor that goes early in the chain:

import structlog

class RateLimiter:
    def __init__(self, sample_rate=0.05):  # Log 5% of events
        self.sample_rate = sample_rate
        self.counter = 0

    def should_log(self):
        self.counter += 1
        return self.counter % int(1 / self.sample_rate) == 0

rate_limiter = RateLimiter(sample_rate=0.05)

def sampling_processor(logger, method_name, event_dict):
    """Sample INFO logs at 5% rate, keep all warnings/errors."""
    level = event_dict.get("level", "info")

    # Always keep errors and warnings
    if level in ("warning", "error", "critical"):
        return event_dict

    # Sample info/debug events
    if not rate_limiter.should_log():
        return None  # Drop event BEFORE expensive processing

    return event_dict

Place this processor EARLY in the chain, before JSON rendering:

structlog.configure(
    processors=[
        # EARLY filtering - before expensive operations
        sampling_processor,      # Sample first (cheap)

        # Context enrichment (also relatively cheap)
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,

        # FINAL: expensive serialization (only for events that passed filters)
        structlog.processors.JSONRenderer(),
    ],
)

For my scraping pipeline, sampling successes at 1% and keeping all failures worked perfectly. I still see every error, but 99% of successful scrapes don’t consume CPU.

Deduplication Before Serialization

Another bottleneck: identical events processed repeatedly. My scraper hitting the same domain would log the same “connection established” message hundreds of times.

Deduplication prevents this, but again, placement matters:

import hashlib
import time
from collections import defaultdict

last_seen = defaultdict(float)

def dedup_processor(logger, method_name, event_dict):
    """Skip duplicate events within 5 second window."""
    # Create hash key from event message and key fields
    key = hashlib.md5(
        f"{event_dict.get('event', '')}:{event_dict.get('worker_id', '')}".encode()
    ).hexdigest()

    now = time.time()
    if now - last_seen[key] < 5.0:  # 5 second dedup window
        return None  # Drop duplicate

    last_seen[key] = now
    return event_dict

This processor should go right after sampling, before any expensive formatting:

structlog.configure(
    processors=[
        sampling_processor,   # Sample first
        dedup_processor,      # Dedup second (both are cheap)

        # Now only unique, sampled events reach here
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ],
)

The performance gain is significant: avoiding JSON rendering for 90%+ of duplicate events.

Context Enrichment Pipeline

The Reddit insight that changed my approach: “being able to attach stuff like worker_id, target_domain, retry_count to every log line pays off quickly.”

Structured context makes debugging distributed systems possible. When I see an error, I immediately know which worker, which domain, how many retries.

def add_context_processor(logger, method_name, event_dict):
    """Add worker and request context to every log."""
    # These would come from context variables in real use
    event_dict.setdefault("worker_id", get_current_worker_id())
    event_dict.setdefault("target_domain", get_current_domain())
    event_dict.setdefault("retry_count", get_retry_count())
    return event_dict

For scraping workflows, add scraping-specific context:

import structlog
from opentelemetry import trace

def scraping_context_processor(logger, method_name, event_dict):
    """Add scraping-specific context with trace correlation."""
    event_dict.setdefault("url_count", get_url_count())
    event_dict.setdefault("batch_id", get_batch_id())
    event_dict.setdefault("response_status", get_last_status())

    # Add OpenTelemetry trace context for correlation
    span = trace.get_current_span()
    if span and span.is_recording():
        ctx = span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, "032x")

    return event_dict

This processor is relatively cheap - just dict operations. Place it after sampling/dedup but before serialization.

Complete High-Performance Configuration

Here’s my production configuration for the scraping pipeline:

import logging
import orjson
import structlog
import hashlib
import time
from collections import defaultdict

# ===== Sampling Processor =====
class RateLimiter:
    def __init__(self, sample_rate=0.01):  # 1% for successes
        self.sample_rate = sample_rate
        self.counter = 0

    def should_log(self):
        self.counter += 1
        return self.counter % int(1 / self.sample_rate) == 0

rate_limiter = RateLimiter()

def sampling_processor(logger, method_name, event_dict):
    level = event_dict.get("level", "info")
    if level in ("warning", "error", "critical"):
        return event_dict
    if not rate_limiter.should_log():
        return None
    return event_dict

# ===== Deduplication Processor =====
last_seen = defaultdict(float)

def dedup_processor(logger, method_name, event_dict):
    key = hashlib.md5(
        f"{event_dict.get('event', '')}:{event_dict.get('worker_id', '')}".encode()
    ).hexdigest()
    now = time.time()
    if now - last_seen[key] < 5.0:
        return None
    last_seen[key] = now
    return event_dict

# ===== Context Processor =====
def add_context_processor(logger, method_name, event_dict):
    event_dict.setdefault("worker_id", "worker-01")
    event_dict.setdefault("target_domain", "example.com")
    event_dict.setdefault("retry_count", 0)
    return event_dict

# ===== HIGH-PERFORMANCE CONFIGURATION =====
structlog.configure(
    cache_logger_on_first_use=True,
    # Critical: makes INFO/DEBUG calls free
    wrapper_class=structlog.make_filtering_bound_logger(logging.WARNING),
    processors=[
        # EARLY: cheap filtering (before expensive operations)
        sampling_processor,
        dedup_processor,

        # MIDDLE: context enrichment
        structlog.contextvars.merge_contextvars,
        add_context_processor,
        structlog.processors.add_log_level,
        structlog.processors.format_exc_info,
        structlog.processors.TimeStamper(fmt="iso", utc=True),

        # FINAL: expensive serialization (only for filtered events)
        structlog.processors.JSONRenderer(serializer=orjson.dumps),
    ],
    logger_factory=structlog.BytesLoggerFactory(),
)

log = structlog.get_logger()

Key ordering principle:

EARLY (cheap, drop events):
  1. sampling_processor   - Drop most info events
  2. dedup_processor      - Drop duplicates

MIDDLE (enrichment):
  3. contextvars merge    - Add async context
  4. add_context          - Add worker/domain info
  5. add_log_level        - Add level field
  6. format_exc_info      - Add exception info
  7. TimeStamper          - Add timestamp

FINAL (expensive):
  8. JSONRenderer         - Serialize (orjson for speed)

Only events that pass all early filters reach the expensive JSON renderer.

Benchmarking Your Setup

To measure logging overhead:

import time
import structlog

def benchmark_logging(iterations=100000):
    start = time.perf_counter()

    for i in range(iterations):
        log.info("processing_url", url=f"url-{i}", status=200)

    elapsed = time.perf_counter() - start
    print(f"{iterations} logs in {elapsed:.2f}s")
    print(f"Rate: {iterations/elapsed:.0f} logs/sec")
    print(f"Per-log overhead: {elapsed/iterations*1000:.3f}ms")

Run this with different configurations:

Baseline (stdlib logging + JSON):
  100,000 logs in 12.5s
  Rate: 8,000 logs/sec
  Per-log: 0.125ms

With structlog (no optimization):
  100,000 logs in 6.3s
  Rate: 16,000 logs/sec
  Per-log: 0.063ms

With make_filtering_bound_logger(WARNING):
  100,000 INFO logs in 0.02s
  Rate: 5,000,000 logs/sec (essentially free)
  Per-log: 0.000ms

With sampling + dedup before serialization:
  Processed 1M events, logged ~50,000
  CPU reduced from 80% to 25%

The difference between optimized and unoptimized is dramatic.

When to Use These Optimizations

Signs you need processor chain optimization:

- CPU spikes correlated with log volume
- Log storage costs growing faster than traffic
- Application slowing with debug logging enabled
- Processor chain appearing in CPU profiling
- Processing >100k log events per day

Common high-throughput scenarios:

1. Scraping pipelines (~1M URLs/day)
   - Sample successes at 1%, keep all failures
   - Dedup connection messages

2. Data processing jobs
   - Sample progress updates, keep errors
   - Context: job_id, stage, progress

3. API gateways
   - Sample health checks, keep errors
   - Context: endpoint, latency, status

4. Real-time stream processing
   - Heavy sampling for throughput events
   - Context: stream_id, partition, offset

For low-volume applications (<10k events/day), standard logging is fine. The optimizations matter at scale.

Summary

I spent weeks trying different logging libraries before discovering the real bottleneck. The lesson:

Level filtering first: make_filtering_bound_logger() makes disabled levels free
Sample before serialization: Drop events early, not after expensive JSON rendering
Deduplicate early: Skip identical events before they reach expensive processors
Processor order matters: Cheap filters first, expensive serialization last
Context enrichment: Structured fields pay off in debugging time

The Reddit user who pointed this out was right: “doing sampling and dedup before serialization saved way more CPU than switching logging libraries ever did.”

For my 1M URL/day pipeline, CPU dropped from 80% to 25% without changing the library. Just reordering the processor chain.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!