Kedro vs pandas .pipe(): Which is Better for Production Data Pipelines?

Apr 30, 2026

Data pipeline visualization

Your pandas pipeline works great in a Jupyter notebook. But when you need to run it daily on a server, share it with teammates, or debug why yesterday’s output differs from today’s, things get messy fast.

I recently faced this dilemma. A simple 5-step transformation pipeline grew into a 15-step monster with branching paths. Debugging became a guessing game. Teammates asked “which version produced this output?” and I had no answer.

This article compares two approaches: pandas’ built-in .pipe() method for lightweight pipelines, and Kedro for production-grade workflows. I’ll show you exactly when each approach makes sense.

The Problem: Messy Data Pipeline Code

Without proper structure, data pipelines suffer from:

Hard to debug - Where did the data go wrong?
Hard to test - How do you test step 5 in isolation?
Hard to version - What changed between runs?
Hard to collaborate - Who wrote this transformation and why?

I’ve seen pipelines that look like this:

import pandas as pd

df = pd.read_csv('sales.csv')
df = df[df['price'] > 10]
df = df[df['price'] < 500]
df['total'] = df['price'] * df['quantity']
df = df.sort_values('total', ascending=False)
# ... 10 more steps mixed together
df.to_csv('output.csv')

This works, but it’s fragile. Let me show you two better approaches.

Approach 1: pandas .pipe() - Lightweight and Quick

The .pipe() method lets you chain custom functions while keeping your DataFrame as the first argument. It transforms messy code into readable pipelines.

import pandas as pd

def filter_by_price(df, min_price=0, max_price=1000):
    return df[(df['price'] >= min_price) & (df['price'] <= max_price)]

def calculate_total(df):
    df = df.copy()
    df['total'] = df['price'] * df['quantity']
    return df

def sort_by_total(df, ascending=False):
    return df.sort_values('total', ascending=ascending)

# Clean, readable pipeline
df = (
    pd.read_csv('sales.csv')
    .pipe(filter_by_price, min_price=10, max_price=500)
    .pipe(calculate_total)
    .pipe(sort_by_total, ascending=False)
)

df.to_csv('output.csv', index=False)

Why this works:

Each function is isolated and testable. The pipeline reads top-to-bottom like a story. Parameters pass naturally through the chain. Zero framework overhead.

I used this approach for months. It’s perfect for:

One-off analyses in notebooks
Pipelines under 10 transformations
Solo projects with no versioning needs
Quick prototyping before committing to a framework

Where it falls short:

No built-in versioning or data catalog
No pipeline visualization for stakeholders
Harder to test intermediate outputs
No standard project structure for teams

Approach 2: Kedro - Production-Grade Framework

When my pipeline grew beyond 10 steps and teammates needed to modify it, I switched to Kedro. It provides structure, versioning, and visualization out of the box.

Kedro enforces a standard project layout:

project/
├── conf/
│   ├── base/
│   │   ├── catalog.yml    # Data sources and outputs
│   │   └── parameters.yml # Configuration values
├── src/
│   └── project/
│       ├── pipelines/
│       │   └── data_processing/
│       │       ├── pipeline.py
│       │       └── nodes.py
│       └── pipeline_registry.py
└── pyproject.toml

The same pipeline in Kedro looks like this:

raw_data:
  type: pandas.CSVDataSet
  filepath: data.csv

processed_data:
  type: pandas.CSVDataSet
  filepath: output.csv

margin_rate: 0.3
min_revenue: 0

import pandas as pd

def clean_data(df):
    return df.dropna().drop_duplicates()

def enrich_data(df, margin_rate):
    df = df.copy()
    df['revenue'] = df['price'] * df['quantity']
    df['margin'] = df['revenue'] * margin_rate
    return df

def filter_valid(df, min_revenue):
    return df[df['revenue'] > min_revenue]

def aggregate_by_category(df):
    return df.groupby('category').agg({
        'revenue': 'sum',
        'margin': 'sum'
    }).reset_index()

from kedro.pipeline import Pipeline, node, pipeline
from .nodes import clean_data, enrich_data, filter_valid, aggregate_by_category

def create_pipeline(**kwargs):
    return pipeline([
        node(clean_data, 'raw_data', 'cleaned_data', name='clean'),
        node(enrich_data, 'cleaned_data', 'enriched_data',
             name='enrich'),
        node(filter_valid, 'enriched_data', 'filtered_data',
             name='filter'),
        node(aggregate_by_category, 'filtered_data', 'processed_data',
             name='aggregate')
    ])

Running the pipeline:

kedro run
kedro viz  # Opens interactive visualization

What changed for me:

The Reddit discussion on .pipe() had a comment that resonated: “I am pretty sure you will love Kedro… modifying or adding steps is way easier than plain python.”

Kedro gave me:

Versioning - Track data and code changes automatically
Visualization - kedro viz shows the pipeline graph for debugging and stakeholder communication
Testing - Built-in test framework for each node
Configuration - YAML files separate config from code
Data Catalog - Centralized data source management

The trade-off:

Learning curve. Project structure conventions. Setup overhead for small projects. Team buy-in required.

Head-to-Head Comparison

| Feature              | pandas .pipe()    | Kedro              |
|---------------------|-------------------|---------------------|
| Setup overhead      | Zero              | Moderate            |
| Learning curve      | Minutes           | Hours               |
| Versioning          | Manual            | Built-in            |
| Visualization       | None              | kedro viz           |
| Testing             | pytest manually   | Built-in framework  |
| Configuration       | Code or argparse  | YAML-based          |
| Data catalog        | Manual            | Built-in            |
| Team collaboration  | Ad-hoc            | Structured          |
| Best for            | Solo, simple      | Teams, production   |

Performance-wise, both approaches are similar. Kedro adds minimal overhead since it’s orchestration, not computation. The real difference is in maintainability.

Decision Framework: Which Should You Choose?

| Factor                | Use .pipe()           | Use Kedro              |
|-----------------------|----------------------|------------------------|
| Team size             | Solo or 2-3 people   | Multiple developers    |
| Pipeline complexity   | <10 transformations  | Many steps, branches   |
| Production needs      | One-off analysis     | Scheduled, monitored   |
| Data versioning       | Not needed           | Required               |
| Testing               | Informal             | Formal test suite      |
| Stakeholder visibility| Not needed           | Visualization required |

Signs you should migrate from .pipe() to Kedro:

Pipeline exceeds 10+ transformations
Multiple team members need to modify it
You need scheduled execution
Debugging takes longer than writing
Data lineage questions arise frequently

Signs you should stick with .pipe():

Quick prototyping or one-off analysis
Pipeline under 5 steps
No team collaboration needed
Jupyter notebook workflow suffices

Common Mistakes to Avoid

I made these mistakes. Learn from them:

Over-engineering simple pipelines - Don’t introduce Kedro for a 3-step transformation you run once a month.
Under-engineering critical pipelines - Don’t stick with .pipe() when your team needs reproducibility and versioning.
Ignoring testing - Both approaches need tests. Neither enforces them automatically.
Mixing approaches inconsistently - Pick one style per project. Mixing .pipe() scripts with Kedro projects creates confusion.

Alternative: Functional Approach with itertools

If you want a middle ground without framework overhead, Python’s itertools.accumulate offers a functional approach:

from itertools import accumulate

def filter_by_price(df):
    return df[(df['price'] >= 10) & (df['price'] <= 500)]

def calculate_total(df):
    df = df.copy()
    df['total'] = df['price'] * df['quantity']
    return df

def sort_by_total(df):
    return df.sort_values('total', ascending=False)

# Functional pipeline
transformations = [filter_by_price, calculate_total, sort_by_total]
df_final = list(accumulate(transformations, lambda x, f: f(x), initial=df))[-1]

This works but lacks parameter passing and is less readable for most Python developers. I prefer .pipe() for its natural syntax.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Kedro Official Documentation
👨‍💻 pandas DataFrame.pipe() Documentation
👨‍💻 Reddit Discussion: .pipe() in pandas changed how I write data pipelines

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!