Skip to content

Kedro vs pandas .pipe(): Which is Better for Production Data Pipelines?

Data pipeline visualization

Your pandas pipeline works great in a Jupyter notebook. But when you need to run it daily on a server, share it with teammates, or debug why yesterday’s output differs from today’s, things get messy fast.

I recently faced this dilemma. A simple 5-step transformation pipeline grew into a 15-step monster with branching paths. Debugging became a guessing game. Teammates asked “which version produced this output?” and I had no answer.

This article compares two approaches: pandas’ built-in .pipe() method for lightweight pipelines, and Kedro for production-grade workflows. I’ll show you exactly when each approach makes sense.

The Problem: Messy Data Pipeline Code

Without proper structure, data pipelines suffer from:

  • Hard to debug - Where did the data go wrong?
  • Hard to test - How do you test step 5 in isolation?
  • Hard to version - What changed between runs?
  • Hard to collaborate - Who wrote this transformation and why?

I’ve seen pipelines that look like this:

messy_pipeline.py
import pandas as pd
df = pd.read_csv('sales.csv')
df = df[df['price'] > 10]
df = df[df['price'] < 500]
df['total'] = df['price'] * df['quantity']
df = df.sort_values('total', ascending=False)
# ... 10 more steps mixed together
df.to_csv('output.csv')

This works, but it’s fragile. Let me show you two better approaches.

Approach 1: pandas .pipe() - Lightweight and Quick

The .pipe() method lets you chain custom functions while keeping your DataFrame as the first argument. It transforms messy code into readable pipelines.

pipe_pipeline.py
import pandas as pd
def filter_by_price(df, min_price=0, max_price=1000):
return df[(df['price'] >= min_price) & (df['price'] <= max_price)]
def calculate_total(df):
df = df.copy()
df['total'] = df['price'] * df['quantity']
return df
def sort_by_total(df, ascending=False):
return df.sort_values('total', ascending=ascending)
# Clean, readable pipeline
df = (
pd.read_csv('sales.csv')
.pipe(filter_by_price, min_price=10, max_price=500)
.pipe(calculate_total)
.pipe(sort_by_total, ascending=False)
)
df.to_csv('output.csv', index=False)

Why this works:

Each function is isolated and testable. The pipeline reads top-to-bottom like a story. Parameters pass naturally through the chain. Zero framework overhead.

I used this approach for months. It’s perfect for:

  • One-off analyses in notebooks
  • Pipelines under 10 transformations
  • Solo projects with no versioning needs
  • Quick prototyping before committing to a framework

Where it falls short:

  • No built-in versioning or data catalog
  • No pipeline visualization for stakeholders
  • Harder to test intermediate outputs
  • No standard project structure for teams

Approach 2: Kedro - Production-Grade Framework

When my pipeline grew beyond 10 steps and teammates needed to modify it, I switched to Kedro. It provides structure, versioning, and visualization out of the box.

Kedro enforces a standard project layout:

Kedro project structure
project/
├── conf/
│ ├── base/
│ │ ├── catalog.yml # Data sources and outputs
│ │ └── parameters.yml # Configuration values
├── src/
│ └── project/
│ ├── pipelines/
│ │ └── data_processing/
│ │ ├── pipeline.py
│ │ └── nodes.py
│ └── pipeline_registry.py
└── pyproject.toml

The same pipeline in Kedro looks like this:

catalog.yml
raw_data:
type: pandas.CSVDataSet
filepath: data.csv
processed_data:
type: pandas.CSVDataSet
filepath: output.csv
parameters.yml
margin_rate: 0.3
min_revenue: 0
nodes.py
import pandas as pd
def clean_data(df):
return df.dropna().drop_duplicates()
def enrich_data(df, margin_rate):
df = df.copy()
df['revenue'] = df['price'] * df['quantity']
df['margin'] = df['revenue'] * margin_rate
return df
def filter_valid(df, min_revenue):
return df[df['revenue'] > min_revenue]
def aggregate_by_category(df):
return df.groupby('category').agg({
'revenue': 'sum',
'margin': 'sum'
}).reset_index()
pipeline.py
from kedro.pipeline import Pipeline, node, pipeline
from .nodes import clean_data, enrich_data, filter_valid, aggregate_by_category
def create_pipeline(**kwargs):
return pipeline([
node(clean_data, 'raw_data', 'cleaned_data', name='clean'),
node(enrich_data, 'cleaned_data', 'enriched_data',
name='enrich'),
node(filter_valid, 'enriched_data', 'filtered_data',
name='filter'),
node(aggregate_by_category, 'filtered_data', 'processed_data',
name='aggregate')
])

Running the pipeline:

Terminal
kedro run
kedro viz # Opens interactive visualization

What changed for me:

The Reddit discussion on .pipe() had a comment that resonated: “I am pretty sure you will love Kedro… modifying or adding steps is way easier than plain python.”

Kedro gave me:

  • Versioning - Track data and code changes automatically
  • Visualization - kedro viz shows the pipeline graph for debugging and stakeholder communication
  • Testing - Built-in test framework for each node
  • Configuration - YAML files separate config from code
  • Data Catalog - Centralized data source management

The trade-off:

Learning curve. Project structure conventions. Setup overhead for small projects. Team buy-in required.

Head-to-Head Comparison

Feature comparison table
| Feature | pandas .pipe() | Kedro |
|---------------------|-------------------|---------------------|
| Setup overhead | Zero | Moderate |
| Learning curve | Minutes | Hours |
| Versioning | Manual | Built-in |
| Visualization | None | kedro viz |
| Testing | pytest manually | Built-in framework |
| Configuration | Code or argparse | YAML-based |
| Data catalog | Manual | Built-in |
| Team collaboration | Ad-hoc | Structured |
| Best for | Solo, simple | Teams, production |

Performance-wise, both approaches are similar. Kedro adds minimal overhead since it’s orchestration, not computation. The real difference is in maintainability.

Decision Framework: Which Should You Choose?

Decision criteria
| Factor | Use .pipe() | Use Kedro |
|-----------------------|----------------------|------------------------|
| Team size | Solo or 2-3 people | Multiple developers |
| Pipeline complexity | <10 transformations | Many steps, branches |
| Production needs | One-off analysis | Scheduled, monitored |
| Data versioning | Not needed | Required |
| Testing | Informal | Formal test suite |
| Stakeholder visibility| Not needed | Visualization required |

Signs you should migrate from .pipe() to Kedro:

  • Pipeline exceeds 10+ transformations
  • Multiple team members need to modify it
  • You need scheduled execution
  • Debugging takes longer than writing
  • Data lineage questions arise frequently

Signs you should stick with .pipe():

  • Quick prototyping or one-off analysis
  • Pipeline under 5 steps
  • No team collaboration needed
  • Jupyter notebook workflow suffices

Common Mistakes to Avoid

I made these mistakes. Learn from them:

  1. Over-engineering simple pipelines - Don’t introduce Kedro for a 3-step transformation you run once a month.

  2. Under-engineering critical pipelines - Don’t stick with .pipe() when your team needs reproducibility and versioning.

  3. Ignoring testing - Both approaches need tests. Neither enforces them automatically.

  4. Mixing approaches inconsistently - Pick one style per project. Mixing .pipe() scripts with Kedro projects creates confusion.

Alternative: Functional Approach with itertools

If you want a middle ground without framework overhead, Python’s itertools.accumulate offers a functional approach:

functional_pipeline.py
from itertools import accumulate
def filter_by_price(df):
return df[(df['price'] >= 10) & (df['price'] <= 500)]
def calculate_total(df):
df = df.copy()
df['total'] = df['price'] * df['quantity']
return df
def sort_by_total(df):
return df.sort_values('total', ascending=False)
# Functional pipeline
transformations = [filter_by_price, calculate_total, sort_by_total]
df_final = list(accumulate(transformations, lambda x, f: f(x), initial=df))[-1]

This works but lacks parameter passing and is less readable for most Python developers. I prefer .pipe() for its natural syntax.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments