Kedro vs pandas .pipe(): Which is Better for Production Data Pipelines?
Your pandas pipeline works great in a Jupyter notebook. But when you need to run it daily on a server, share it with teammates, or debug why yesterday’s output differs from today’s, things get messy fast.
I recently faced this dilemma. A simple 5-step transformation pipeline grew into a 15-step monster with branching paths. Debugging became a guessing game. Teammates asked “which version produced this output?” and I had no answer.
This article compares two approaches: pandas’ built-in .pipe() method for lightweight pipelines, and Kedro for production-grade workflows. I’ll show you exactly when each approach makes sense.
The Problem: Messy Data Pipeline Code
Without proper structure, data pipelines suffer from:
- Hard to debug - Where did the data go wrong?
- Hard to test - How do you test step 5 in isolation?
- Hard to version - What changed between runs?
- Hard to collaborate - Who wrote this transformation and why?
I’ve seen pipelines that look like this:
import pandas as pd
df = pd.read_csv('sales.csv')df = df[df['price'] > 10]df = df[df['price'] < 500]df['total'] = df['price'] * df['quantity']df = df.sort_values('total', ascending=False)# ... 10 more steps mixed togetherdf.to_csv('output.csv')This works, but it’s fragile. Let me show you two better approaches.
Approach 1: pandas .pipe() - Lightweight and Quick
The .pipe() method lets you chain custom functions while keeping your DataFrame as the first argument. It transforms messy code into readable pipelines.
import pandas as pd
def filter_by_price(df, min_price=0, max_price=1000): return df[(df['price'] >= min_price) & (df['price'] <= max_price)]
def calculate_total(df): df = df.copy() df['total'] = df['price'] * df['quantity'] return df
def sort_by_total(df, ascending=False): return df.sort_values('total', ascending=ascending)
# Clean, readable pipelinedf = ( pd.read_csv('sales.csv') .pipe(filter_by_price, min_price=10, max_price=500) .pipe(calculate_total) .pipe(sort_by_total, ascending=False))
df.to_csv('output.csv', index=False)Why this works:
Each function is isolated and testable. The pipeline reads top-to-bottom like a story. Parameters pass naturally through the chain. Zero framework overhead.
I used this approach for months. It’s perfect for:
- One-off analyses in notebooks
- Pipelines under 10 transformations
- Solo projects with no versioning needs
- Quick prototyping before committing to a framework
Where it falls short:
- No built-in versioning or data catalog
- No pipeline visualization for stakeholders
- Harder to test intermediate outputs
- No standard project structure for teams
Approach 2: Kedro - Production-Grade Framework
When my pipeline grew beyond 10 steps and teammates needed to modify it, I switched to Kedro. It provides structure, versioning, and visualization out of the box.
Kedro enforces a standard project layout:
project/├── conf/│ ├── base/│ │ ├── catalog.yml # Data sources and outputs│ │ └── parameters.yml # Configuration values├── src/│ └── project/│ ├── pipelines/│ │ └── data_processing/│ │ ├── pipeline.py│ │ └── nodes.py│ └── pipeline_registry.py└── pyproject.tomlThe same pipeline in Kedro looks like this:
raw_data: type: pandas.CSVDataSet filepath: data.csv
processed_data: type: pandas.CSVDataSet filepath: output.csvmargin_rate: 0.3min_revenue: 0import pandas as pd
def clean_data(df): return df.dropna().drop_duplicates()
def enrich_data(df, margin_rate): df = df.copy() df['revenue'] = df['price'] * df['quantity'] df['margin'] = df['revenue'] * margin_rate return df
def filter_valid(df, min_revenue): return df[df['revenue'] > min_revenue]
def aggregate_by_category(df): return df.groupby('category').agg({ 'revenue': 'sum', 'margin': 'sum' }).reset_index()from kedro.pipeline import Pipeline, node, pipelinefrom .nodes import clean_data, enrich_data, filter_valid, aggregate_by_category
def create_pipeline(**kwargs): return pipeline([ node(clean_data, 'raw_data', 'cleaned_data', name='clean'), node(enrich_data, 'cleaned_data', 'enriched_data', name='enrich'), node(filter_valid, 'enriched_data', 'filtered_data', name='filter'), node(aggregate_by_category, 'filtered_data', 'processed_data', name='aggregate') ])Running the pipeline:
kedro runkedro viz # Opens interactive visualizationWhat changed for me:
The Reddit discussion on .pipe() had a comment that resonated: “I am pretty sure you will love Kedro… modifying or adding steps is way easier than plain python.”
Kedro gave me:
- Versioning - Track data and code changes automatically
- Visualization -
kedro vizshows the pipeline graph for debugging and stakeholder communication - Testing - Built-in test framework for each node
- Configuration - YAML files separate config from code
- Data Catalog - Centralized data source management
The trade-off:
Learning curve. Project structure conventions. Setup overhead for small projects. Team buy-in required.
Head-to-Head Comparison
| Feature | pandas .pipe() | Kedro ||---------------------|-------------------|---------------------|| Setup overhead | Zero | Moderate || Learning curve | Minutes | Hours || Versioning | Manual | Built-in || Visualization | None | kedro viz || Testing | pytest manually | Built-in framework || Configuration | Code or argparse | YAML-based || Data catalog | Manual | Built-in || Team collaboration | Ad-hoc | Structured || Best for | Solo, simple | Teams, production |Performance-wise, both approaches are similar. Kedro adds minimal overhead since it’s orchestration, not computation. The real difference is in maintainability.
Decision Framework: Which Should You Choose?
| Factor | Use .pipe() | Use Kedro ||-----------------------|----------------------|------------------------|| Team size | Solo or 2-3 people | Multiple developers || Pipeline complexity | <10 transformations | Many steps, branches || Production needs | One-off analysis | Scheduled, monitored || Data versioning | Not needed | Required || Testing | Informal | Formal test suite || Stakeholder visibility| Not needed | Visualization required |Signs you should migrate from .pipe() to Kedro:
- Pipeline exceeds 10+ transformations
- Multiple team members need to modify it
- You need scheduled execution
- Debugging takes longer than writing
- Data lineage questions arise frequently
Signs you should stick with .pipe():
- Quick prototyping or one-off analysis
- Pipeline under 5 steps
- No team collaboration needed
- Jupyter notebook workflow suffices
Common Mistakes to Avoid
I made these mistakes. Learn from them:
-
Over-engineering simple pipelines - Don’t introduce Kedro for a 3-step transformation you run once a month.
-
Under-engineering critical pipelines - Don’t stick with
.pipe()when your team needs reproducibility and versioning. -
Ignoring testing - Both approaches need tests. Neither enforces them automatically.
-
Mixing approaches inconsistently - Pick one style per project. Mixing
.pipe()scripts with Kedro projects creates confusion.
Alternative: Functional Approach with itertools
If you want a middle ground without framework overhead, Python’s itertools.accumulate offers a functional approach:
from itertools import accumulate
def filter_by_price(df): return df[(df['price'] >= 10) & (df['price'] <= 500)]
def calculate_total(df): df = df.copy() df['total'] = df['price'] * df['quantity'] return df
def sort_by_total(df): return df.sort_values('total', ascending=False)
# Functional pipelinetransformations = [filter_by_price, calculate_total, sort_by_total]df_final = list(accumulate(transformations, lambda x, f: f(x), initial=df))[-1]This works but lacks parameter passing and is less readable for most Python developers. I prefer .pipe() for its natural syntax.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Kedro Official Documentation
- 👨💻 pandas DataFrame.pipe() Documentation
- 👨💻 Reddit Discussion: .pipe() in pandas changed how I write data pipelines
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments