Skip to content

pandas .pipe() vs Variable Assignment: When Simplicity Wins

Python code on screen

I recently had a debate with myself about pandas code style. Should I use .pipe() for clean method chaining, or stick with simple variable assignment? After testing both approaches on real data pipelines, I found something unexpected: we sometimes wind up back where we started, and that’s not a failure.

The Core Question

When transforming DataFrames, you have two main options:

Option 1: .pipe() chaining

pipe_approach.py
df_final = (df
.pipe(filter_by_price, threshold=100)
.pipe(calculate_total)
.pipe(sort_by_total)
)

Option 2: Variable assignment

variable_approach.py
df_final = df[df['price'] > 100]
df_final['total'] = df_final['price'] * df_final['quantity']
df_final = df_final.sort_values('total', ascending=False)

Both produce identical results. Both are readable to different audiences. The question isn’t which is “better” - it’s which fits your context.

What I Discovered Testing Both

I built the same data pipeline twice. Here’s what happened:

comparison.py
import pandas as pd
# Sample data
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'price': [150, 50, 200, 80],
'quantity': [10, 20, 5, 15]
})
# Approach 1: Variable assignment
# Direct, visible, no indirection
df_final = df[df['price'] > 100]
df_final['total'] = df_final['price'] * df_final['quantity']
df_final = df_final.sort_values('total', ascending=False)
# Approach 2: .pipe() with functions
# Named operations, reusable, composable
def filter_by_price(df, threshold):
return df[df['price'] > threshold]
def calculate_total(df):
return df.assign(total=df['price'] * df['quantity'])
def sort_by_total(df, ascending=False):
return df.sort_values('total', ascending=ascending)
df_final = (df
.pipe(filter_by_price, threshold=100)
.pipe(calculate_total)
.pipe(sort_by_total, ascending=False)
)
# Approach 3: Hybrid - meaningful variable names
# Clear intent, no function overhead
filtered = df[df['price'] > 100]
with_total = filtered.assign(total=filtered['price'] * filtered['quantity'])
sorted_df = with_total.sort_values('total', ascending=False)

The hybrid approach (Approach 3) became my favorite for one-off transformations. Meaningful variable names (filtered, with_total) communicate intent without function overhead.

When Variable Assignment Wins

For exploratory data analysis, visibility trumps abstraction. I want to see each step, inspect intermediate results, and quickly comment out lines to debug:

eda_workflow.py
# EDA workflow - each step is inspected and understood
sales = pd.read_csv('sales.csv')
print(sales.shape)
# Step-by-step with inspection capability
sales = sales.dropna(subset=['customer_id'])
# sales = sales[sales['amount'] > 0] # Maybe skip this?
sales['date'] = pd.to_datetime(sales['date'])
sales['month'] = sales['date'].dt.month
monthly = sales.groupby('month').agg({'amount': ['sum', 'mean', 'count']})
monthly.columns = ['total', 'avg', 'count']
print(monthly.head(10))

I can comment out any step to see its effect. No function definitions to maintain. No jumping between files to understand what process_data() actually does.

When .pipe() Wins

For production pipelines with reusable components, .pipe() pays off. The transformation functions become documented, testable, and reusable:

reusable_pipeline.py
def clean_column_names(df):
"""Standardize column names to snake_case."""
df.columns = (
df.columns.str.lower()
.str.replace(' ', '_')
.str.replace('[^a-z0-9_]', '', regex=True)
)
return df
def handle_missing_values(df, strategy='median'):
"""Handle missing values with configurable strategy."""
numeric_cols = df.select_dtypes(include=['number']).columns
if strategy == 'median':
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
elif strategy == 'drop':
df = df.dropna(subset=numeric_cols)
return df
def remove_outliers(df, columns, n_std=3):
"""Remove outliers beyond n standard deviations."""
for col in columns:
mean, std = df[col].mean(), df[col].std()
df = df[(df[col] >= mean - n_std * std) & (df[col] <= mean + n_std * std)]
return df
def prepare_data(df, outlier_strategy=3, missing_strategy='median'):
"""Complete data preparation pipeline."""
return (df
.pipe(clean_column_names)
.pipe(handle_missing_values, strategy=missing_strategy)
.pipe(remove_outliers, columns=['price', 'quantity'], n_std=outlier_strategy)
)
# Reuse across contexts
clean_data = prepare_data(raw_data, outlier_strategy=2)
test_clean = prepare_data(test_data, missing_strategy='drop')

Now prepare_data() is a documented, reusable component. I can use it in multiple pipelines with different configurations.

The Philosophical Insight

A Reddit commenter captured something profound: after exploring OO patterns, functional patterns, and chaining patterns, they “wound up essentially back where we started, but with a lot more code to say it.”

Pattern Evolution:
┌─────────────────────────────────────────────────────────────┐
│ Simple code → OO abstraction → Functional → Back to simple │
│ │
│ df[...] → Classes → .pipe() → df[...] │
│ │
│ Full circle. Not failure. Recognition of simplicity. │
└─────────────────────────────────────────────────────────────┘

This isn’t failure. It’s recognition that for many operations, simple variable assignment communicates intent better than layers of abstraction.

For single-line operations, creating a function is often overkill. Why write:

over_engineered.py
def filter_by_price(df, threshold):
return df[df['price'] > threshold]
df.pipe(filter_by_price, threshold=100)

When this works identically:

simple_alternative.py
df = df[df['price'] > 100]

The function adds 4 lines of code for no benefit. You still have to trace back to the function definition to understand what it does.

Debugging: Both Work Equally

I tested debugging both approaches. Neither has an advantage:

debugging_comparison.py
# Debugging with variables - comment out steps
df_final = df[df['price'] > 100]
# df_final['total'] = ... # Bug here?
print(df_final.head())
# Debugging with .pipe() - add logging function
def debug_step(df, name=""):
print(f"{name}: shape={df.shape}")
return df
df_final = (df
.pipe(debug_step, "Start")
.pipe(filter_by_price, threshold=100)
.pipe(debug_step, "After filter")
.pipe(calculate_total)
)

Both let you inspect intermediate results. Both let you isolate problematic steps. The “pipe debugging advantage” is a myth.

Performance: Negligible Difference

I benchmarked both approaches:

benchmark.py
import timeit
import pandas as pd
import numpy as np
df = pd.DataFrame({
'price': np.random.rand(10000) * 1000,
'quantity': np.random.randint(1, 100, 10000)
})
def variable_approach():
result = df[df['price'] > 100]
result['total'] = result['price'] * result['quantity']
return result.sort_values('total')
def pipe_approach():
def filter_price(d): return d[d['price'] > 100]
def calc_total(d): return d.assign(total=d['price'] * d['quantity'])
return df.pipe(filter_price).pipe(calc_total).sort_values('total')
# 1000 runs
# Variable: ~480ms
# Pipe: ~490ms
# Difference: <2% - negligible

Don’t choose based on performance. Choose based on readability and maintainability for your specific context.

Decision Framework

┌─────────────────────┬─────────────────────┬─────────────────────┐
│ Factor │ Use Variables │ Use .pipe() │
├─────────────────────┼─────────────────────┼─────────────────────┤
│ Complexity │ Simple (1-2 lines) │ Complex (3+ lines) │
│ Reusability │ One-time │ Across projects │
│ Team style │ Imperative │ Functional │
│ Parameters │ None/minimal │ Multiple params │
│ Functions needed │ Would create 1-2 │ Would create 3+ │
│ Review process │ Quick scan │ Detailed review │
└─────────────────────┴─────────────────────┴─────────────────────┘

Common Mistakes to Avoid

Over-engineering with .pipe(): Creating functions for single-line operations that don’t need abstraction.

Excessive variable proliferation: Using df1, df2, df3 instead of meaningful names like filtered, with_totals.

Assuming readability: What’s readable depends on your team’s familiarity, not the syntax itself.

Forcing patterns: As one commenter noted, it’s “not worth other sacrifices just to force the pipe pattern to work.”

The Pragmatic Middle Ground

My recommendation: mix both approaches strategically:

pragmatic_mix.py
# One-off operations: use variables
high_value = df[df['total_value'] > 1000]
high_value['discount'] = high_value['total_value'] * 0.1
# Repeated operations: use .pipe()
def standard_preprocess(df, min_date=None, max_date=None):
"""Standard preprocessing used across pipelines."""
df = df.copy()
df['date'] = pd.to_datetime(df['date'])
if min_date: df = df[df['date'] >= min_date]
if max_date: df = df[df['date'] <= max_date]
return df
q1_data = raw_data.pipe(standard_preprocess, max_date='2024-03-31')
q2_data = raw_data.pipe(standard_preprocess, min_date='2024-04-01')

This hybrid approach gives you reusability where it matters, and simplicity where it suffices.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments