pandas .pipe() vs Variable Assignment: When Simplicity Wins

Apr 30, 2026

Python code on screen

I recently had a debate with myself about pandas code style. Should I use .pipe() for clean method chaining, or stick with simple variable assignment? After testing both approaches on real data pipelines, I found something unexpected: we sometimes wind up back where we started, and that’s not a failure.

The Core Question

When transforming DataFrames, you have two main options:

Option 1: .pipe() chaining

df_final = (df
    .pipe(filter_by_price, threshold=100)
    .pipe(calculate_total)
    .pipe(sort_by_total)
)

Option 2: Variable assignment

df_final = df[df['price'] > 100]
df_final['total'] = df_final['price'] * df_final['quantity']
df_final = df_final.sort_values('total', ascending=False)

Both produce identical results. Both are readable to different audiences. The question isn’t which is “better” - it’s which fits your context.

What I Discovered Testing Both

I built the same data pipeline twice. Here’s what happened:

import pandas as pd

# Sample data
df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'price': [150, 50, 200, 80],
    'quantity': [10, 20, 5, 15]
})

# Approach 1: Variable assignment
# Direct, visible, no indirection
df_final = df[df['price'] > 100]
df_final['total'] = df_final['price'] * df_final['quantity']
df_final = df_final.sort_values('total', ascending=False)

# Approach 2: .pipe() with functions
# Named operations, reusable, composable
def filter_by_price(df, threshold):
    return df[df['price'] > threshold]

def calculate_total(df):
    return df.assign(total=df['price'] * df['quantity'])

def sort_by_total(df, ascending=False):
    return df.sort_values('total', ascending=ascending)

df_final = (df
    .pipe(filter_by_price, threshold=100)
    .pipe(calculate_total)
    .pipe(sort_by_total, ascending=False)
)

# Approach 3: Hybrid - meaningful variable names
# Clear intent, no function overhead
filtered = df[df['price'] > 100]
with_total = filtered.assign(total=filtered['price'] * filtered['quantity'])
sorted_df = with_total.sort_values('total', ascending=False)

The hybrid approach (Approach 3) became my favorite for one-off transformations. Meaningful variable names (filtered, with_total) communicate intent without function overhead.

When Variable Assignment Wins

For exploratory data analysis, visibility trumps abstraction. I want to see each step, inspect intermediate results, and quickly comment out lines to debug:

# EDA workflow - each step is inspected and understood
sales = pd.read_csv('sales.csv')
print(sales.shape)

# Step-by-step with inspection capability
sales = sales.dropna(subset=['customer_id'])
# sales = sales[sales['amount'] > 0]  # Maybe skip this?
sales['date'] = pd.to_datetime(sales['date'])
sales['month'] = sales['date'].dt.month

monthly = sales.groupby('month').agg({'amount': ['sum', 'mean', 'count']})
monthly.columns = ['total', 'avg', 'count']

print(monthly.head(10))

I can comment out any step to see its effect. No function definitions to maintain. No jumping between files to understand what process_data() actually does.

When .pipe() Wins

For production pipelines with reusable components, .pipe() pays off. The transformation functions become documented, testable, and reusable:

def clean_column_names(df):
    """Standardize column names to snake_case."""
    df.columns = (
        df.columns.str.lower()
        .str.replace(' ', '_')
        .str.replace('[^a-z0-9_]', '', regex=True)
    )
    return df

def handle_missing_values(df, strategy='median'):
    """Handle missing values with configurable strategy."""
    numeric_cols = df.select_dtypes(include=['number']).columns
    if strategy == 'median':
        df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
    elif strategy == 'drop':
        df = df.dropna(subset=numeric_cols)
    return df

def remove_outliers(df, columns, n_std=3):
    """Remove outliers beyond n standard deviations."""
    for col in columns:
        mean, std = df[col].mean(), df[col].std()
        df = df[(df[col] >= mean - n_std * std) & (df[col] <= mean + n_std * std)]
    return df

def prepare_data(df, outlier_strategy=3, missing_strategy='median'):
    """Complete data preparation pipeline."""
    return (df
        .pipe(clean_column_names)
        .pipe(handle_missing_values, strategy=missing_strategy)
        .pipe(remove_outliers, columns=['price', 'quantity'], n_std=outlier_strategy)
    )

# Reuse across contexts
clean_data = prepare_data(raw_data, outlier_strategy=2)
test_clean = prepare_data(test_data, missing_strategy='drop')

Now prepare_data() is a documented, reusable component. I can use it in multiple pipelines with different configurations.

The Philosophical Insight

A Reddit commenter captured something profound: after exploring OO patterns, functional patterns, and chaining patterns, they “wound up essentially back where we started, but with a lot more code to say it.”

Pattern Evolution:
┌─────────────────────────────────────────────────────────────┐
│  Simple code → OO abstraction → Functional → Back to simple │
│                                                             │
│  df[...] → Classes → .pipe() → df[...]                      │
│                                                             │
│  Full circle. Not failure. Recognition of simplicity.       │
└─────────────────────────────────────────────────────────────┘

This isn’t failure. It’s recognition that for many operations, simple variable assignment communicates intent better than layers of abstraction.

For single-line operations, creating a function is often overkill. Why write:

def filter_by_price(df, threshold):
    return df[df['price'] > threshold]

df.pipe(filter_by_price, threshold=100)

When this works identically:

df = df[df['price'] > 100]

The function adds 4 lines of code for no benefit. You still have to trace back to the function definition to understand what it does.

Debugging: Both Work Equally

I tested debugging both approaches. Neither has an advantage:

# Debugging with variables - comment out steps
df_final = df[df['price'] > 100]
# df_final['total'] = ...  # Bug here?
print(df_final.head())

# Debugging with .pipe() - add logging function
def debug_step(df, name=""):
    print(f"{name}: shape={df.shape}")
    return df

df_final = (df
    .pipe(debug_step, "Start")
    .pipe(filter_by_price, threshold=100)
    .pipe(debug_step, "After filter")
    .pipe(calculate_total)
)

Both let you inspect intermediate results. Both let you isolate problematic steps. The “pipe debugging advantage” is a myth.

Performance: Negligible Difference

I benchmarked both approaches:

import timeit
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'price': np.random.rand(10000) * 1000,
    'quantity': np.random.randint(1, 100, 10000)
})

def variable_approach():
    result = df[df['price'] > 100]
    result['total'] = result['price'] * result['quantity']
    return result.sort_values('total')

def pipe_approach():
    def filter_price(d): return d[d['price'] > 100]
    def calc_total(d): return d.assign(total=d['price'] * d['quantity'])
    return df.pipe(filter_price).pipe(calc_total).sort_values('total')

# 1000 runs
# Variable: ~480ms
# Pipe: ~490ms
# Difference: <2% - negligible

Don’t choose based on performance. Choose based on readability and maintainability for your specific context.

Decision Framework

┌─────────────────────┬─────────────────────┬─────────────────────┐
│ Factor              │ Use Variables       │ Use .pipe()         │
├─────────────────────┼─────────────────────┼─────────────────────┤
│ Complexity          │ Simple (1-2 lines)  │ Complex (3+ lines)  │
│ Reusability         │ One-time            │ Across projects     │
│ Team style          │ Imperative          │ Functional          │
│ Parameters          │ None/minimal        │ Multiple params     │
│ Functions needed    │ Would create 1-2    │ Would create 3+     │
│ Review process      │ Quick scan          │ Detailed review     │
└─────────────────────┴─────────────────────┴─────────────────────┘

Common Mistakes to Avoid

Over-engineering with .pipe(): Creating functions for single-line operations that don’t need abstraction.

Excessive variable proliferation: Using df1, df2, df3 instead of meaningful names like filtered, with_totals.

Assuming readability: What’s readable depends on your team’s familiarity, not the syntax itself.

Forcing patterns: As one commenter noted, it’s “not worth other sacrifices just to force the pipe pattern to work.”

The Pragmatic Middle Ground

My recommendation: mix both approaches strategically:

# One-off operations: use variables
high_value = df[df['total_value'] > 1000]
high_value['discount'] = high_value['total_value'] * 0.1

# Repeated operations: use .pipe()
def standard_preprocess(df, min_date=None, max_date=None):
    """Standard preprocessing used across pipelines."""
    df = df.copy()
    df['date'] = pd.to_datetime(df['date'])
    if min_date: df = df[df['date'] >= min_date]
    if max_date: df = df[df['date'] <= max_date]
    return df

q1_data = raw_data.pipe(standard_preprocess, max_date='2024-03-31')
q2_data = raw_data.pipe(standard_preprocess, min_date='2024-04-01')

This hybrid approach gives you reusability where it matters, and simplicity where it suffices.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 pandas DataFrame.pipe documentation

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!