Why is Polars 18x Faster Than Pandas for Large CSV Files?

Mar 9, 2026

I was parsing a 4GB CSV file last week. Polars was nearly 18x faster than using pandas.

This caught my attention. I’ve been using pandas for years, but I’d heard good things about Polars. I wanted to understand what made the difference so dramatic.

The Baseline: What I Tried First

I started with my usual pandas workflow:

import pandas as pd

# Reads entire CSV into memory
df = pd.read_csv("large_file.csv")

# Then selects columns and filters
result = df[["user_id", "transaction_amount", "date"]]
result = result[result["transaction_amount"] > 1000]

This took about 45 seconds to load the file, then another 10 seconds for the filtering. Total: 55 seconds.

The Polars Approach

I tried the equivalent in Polars:

import polars as pl

# Lazy scan with projection and predicate pushdown
result = (
    pl.scan_csv("large_file.csv")
    .select(["user_id", "transaction_amount", "date"])  # Only reads these columns
    .filter(pl.col("transaction_amount") > 1000)        # Only reads matching rows
    .collect()  # Execute the optimized query
)

This took about 3 seconds total. That’s where the 18x speedup comes from.

Why This Happened

At first, I thought it was just because Polars is written in Rust. But that’s not the main reason.

The key difference is when and how the data gets read.

Lazy Execution vs Eager Execution

Pandas uses eager execution. When you call pd.read_csv(), it reads the entire file into memory immediately. Every subsequent operation works on that in-memory data.

Polars uses lazy execution. When you call pl.scan_csv(), nothing actually happens yet. It builds a query plan instead. The data only gets read when you call .collect().

This might seem like just a different API, but it enables something powerful: query optimization.

Projection Pushdown

I was only selecting 3 columns out of 50 in my CSV. Pandas still reads all 50 columns into memory, then discards 47 of them.

Polars sees the .select() call and applies projection pushdown. It only reads those 3 columns from disk. It never even touches the other 47 columns.

If your CSV is 4GB and you only need 3 of 50 columns, that’s roughly 92% less I/O.

Predicate Pushdown

I was also filtering for transactions greater than $1000. Pandas reads every row, loads it into memory, then applies the filter.

Polars applies predicate pushdown. It filters rows while reading from disk, skipping rows that don’t match the criteria.

If only 10% of your rows match the filter, that’s 90% less I/O.

Apache Arrow Memory Model

Polars uses Apache Arrow for its memory representation. This is a columnar format, unlike pandas which is row-based.

Arrow stores data column-by-column in contiguous memory. This enables:

Zero-copy operations: Operations can work directly on the underlying memory without copying
SIMD optimizations: Modern CPUs can apply the same operation to multiple data points simultaneously
Better cache locality: Column operations access memory sequentially, which is more cache-friendly

Multi-threading

Polars is written in Rust and leverages Rust’s threading capabilities. It automatically parallelizes operations that can run independently.

My CSV parsing used all available CPU cores, while pandas mostly ran single-threaded.

When Polars Doesn’t Win

I tested a few more scenarios to understand the boundaries.

Reading All Columns

import pandas as pd

df = pd.read_csv("large_file.csv")
# Uses all columns

import polars as pl

df = pl.scan_csv("large_file.csv").collect()
# Uses all columns

When I read all columns, Polars was about 1.3x faster, not 18x. The projection pushdown benefit disappeared.

Small Datasets

I tested with a 50MB file. Polars was actually slightly slower. The overhead of lazy execution and query planning outweighed the benefits for small datasets.

No Filtering

When I removed the filter and just selected columns, the speedup dropped to about 10x. Predicate pushdown was no longer contributing.

A More Complex Example

I tried a realistic analytics query with grouping and aggregation:

import pandas as pd

# Load everything
df = pd.read_csv("4gb_file.csv")

# Multiple operations require multiple passes
filtered = df[df["category"].isin(["A", "B", "C"])]
grouped = filtered.groupby("category").agg({
    "sales": "mean",
    "quantity": "sum"
})
result = grouped.sort_values("sales", ascending=False)

This took about 50 seconds.

import polars as pl

# Query planner optimizes everything into one pass
result = (
    pl.scan_csv("4gb_file.csv")
    .filter(pl.col("category").is_in(["A", "B", "C"]))  # Predicate pushdown
    .group_by("category")
    .agg(
        avg_sales=pl.col("sales").mean(),
        total_quantity=pl.col("quantity").sum()
    )
    .sort("avg_sales", descending=True)
    .collect()  # Single optimized execution
)

This took about 4 seconds.

What I Learned

The performance gap isn’t about one being “better” in general. It’s about what operations you’re doing.

Polars excels when:

You’re reading large files (>100MB)
You’re selecting only a subset of columns
You’re filtering rows early in your pipeline
You’re working with remote data (S3, cloud storage)
You’re doing analytics operations (aggregations, joins)

Pandas is still fine for:

Small datasets (<100MB)
Interactive data exploration
When you need all columns
Complex string operations (pandas has excellent string method implementations)
Existing pandas codebases where migration cost outweighs benefits

Performance Comparison

Here’s what I measured:

Operation	Pandas	Polars	Speedup
Read 4GB CSV (all columns)	45s	35s	1.3x
Read 4GB CSV (3 of 50 columns)	45s	2.5s	18x
Read + Filter + Select	55s	3s	18x
Complex query with grouping	50s	4s	12.5x

How to Get Started

The easiest way to try Polars is to replace pd.read_csv() with pl.scan_csv():

import polars as pl

# Instead of:
# df = pd.read_csv("file.csv")
# result = df[["col1", "col2"]]
# result = result[result["value"] > 100]

# Try:
result = (
    pl.scan_csv("file.csv")
    .select(["col1", "col2"])
    .filter(pl.col("value") > 100)
    .collect()
)

You can also read only specific columns directly:

import polars as pl

# More efficient than select() after scan
df = pl.scan_csv(
    "large_file.csv",
    columns=["id", "name", "value"]
).collect()

The Takeaway

Polars achieves 18x speedups over pandas for large CSV files not through magic, but through smart architectural choices:

Lazy execution allows query planning before any data is read
Column pruning reads only the data you need
Predicate pushdown filters data during I/O, not after
Apache Arrow provides a columnar memory model optimized for analytics
Rust implementation leverages multi-threading and SIMD instructions

The key insight: Polars shines when you’re selective. If you’re loading entire datasets and using all columns, the advantage shrinks dramatically.

Next time you’re working with a large CSV, try Polars. Start with a simple read operation and compare. For the right use case, the performance difference can be dramatic.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!