Skip to content

Why is Polars 18x Faster Than Pandas for Large CSV Files?

I was parsing a 4GB CSV file last week. Polars was nearly 18x faster than using pandas.

This caught my attention. I’ve been using pandas for years, but I’d heard good things about Polars. I wanted to understand what made the difference so dramatic.

The Baseline: What I Tried First

I started with my usual pandas workflow:

pandas_baseline.py
import pandas as pd
# Reads entire CSV into memory
df = pd.read_csv("large_file.csv")
# Then selects columns and filters
result = df[["user_id", "transaction_amount", "date"]]
result = result[result["transaction_amount"] > 1000]

This took about 45 seconds to load the file, then another 10 seconds for the filtering. Total: 55 seconds.

The Polars Approach

I tried the equivalent in Polars:

polars_baseline.py
import polars as pl
# Lazy scan with projection and predicate pushdown
result = (
pl.scan_csv("large_file.csv")
.select(["user_id", "transaction_amount", "date"]) # Only reads these columns
.filter(pl.col("transaction_amount") > 1000) # Only reads matching rows
.collect() # Execute the optimized query
)

This took about 3 seconds total. That’s where the 18x speedup comes from.

Why This Happened

At first, I thought it was just because Polars is written in Rust. But that’s not the main reason.

The key difference is when and how the data gets read.

Lazy Execution vs Eager Execution

Pandas uses eager execution. When you call pd.read_csv(), it reads the entire file into memory immediately. Every subsequent operation works on that in-memory data.

Polars uses lazy execution. When you call pl.scan_csv(), nothing actually happens yet. It builds a query plan instead. The data only gets read when you call .collect().

This might seem like just a different API, but it enables something powerful: query optimization.

Projection Pushdown

I was only selecting 3 columns out of 50 in my CSV. Pandas still reads all 50 columns into memory, then discards 47 of them.

Polars sees the .select() call and applies projection pushdown. It only reads those 3 columns from disk. It never even touches the other 47 columns.

If your CSV is 4GB and you only need 3 of 50 columns, that’s roughly 92% less I/O.

Predicate Pushdown

I was also filtering for transactions greater than $1000. Pandas reads every row, loads it into memory, then applies the filter.

Polars applies predicate pushdown. It filters rows while reading from disk, skipping rows that don’t match the criteria.

If only 10% of your rows match the filter, that’s 90% less I/O.

Apache Arrow Memory Model

Polars uses Apache Arrow for its memory representation. This is a columnar format, unlike pandas which is row-based.

Arrow stores data column-by-column in contiguous memory. This enables:

  • Zero-copy operations: Operations can work directly on the underlying memory without copying
  • SIMD optimizations: Modern CPUs can apply the same operation to multiple data points simultaneously
  • Better cache locality: Column operations access memory sequentially, which is more cache-friendly

Multi-threading

Polars is written in Rust and leverages Rust’s threading capabilities. It automatically parallelizes operations that can run independently.

My CSV parsing used all available CPU cores, while pandas mostly ran single-threaded.

When Polars Doesn’t Win

I tested a few more scenarios to understand the boundaries.

Reading All Columns

pandas_all_cols.py
import pandas as pd
df = pd.read_csv("large_file.csv")
# Uses all columns
polars_all_cols.py
import polars as pl
df = pl.scan_csv("large_file.csv").collect()
# Uses all columns

When I read all columns, Polars was about 1.3x faster, not 18x. The projection pushdown benefit disappeared.

Small Datasets

I tested with a 50MB file. Polars was actually slightly slower. The overhead of lazy execution and query planning outweighed the benefits for small datasets.

No Filtering

When I removed the filter and just selected columns, the speedup dropped to about 10x. Predicate pushdown was no longer contributing.

A More Complex Example

I tried a realistic analytics query with grouping and aggregation:

pandas_complex.py
import pandas as pd
# Load everything
df = pd.read_csv("4gb_file.csv")
# Multiple operations require multiple passes
filtered = df[df["category"].isin(["A", "B", "C"])]
grouped = filtered.groupby("category").agg({
"sales": "mean",
"quantity": "sum"
})
result = grouped.sort_values("sales", ascending=False)

This took about 50 seconds.

polars_complex.py
import polars as pl
# Query planner optimizes everything into one pass
result = (
pl.scan_csv("4gb_file.csv")
.filter(pl.col("category").is_in(["A", "B", "C"])) # Predicate pushdown
.group_by("category")
.agg(
avg_sales=pl.col("sales").mean(),
total_quantity=pl.col("quantity").sum()
)
.sort("avg_sales", descending=True)
.collect() # Single optimized execution
)

This took about 4 seconds.

What I Learned

The performance gap isn’t about one being “better” in general. It’s about what operations you’re doing.

Polars excels when:

  • You’re reading large files (>100MB)
  • You’re selecting only a subset of columns
  • You’re filtering rows early in your pipeline
  • You’re working with remote data (S3, cloud storage)
  • You’re doing analytics operations (aggregations, joins)

Pandas is still fine for:

  • Small datasets (<100MB)
  • Interactive data exploration
  • When you need all columns
  • Complex string operations (pandas has excellent string method implementations)
  • Existing pandas codebases where migration cost outweighs benefits

Performance Comparison

Here’s what I measured:

OperationPandasPolarsSpeedup
Read 4GB CSV (all columns)45s35s1.3x
Read 4GB CSV (3 of 50 columns)45s2.5s18x
Read + Filter + Select55s3s18x
Complex query with grouping50s4s12.5x

How to Get Started

The easiest way to try Polars is to replace pd.read_csv() with pl.scan_csv():

migration_example.py
import polars as pl
# Instead of:
# df = pd.read_csv("file.csv")
# result = df[["col1", "col2"]]
# result = result[result["value"] > 100]
# Try:
result = (
pl.scan_csv("file.csv")
.select(["col1", "col2"])
.filter(pl.col("value") > 100)
.collect()
)

You can also read only specific columns directly:

column_selection.py
import polars as pl
# More efficient than select() after scan
df = pl.scan_csv(
"large_file.csv",
columns=["id", "name", "value"]
).collect()

The Takeaway

Polars achieves 18x speedups over pandas for large CSV files not through magic, but through smart architectural choices:

  1. Lazy execution allows query planning before any data is read
  2. Column pruning reads only the data you need
  3. Predicate pushdown filters data during I/O, not after
  4. Apache Arrow provides a columnar memory model optimized for analytics
  5. Rust implementation leverages multi-threading and SIMD instructions

The key insight: Polars shines when you’re selective. If you’re loading entire datasets and using all columns, the advantage shrinks dramatically.

Next time you’re working with a large CSV, try Polars. Start with a simple read operation and compare. For the right use case, the performance difference can be dramatic.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments