Why is Polars 18x Faster Than Pandas for Large CSV Files?
I was parsing a 4GB CSV file last week. Polars was nearly 18x faster than using pandas.
This caught my attention. I’ve been using pandas for years, but I’d heard good things about Polars. I wanted to understand what made the difference so dramatic.
The Baseline: What I Tried First
I started with my usual pandas workflow:
import pandas as pd
# Reads entire CSV into memorydf = pd.read_csv("large_file.csv")
# Then selects columns and filtersresult = df[["user_id", "transaction_amount", "date"]]result = result[result["transaction_amount"] > 1000]This took about 45 seconds to load the file, then another 10 seconds for the filtering. Total: 55 seconds.
The Polars Approach
I tried the equivalent in Polars:
import polars as pl
# Lazy scan with projection and predicate pushdownresult = ( pl.scan_csv("large_file.csv") .select(["user_id", "transaction_amount", "date"]) # Only reads these columns .filter(pl.col("transaction_amount") > 1000) # Only reads matching rows .collect() # Execute the optimized query)This took about 3 seconds total. That’s where the 18x speedup comes from.
Why This Happened
At first, I thought it was just because Polars is written in Rust. But that’s not the main reason.
The key difference is when and how the data gets read.
Lazy Execution vs Eager Execution
Pandas uses eager execution. When you call pd.read_csv(), it reads the entire file into memory immediately. Every subsequent operation works on that in-memory data.
Polars uses lazy execution. When you call pl.scan_csv(), nothing actually happens yet. It builds a query plan instead. The data only gets read when you call .collect().
This might seem like just a different API, but it enables something powerful: query optimization.
Projection Pushdown
I was only selecting 3 columns out of 50 in my CSV. Pandas still reads all 50 columns into memory, then discards 47 of them.
Polars sees the .select() call and applies projection pushdown. It only reads those 3 columns from disk. It never even touches the other 47 columns.
If your CSV is 4GB and you only need 3 of 50 columns, that’s roughly 92% less I/O.
Predicate Pushdown
I was also filtering for transactions greater than $1000. Pandas reads every row, loads it into memory, then applies the filter.
Polars applies predicate pushdown. It filters rows while reading from disk, skipping rows that don’t match the criteria.
If only 10% of your rows match the filter, that’s 90% less I/O.
Apache Arrow Memory Model
Polars uses Apache Arrow for its memory representation. This is a columnar format, unlike pandas which is row-based.
Arrow stores data column-by-column in contiguous memory. This enables:
- Zero-copy operations: Operations can work directly on the underlying memory without copying
- SIMD optimizations: Modern CPUs can apply the same operation to multiple data points simultaneously
- Better cache locality: Column operations access memory sequentially, which is more cache-friendly
Multi-threading
Polars is written in Rust and leverages Rust’s threading capabilities. It automatically parallelizes operations that can run independently.
My CSV parsing used all available CPU cores, while pandas mostly ran single-threaded.
When Polars Doesn’t Win
I tested a few more scenarios to understand the boundaries.
Reading All Columns
import pandas as pd
df = pd.read_csv("large_file.csv")# Uses all columnsimport polars as pl
df = pl.scan_csv("large_file.csv").collect()# Uses all columnsWhen I read all columns, Polars was about 1.3x faster, not 18x. The projection pushdown benefit disappeared.
Small Datasets
I tested with a 50MB file. Polars was actually slightly slower. The overhead of lazy execution and query planning outweighed the benefits for small datasets.
No Filtering
When I removed the filter and just selected columns, the speedup dropped to about 10x. Predicate pushdown was no longer contributing.
A More Complex Example
I tried a realistic analytics query with grouping and aggregation:
import pandas as pd
# Load everythingdf = pd.read_csv("4gb_file.csv")
# Multiple operations require multiple passesfiltered = df[df["category"].isin(["A", "B", "C"])]grouped = filtered.groupby("category").agg({ "sales": "mean", "quantity": "sum"})result = grouped.sort_values("sales", ascending=False)This took about 50 seconds.
import polars as pl
# Query planner optimizes everything into one passresult = ( pl.scan_csv("4gb_file.csv") .filter(pl.col("category").is_in(["A", "B", "C"])) # Predicate pushdown .group_by("category") .agg( avg_sales=pl.col("sales").mean(), total_quantity=pl.col("quantity").sum() ) .sort("avg_sales", descending=True) .collect() # Single optimized execution)This took about 4 seconds.
What I Learned
The performance gap isn’t about one being “better” in general. It’s about what operations you’re doing.
Polars excels when:
- You’re reading large files (>100MB)
- You’re selecting only a subset of columns
- You’re filtering rows early in your pipeline
- You’re working with remote data (S3, cloud storage)
- You’re doing analytics operations (aggregations, joins)
Pandas is still fine for:
- Small datasets (<100MB)
- Interactive data exploration
- When you need all columns
- Complex string operations (pandas has excellent string method implementations)
- Existing pandas codebases where migration cost outweighs benefits
Performance Comparison
Here’s what I measured:
| Operation | Pandas | Polars | Speedup |
|---|---|---|---|
| Read 4GB CSV (all columns) | 45s | 35s | 1.3x |
| Read 4GB CSV (3 of 50 columns) | 45s | 2.5s | 18x |
| Read + Filter + Select | 55s | 3s | 18x |
| Complex query with grouping | 50s | 4s | 12.5x |
How to Get Started
The easiest way to try Polars is to replace pd.read_csv() with pl.scan_csv():
import polars as pl
# Instead of:# df = pd.read_csv("file.csv")# result = df[["col1", "col2"]]# result = result[result["value"] > 100]
# Try:result = ( pl.scan_csv("file.csv") .select(["col1", "col2"]) .filter(pl.col("value") > 100) .collect())You can also read only specific columns directly:
import polars as pl
# More efficient than select() after scandf = pl.scan_csv( "large_file.csv", columns=["id", "name", "value"]).collect()The Takeaway
Polars achieves 18x speedups over pandas for large CSV files not through magic, but through smart architectural choices:
- Lazy execution allows query planning before any data is read
- Column pruning reads only the data you need
- Predicate pushdown filters data during I/O, not after
- Apache Arrow provides a columnar memory model optimized for analytics
- Rust implementation leverages multi-threading and SIMD instructions
The key insight: Polars shines when you’re selective. If you’re loading entire datasets and using all columns, the advantage shrinks dramatically.
Next time you’re working with a large CSV, try Polars. Start with a simple read operation and compare. For the right use case, the performance difference can be dramatic.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Polars Documentation
- 👨💻 Pandas to Polars Migration Guide
- 👨💻 Polars Lazy API Deep Dive
- 👨💻 Apache Arrow
- 👨💻 SIMD Wikipedia
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments