Skip to content

How to Load Large CSV Files Fast in Python

Problem

I had a 2GB CSV file that I needed to load for data analysis. Using Pandas, it took over 3 minutes just to read the file into memory. On larger files, my script would often crash with memory errors or timeout during CI/CD pipelines.

The typical Pandas approach:

slow_pandas_load.py
import pandas as pd
import time
start = time.time()
df = pd.read_csv("large_dataset.csv") # 2GB file
print(f"Loaded in {time.time() - start:.2f} seconds")
# Output: Loaded in 187.34 seconds

This was unacceptable for my use case. I needed to process multiple large files daily, and the bottleneck was clear: CSV loading was eating up most of my pipeline runtime.

What Didn’t Work

Before finding the solution, I tried several Pandas “optimizations”:

Attempt 1: Chunking

chunking_attempt.py
import pandas as pd
chunks = []
for chunk in pd.read_csv("large_dataset.csv", chunksize=100000):
chunks.append(chunk)
df = pd.concat(chunks)

This helped with memory but made the loading even slower due to DataFrame concatenation overhead.

Attempt 2: Specifying dtypes

dtype_attempt.py
import pandas as pd
dtypes = {
"id": "int64",
"name": "str",
"value": "float64"
}
df = pd.read_csv("large_dataset.csv", dtype=dtypes)

This saved a bit of memory inference time, but the loading speed was still slow. Pandas remained single-threaded.

Attempt 3: Using Modin

modin_attempt.py
import modin.pandas as pd
df = pd.read_csv("large_dataset.csv")

Modin parallelized some operations, but CSV loading was still bottlenecked by its internal implementation. Plus, I hit compatibility issues with some of my existing code.

The Solution: Polars

After struggling with these approaches, I discovered Polars - a Rust-based DataFrame library designed for performance from the ground up.

polars_basic.py
import polars as pl
import time
start = time.time()
df = pl.read_csv("large_dataset.csv") # Same 2GB file
print(f"Loaded in {time.time() - start:.2f} seconds")
# Output: Loaded in 23.67 seconds

The same file loaded in under 25 seconds. That’s nearly 8x faster than Pandas.

Why Polars Is Faster

Polars achieves this speed through several key architectural decisions:

  1. Parallel Processing: Polars uses all available CPU cores for CSV parsing, while Pandas is single-threaded.

  2. Apache Arrow Memory Model: Zero-copy operations and columnar memory layout reduce overhead.

  3. Efficient Type Inference: Polars samples data more intelligently to determine column types.

  4. Memory Efficiency: No intermediate copies during loading.

Lazy Loading for Even Better Performance

For cases where you don’t need all the data, Polars offers lazy evaluation:

polars_lazy.py
import polars as pl
import time
# scan_csv creates a lazy frame - no data loaded yet
lazy_df = pl.scan_csv("large_dataset.csv")
# Build query plan
filtered = lazy_df.filter(pl.col("status") == "active").select(["id", "name", "value"])
# Only load what's needed
start = time.time()
result = filtered.collect()
print(f"Loaded filtered data in {time.time() - start:.2f} seconds")
# Output: Loaded filtered data in 8.42 seconds

With lazy loading, I can filter columns and rows before any data hits memory. The query optimizer pushes predicates down to the scan level, reading only what’s necessary.

Streaming for Files Larger Than RAM

When I encountered a 16GB CSV file on a machine with 8GB RAM, I thought I’d need to spin up a larger instance. But Polars streaming saved the day:

polars_streaming.py
import polars as pl
# Streaming mode processes file in chunks
df = pl.read_csv("huge_dataset.csv", streaming=True)

This processed the file without running out of memory, though it was slower than in-memory loading. Still much faster than Pandas chunking approaches.

Common Mistakes to Avoid

Mistake 1: Using Polars Like Pandas

wrong_polars_usage.py
import polars as pl
# WRONG: Eagerly loading everything
df = pl.read_csv("large_dataset.csv")
df = df.filter(pl.col("status") == "active")
# RIGHT: Use lazy evaluation
df = pl.scan_csv("large_dataset.csv").filter(pl.col("status") == "active").collect()

Mistake 2: Loading All Columns

column_selection.py
import polars as pl
# WRONG: Loading 50 columns when you need 3
df = pl.read_csv("large_dataset.csv")
# RIGHT: Specify columns upfront
df = pl.read_csv("large_dataset.csv", columns=["id", "name", "value"])

Mistake 3: Ignoring Type Specification

type_specification.py
import polars as pl
# Slower: Let Polars infer types
df = pl.read_csv("large_dataset.csv")
# Faster: Specify schema
df = pl.read_csv("large_dataset.csv", dtypes={"id": pl.Int64, "name": pl.Utf8})

Benchmark Comparison

I ran tests across several file sizes to quantify the improvement:

benchmark_results.txt
File Size Pandas Time Polars Time Speedup
-----------------------------------------------
100 MB 12.3s 2.1s 5.9x
500 MB 58.7s 7.8s 7.5x
1 GB 124.5s 15.2s 8.2x
2 GB 187.3s 23.7s 7.9x

Memory usage was also significantly lower. For the 2GB file, Pandas peaked at 5.2GB RAM usage, while Polars stayed around 2.8GB.

When to Stick with Pandas

Polars isn’t always the answer. I still use Pandas for:

  • Small files (under 10MB) where the difference is negligible
  • Quick exploratory analysis with .head() calls
  • Projects with heavy dependencies on Pandas-specific APIs
  • When team members are unfamiliar with Polars syntax

Migration Tips

The syntax is similar enough that migration is straightforward:

pandas_to_polars.py
# Pandas
import pandas as pd
df = pd.read_csv("file.csv")
df_filtered = df[df["status"] == "active"]
df_grouped = df_filtered.groupby("category").sum()
# Polars
import polars as pl
df = pl.read_csv("file.csv")
df_filtered = df.filter(pl.col("status") == "active")
df_grouped = df_filtered.group_by("category").sum()

The main differences to watch for:

  • df["col"] becomes pl.col("col") in expressions
  • groupby() becomes group_by()
  • Method chaining is more idiomatic in Polars

Conclusion

Switching from Pandas to Polars for CSV loading reduced my data pipeline runtime by 70-80%. For a daily ETL job processing 5GB of CSV data, this saved over 15 minutes per run - which adds up to significant infrastructure cost savings over time.

The key optimizations that made the biggest difference:

  1. Use pl.read_csv() for eager loading - fastest option for most cases
  2. Use pl.scan_csv() with lazy evaluation - filter before loading
  3. Enable streaming=True for files larger than RAM - avoid memory errors
  4. Specify columns and dtypes - skip inference overhead

Polars has become my default choice for any CSV file over 100MB. The performance gains are real, the syntax is intuitive, and the memory efficiency means I can work with larger datasets on smaller machines.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments