When to Use Polars vs DuckDB for Data Analysis
I needed to choose between Polars and DuckDB for a new data analysis project. Both tools promised better performance than pandas, but they work differently. I spent time testing both and understanding their strengths.
The Core Difference
Polars is a DataFrame library first. You use Python code to transform data. DuckDB is a SQL database that runs in your Python process. You write SQL queries to analyze data.
I found this distinction matters more than raw performance numbers. Let me show you what I learned.
When I Chose Polars
I started with a project that needed complex data transformation pipelines. The workflow looked like this:
import polars as pl
# Lazy evaluation - doesn't execute until collect()result = ( pl.scan_csv("sales_data.csv") .filter(pl.col("amount") > 1000) .group_by("category") .agg([ pl.col("amount").sum().alias("total"), pl.col("amount").mean().alias("average"), pl.col("amount").std().alias("std_dev") ]) .sort("total", descending=True) .collect())What impressed me was how Polars optimized this query automatically. It reorganized operations, eliminated redundant computations, and selected the best algorithms. I didn’t need to think about optimization - the lazy API handled it.
Polars shined when I built multi-step transformations:
import polars as pl
# Complex pipeline with conditional logicpipeline = ( pl.scan_parquet("orders.parquet") .with_columns([ (pl.col("price") * pl.col("quantity")).alias("total"), pl.when(pl.col("region") == "US") .then(pl.col("total") * 0.9) # US discount .otherwise(pl.col("total")) .alias("discounted_total") ]) .filter(pl.col("discounted_total") > 100) .group_by("customer_id") .agg([ pl.col("discounted_total").sum().alias("lifetime_value"), pl.col("order_id").count().alias("order_count") ]))This felt natural as a Python developer. I could build transformations step by step, debug intermediate results, and use familiar programming constructs.
When I Switched to DuckDB
Then I worked on a project replicating an OLAP platform locally. The data team already had SQL queries from their production system. Using DuckDB made sense:
import duckdb
con = duckdb.connect()
# Direct SQL on CSV filesresult = con.execute(""" SELECT category, SUM(amount) as total, AVG(amount) as average, STDDEV(amount) as std_dev FROM read_csv('sales_data.csv') WHERE amount > 1000 GROUP BY category ORDER BY total DESC""").df()What I liked about DuckDB:
- No setup required - just connect and query
- Full SQL support including window functions, CTEs, and subqueries
- Familiar syntax if you know SQL
- Great for analytical queries on local datasets
The zero-configuration approach appealed to me. I didn’t need to think about lazy vs eager evaluation or query optimization plans. DuckDB handled everything.
Trying Both Together
I discovered they work well together. I could register a Polars DataFrame in DuckDB and query it with SQL:
import polars as plimport duckdb
# Create data with Polarsdf = pl.DataFrame({ "category": ["A", "B", "A", "B", "C"], "value": [10, 20, 30, 40, 50]})
# Register in DuckDBcon = duckdb.connect()con.register("polars_data", df)
# Query with SQLresult = con.execute(""" SELECT category, SUM(value) as total, AVG(value) as average FROM polars_data GROUP BY category ORDER BY total DESC""").df()This worked with zero data copying - a real performance advantage when working with large datasets.
Polars SQL Interface
I learned that Polars also has SQL capabilities. This surprised me:
import polars as pl
# Create DataFramedf = pl.DataFrame({ "category": ["A", "B", "A", "B"], "value": [10, 20, 30, 40]})
# Create SQL contextctx = pl.SQLContext(data=df)
# Execute SQLresult = ctx.execute(""" SELECT category, SUM(value) as total, AVG(value) as average FROM data GROUP BY category ORDER BY total DESC""").collect()This helped when I had SQL queries but wanted to stay in the Polars ecosystem.
Performance Considerations
I tested both with a 10 million row dataset. Results showed similar performance for analytical queries, but differences emerged based on workflow:
Polars advantages:
- Lazy evaluation optimizes before execution
- Better for complex multi-step transformations
- Zero-copy between operations
- Multi-threading and SIMD support
DuckDB advantages:
- SQL optimization is mature and battle-tested
- Excellent for aggregations and joins
- Vectorized execution engine
- Direct SQL on files without loading into memory
The performance gap wasn’t as important as workflow fit. Both tools were significantly faster than pandas for most operations.
Multi-Language Support
I worked on a project using Python for analysis but needing the same transformations in a Node.js service. Polars made this easy:
const pl = require("nodejs-polars");
// Same API as Pythonconst result = pl .scanCSV("data.csv") .filter(pl.col("value").gt(20)) .groupBy("category") .agg([ pl.col("value").sum().alias("total"), pl.col("value").mean().alias("average") ]) .sort("total", { descending: true }) .collect();DuckDB also supports multiple languages, but the API differs. Polars offered more consistency across languages.
Ibis for Unified Execution
I discovered Ibis, which provides a unified API across multiple backends including Polars and DuckDB:
import ibis
# Use Polars backendcon = ibis.polars.connect({"data": "sales.csv"})
# Switch to DuckDB backendcon = ibis.duckdb.connect("sales.db")
# Same Ibis expressions work with both backendst = con.table("data")result = t.filter(t.amount > 1000).group_by("category").aggregate( total=t.amount.sum(), average=t.amount.mean())This helped when I needed to support multiple execution engines.
My Decision Framework
After extensive testing, here’s how I decide:
Choose Polars when:
- You prefer Python syntax and want to build transformation pipelines
- You need lazy evaluation with automatic optimization
- Your project uses multiple languages
- You want zero-copy between operations
- You’re comfortable with the DataFrame mental model
Choose DuckDB when:
- You’re comfortable with SQL and need full OLAP features
- You’re replicating a production OLAP platform locally
- You have existing SQL queries to port
- You want zero-configuration database functionality
- Your team has strong SQL backgrounds
What I Recommend
Start with what matches your current workflow:
- If you use pandas today and like the DataFrame API, try Polars first. The learning curve is smaller and you get SQL capabilities when needed.
- If you come from a SQL background or work with existing SQL queries, start with DuckDB. The Python integration makes it easy to work with DataFrames.
Both tools are excellent choices. The key is matching the tool to your workflow, not chasing benchmark numbers. They’re also designed to work together, so you can use both in the same project when it makes sense.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Polars Documentation
- 👨💻 DuckDB Documentation
- 👨💻 Ibis Framework
- 👨💻 Reddit Discussion
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments