DuckDB: File Formats and Performance Optimizations : AntonioCortes.com

Lately I’ve been working quite a bit with DuckDB, and one of the things that interests me most is understanding how to optimize performance according to the file format we’re using.

It’s not the same working with Parquet, compressed CSV, or uncompressed CSV. And the performance differences can be dramatic.

Let’s review the key optimizations to keep in mind when working with different file formats in DuckDB.

Parquet: Direct Query or Load First?

DuckDB has advanced Parquet support, including the ability to query Parquet files directly without loading them into the database. But when should you do one or the other?

In Favor of Direct Parquet Query

Basic statistics available: Parquet uses columnar storage and contains basic statistics like zonemaps. This allows DuckDB to apply optimizations like projection pushdown and filter pushdown. Workloads combining projection, filtering, and aggregation work very well on Parquet.

Storage considerations: Loading data from Parquet requires approximately the same space as the DuckDB file. If disk space is limited, direct query on Parquet is a good option.

Against Direct Parquet Query

Lack of advanced statistics: The DuckDB database format has hyperloglog statistics that Parquet doesn’t have. These improve cardinality estimation accuracy, especially important in queries with many joins.

Tip: If DuckDB produces a suboptimal join order on Parquet files, try loading the Parquet into DuckDB tables. The improved statistics will help obtain a better join order.

Repeated queries: If you plan to execute multiple queries on the same dataset, it’s worth loading the data into DuckDB. Queries will always be somewhat faster, amortizing the initial load time.

Row Group Size in Parquet

DuckDB works better with Parquet files that have row groups of 100K-1M rows each.

Why? Because DuckDB can only parallelize over row groups. If a Parquet has a single giant row group, it can only be processed by one thread.

Recommendations Summary

Parquet

✅ Direct query for projection/filter/aggregation workloads
✅ Load into DuckDB for many joins or repeated queries
✅ Use row groups of 100K-1M rows
✅ Keep files between 100 MB - 10 GB
✅ Prefer Snappy/LZ4/zstd over gzip

CSV

✅ Read .csv.gz directly (DO NOT uncompress first)
✅ Disable sniffer for many small files with same schema
✅ Consider converting to Parquet for analytical workloads

Conclusion

What has struck me most working with DuckDB is that traditional intuitions about data processing sometimes don’t apply.

The .csv.gz case is the perfect example. You’d think uncompressing first would be faster, but the reality is that reading the compressed file directly is faster.

And that’s key when working with large data. Load and process times matter, and understanding these optimizations can make a substantial difference in your data pipeline performance.

References

The problem: httpfs ignores your environment variables

If you work with DuckDB and the httpfs extension to read remote Parquet files, CSVs from S3, or any HTTP resource, you probably assume that the HTTP_PROXY and HTTPS_PROXY environment variables work just like every other tool. Curl respects them. wget respects them. Python requests respects them. Node.js respects them.

DuckDB does not.

I ran into this while working in a corporate environment with a mandatory proxy. I had a script reading Parquet files from Google Cloud Storage using httpfs, and it simply would not work. No clear error, no descriptive timeout, just silence. Meanwhile, a curl to the same resource with the same environment variables returned data without issue.

Apache Iceberg v3: Revolution in Geospatial Data for Modern Analytics

The recent ratification of the Apache Iceberg v3 specification marks a significant milestone in the open data ecosystem, especially in the realm of geospatial data. This update not only consolidates Iceberg as the leading standard in open table formats, but introduces native geospatial capabilities that promise to transform how we handle location and mapping data at scale.

The Challenge of Geospatial Data in the Current Ecosystem

Before diving into Iceberg v3’s innovations, it’s crucial to understand the fragmented landscape that existed in geospatial data handling. As Jia Yu, Apache Sedona PMC Chair and Wherobots Co-Founder notes, the final functionality is the result of exhaustive community research that reviewed numerous projects and technologies with geospatial support.

Latest Posts

Claude Code with LSP: from searching text to understanding code

Ghost Jobs: the economy built on positions that don't exist

DuckDB and httpfs behind a proxy: the secret nobody tells you

How PostgreSQL Estimates Your Queries (And Why It Sometimes Gets It Wrong)

Analyzing Container Filesystem Isolation for Multi-Tenant Workloads

The Software Development Renaissance with AI Agents

DuckDB: File Formats and Performance Optimizations

Parquet: Direct Query or Load First?

In Favor of Direct Parquet Query

Against Direct Parquet Query

Row Group Size in Parquet

Recommendations Summary

Parquet

CSV

Conclusion

References

Comments

Latest Posts

How PostgreSQL Estimates Your Queries (And Why It Sometimes Gets It Wrong)

DuckDB and httpfs behind a proxy: the secret nobody tells you

The problem: httpfs ignores your environment variables

AgentHouse: When databases start speaking our language

Apache Iceberg v3: Revolution in Geospatial Data for Modern Analytics

Apache Iceberg v3: Revolution in Geospatial Data for Modern Analytics

The Challenge of Geospatial Data in the Current Ecosystem