DuckDB and httpfs behind a proxy: the secret nobody tells you

2026-03-02

4 min read

The problem: httpfs ignores your environment variables

If you work with DuckDB and the httpfs extension to read remote Parquet files, CSVs from S3, or any HTTP resource, you probably assume that the HTTP_PROXY and HTTPS_PROXY environment variables work just like every other tool. Curl respects them. wget respects them. Python requests respects them. Node.js respects them.

DuckDB does not.

I ran into this while working in a corporate environment with a mandatory proxy. I had a script reading Parquet files from Google Cloud Storage using httpfs, and it simply would not work. No clear error, no descriptive timeout, just silence. Meanwhile, a curl to the same resource with the same environment variables returned data without issue.

#duckdb #devops #databases

How PostgreSQL Estimates Your Queries (And Why It Sometimes Gets It Wrong)

2026-02-27

11 min read

Every query starts with a plan. Every slow query probably starts with a bad one. And more often than not, the statistics are to blame. But how does it really work?

PostgreSQL doesn’t run the query to find out — it estimates the cost. It reads pre-computed data from pg_class and pg_statistic and does the maths to figure out the cheapest path to your data.

In the ideal scenario, the numbers read are accurate, and you get the plan you expect. But when they’re stale, the situation gets out of control. The planner estimates 500 rows, plans a nested loop, and hits 25,000. What seemed like an optimal plan turns into a cascading failure.

#postgresql #databases #performance

DuckDB: File Formats and Performance Optimizations

2026-02-01

3 min read

Lately I’ve been working quite a bit with DuckDB, and one of the things that interests me most is understanding how to optimize performance according to the file format we’re using.

It’s not the same working with Parquet, compressed CSV, or uncompressed CSV. And the performance differences can be dramatic.

Let’s review the key optimizations to keep in mind when working with different file formats in DuckDB.

Parquet: Direct Query or Load First?

DuckDB has advanced Parquet support, including the ability to query Parquet files directly without loading them into the database. But when should you do one or the other?

#duckdb #sql #optimization

Amazon S3 Vectors: Native vector storage in the cloud

2025-07-17

3 min read

Amazon has taken an important step in the world of artificial intelligence with the launch of S3 Vectors, the first cloud storage service with native support for large-scale vectors. This innovation promises to reduce costs by up to 90% for uploading, storing, and querying vector data.

What are vectors and why do we care?

Vectors are numerical representations of unstructured data (text, images, audio, video) generated by embedding models. They are the foundation of generative AI applications that need to find similarities between data using distance metrics.

#aws #s3 #vectors

AgentHouse: When databases start speaking our language

2025-07-09

5 min read

A few months ago, when Anthropic launched their MCP (Model Context Protocol), I knew we’d see interesting integrations between LLMs and databases. What I didn’t expect was to see something as polished and functional as ClickHouse’s AgentHouse so soon.

I’m planning to test this demo soon, but just reading about it, the idea of being able to ask a database questions like “What are the most popular GitHub repositories this month?” and getting not just an answer, but automatic visualizations, seems fascinating.

#clickhouse #llm #mcp

Apache Iceberg v3: Revolution in Geospatial Data for Modern Analytics

2025-06-30

9 min read

Apache Iceberg v3: Revolution in Geospatial Data for Modern Analytics

The recent ratification of the Apache Iceberg v3 specification marks a significant milestone in the open data ecosystem, especially in the realm of geospatial data. This update not only consolidates Iceberg as the leading standard in open table formats, but introduces native geospatial capabilities that promise to transform how we handle location and mapping data at scale.

The Challenge of Geospatial Data in the Current Ecosystem

Before diving into Iceberg v3’s innovations, it’s crucial to understand the fragmented landscape that existed in geospatial data handling. As Jia Yu, Apache Sedona PMC Chair and Wherobots Co-Founder notes, the final functionality is the result of exhaustive community research that reviewed numerous projects and technologies with geospatial support.

#apache-iceberg #geospatial #bigdata

Latest Posts

Auto Memory and Auto Dream: how Claude Code learns and consolidates its memory

Claude Code with LSP: from searching text to understanding code

Ghost Jobs: the economy built on positions that don't exist

DuckDB and httpfs behind a proxy: the secret nobody tells you

How PostgreSQL Estimates Your Queries (And Why It Sometimes Gets It Wrong)

Analyzing Container Filesystem Isolation for Multi-Tenant Workloads

Category: Databases

DuckDB and httpfs behind a proxy: the secret nobody tells you

The problem: httpfs ignores your environment variables

How PostgreSQL Estimates Your Queries (And Why It Sometimes Gets It Wrong)

DuckDB: File Formats and Performance Optimizations

Parquet: Direct Query or Load First?

Amazon S3 Vectors: Native vector storage in the cloud

What are vectors and why do we care?

AgentHouse: When databases start speaking our language

Apache Iceberg v3: Revolution in Geospatial Data for Modern Analytics

Apache Iceberg v3: Revolution in Geospatial Data for Modern Analytics

The Challenge of Geospatial Data in the Current Ecosystem