How PostgreSQL Estimates Your Queries (And Why It Sometimes Gets It Wrong)
11 min read

How PostgreSQL Estimates Your Queries (And Why It Sometimes Gets It Wrong)

2159 words

Every query starts with a plan. Every slow query probably starts with a bad one. And more often than not, the statistics are to blame. But how does it really work?

PostgreSQL doesn’t run the query to find out — it estimates the cost. It reads pre-computed data from pg_class and pg_statistic and does the maths to figure out the cheapest path to your data.

In the ideal scenario, the numbers read are accurate, and you get the plan you expect. But when they’re stale, the situation gets out of control. The planner estimates 500 rows, plans a nested loop, and hits 25,000. What seemed like an optimal plan turns into a cascading failure.

How Do Statistics Get Stale?

It can be a bulk load, a schema migration, faster-than-expected growth, or simply VACUUM not keeping up. Whatever the cause, the result is the same: the planner is flying blind, choosing paths based on a reality that no longer exists.

Let’s dive into the two catalogs the planner depends on, understand what ANALYZE actually gets for you from a 30,000-row table, and see how those numbers determine whether your query takes milliseconds or minutes.

What the Planner Reads

Every decision the planner makes is based on two sources:

  • Table-level metadata from pg_class
  • Column-level metadata from pg_statistic

pg_class — relation-level stats

It actually tracks all relations: not just tables and indexes, but also partitions, TOAST tables, sequences, composite types, and views.

Every table, index, and materialized view has a row in pg_class. Before it even looks at column-level statistics, the planner reads three key values:

ColumnMeaning
relpagesNumber of 8KB pages representing the table on disk
reltuplesEstimated number of live rows in the table
relallvisiblePages where all tuples are visible to all transactions
SELECT relname, relpages, reltuples, relallvisible
FROM pg_class
WHERE relname = 'orders';
relname | relpages | reltuples | relallvisible
---------+----------+-----------+---------------
orders  |      856 |    100000 |           856

The planner sees 100,000 rows spread across 856 pages. Every cost estimate starts from those two numbers. relpages drives sequential scan cost — each page is one unit of I/O work as configured via seq_page_cost. reltuples controls estimates for joins, aggregations, and pretty much everything else.

The reltuples value is only an estimate, not a live count. It’s updated by ANALYZE (and autovacuum), not by individual INSERTs or DELETEs. Between ANALYZE runs, PostgreSQL scales reltuples proportionally when relpages value changes — if the table grows by 10% in pages, the planner assumes 10% more rows too.

This works well enough for normal growth, but breaks down with bloat. If dead tuples are inflating the number of pages used, without adding real rows, the planner overestimates the table size.

pg_statistic (via pg_stats) — column-level stats

Knowing the size of a table is only half the picture. To estimate how many rows might match, the planner needs to understand the data inside each column. PostgreSQL maintains statistics in the pg_statistic catalog, though in practice you’ll use the pg_stats view which presents the data in a human-friendly format.

The most interesting values it exposes are:

StatisticWhat it tells the planner
null_fracFraction of entries that are NULL values
avg_widthAverage width in bytes
n_distinctNumber of distinct values (negative means fraction of rows)
most_common_valsMost frequent values
most_common_freqsFrequencies of those values
histogram_boundsValues dividing the remaining data into equal-population buckets
correlationStatistical correlation between physical and logical ordering
SELECT attname, null_frac, avg_width, n_distinct,
       most_common_vals, most_common_freqs, histogram_bounds, correlation
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
attname      | status
null_frac    | 0
avg_width    | 8
n_distinct   | 4
most_common_vals | {pending,shipped,delivered,cancelled}
most_common_freqs | {0.25396666,0.25,0.24973333,0.2463}
histogram_bounds |
correlation  | 0.2524199

From this, the planner knows there are exactly 4 distinct values, more or less equally distributed, without any NULL values. When you write a predicate WHERE status = 'pending', it will estimate ~25% of rows are going to match. All that without running a query, just by reading the catalog row.

Selectivity in Action

Now that we’ve covered what data the planner has available, let’s have a look at how it’s used to estimate how many rows a certain part of a query will read. This “guess” is called Selectivity, and it’s defined as a floating-point number between 0 and 1.

The formula is pretty simple:

Estimated Rows = Total Rows × Selectivity

But the way Selectivity is calculated depends entirely on the operator you use (=, <, > or LIKE).

Equality (Most Common Values)

Equality is easiest to start with. When you use WHERE status = 'shipped', the planner first checks the most_common_vals (MCV) list. If there’s a match, the selectivity is the same as the one provided by most_common_freqs.

If the value isn’t in the list, the planner assumes the value is part of the “remaining” population. It subtracts all MCV frequencies from 1.0 and divides the remainder by the number of other distinct values.

Range lookup (the histogram)

Life would be easy if we only looked for exact values. Most of the time we need to utilize range lookup. For example WHERE amount > 400.

MCVs are useless in this case as there might be thousands or millions of unique constants. This is where histogram_bounds comes in. PostgreSQL divides the column values into a number of buckets, where each bucket contains an equal number of rows (not values).

The Selectivity in this case is determined by how many buckets your query covers. If we have bounds (100, 200, 300, 400, 500, 600), the planner will establish it covers 2 full buckets. Since there are 5 buckets in total, the Selectivity is going to be 0.4 (2/5).

The histogram’s biggest weakness is linear interpolation. If your data has massive spikes in distribution, the planner will always assume a perfect distribution.

Search and pattern matching

This is probably the most treacherous territory for the planner. For substring matching patterns like WHERE note LIKE '%middle%', there’s no histogram or list of values to rely on. The planner must fall back to “magic constants” hardcoded in the PostgreSQL source code.

The default for generic patterns is 0.5% of the total rows, defined as #define DEFAULT_MATCH_SEL 0.005.

A slightly better situation comes for prefixed matches like WHERE note LIKE 'boringSQL%', where PostgreSQL can fall back to range conditions and use histogram bounds. While this is a subtle difference, it makes a night-and-day difference.

Correlation and index scan cost

Remember correlation from pg_stats? It says how closely the physical order of rows on disk matches the logical order of column values. Values close to 1.0 mean high correlation; values near 0 mean data is laid out randomly across pages.

This matters because it determines whether an index scan is worth it. The planner assumes a random page read costs 4× more than a sequential one (random_page_cost = 4.0 vs seq_page_cost = 1.0). When correlation is high, the rows an index points to are physically adjacent. The planner expects sequential I/O and costs the scan cheaply. When correlation is low, each lookup likely hits a different page, and the planner costs each of those reads at the higher random rate.

That difference alone can be enough to make a sequential scan cheaper than an index scan.

What If There Are No Statistics?

So far we’ve touched on why up-to-date statistics are a must. But what if there are no statistics at all? For example, for a new table or new column when ANALYZE has not yet run.

In those cases, PostgreSQL falls back to hardcoded defaults:

Condition typeDefault selectivityConstant
Equality (=)0.5%DEFAULT_EQ_SEL = 0.005
Range (>, <)33.3%DEFAULT_INEQ_SEL = 0.3333
Range (BETWEEN)0.5%DEFAULT_RANGE_INEQ_SEL = 0.005
Pattern matching (LIKE)0.5%DEFAULT_MATCH_SEL = 0.005
IS NULL0.5%DEFAULT_UNK_SEL = 0.005
IS NOT NULL99.5%DEFAULT_NOT_UNK_SEL = 0.995

Nothing that a quick ANALYZE can’t fix, right? Or maybe not.

Where No Statistics Go

While this article focuses on statistics and getting them right, there are situations where no statistics will be available (never or not predictably):

  • CTEs and subqueries when not inlined/materialized have no statistics
  • Temporary tables are not touched by autovacuum, so no automatic ANALYZE
  • Foreign tables do not guarantee stats are propagated
  • And, to a big surprise, computed expressions in WHERE like WHERE amount * 1.1 > 500 or lower(email) = 'hello@example.com'; unless you create an expression index or extended statistics

And by the way, did you know that TRUNCATE is a fast way to get rid of the data, but it leaves statistics behind?

How ANALYZE Works

As we’ve already mentioned several times, ANALYZE is the only mechanism that populates pg_class and pg_statistic with fresh data. Understanding what it samples, what it computes, and what it misses is key to understanding why statistics are sometimes wrong.

The sampling mechanism

ANALYZE actually doesn’t read the entire table. It samples what is considered to be a statistically justified minimum sample size (defined as 300 for the reservoir sampling algorithm). For PostgreSQL that means 300 × default_statistics_target rows.

SHOW default_statistics_target;
default_statistics_target
---------------------------
100

With the default target of 100, that’s 30,000 rows. For a 100,000-row orders table, ANALYZE reads about 30% of the data. For a 50-million-row table, it reads 0.06%.

The sampling is two-stage. First, ANALYZE selects a random set of pages. Then it reads all live rows from those pages. This gives a representative cross-section without reading every page.

Computing statistics

Once ANALYZE has its 30,000 sample rows (considering the default values), it processes each column independently. The pipeline for a single column looks roughly like this:

First, it counts NULLs and calculates null_frac and avg_width — the cheapest statistics to compute. Then it sorts the non-null values and builds the MCV list by counting duplicates. Values that appear frequently enough make the cut; the rest are passed to the histogram builder, which divides them into equal-population buckets. Finally, because the values are already sorted, ANALYZE compares the logical sort order against the physical tuple positions (which page each row came from) to compute correlation.

The key detail here is that MCVs and histograms are not built from the same pool. Values that land in most_common_vals are excluded from histogram_bounds. This is why you’ll sometimes see a column with MCVs but no histogram, or a histogram but no MCVs. They represent different slices of the same data.

Controlling statistics quality

The default target of 100 works well for most columns. It means up to 100 MCVs, 101 histogram bounds, and a 30,000-row sample. Increasing it helps when:

  • A column has many distinct values and the top 100 don’t cover enough of the distribution
  • Range queries on skewed data produce bad estimates because histogram buckets are too coarse
  • Join estimates are off because n_distinct is inaccurate

The cost scales linearly. Setting it to 1000 means 300,000 sampled rows, up to 1000 MCVs, more catalog storage, and slower planning from larger arrays to search. The maximum is 10,000.

You don’t have to raise it globally. For a single problematic column:

ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;

Now status gets up to 500 MCVs and 501 histogram bounds, while every other column stays at 100.

Extended statistics

Standard statistics treat each column independently. This means the planner can’t know that city = 'Edinburgh' and country = 'UK' are correlated. It multiplies their selectivities independently, potentially underestimating by orders of magnitude.

Extended statistics solve this for specific column combinations:

CREATE STATISTICS orders_status_date (dependencies, ndistinct, mcv)
ON status, created_at FROM orders;
ANALYZE orders;

This tells ANALYZE to compute functional dependencies, combined distinct counts, and combined MCVs between the columns. The planner can then use these to avoid the independence assumption for queries filtering on both columns.

The three types of extended statistics serve different purposes:

  • dependencies capture functional dependencies between columns. Helps when knowing one column’s value determines or narrows another’s (e.g. zip_code largely determines city)
  • ndistinct tracks the number of distinct value combinations across columns. Helps with GROUP BY on multiple columns where the planner would otherwise multiply distinct counts independently
  • mcv builds a combined most-common-values list for column tuples. The most powerful but most expensive option. Helps with multi-column WHERE conditions on correlated values

Diagnosing Bad Estimates

When a query is slow, the first question should always be: did the planner estimate correctly? Compare the estimate to reality with EXPLAIN ANALYZE.

Estimates off by a handful of rows means statistics are fine. But when you see estimates off by 10× or more, that’s where planning goes wrong. A nested loop that looks cheap for 100 rows becomes a disaster at 10,000.

The statistics tell you what the planner believed, and comparing that with reality tells you what to do next. Either run ANALYZE, consider tuning the statistics target for a specific column, or create extended statistics if multiple columns are involved.

The planner is only as good as what it reads from the catalog. When estimates go wrong, don’t blame the planner. Check the data it’s working with.

Comments

Latest Posts

3 min

455 words

Lately I’ve been working quite a bit with DuckDB, and one of the things that interests me most is understanding how to optimize performance according to the file format we’re using.

It’s not the same working with Parquet, compressed CSV, or uncompressed CSV. And the performance differences can be dramatic.

Let’s review the key optimizations to keep in mind when working with different file formats in DuckDB.

Parquet: Direct Query or Load First?

DuckDB has advanced Parquet support, including the ability to query Parquet files directly without loading them into the database. But when should you do one or the other?

4 min

767 words

The problem: httpfs ignores your environment variables

If you work with DuckDB and the httpfs extension to read remote Parquet files, CSVs from S3, or any HTTP resource, you probably assume that the HTTP_PROXY and HTTPS_PROXY environment variables work just like every other tool. Curl respects them. wget respects them. Python requests respects them. Node.js respects them.

DuckDB does not.

I ran into this while working in a corporate environment with a mandatory proxy. I had a script reading Parquet files from Google Cloud Storage using httpfs, and it simply would not work. No clear error, no descriptive timeout, just silence. Meanwhile, a curl to the same resource with the same environment variables returned data without issue.

5 min

911 words

A few months ago, when Anthropic launched their MCP (Model Context Protocol), I knew we’d see interesting integrations between LLMs and databases. What I didn’t expect was to see something as polished and functional as ClickHouse’s AgentHouse so soon.

I’m planning to test this demo soon, but just reading about it, the idea of being able to ask a database questions like “What are the most popular GitHub repositories this month?” and getting not just an answer, but automatic visualizations, seems fascinating.

9 min

1802 words

Apache Iceberg v3: Revolution in Geospatial Data for Modern Analytics

The recent ratification of the Apache Iceberg v3 specification marks a significant milestone in the open data ecosystem, especially in the realm of geospatial data. This update not only consolidates Iceberg as the leading standard in open table formats, but introduces native geospatial capabilities that promise to transform how we handle location and mapping data at scale.

The Challenge of Geospatial Data in the Current Ecosystem

Before diving into Iceberg v3’s innovations, it’s crucial to understand the fragmented landscape that existed in geospatial data handling. As Jia Yu, Apache Sedona PMC Chair and Wherobots Co-Founder notes, the final functionality is the result of exhaustive community research that reviewed numerous projects and technologies with geospatial support.