§POST|~/blog||5 min read

The 1am deploy that only failed in prod

A field note on why a query that ran in 4ms on a seeded dev database stalled the moment it hit real cardinality, and what the planner was actually looking at.


This is a field note from a satus user, anonymized with their permission. The details that matter (schema shape, planner behavior, resolution) are intact. The names and column identifiers are not.

The deploy went out at 1:07am on a Tuesday. It was a two-line change: a new WHERE tenant_id = $1 AND status = 'active' clause on an existing endpoint, plus a composite index to support it. Local ran in 4ms. Staging ran in 6ms. The migration applied in prod in 800ms. The endpoint p99 went from 90ms to 41 seconds within four minutes and the on-call got paged.

The rollback was clean. The interesting part is why every pre-prod signal was green.

What dev looked like

Dev had roughly 1,000 rows in the affected table, generated by a fixture script that predated satus. tenant_id was drawn uniformly at random from 20 tenants. status was drawn uniformly from four values. On that distribution, the new index had ~12 rows per (tenant_id, status) pair. The planner picked an index scan, returned 12 rows in a nested loop against a small parent table, and everything was fast because everything was small.

Staging was a nightly restore of a scrubbed prod snapshot from six months earlier, back when the table had 400k rows. The composite index was still narrow enough on that snapshot that an index scan was cheap. Also fast.

Prod had 51.2M rows. One tenant owned 78% of them. status = 'active' covered 94% of that tenant's rows. The (tenant_id, status) pair the endpoint was called with matched roughly 37 million rows.

What the planner was actually doing

Postgres estimates row counts from pg_statistic, populated by ANALYZE and controlled by default_statistics_target (default 100). For a composite predicate like WHERE a = $1 AND b = $2, the planner multiplies per-column selectivities unless extended statistics tell it the columns are correlated. See the row-estimation walkthrough in the Postgres docs and the planner statistics chapter for the mechanics.

In dev, the estimate and the reality were both "about a dozen rows". Nested loop, index scan, done.

In prod, the estimate was still low (per-column selectivities multiplied to something like 0.05 * 0.25 = 1.25%, which on 51.2M rows is ~640k), but the actual matching set was ~37M. The planner committed to the nested-loop plan the estimate justified, then had to drive 37M outer-loop iterations against the parent table's index. That is the 41-second p99. A hash join over a sequential scan would have finished in under two seconds; the planner never considered it because the estimate said it did not need to.

The abbreviated shape, with representative numbers rounded from the incident's actual EXPLAIN (ANALYZE, BUFFERS):

Nested Loop  (cost=0.86..48210.44 rows=640123 width=84)
             (actual time=0.42..41210.88 rows=37118442 loops=1)
  ->  Index Scan using events_tenant_status_idx on events
        Index Cond: (tenant_id = $1 AND status = 'active')
        rows=640000 (estimate)  vs  rows=37118442 (actual)
  ->  Index Scan using parents_pkey on parents
        Index Cond: (id = events.parent_id)

The rows= estimate being off by ~58× is the whole bug. Once the estimate is wrong by that much, every downstream join method choice is wrong too.

Why seeded dev data hid it

Three separate properties of the fixture worked together to hide the problem:

  1. Uniform distributions. Real tenant data is almost never uniform. One tenant is usually an order of magnitude larger than the median. Uniformly-distributed seed data makes selectivity estimates look accurate on dev because they are accurate on dev.
  2. Small absolute row counts. At 1,000 rows, every plan is fast. Sequential scan, index scan, nested loop, hash join—the wall-clock differences are noise.
  3. No cross-column correlation. In prod, tenant_id and status were correlated: the largest tenant also had the highest active-row ratio. In dev they were independent by construction, which is exactly the assumption the planner already makes without extended statistics. The dev data confirmed the planner's default assumption instead of stress-testing it.

None of these are bugs in the fixture. They are the default behavior of every "generate N rows of random data" script we have ever seen, including the ones satus generated in v0.1. The realism gap is not about volume alone—it is about shape.

What the fix looked like

Two changes shipped the same week:

  • CREATE STATISTICS events_tenant_status (dependencies, ndistinct) ON tenant_id, status FROM events; followed by ANALYZE events;. Extended statistics let the planner see the correlation and produced a hash-join plan on the same query in ~1.4s. The CREATE STATISTICS docs describe the two flavors that matter here.
  • A staging environment restored from a current prod snapshot, not a six-month-old one. This is the change that actually catches the next incident.

The satus-side change was the one that took the rest of the week to design honestly. A profile that generates a heavy-tailed tenant distribution and a correlated status field is not hard to write; the hard part is picking distributions that generalize across schemas without becoming a lie of a different shape. The saas-subscriptions profile, written up here, is the first one to encode heavy-tailed tenant sizes as guidance. It is a specification, not enforcement—the LLM plans against it and the --dry-run validator checks structural conformance, but neither one can guarantee your dev database will trip the same planner branch as prod. What they can do is stop your dev database from confirming a wrong assumption.

The one-line takeaway

Uniformly-distributed seed data is a hypothesis test the planner is guaranteed to pass. If you want to catch the 1am deploy in dev, seed a distribution that could plausibly disagree with the planner's defaults, then run EXPLAIN (ANALYZE, BUFFERS) on the query paths you actually ship.

References

—the satus.sh team


published 2026-07-01 · satus.sh · ·postgres ·planner ·performance ·seeding
← all posts