Postgres DiskFullError Shared Memory Temp Space
PostgreSQL can throw DiskFullError while the main data disk looks healthy. The usual trap is temp-space or /dev/shm pressure from parallel hash joins, sorts, materialized subplans, or expensive summary endpoints. Treat it as a query-and-memory budget incident, not just a disk cleanup task.
Turn one DiskFullError cluster into a reusable query and temp-space policy.
Use this when logs show could not resize shared memory segment, No space left on device, or asyncpg DiskFullError while health checks still report normal disk usage.
df /dev/shm -> temp file stats -> active queries -> EXPLAIN memory nodes
Capture temp files, shared memory, and active query pressure.
These checks avoid table contents. They show whether the incident is data-volume disk, temporary files, Docker/Kubernetes shared memory, or one expensive endpoint plan.
df /dev/shm; pg_stat_database temp; pg_stat_activity; temp file logs
Runbook: Fix The Query Budget, Not Just The Disk
- Do not assume the main data volume is full. Check
/dev/shm, container shared-memory size, temp directories, and Postgres temp counters separately. - Find the failing endpoint and query family. Repeated
operations_summaryor prediction endpoints usually point to one expensive plan, not random storage pressure. - Enable or inspect temp-file logging. Large temp files identify sort/hash/materialize nodes that need query-plan work.
- Estimate concurrent memory pressure.
work_memapplies per operation per worker; raising it globally can make the next incident worse. - Prefer targeted changes: indexes, precomputed summaries, narrower time windows, lower parallelism on the endpoint,
temp_file_limit, and statement timeouts. - Make the incident observable: alert on repeated DiskFullError clusters, high temp_bytes delta,
/dev/shmfree space, and summary endpoint timeout rate. - After a change, run the same query under expected concurrency and confirm temp_bytes, latency, and error count all move in the right direction.
Use this when Postgres says disk full but health says disk is fine.
This keeps the thread focused on evidence: temp-space source, query family, concurrency budget, and acceptance checks.
I would treat the Postgres DiskFullError as a temp-space / shared-memory budget incident first, not as ordinary disk cleanup.
Acceptance checks I would add:
- Capture `df -h /dev/shm /tmp /` next to every DiskFullError cluster.
- Log or query `pg_stat_database.temp_files/temp_bytes` before and after the failing window.
- Identify the exact endpoint/query family that triggers `could not resize shared memory segment`.
- Run `EXPLAIN (ANALYZE, BUFFERS)` on the summary/prediction query and look for hash/sort/materialize nodes plus parallel workers.
- Estimate worst-case memory as work_mem * memory nodes * workers * concurrent requests before any global work_mem increase.
- Add a guard: statement timeout or temp_file_limit for the endpoint, plus an alert on repeated DiskFullError and `/dev/shm` low-space.
- Verify the fix by replaying the endpoint and watching temp_bytes, latency, and timeout count.
Turn one Postgres DiskFullError cluster into a reusable temp-space policy.
The $99 policy is for production APIs, transit/analytics dashboards, app templates, and internal services where expensive summary queries can exhaust Postgres temp or shared-memory space. You get the evidence checklist, safe settings boundary, query-plan acceptance tests, and alert thresholds for one representative incident.
Do Not Change First
- Global
work_membefore estimating per-query nodes, parallel workers, and concurrent request count. - Postgres data files, WAL, or temp directories before proving the process holding them is stopped or safe to interrupt.
- Container
/dev/shmsize without also bounding the query that can consume it. - Health checks that only report main disk percentage while shared memory and temp-space pressure remain invisible.