SafeDisk AI

Postgres DiskFullError Shared Memory Temp Space

PostgreSQL can throw DiskFullError while the main data disk looks healthy. The usual trap is temp-space or /dev/shm pressure from parallel hash joins, sorts, materialized subplans, or expensive summary endpoints. Treat it as a query-and-memory budget incident, not just a disk cleanup task.

No credentials, database dumps, private logs, or full query text. A public-safe symptom is enough to scope the policy.

$99 Postgres temp-space policy

Turn one DiskFullError cluster into a reusable query and temp-space policy.

Use this when logs show could not resize shared memory segment, No space left on device, or asyncpg DiskFullError while health checks still report normal disk usage.

df /dev/shm -> temp file stats -> active queries -> EXPLAIN memory nodes
Read-only evidence

Capture temp files, shared memory, and active query pressure.

These checks avoid table contents. They show whether the incident is data-volume disk, temporary files, Docker/Kubernetes shared memory, or one expensive endpoint plan.

df /dev/shm; pg_stat_database temp; pg_stat_activity; temp file logs
Request $99 DB policy Request $29 incident review

Runbook: Fix The Query Budget, Not Just The Disk

  1. Do not assume the main data volume is full. Check /dev/shm, container shared-memory size, temp directories, and Postgres temp counters separately.
  2. Find the failing endpoint and query family. Repeated operations_summary or prediction endpoints usually point to one expensive plan, not random storage pressure.
  3. Enable or inspect temp-file logging. Large temp files identify sort/hash/materialize nodes that need query-plan work.
  4. Estimate concurrent memory pressure. work_mem applies per operation per worker; raising it globally can make the next incident worse.
  5. Prefer targeted changes: indexes, precomputed summaries, narrower time windows, lower parallelism on the endpoint, temp_file_limit, and statement timeouts.
  6. Make the incident observable: alert on repeated DiskFullError clusters, high temp_bytes delta, /dev/shm free space, and summary endpoint timeout rate.
  7. After a change, run the same query under expected concurrency and confirm temp_bytes, latency, and error count all move in the right direction.
Copy-ready issue reply

Use this when Postgres says disk full but health says disk is fine.

This keeps the thread focused on evidence: temp-space source, query family, concurrency budget, and acceptance checks.

I would treat the Postgres DiskFullError as a temp-space / shared-memory budget incident first, not as ordinary disk cleanup.

Acceptance checks I would add:
- Capture `df -h /dev/shm /tmp /` next to every DiskFullError cluster.
- Log or query `pg_stat_database.temp_files/temp_bytes` before and after the failing window.
- Identify the exact endpoint/query family that triggers `could not resize shared memory segment`.
- Run `EXPLAIN (ANALYZE, BUFFERS)` on the summary/prediction query and look for hash/sort/materialize nodes plus parallel workers.
- Estimate worst-case memory as work_mem * memory nodes * workers * concurrent requests before any global work_mem increase.
- Add a guard: statement timeout or temp_file_limit for the endpoint, plus an alert on repeated DiskFullError and `/dev/shm` low-space.
- Verify the fix by replaying the endpoint and watching temp_bytes, latency, and timeout count.
Request policy review
Paid scope

Turn one Postgres DiskFullError cluster into a reusable temp-space policy.

The $99 policy is for production APIs, transit/analytics dashboards, app templates, and internal services where expensive summary queries can exhaust Postgres temp or shared-memory space. You get the evidence checklist, safe settings boundary, query-plan acceptance tests, and alert thresholds for one representative incident.

No credentials, database dumps, full query text, or private logs. A public-safe summary is enough to start.

Do Not Change First