Litestream LTX Staging Disk Full
Litestream initial sync can turn a healthy SQLite backup plan into a disk-full outage when the local LTX staging file needs roughly another copy of the database before upload. If the staging write fails silently, the service can still look active while the application sharing the disk starts failing because the SQLite WAL cannot grow.
Prove whether local LTX staging is the disk-full driver.
Capture database size, free bytes, staging directory growth, Litestream version, journal output, and replica upload state before deleting staging files. The core question is whether initial sync needs more free space than the host can provide.
Measure DB size, staging size, and free-space headroom.
These checks avoid database contents. They show whether staging can fit, whether the WAL is blocked by shared disk pressure, and whether operators would see an error from logs alone.
df -h; du -sh db .db-litestream/ltx/0; journalctl -u litestream
Runbook: Make Initial Sync Fail Loudly Before Disk Full
- Compute the staging budget before initial sync. If the snapshot builder stages one local LTX file close to database size, free space must exceed database size plus WAL growth and reserve.
- Keep the app database, WAL, and Litestream staging on separate risk budgets when possible. A backup job should not consume the last bytes needed by application writes.
- Make staging write failures visible. ENOSPC should produce an ERROR-level log, a failed health state, and a metric operators can alert on.
- Add a preflight guard before snapshot build: database bytes, available bytes, projected staging bytes, reserve bytes, and replica upload progress.
- For recovery, stop Litestream before removing incomplete staging files. Do not touch the SQLite database, WAL, or SHM files as a cleanup shortcut.
- Turn the incident into a policy: max staging bytes, minimum free bytes, alert before reserve breach, and an operator runbook for aborted initial sync.
Use this when initial sync silently fills the disk.
This keeps the thread focused on the operational contract: preflight the staging budget, fail loudly on ENOSPC, and protect the app WAL from backup staging pressure.
I would frame this as a missing staging preflight plus missing ENOSPC surfacing.
Read-only evidence I would capture before cleanup:
DB=/path/to/app.db
DB_DIR=$(dirname "$DB")
df -h "$DB_DIR"
df -i "$DB_DIR"
du -sh "$DB" "$DB-wal" "$DB-shm" "$DB_DIR"/.*-litestream/ltx/0 2>/dev/null | sort -h
find "$DB_DIR" -path ".*-litestream/ltx/0/*" -type f -size +100M -print 2>/dev/null
journalctl -u litestream -n 300 --no-pager 2>/dev/null || true
For the fix, I would add a preflight check before building the snapshot: database size, available bytes on the staging filesystem, required reserve, and projected staging bytes. If staging writes hit ENOSPC anyway, that should be an ERROR-level event and a failed health state, not a silent active service while the application WAL is starved.
Turn one backup staging outage into a storage policy.
The $99 policy is for teams running SQLite apps, Litestream replicas, local snapshot builders, or backup sidecars where staging files share disk with live writes. You get the safe/review/do-not-touch boundary, preflight math, alert thresholds, and recovery runbook for one representative incident.
Do Not Delete First
- The SQLite database, WAL, or SHM files while the app or Litestream may still be writing.
- Incomplete LTX staging files until Litestream is stopped and the owner confirms they are not the only local evidence of the failed sync.
- Replica credentials, bucket contents, or backup metadata in a public issue.
- Old snapshots or backups without confirming retention, restore point, and upload state.