Shared GPU Scratch Volume Disk Full
When a shared AI training volume reaches 100%, the right answer is rarely "delete the biggest directory." Student homes, active workdirs, model weights, W&B artifacts, checkpoints, and source builds need an owner-approved cleanup boundary before GPU jobs start losing writes.
Separate active experiments from regenerable scratch before deleting anything.
The first policy question is ownership: which paths belong to one user, which are shared infrastructure, which are rebuildable caches, and which are active experiment state that needs explicit approval.
home + workdir + model cache + checkpoint + artifact ownership
Capture owner and reclaimability before cleanup.
This packet is designed for shared homes/workdirs such as /senpai-run/home, /senpai-run/workdirs, Hugging Face caches, W&B runs, checkpoints, and source build outputs.
df -h; du by owner; find caches/checkpoints/artifacts
Runbook: Recover Without Losing Experiment State
- Stop or gate new writes before cleanup starts. A volume at 95-100% can lose logs, checkpoints, git writes, and W&B flushes while you are measuring it.
- Build an owner table for each large directory: active user, active issue/job, stale candidate, or infrastructure-owned.
- Let users self-delete only their own regenerable caches first: package caches, downloaded model copies, compiled build products, and failed run artifacts they can recreate.
- Keep active workdirs, current checkpoints, final submissions, and experiment logs review-first until the job owner marks them stale.
- Move repeated large writes out of the shared volume: per-pod ephemeral scratch, quotas, cache TTLs, or separate model-cache volumes.
- Add a recurrence guard: warn at 85%, block new large builds at 90%, and require owner approval for shared cleanup at 95%.
Use this when a shared GPU workspace is full.
This keeps cleanup focused on ownership and rebuildability instead of deleting the biggest active experiment directory.
I would treat this as a shared-volume ownership problem, not a one-user cleanup problem.
Before deleting anything from student homes or workdirs, I would build a table with:
- owner
- path
- size
- active job / issue
- regenerable cache vs active experiment state
- approved cleanup action
Read-only evidence:
df -hT /senpai-run /
df -i /senpai-run /
du -xh /senpai-run --max-depth=2 | sort -h | tail -80
find /senpai-run -xdev -maxdepth 5 -type d \( -name ".cache" -o -name "huggingface" -o -name "wandb" -o -name "checkpoints" -o -name "outputs" -o -name "build" \) -print
find /senpai-run -xdev -type f -size +1G -printf "%s %u %p\n" | sort -n | tail -80
The safe immediate move is user-owned regenerable cache cleanup plus a write gate. The durable fix is per-user/per-pod scratch isolation or quota/TTL rules so one active build cannot re-fill the shared volume.
Turn one shared-volume outage into a cleanup policy.
The $99 policy is for teams running shared AI/GPU workspaces where one volume holds multiple users, model caches, checkpoints, W&B artifacts, source builds, and active workdirs. You get a safe/review/do-not-touch boundary and a recurrence guard.
Do Not Delete First
- Active workdirs, current checkpoints, and final submissions without owner approval.
- Shared model weights if jobs still reference them and the source is gated or slow to rehydrate.
- W&B/artifact logs before confirming whether they are the only record of a failed experiment.
- Other users' homes or experiment outputs from an advisor/helper pod without node-wide ownership.