Shared GPU Scratch Volume Disk Full

When a shared AI training volume reaches 100%, the right answer is rarely "delete the biggest directory." Student homes, active workdirs, model weights, W&B artifacts, checkpoints, and source builds need an owner-approved cleanup boundary before GPU jobs start losing writes.

Free shared-volume decision card

Separate active experiments from regenerable scratch before deleting anything.

The first policy question is ownership: which paths belong to one user, which are shared infrastructure, which are rebuildable caches, and which are active experiment state that needs explicit approval.

home + workdir + model cache + checkpoint + artifact ownership

Get $99 cleanup policy Read-only evidence Need $29 incident read Request payment link

Read-only evidence

Capture owner and reclaimability before cleanup.

This packet is designed for shared homes/workdirs such as /senpai-run/home, /senpai-run/workdirs, Hugging Face caches, W&B runs, checkpoints, and source build outputs.

df -h; du by owner; find caches/checkpoints/artifacts

Request $99 cleanup policy Request $29 incident triage

Runbook: Recover Without Losing Experiment State

Stop or gate new writes before cleanup starts. A volume at 95-100% can lose logs, checkpoints, git writes, and W&B flushes while you are measuring it.
Build an owner table for each large directory: active user, active issue/job, stale candidate, or infrastructure-owned.
Let users self-delete only their own regenerable caches first: package caches, downloaded model copies, compiled build products, and failed run artifacts they can recreate.
Keep active workdirs, current checkpoints, final submissions, and experiment logs review-first until the job owner marks them stale.
Move repeated large writes out of the shared volume: per-pod ephemeral scratch, quotas, cache TTLs, or separate model-cache volumes.
Add a recurrence guard: warn at 85%, block new large builds at 90%, and require owner approval for shared cleanup at 95%.

Copy-ready issue reply

Use this when a shared GPU workspace is full.

This keeps cleanup focused on ownership and rebuildability instead of deleting the biggest active experiment directory.

I would treat this as a shared-volume ownership problem, not a one-user cleanup problem.

Before deleting anything from student homes or workdirs, I would build a table with:

- owner
- path
- size
- active job / issue
- regenerable cache vs active experiment state
- approved cleanup action

Read-only evidence:

df -hT /senpai-run /
df -i /senpai-run /
du -xh /senpai-run --max-depth=2 | sort -h | tail -80
find /senpai-run -xdev -maxdepth 5 -type d \( -name ".cache" -o -name "huggingface" -o -name "wandb" -o -name "checkpoints" -o -name "outputs" -o -name "build" \) -print
find /senpai-run -xdev -type f -size +1G -printf "%s %u %p\n" | sort -n | tail -80

The safe immediate move is user-owned regenerable cache cleanup plus a write gate. The durable fix is per-user/per-pod scratch isolation or quota/TTL rules so one active build cannot re-fill the shared volume.

Request policy review

Paid scope

Turn one shared-volume outage into a cleanup policy.

The $99 policy is for teams running shared AI/GPU workspaces where one volume holds multiple users, model caches, checkpoints, W&B artifacts, source builds, and active workdirs. You get a safe/review/do-not-touch boundary and a recurrence guard.

Do Not Delete First

Active workdirs, current checkpoints, and final submissions without owner approval.
Shared model weights if jobs still reference them and the source is gated or slow to rehydrate.
W&B/artifact logs before confirming whether they are the only record of a failed experiment.
Other users' homes or experiment outputs from an advisor/helper pod without node-wide ownership.