VTOrc Primary Disk Full Recovery
When a Vitess or MySQL PRIMARY tablet fills its datadir, orchestration can see downstream noise first: stopped replication, hung writes, transaction errors, and lag. The useful root cause is sharper: is there a healthy replica with free disk that can safely become primary, or should VTOrc only surface a PrimaryDiskFull analysis until an operator frees space?
Turn one disk-full PRIMARY incident into a reusable recovery boundary.
A disk-full primary should not trigger blind cleanup or topology movement. First prove ENOSPC on the current primary, then separate reachable non-full replicas from replicas that are also unsafe targets.
primary ENOSPC -> replica free space -> replication position -> reparent or surface
Measure primary disk pressure and candidate replica safety.
These checks are intentionally read-only. They capture disk pressure, inode pressure, MySQL log evidence, binary-log state, and replica status before cleanup or topology changes.
df -h "$PRIMARY_DATADIR"; df -i "$PRIMARY_DATADIR"; SHOW REPLICA STATUS
Runbook: Reparent Only When The Target Is Safer
- Confirm the PRIMARY tablet is actually disk-full. Do not infer from replication lag alone.
- Record both byte pressure and inode pressure for the datadir filesystem. Either can stop writes.
- Capture the recent MySQL or tablet log lines that prove ENOSPC, errno 28, or equivalent write failure.
- Classify replicas by reachability, mysqld health, datadir free space, and replication freshness.
- Permit automated reparent only when a candidate replica is non-full and fresh enough for the shard policy.
- If no candidate is safe, surface PrimaryDiskFull as root cause and require operator action before topology movement.
Use this when VTOrc sees disk-full primary symptoms.
This keeps the change focused on analysis and recovery safety, instead of turning every downstream replication symptom into a reparent attempt.
I would model this as a two-branch analysis: PrimaryDiskFullRecoverable and PrimaryDiskFullInformational.
Read-only evidence before any cleanup or topology change:
PRIMARY_DATADIR=<primary-datadir>
df -h "$PRIMARY_DATADIR" .
df -i "$PRIMARY_DATADIR" .
du -h -d 1 "$PRIMARY_DATADIR" 2>/dev/null | sort -h | tail -40
mysql -e "SHOW VARIABLES WHERE Variable_name IN ('datadir','log_bin','relay_log','gtid_mode'); SHOW BINARY LOGS;"
mysql -e "SHOW REPLICA STATUS\G" 2>/dev/null || mysql -e "SHOW SLAVE STATUS\G" 2>/dev/null || true
Recovery rule: reparent only if a reachable replica has non-full datadir, healthy mysqld, and acceptable replication position. If every reachable replica is full, stale, or unhealthy, emit the root-cause signal without automated reparent so an operator can free space safely.
Turn one disk-full primary incident into a recovery policy.
The $99 policy is for Vitess, MySQL, and database platform teams that need a safe/review/do-not-touch recovery boundary for one representative primary disk-full incident. You get root-cause signals, reparent eligibility checks, do-not-delete paths, and operator-facing runbook text.
Do Not Delete First
- Binary logs, relay logs, or tablet state before replication and PITR impact are known.
- The first log lines that prove the PRIMARY failed because of ENOSPC or inode exhaustion.
- Replica evidence before recording whether each candidate has free disk and acceptable freshness.
- Database directories, tablet aliases, or topology records while orchestration is still deciding ownership.