VTOrc Primary Disk Full Recovery

When a Vitess or MySQL PRIMARY tablet fills its datadir, orchestration can see downstream noise first: stopped replication, hung writes, transaction errors, and lag. The useful root cause is sharper: is there a healthy replica with free disk that can safely become primary, or should VTOrc only surface a PrimaryDiskFull analysis until an operator frees space?

$99 primary recovery policy

Turn one disk-full PRIMARY incident into a reusable recovery boundary.

A disk-full primary should not trigger blind cleanup or topology movement. First prove ENOSPC on the current primary, then separate reachable non-full replicas from replicas that are also unsafe targets.

primary ENOSPC -> replica free space -> replication position -> reparent or surface

Read-only evidence

Measure primary disk pressure and candidate replica safety.

These checks are intentionally read-only. They capture disk pressure, inode pressure, MySQL log evidence, binary-log state, and replica status before cleanup or topology changes.

df -h "$PRIMARY_DATADIR"; df -i "$PRIMARY_DATADIR"; SHOW REPLICA STATUS

Request $99 recovery policy Request $29 incident triage

Runbook: Reparent Only When The Target Is Safer

Confirm the PRIMARY tablet is actually disk-full. Do not infer from replication lag alone.
Record both byte pressure and inode pressure for the datadir filesystem. Either can stop writes.
Capture the recent MySQL or tablet log lines that prove ENOSPC, errno 28, or equivalent write failure.
Classify replicas by reachability, mysqld health, datadir free space, and replication freshness.
Permit automated reparent only when a candidate replica is non-full and fresh enough for the shard policy.
If no candidate is safe, surface PrimaryDiskFull as root cause and require operator action before topology movement.

Copy-ready issue reply

Use this when VTOrc sees disk-full primary symptoms.

This keeps the change focused on analysis and recovery safety, instead of turning every downstream replication symptom into a reparent attempt.

I would model this as a two-branch analysis: PrimaryDiskFullRecoverable and PrimaryDiskFullInformational.

Read-only evidence before any cleanup or topology change:

PRIMARY_DATADIR=<primary-datadir>
df -h "$PRIMARY_DATADIR" .
df -i "$PRIMARY_DATADIR" .
du -h -d 1 "$PRIMARY_DATADIR" 2>/dev/null | sort -h | tail -40
mysql -e "SHOW VARIABLES WHERE Variable_name IN ('datadir','log_bin','relay_log','gtid_mode'); SHOW BINARY LOGS;"
mysql -e "SHOW REPLICA STATUS\G" 2>/dev/null || mysql -e "SHOW SLAVE STATUS\G" 2>/dev/null || true

Recovery rule: reparent only if a reachable replica has non-full datadir, healthy mysqld, and acceptable replication position. If every reachable replica is full, stale, or unhealthy, emit the root-cause signal without automated reparent so an operator can free space safely.

Request policy review

Paid scope

Turn one disk-full primary incident into a recovery policy.

The $99 policy is for Vitess, MySQL, and database platform teams that need a safe/review/do-not-touch recovery boundary for one representative primary disk-full incident. You get root-cause signals, reparent eligibility checks, do-not-delete paths, and operator-facing runbook text.

Do Not Delete First

Binary logs, relay logs, or tablet state before replication and PITR impact are known.
The first log lines that prove the PRIMARY failed because of ENOSPC or inode exhaustion.
Replica evidence before recording whether each candidate has free disk and acceptable freshness.
Database directories, tablet aliases, or topology records while orchestration is still deciding ownership.