File Lock Heartbeat Disk Full Stale Lock
When a file-lock heartbeat silently fails on ENOSPC, inode exhaustion, permissions, or a missing heartbeat path, the lock holder may still be inside the critical section while another process decides the lock is stale and enters too.
Make heartbeat failure visible before a stale lock becomes concurrent writes.
Use a small regression checklist to separate lock acquisition, heartbeat refresh, stale detection, and lock stealing. The goal is not more logging alone; it is a safe rule for what happens after heartbeat writes stop working.
df -h "$LOCK_DIR"; df -i "$LOCK_DIR"; test -w "$LOCK_DIR"; stat "$HEARTBEAT_PATH"
First Response Runbook
A heartbeat failure should not be treated as a successful lock refresh. It should create an explicit lock-health state that downstream stale-lock logic can reason about.
- Log heartbeat refresh failures with the lock path, heartbeat path, operation, and filesystem error.
- Classify ENOSPC, EDQUOT, EIO, EACCES, EPERM, and missing heartbeat paths separately from normal stale timeout.
- When heartbeat refresh fails, decide whether the holder aborts protected work, releases the lock, or marks the lock as unhealthy.
- Do not let a contender steal only because mtime is stale when disk or permission failure could explain the stale heartbeat.
- Require dead-owner evidence, an explicit fencing token, or a recovery lock before allowing a steal.
- Add a two-contender regression test: holder heartbeat fails, contender polls, and both processes never enter the critical section at once.
Use this checklist when a heartbeat error is currently swallowed.
It keeps the fix focused on preventing concurrent entry, not only printing a warning.
I would make the heartbeat failure visible, and I would also define what happens to the protected critical section once the heartbeat path becomes unhealthy.
Acceptance checks I would add:
- Inject utimes(heartbeatPath) failure with ENOSPC/EIO/EACCES and assert a warning includes the lock path and operation.
- Treat heartbeat-write failure as lock-health degradation, not a silent success.
- The holder should either abort protected work or mark the lock as non-stealable until ownership is resolved.
- The stale-lock detector should require both stale mtime and dead owner/process evidence before stealing.
- Add a two-contender regression test: holder heartbeat fails, second process polls, and concurrent entry never happens.
- Surface lock-dir filesystem and inode status so disk-full and permission failures are distinguishable.
Evidence To Collect
- Lock path, heartbeat path, owner PID, and stale timeout.
- Filesystem free space and inode status for the directory that stores the heartbeat.
- The exact error returned by heartbeat refresh, not just whether the timer is still running.
- The condition a contender uses before stealing the lock.
- Whether protected writes can continue after heartbeat refresh has failed.
Paid Scope
The $29 incident triage reviews one lock or runner failure and returns the safest next diagnostic step. The $99 team pilot turns one representative incident into a stale-lock policy, failure taxonomy, and regression checklist for your agent, CLI, or CI tool.