File Lock Heartbeat Disk Full Stale Lock

When a file-lock heartbeat silently fails on ENOSPC, inode exhaustion, permissions, or a missing heartbeat path, the lock holder may still be inside the critical section while another process decides the lock is stale and enters too.

AI tool lock safety

Make heartbeat failure visible before a stale lock becomes concurrent writes.

Use a small regression checklist to separate lock acquisition, heartbeat refresh, stale detection, and lock stealing. The goal is not more logging alone; it is a safe rule for what happens after heartbeat writes stop working.

df -h "$LOCK_DIR"; df -i "$LOCK_DIR"; test -w "$LOCK_DIR"; stat "$HEARTBEAT_PATH"

First Response Runbook

A heartbeat failure should not be treated as a successful lock refresh. It should create an explicit lock-health state that downstream stale-lock logic can reason about.

Log heartbeat refresh failures with the lock path, heartbeat path, operation, and filesystem error.
Classify ENOSPC, EDQUOT, EIO, EACCES, EPERM, and missing heartbeat paths separately from normal stale timeout.
When heartbeat refresh fails, decide whether the holder aborts protected work, releases the lock, or marks the lock as unhealthy.
Do not let a contender steal only because mtime is stale when disk or permission failure could explain the stale heartbeat.
Require dead-owner evidence, an explicit fencing token, or a recovery lock before allowing a steal.
Add a two-contender regression test: holder heartbeat fails, contender polls, and both processes never enter the critical section at once.

Copy-ready issue reply

Use this checklist when a heartbeat error is currently swallowed.

It keeps the fix focused on preventing concurrent entry, not only printing a warning.

I would make the heartbeat failure visible, and I would also define what happens to the protected critical section once the heartbeat path becomes unhealthy.

Acceptance checks I would add:

- Inject utimes(heartbeatPath) failure with ENOSPC/EIO/EACCES and assert a warning includes the lock path and operation.
- Treat heartbeat-write failure as lock-health degradation, not a silent success.
- The holder should either abort protected work or mark the lock as non-stealable until ownership is resolved.
- The stale-lock detector should require both stale mtime and dead owner/process evidence before stealing.
- Add a two-contender regression test: holder heartbeat fails, second process polls, and concurrent entry never happens.
- Surface lock-dir filesystem and inode status so disk-full and permission failures are distinguishable.

Request policy review AI CLI disk-full guide

Evidence To Collect

Lock path, heartbeat path, owner PID, and stale timeout.
Filesystem free space and inode status for the directory that stores the heartbeat.
The exact error returned by heartbeat refresh, not just whether the timer is still running.
The condition a contender uses before stealing the lock.
Whether protected writes can continue after heartbeat refresh has failed.

Paid Scope