SafeDisk AI

Log Monitor Loop Fills Disk

A monitor that should protect the host can become the outage when it busy-polls log files, repeats the same error thousands of times per second, and fills syslog or journald before anyone sees the original permission or file-descriptor problem.

Free maintainer checklist

Stop a busy-polling monitor from turning one error into a disk-full outage.

Copy the starter fix first: preserve the first error, cap log growth, add backoff, and track the last-read offset so one permission or fd error cannot flood syslog.

preserve first error -> cap logs -> add backoff -> track offset
Need $29 maintainer review Read-only evidence Open runbook $99 reusable policy
Read-only evidence

Measure log growth, repeat rate, and service limits.

These checks capture the blast radius without deleting logs first. They help separate the root trigger from the secondary log storm.

df -h /var; journalctl -u service; du -sh /var/log
Request $99 log policy Request $29 incident triage

Runbook: Stop The Flood Without Hiding The Root Error

  1. Preserve the first few error lines. The repeated line that fills disk may be a symptom of an earlier permission, fd limit, or missing file state.
  2. Put a temporary blast-radius cap in place: journald or rsyslog max use, service CPU quota, and rate limiting. This is containment, not the final fix.
  3. Fix the loop: use inotify or a sleep-backed poll interval, keep a file offset, and avoid reopening/re-reading from byte zero on every iteration.
  4. Add error backoff: 1s, 2s, 4s up to a cap, plus "repeated N times" aggregation instead of one log line per failed loop.
  5. Bound resource spikes: file descriptor limit, concurrent lookups, and a disable-after-K-failures path for unreadable monitored files.
  6. Turn the incident into a policy: max log bytes/minute, max repeated errors/minute, alert when /var reserve is breached, and documented service permissions.
Copy-ready issue reply

Use this when a monitor fills disk with repeated errors.

This separates containment from the real product fix: sleep/inotify, offset tracking, rate-limited logging, and permission-safe packaging.

I would split this into containment and the loop fix.

Read-only evidence before cleanup:

SERVICE=<service-name>
df -h / /var /var/log 2>/dev/null
df -i / /var /var/log 2>/dev/null
du -sh /var/log /var/log/journal 2>/dev/null
journalctl -u "$SERVICE" -n 300 --no-pager 2>/dev/null || true
systemctl show "$SERVICE" -p MainPID -p Restart -p RestartUSec -p LimitNOFILE 2>/dev/null || true

For the fix, I would require three guards: no tight poll loop (inotify or sleep + last-byte offset), rate-limited recurring errors with exponential backoff, and a bounded log/journal budget so a permission error cannot become a disk-full outage.
Request policy review
Paid scope

Turn one log storm into a host-safe storage policy.

The $99 policy is for teams running monitors, log tailers, fail2ban-style tools, scanner bridges, or agents where repeated errors can fill shared host storage. You get the safe evidence checklist, backoff/rate-limit requirements, systemd guardrails, and recovery runbook for one representative incident.

No secrets, private logs, or customer data. A public-safe summary is enough to start.

Do Not Delete First