fix: correct ZFS pool status threshold operator and add per-metric grace

The default zfs_monitor.*.status threshold used operator '>' with warning=1, so a DEGRADED pool (status=1) never alerted (1 > 1 is false) and a FAULTED pool (status=2) only triggered WARNING instead of CRITICAL. Fix the operator to '>=' in THRESHOLD_DEFAULTS and the example config. Also adds a per-metric grace period override (ThresholdConfig.grace) so individual thresholds can bypass or shorten the global grace delay. Alerts with grace=0 fire immediately on state change rather than waiting for a second collection cycle. Sets grace=0 on zfs_monitor.*.status so pool degradation alerts fire on the first data report after the event. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-13 06:33:06 -04:00
parent 4160e34a96
commit 0f90be659e
5 changed files with 49 additions and 16 deletions
@@ -146,8 +146,9 @@ thresholds:
        status:
          warning: 1           # Alert WARNING when pool is DEGRADED
          critical: 2           # Alert CRITICAL when pool is SUSPENDED/FAULTED/UNAVAIL
-          operator: ">"
-          hysteresis: 0.0       # No hysteresis — a degraded pool is always critical
+          operator: ">="
+          hysteresis: 0.0       # No hysteresis — a degraded pool is always alerting
+          grace: 0              # Fire immediately — don't wait for a second collection
          display: "ZFS pool {pool_name} is {health}"

      # Per-pool capacity thresholds (optional; add pools you care about)