- nagios_runner: remove overall_status/overall_status_code/plugin_count fields;
each command still reports its own <name>_status and <name>_status_code
- threshold: expose {output} and {status} aliases in display templates for
nagios_runner generic matches (mapped from <check_name>_output/status)
- alerts.html: fix scrolling by overriding html,body height/overflow (style.css
sets both); make hostname a link to /plugins/<hostname>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_find_threshold() now returns the stripped prefix ("check_name") alongside
the ThresholdConfig, enabling a single generic entry (e.g. nagios_runner.status_code)
to cover all per-command metrics (check_disk_root_status_code, check_load_status_code,
…). The prefix is threaded through to _format_display() as {check_name}, with
{metric_name} also available in display templates. purge_stale_alerts() updated
to use generic matching so it does not incorrectly drop alerts on generic-matched
metrics. README updated with Display Format Templates and Generic Threshold
Matching sections.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 10% default hysteresis created an unreasonably wide recovery band:
a 95% threshold would only clear once the value dropped below 85.5%,
causing alerts to linger long after the metric was well below the
trigger level.
Change default hysteresis to 2% across all threshold parsers (plugin
metrics, partitions, RTT). For a 95% threshold, recovery is now at
93.1% instead of 85.5%.
Add AlertState.hysteresis field (set on every check, cleared on OK) and
expose recovery_threshold in to_dict() so the Alerts dashboard can
display "recovers < 93.1" alongside the trigger threshold, making the
hysteresis band visible to the user. Pickle backward-compatible via
__setstate__.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
memory_monitor / hbc_mini: ZFS ARC is reclaimable but not reflected in
MemAvailable by the Linux kernel (not in SReclaimable). Read ARC size
from /proc/spl/kstat/zfs/arcstats and add it to available memory before
computing memory_percent and memory_used. No-op on systems without ZFS.
cpu_monitor: report uptime_seconds via psutil.boot_time() (full client)
and /proc/uptime (hbc_mini).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace break-after-first-iteration with next(c for c in connections if
c.transport) so the message goes to the first connection that actually
has an open transport. Falls back to connections[0] if none are open
yet (sendto will attempt reopen), avoiding silent message loss when the
leading connection is still connecting.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Settings page: pass threshold_checker to http.start so the Threshold
Configurations section has data. Use threshold_checker's already-parsed
ThresholdConfig objects instead of re-parsing the raw nested YAML.
Named (non-default) configs now display only their explicit overrides
via threshold_raw_configs, not the full merged set with defaults.
hbc/hbc_mini: send boot and shutdown messages on first connection only
to avoid duplicate packets when multiple servers are configured.
Replace print("Daemonizing...") with logging.info so output goes to
syslog in daemon mode.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace href navigation with fetch() so the server response is captured
and displayed in a slide-up toast at the bottom of the page. Delete also
removes the host card from the DOM on success without a page reload.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Host Overview (plugins.html): show Update and Delete buttons in the
host-right zone when the logged-in user is the host owner (or admin /
unauthenticated mode). Buttons link to /u?h=<host> and /d?h=<host>
with stopPropagation so they don't toggle the accordion; Delete prompts
for confirmation first.
ThresholdChecker.purge_stale_alerts(): removes alert states whose
metric_path has no matching threshold in the current config. Called
after startup pickle restore and after every SIGHUP config reload so
alerts orphaned by upgrades or config changes do not persist
indefinitely.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Override the global style.css body height/overflow that locks all pages
to the viewport height (a remnant of the old drawer-menu layout).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ZFS monitor plugin (zfs_monitor) added to plugin list and features
- nagios_runner: async execution, stderr capture, signal handling, path validation
- Threshold alerting: de-escalation suppression, short-duration suppression, ping_monitor thresholds
- Per-host watch flag and role-filtered dashboards
- HTTP API & Web UI: hostname links in Live View, Host Overview with ZFS renderer, alert pie chart in nav bar, Settings threshold viewer
- hbc connection retry: indefinite retry for IPv4; IPv6 dropped after 3 early startup failures
- hbc daemon mode: logs routed to syslog after daemonizing
- hbc_mini: noted zfs_monitor and IPv6 early-fail protection not available
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
IPv4 connections are retried forever in heartbeat_sender if open() fails,
so a temporary network outage does not terminate the sender.
IPv6 connections that have never opened successfully are dropped after
IPV6_EARLY_FAIL_LIMIT (3) consecutive failures so that a network without
IPv6 support does not keep a dead sender running.
At startup all resolved connections are added to the list regardless of
whether the initial open() succeeds; the heartbeat_sender loop handles
the first real connection attempt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Show a colour-coded pie chart (red=critical, yellow=warning, green=ok)
to the left of the clock in the nav bar. Backed by a new
GET /api/0/alert_summary endpoint that counts hosts per alert level
for the current user's visible hosts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- threshold.py: add _find_threshold() with suffix fallback so thresholds
like ping_monitor.rtt_avg match ping_monitor.8_8_8_8_rtt_avg etc.;
each pinged host keeps its own alert state
- hbdclass.py: format RTT as integer ms (round())
- live.html: JS RTT display rounded to nearest ms (Math.round)
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Hostnames in the live dashboard table are now links to /plugins#hostname,
which expands and scrolls to that host's card in the Host Overview page.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Only notify on worsening transitions (OK→WARNING, OK→CRITICAL,
WARNING→CRITICAL) and recovery (any→OK). De-escalation within alert
states no longer sends a duplicate notification since the metric never
recovered.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Add renderZfsTables() to plugins.html with health/capacity/frag/dedup
table and cumulative I/O table; colour-code health and capacity thresholds;
add zfs_monitor to plugin_order and summary/render dispatch.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- watch: true (default) per host; watch: false suppresses all notifications
for that host in udp.py and threshold.py
- Live Dashboard and Host Overview now show only hosts where the logged-in
user is owner or manager (admins see all); WebSocket broadcasts filtered
per-connection by the same rule
- Add hbd/client/plugins/zfs_monitor.py: collects per-pool health, capacity,
fragmentation, dedup ratio, and cumulative I/O ops/bandwidth via zpool(8)
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
threshold_config in the hosts section now accepts a list of named
configs applied left-to-right on top of the defaults, so focused
override profiles can be mixed without duplication. Single-string
and legacy host_threshold_mapping forms are unchanged.
- Add threshold_raw_configs to store per-config overrides separately
- Normalise threshold_config to list on parse (string or list)
- get_thresholds_for_host folds the list over the default base
- Update README and docs/THRESHOLD_ALERTING.md with examples
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sets dorestart and triggers a clean shutdown; os.execv re-execs
the process with the original arguments after cleanup.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- fix: matrix/sms_voipms notifications blocked the event loop on timeout;
make send_notification async, dispatch all channel drivers as non-blocking
tasks (asyncio.to_thread for sync drivers, asyncio.wait_for for async);
update all call sites to fire-and-forget via create_task
- feat: add /about page with version, runtime, uptime counter, and repo link
- fix: hbc_mini plugin data format now matches full hbc client so Host
Overview displays memory, disk, and network metrics correctly
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>