Hostnames in the live dashboard table are now links to /plugins#hostname,
which expands and scrolls to that host's card in the Host Overview page.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Only notify on worsening transitions (OK→WARNING, OK→CRITICAL,
WARNING→CRITICAL) and recovery (any→OK). De-escalation within alert
states no longer sends a duplicate notification since the metric never
recovered.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Add renderZfsTables() to plugins.html with health/capacity/frag/dedup
table and cumulative I/O table; colour-code health and capacity thresholds;
add zfs_monitor to plugin_order and summary/render dispatch.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- watch: true (default) per host; watch: false suppresses all notifications
for that host in udp.py and threshold.py
- Live Dashboard and Host Overview now show only hosts where the logged-in
user is owner or manager (admins see all); WebSocket broadcasts filtered
per-connection by the same rule
- Add hbd/client/plugins/zfs_monitor.py: collects per-pool health, capacity,
fragmentation, dedup ratio, and cumulative I/O ops/bandwidth via zpool(8)
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
threshold_config in the hosts section now accepts a list of named
configs applied left-to-right on top of the defaults, so focused
override profiles can be mixed without duplication. Single-string
and legacy host_threshold_mapping forms are unchanged.
- Add threshold_raw_configs to store per-config overrides separately
- Normalise threshold_config to list on parse (string or list)
- get_thresholds_for_host folds the list over the default base
- Update README and docs/THRESHOLD_ALERTING.md with examples
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- fix: matrix/sms_voipms notifications blocked the event loop on timeout;
make send_notification async, dispatch all channel drivers as non-blocking
tasks (asyncio.to_thread for sync drivers, asyncio.wait_for for async);
update all call sites to fire-and-forget via create_task
- feat: add /about page with version, runtime, uptime counter, and repo link
- fix: hbc_mini plugin data format now matches full hbc client so Host
Overview displays memory, disk, and network metrics correctly
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Server now sends a bare UPD command; client runs hb_install.sh to
reinstall from the package registry, then restarts. hb_install.sh
also copies itself alongside hbc on client installs.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replace pill-tab plugin view with an accordion layout that shows key
metrics (CPU%, MEM%, top disk%, net delta, nagios status) at a glance
in each host card header. Plugin sections expand as structured tables.
- Rename page to "Host Overview" (URL /plugins unchanged)
- Three-wave parallel data loading: glance plugins on host expand,
on-demand fetch for filesystem_info and extras
- Per-plugin table renderers with inline percent bars and threshold
colour coding
- Add escHtml() for XSS-safe rendering of all field values
- Remove stale planning docs (REFACTORING.md, hbd/Plan.md)
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Threshold alerts (plugin metrics, RTT) were firing immediately on the
first breach. Now every state transition to WARNING/CRITICAL starts a
grace-period timer (grace_seconds from the 'grace' config key). The
notification is deferred until the next heartbeat after grace_seconds
have elapsed. If the metric recovers within the grace window, both the
alert and the recovery are suppressed — no spurious pages for transient
spikes.
Two helper methods added to ThresholdChecker:
- _apply_grace: handles the state-change path (defer or suppress)
- _check_pending_or_renotify: handles the stable-alert path (fire
deferred notification once grace expires, or fall through to reminders)
The overdue case is unchanged — on_overdue already fires only after
interval+grace seconds of silence, which is equivalent behaviour.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
threshold.py was emitting level="RECOVERED" for metric recoveries, which
failed the is_recover check in send_notification (which only matched "RECOVER"),
bypassing _alerted_channels routing and the min_level bypass added in the
previous commit. Changed to "RECOVER" so all recovery paths are consistent.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- AlertState.update() now resets last_notification when the alert level
changes, so a WARNING→CRITICAL escalation restarts the reminder interval
rather than inheriting a nearly-expired timer.
- _dispatch_to_channel() bypasses min_level for RECOVER, so recovery
notifications are delivered even after a server restart when
_alerted_channels is empty and the fallback dispatch path is used.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>