heartbeat

Public Access

Author	SHA1	Message	Date
andreas	aef9e7769b	fix: zfs_monitor alerts dropped on restart with wildcard pool thresholds purge_stale_alerts used _find_threshold to validate alert state keys, but _find_threshold has no wildcard matching. A threshold configured as "zfs_monitor.*.status" never matched the concrete alert state key "zfs_monitor.tank.status", so every restart silently purged active ZFS pool alert states and reset the grace period from scratch. Also fix _check_pending_or_renotify to set last_notification after the grace-period notification fires, so the re-notification interval is anchored to when the alert was actually sent rather than the next PLG cycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-09 07:42:09 -04:00
andreas	2e8bcb630d	fix: show human-readable duration in re-notification messages Replace raw seconds with d h m s format in "ongoing for ..." strings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-09 06:53:41 -04:00
andreas	b95f1a5bb7	fix: agree: zpool ONLINE=OK, DEGRADED=WARNING, all else is CRITICAL	2026-05-08 17:18:41 -04:00
andreas	217bba1b76	fix: change health_ok to status	2026-05-08 16:57:45 -04:00
andreas	967e05ed74	threshold: synthesize health_ok server-side for older ZFS clients Older hbd clients send zfs_monitor data with a `health` string but no `health_ok` numeric field (added in a recent plugin update). Without health_ok in the data, the wildcard threshold check found nothing and no CRITICAL alert was raised for DEGRADED/SUSPENDED pools. Synthesize health_ok from the health string in the server's nested- metric loop so alerts fire regardless of client version. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-08 16:39:16 -04:00
andreas	b9db0c552e	feat: alert CRITICAL on degraded or suspended ZFS pools	2026-05-08 16:23:49 -04:00
andreas	1ddc4b8132	threshold/alerts: strip _status_code suffix from displayed metric names Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-08 06:19:16 -04:00
andreas	28f5fa951c	ui: show metric name inline with hostname in alerts and notifications Alerts page: move metric name into the header row alongside hostname. Notifications: include metric name in title (hostname metric) and strip the metric prefix from the body so it contains only value/detail. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-07 06:26:27 -04:00
andreas	1e4263b793	fix: threshold and logging improvements - threshold: fix crash when display is None (_format_display now falls back to default format string instead of calling None.format()) - threshold: shorten notification messages by stripping plugin-name prefix from metric_path (cpu_percent instead of cpu_monitor.cpu_percent) - main: demote aiohttp.access log records from INFO to DEBUG - udp: replace debug print with proper logger.info for new host sign-on Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 07:06:56 -04:00
andreas	1824f637b4	fix: always show THRESHOLD_DEFAULTS in Settings threshold config Seed threshold_configs["default"] from THRESHOLD_DEFAULTS at the start of _parse_config() so the Settings page displays built-in defaults regardless of whether the server config uses the multi-config format, the legacy thresholds: format, or has no threshold config at all. _parse_multi_config() overwrites the seed with the fully-merged effective defaults when a threshold_configs section is present. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 13:02:28 -04:00
andreas	a534c06b26	feat: nagios operator for direct exit-code severity mapping Add ComparisonOperator.NAGIOS ("nagios") that maps Nagios exit codes directly to alert levels (0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN) without requiring numeric warning/critical thresholds. Hysteresis is bypassed for discrete codes. Display template defaults to "{check_name}: {output}". _format_display() handles None threshold_value gracefully. Add nagios_runner.status_code as a built-in default threshold config so nagios checks alert out of the box. Also: fix alerts.html scrolling (override html,body), make hostname a link to /plugins#<hostname>, remove overall_status/overall_status_code/plugin_count from nagios_runner and hbc_mini, replace with computed worst-status in plugins.html via nagiosWorstStatus() helper. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 12:26:56 -04:00
andreas	ae447ac4a6	feat: nagios_runner improvements and alerts page fixes - nagios_runner: remove overall_status/overall_status_code/plugin_count fields; each command still reports its own <name>_status and <name>_status_code - threshold: expose {output} and {status} aliases in display templates for nagios_runner generic matches (mapped from <check_name>_output/status) - alerts.html: fix scrolling by overriding html,body height/overflow (style.css sets both); make hostname a link to /plugins/<hostname> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 11:05:45 -04:00
andreas	b1985d0eb2	feat: generic threshold matching for nagios_runner with {check_name} display support _find_threshold() now returns the stripped prefix ("check_name") alongside the ThresholdConfig, enabling a single generic entry (e.g. nagios_runner.status_code) to cover all per-command metrics (check_disk_root_status_code, check_load_status_code, …). The prefix is threaded through to _format_display() as {check_name}, with {metric_name} also available in display templates. purge_stale_alerts() updated to use generic matching so it does not incorrectly drop alerts on generic-matched metrics. README updated with Display Format Templates and Generic Threshold Matching sections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 10:48:17 -04:00
andreas	de778f680f	fix: reduce default hysteresis 10%→2%; show recovery threshold in alerts UI The 10% default hysteresis created an unreasonably wide recovery band: a 95% threshold would only clear once the value dropped below 85.5%, causing alerts to linger long after the metric was well below the trigger level. Change default hysteresis to 2% across all threshold parsers (plugin metrics, partitions, RTT). For a 95% threshold, recovery is now at 93.1% instead of 85.5%. Add AlertState.hysteresis field (set on every check, cleared on OK) and expose recovery_threshold in to_dict() so the Alerts dashboard can display "recovers < 93.1" alongside the trigger threshold, making the hysteresis band visible to the user. Pickle backward-compatible via __setstate__. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-04 14:47:50 -04:00
andreas	3301dbfe34	feat: owner Update/Delete buttons on Host Overview; purge stale alerts on reload Host Overview (plugins.html): show Update and Delete buttons in the host-right zone when the logged-in user is the host owner (or admin / unauthenticated mode). Buttons link to /u?h=<host> and /d?h=<host> with stopPropagation so they don't toggle the accordion; Delete prompts for confirmation first. ThresholdChecker.purge_stale_alerts(): removes alert states whose metric_path has no matching threshold in the current config. Called after startup pickle restore and after every SIGHUP config reload so alerts orphaned by upgrades or config changes do not persist indefinitely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-04 08:03:46 -04:00
Andreas Wrede	a76d0fc840	feat: generic ping_monitor thresholds; round RTT to nearest ms - threshold.py: add _find_threshold() with suffix fallback so thresholds like ping_monitor.rtt_avg match ping_monitor.8_8_8_8_rtt_avg etc.; each pinged host keeps its own alert state - hbdclass.py: format RTT as integer ms (round()) - live.html: JS RTT display rounded to nearest ms (Math.round) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>	2026-05-03 06:08:11 -04:00
Andreas Wrede	28e2180f7b	fix: suppress notifications on alert de-escalation (e.g. CRITICAL→WARNING) Only notify on worsening transitions (OK→WARNING, OK→CRITICAL, WARNING→CRITICAL) and recovery (any→OK). De-escalation within alert states no longer sends a duplicate notification since the metric never recovered. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>	2026-05-02 14:27:18 -04:00
Andreas Wrede	691f62aa69	feat: host-level watch flag suppresses notifications; filter dashboard/overview by owner/manager; add ZFS monitor plugin - watch: true (default) per host; watch: false suppresses all notifications for that host in udp.py and threshold.py - Live Dashboard and Host Overview now show only hosts where the logged-in user is owner or manager (admins see all); WebSocket broadcasts filtered per-connection by the same rule - Add hbd/client/plugins/zfs_monitor.py: collects per-pool health, capacity, fragmentation, dedup ratio, and cumulative I/O ops/bandwidth via zpool(8) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>	2026-05-02 12:42:35 -04:00
Andreas Wrede	cffc9805f9	fix: mask api_password and access_token in settings page; add List to threshold imports Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>	2026-05-02 11:51:55 -04:00
Andreas Wrede	917d6a401b	feat: composable threshold_config list for per-host threshold layering threshold_config in the hosts section now accepts a list of named configs applied left-to-right on top of the defaults, so focused override profiles can be mixed without duplication. Single-string and legacy host_threshold_mapping forms are unchanged. - Add threshold_raw_configs to store per-config overrides separately - Normalise threshold_config to list on parse (string or list) - get_thresholds_for_host folds the list over the default base - Update README and docs/THRESHOLD_ALERTING.md with examples Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>	2026-05-02 10:35:23 -04:00
Andreas Wrede	c4f09e9ced	version 5.1.8 Release / release (push) Successful in 5s Details - fix: matrix/sms_voipms notifications blocked the event loop on timeout; make send_notification async, dispatch all channel drivers as non-blocking tasks (asyncio.to_thread for sync drivers, asyncio.wait_for for async); update all call sites to fire-and-forget via create_task - feat: add /about page with version, runtime, uptime counter, and repo link - fix: hbc_mini plugin data format now matches full hbc client so Host Overview displays memory, disk, and network metrics correctly Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 05:33:27 -04:00
andreas	990c658e65	Apply grace period to all threshold alerts before logging/notifying Threshold alerts (plugin metrics, RTT) were firing immediately on the first breach. Now every state transition to WARNING/CRITICAL starts a grace-period timer (grace_seconds from the 'grace' config key). The notification is deferred until the next heartbeat after grace_seconds have elapsed. If the metric recovers within the grace window, both the alert and the recovery are suppressed — no spurious pages for transient spikes. Two helper methods added to ThresholdChecker: - _apply_grace: handles the state-change path (defer or suppress) - _check_pending_or_renotify: handles the stable-alert path (fire deferred notification once grace expires, or fall through to reminders) The overdue case is unchanged — on_overdue already fires only after interval+grace seconds of silence, which is equivalent behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 12:00:40 +02:00
andreas	b78d6ac0fe	Fix RECOVER routing: use consistent level name and route via alerted channel threshold.py was emitting level="RECOVERED" for metric recoveries, which failed the is_recover check in send_notification (which only matched "RECOVER"), bypassing _alerted_channels routing and the min_level bypass added in the previous commit. Changed to "RECOVER" so all recovery paths are consistent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 11:29:04 +02:00
andreas	afd5060f59	Fix early reminder notifications and lost recovery notifications - AlertState.update() now resets last_notification when the alert level changes, so a WARNING→CRITICAL escalation restarts the reminder interval rather than inheriting a nearly-expired timer. - _dispatch_to_channel() bypasses min_level for RECOVER, so recovery notifications are delivered even after a server restart when _alerted_channels is empty and the fallback dispatch path is used. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 18:11:22 +02:00
Andreas Wrede	0199ca4693	re-factor notifications, add sms and matrix as channels	2026-04-12 11:21:21 -04:00
Andreas Wrede	2468386f24	adjust default log, pick and config locations. renotify on critical only, make user sessions persistem	2026-04-10 13:24:57 -04:00
Andreas Wrede	9eedbafe97	Show overdue in alerts instead of null	2026-04-10 09:20:28 -04:00
Andreas Wrede	a5f31c5cb5	update picked data strucures	2026-04-10 09:18:38 -04:00
Andreas Wrede	ba27d2e300	Add count to rtt threshold	2026-04-10 08:07:50 -04:00
Andreas Wrede	d281ac5a70	provide defaults for threshold_configs	2026-04-10 07:47:39 -04:00
Andreas Wrede	73aa89f8f4	fix web page issues	2026-04-04 12:43:30 -04:00
Andreas Wrede	941f3ea4b0	display and acknowledge alerts	2026-04-03 06:35:45 -04:00
Andreas Wrede	c5770006f7	hbc proper termination, hbd config reloadable	2026-04-02 07:17:00 -04:00
Andreas Wrede	460d2be9e9	Fix rtt, including bug in time compute	2026-04-01 19:41:53 -04:00
Andreas Wrede	090d341244	per-client threshold config	2026-04-01 15:22:42 -04:00
Andreas Wrede	079e84f729	display tag fro alterts, cleanup udp	2026-04-01 11:49:55 -04:00
Andreas Wrede	dd23d9d163	refactor monitor, add threshold rtesting	2026-03-31 12:22:03 -04:00
Andreas Wrede	ad7178ebcb	Move threshhold to server, move eventlog to notify	2026-03-29 20:29:33 -04:00

38 Commits