From babb5d61aa0c42b257e7b1616542f11ee6b5f8dc Mon Sep 17 00:00:00 2001 From: Andreas Wrede Date: Mon, 4 May 2026 12:46:35 +0200 Subject: [PATCH] docs: update README with changes since 917d6a4 - ZFS monitor plugin (zfs_monitor) added to plugin list and features - nagios_runner: async execution, stderr capture, signal handling, path validation - Threshold alerting: de-escalation suppression, short-duration suppression, ping_monitor thresholds - Per-host watch flag and role-filtered dashboards - HTTP API & Web UI: hostname links in Live View, Host Overview with ZFS renderer, alert pie chart in nav bar, Settings threshold viewer - hbc connection retry: indefinite retry for IPv4; IPv6 dropped after 3 early startup failures - hbc daemon mode: logs routed to syslog after daemonizing - hbc_mini: noted zfs_monitor and IPv6 early-fail protection not available Co-Authored-By: Claude Sonnet 4.6 --- README.md | 26 +++++++++++++++++++++----- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 1194aa2..4c698bd 100644 --- a/README.md +++ b/README.md @@ -27,6 +27,7 @@ A lightweight daemon that listens for UDP heartbeat messages and acts on them: k - Configurable retention and backup management - **Plugin system for extensible monitoring** ✅ - Collect system metrics (CPU, memory, disk, network) + - Monitor ZFS pool health, capacity, and I/O via `zpool(8)` - Execute existing Nagios monitoring plugins - Create custom plugins with simple Python classes - **Threshold alerting system** ✅ @@ -34,6 +35,8 @@ A lightweight daemon that listens for UDP heartbeat messages and acts on them: k - Hysteresis to prevent alert flapping - Automatic notifications on state changes - Re-notification for ongoing alerts +- **Per-host watch flag** — set `watch: false` on any host to silence all notifications for that host without removing its configuration ✅ +- **Role-filtered dashboards** — Live Dashboard and Host Overview show only hosts where the logged-in user is owner or manager (admins see all) ✅ - Modular codebase suitable for unit testing and CI ✅ --- @@ -61,12 +64,16 @@ Heartbeat includes a comprehensive plugin architecture that extends monitoring b - `network_monitor`: Monitors network interface statistics, bandwidth, and connections - `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default) - `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.) +- `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)` ### Nagios Integration The `nagios_runner` plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored: -- Executes plugins via subprocess with timeout protection +- Executes plugins asynchronously (non-blocking) with timeout protection +- Captures both stdout and stderr; if stdout is empty, stderr is used as the status message +- Handles signal-killed processes (negative exit code → UNKNOWN status) +- Validates absolute command paths at startup and warns on missing or non-executable files - Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN) - Extracts performance data with thresholds - Reports aggregated status across all configured checks @@ -147,9 +154,11 @@ Heartbeat includes a sophisticated threshold alerting system that monitors plugi - **Multi-level alerts**: WARNING and CRITICAL severity levels - **Flexible operators**: Support for >, >=, <, <=, ==, != comparisons - **Hysteresis**: Prevents alert flapping with configurable recovery thresholds -- **Smart notifications**: Alerts only on state changes, not every check +- **Smart notifications**: Alerts only on state changes, not every check; de-escalations (e.g. CRITICAL → WARNING) do not generate a notification - **Re-notifications**: Periodic reminders for ongoing alerts +- **Short-duration suppression**: Recovery notifications are suppressed for down events under 4 seconds (avoids noise from transient blips) - **Journal integration**: All threshold events logged for audit trail +- **`ping_monitor` thresholds**: Latency and packet-loss thresholds use the same format as all other plugin metrics ### Configuration @@ -363,9 +372,10 @@ Heartbeat includes a built-in HTTP/WebSocket server that provides both a REST AP ### Web Dashboards - **Login** (`/login`): Browser login form (shown automatically when auth is configured) -- **Live View** (`/live`): Real-time host connectivity, latency, and messages -- **Plugin Metrics** (`/plugins`): Browse and visualize metrics from all plugins -- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering +- **Live View** (`/live`): Real-time host connectivity, latency, and messages; hostnames link directly to the Host Overview page +- **Host Overview** (`/plugins/`): Per-host plugin metrics with ZFS pool visualization; filtered to hosts where the logged-in user is owner or manager (admins see all) +- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering; alert count pie chart shown in the navigation bar +- **Settings** (`/settings`): Server configuration, user management, and threshold configuration viewer ### API Endpoints @@ -476,6 +486,10 @@ plugins: All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed. +**Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry. + +**Daemon logging:** When running with `-d`, `hbc` routes all log output to syslog (`LOG_DAEMON` facility) after daemonizing. Without `-d`, logs go to stderr as usual. + ### hbc_mini — single-file client (no external dependencies) `scripts/hbc_mini.py` is a self-contained version of the heartbeat client that requires only Python 3.8+ and no external packages. Copy it to any host and run it directly — no virtualenv, no `pip install`. @@ -531,8 +545,10 @@ python3 hbc_mini.py -m "maintenance starting" your-server.example.com - No YAML config (use JSON instead) - No `filesystem_info` plugin +- No `zfs_monitor` plugin (requires `zpool(8)` and the full plugin loader) - `cpu_monitor` does not report per-core usage or CPU frequency (no psutil) - Plugins cannot be loaded from external `.py` files — all plugins are compiled in +- No IPv6 early-fail protection — connections that fail to open at startup are silently skipped rather than retried Everything else — heartbeat protocol, ACK/CMD/UPD handling, `hb_install.sh`-based self-update, daemonize, syslog — is identical to the full client.