docs: correct README inaccuracies found during code audit

- Add ping_monitor to built-in plugins list - Update cpu_monitor (uptime) and memory_monitor (ZFS ARC) descriptions - Replace "aggregated status" bullet with accurate per-check reporting note - Fix RTT hysteresis default: 0.1 → 0.02 - Fix client YAML config: remove non-existent server:/port: keys, use hb_port: - Fix nagios_runner commands format: plain strings → {name:, command:} dicts - Fix Supported Metrics: exit_code → actual <name>_status_code/<name>_status/<name>_output fields Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 13:45:43 -04:00
parent 1824f637b4
commit 018409e71d
1 changed files with 15 additions and 11 deletions
@@ -58,10 +58,11 @@ Heartbeat includes a comprehensive plugin architecture that extends monitoring b
 ### Built-in Plugins
 - `os_info`: Collects OS, kernel, distribution, and architecture information
- `cpu_monitor`: Monitors CPU usage, load average, frequency, and process counts
+- `cpu_monitor`: Monitors CPU usage, load average, frequency, process counts, and uptime
- `memory_monitor`: Monitors RAM and swap usage, available memory
+- `memory_monitor`: Monitors RAM and swap usage, available memory (ZFS ARC-aware)
 - `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
 - `network_monitor`: Monitors network interface statistics, bandwidth, and connections
 - `ping_monitor`: Measures round-trip latency to configured hosts
 - `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
 - `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
 - `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`
@@ -76,7 +77,7 @@ The `nagios_runner` plugin provides seamless integration with the vast Nagios pl
 - Validates absolute command paths at startup and warns on missing or non-executable files
 - Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
 - Extracts performance data with thresholds
- Reports aggregated status across all configured checks
+- Reports per-check status, exit code, and output; no aggregate rollup field
 See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.
@@ -224,7 +225,7 @@ thresholds:
    <hostname>:
      warning: <milliseconds>   # Warn when RTT > this value
      critical: <milliseconds>  # Critical when RTT > this value
-      hysteresis: 0.1           # Optional: 10% hysteresis (default)
+      hysteresis: 0.02          # Optional: 2% hysteresis (default)
 ```
 **Example alerts:**
@@ -275,7 +276,7 @@ All plugin metrics can be thresholded:
 - **Memory**: percent, available_mb, swap_percent
 - **Disk**: Per-partition percent, free_gb, free_mb
 - **Network**: errors_total, dropped packets, connection counts
- **Nagios**: Any field emitted by `nagios_runner` (status_code, exit_code, performance data, …)
+- **Nagios**: Any field emitted by `nagios_runner` (`<name>_status_code`, `<name>_status`, `<name>_output`, performance data fields)
 ### Display Format Templates
@@ -514,12 +515,11 @@ You can also run it via the module entrypoint:
 python -m hbd.client.main your-server.example.com
 ```
-Client configuration can also be specified in YAML:
+Client configuration can also be specified in YAML (`~/.hbc.yaml`):
 ```yaml
-server: hbd.example.com
+hb_port: 50003        # Server port (default: 50003)
-port: 50003
+interval: 30          # Heartbeat interval in seconds
 interval: 30
 plugins:
  cpu_monitor:
    interval: 300      # Check every 5 minutes (default)
@@ -533,10 +533,14 @@ plugins:
  nagios_runner:
    interval: 300      # Check every 5 minutes (default)
    commands:
-      - /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
+      - name: check_load
-      - /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
+        command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
      - name: check_disk
        command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
 ```
 The server hostname is always passed as a positional command-line argument; there is no `server:` config key.
 All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
 **Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.