docs: correct README inaccuracies found during code audit
- Add ping_monitor to built-in plugins list
- Update cpu_monitor (uptime) and memory_monitor (ZFS ARC) descriptions
- Replace "aggregated status" bullet with accurate per-check reporting note
- Fix RTT hysteresis default: 0.1 → 0.02
- Fix client YAML config: remove non-existent server:/port: keys, use hb_port:
- Fix nagios_runner commands format: plain strings → {name:, command:} dicts
- Fix Supported Metrics: exit_code → actual <name>_status_code/<name>_status/<name>_output fields
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -58,10 +58,11 @@ Heartbeat includes a comprehensive plugin architecture that extends monitoring b
|
|||||||
### Built-in Plugins
|
### Built-in Plugins
|
||||||
|
|
||||||
- `os_info`: Collects OS, kernel, distribution, and architecture information
|
- `os_info`: Collects OS, kernel, distribution, and architecture information
|
||||||
- `cpu_monitor`: Monitors CPU usage, load average, frequency, and process counts
|
- `cpu_monitor`: Monitors CPU usage, load average, frequency, process counts, and uptime
|
||||||
- `memory_monitor`: Monitors RAM and swap usage, available memory
|
- `memory_monitor`: Monitors RAM and swap usage, available memory (ZFS ARC-aware)
|
||||||
- `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
|
- `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
|
||||||
- `network_monitor`: Monitors network interface statistics, bandwidth, and connections
|
- `network_monitor`: Monitors network interface statistics, bandwidth, and connections
|
||||||
|
- `ping_monitor`: Measures round-trip latency to configured hosts
|
||||||
- `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
|
- `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
|
||||||
- `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
|
- `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
|
||||||
- `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`
|
- `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`
|
||||||
@@ -76,7 +77,7 @@ The `nagios_runner` plugin provides seamless integration with the vast Nagios pl
|
|||||||
- Validates absolute command paths at startup and warns on missing or non-executable files
|
- Validates absolute command paths at startup and warns on missing or non-executable files
|
||||||
- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
|
- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
|
||||||
- Extracts performance data with thresholds
|
- Extracts performance data with thresholds
|
||||||
- Reports aggregated status across all configured checks
|
- Reports per-check status, exit code, and output; no aggregate rollup field
|
||||||
|
|
||||||
See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.
|
See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.
|
||||||
|
|
||||||
@@ -224,7 +225,7 @@ thresholds:
|
|||||||
<hostname>:
|
<hostname>:
|
||||||
warning: <milliseconds> # Warn when RTT > this value
|
warning: <milliseconds> # Warn when RTT > this value
|
||||||
critical: <milliseconds> # Critical when RTT > this value
|
critical: <milliseconds> # Critical when RTT > this value
|
||||||
hysteresis: 0.1 # Optional: 10% hysteresis (default)
|
hysteresis: 0.02 # Optional: 2% hysteresis (default)
|
||||||
```
|
```
|
||||||
|
|
||||||
**Example alerts:**
|
**Example alerts:**
|
||||||
@@ -275,7 +276,7 @@ All plugin metrics can be thresholded:
|
|||||||
- **Memory**: percent, available_mb, swap_percent
|
- **Memory**: percent, available_mb, swap_percent
|
||||||
- **Disk**: Per-partition percent, free_gb, free_mb
|
- **Disk**: Per-partition percent, free_gb, free_mb
|
||||||
- **Network**: errors_total, dropped packets, connection counts
|
- **Network**: errors_total, dropped packets, connection counts
|
||||||
- **Nagios**: Any field emitted by `nagios_runner` (status_code, exit_code, performance data, …)
|
- **Nagios**: Any field emitted by `nagios_runner` (`<name>_status_code`, `<name>_status`, `<name>_output`, performance data fields)
|
||||||
|
|
||||||
### Display Format Templates
|
### Display Format Templates
|
||||||
|
|
||||||
@@ -514,12 +515,11 @@ You can also run it via the module entrypoint:
|
|||||||
python -m hbd.client.main your-server.example.com
|
python -m hbd.client.main your-server.example.com
|
||||||
```
|
```
|
||||||
|
|
||||||
Client configuration can also be specified in YAML:
|
Client configuration can also be specified in YAML (`~/.hbc.yaml`):
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
server: hbd.example.com
|
hb_port: 50003 # Server port (default: 50003)
|
||||||
port: 50003
|
interval: 30 # Heartbeat interval in seconds
|
||||||
interval: 30
|
|
||||||
plugins:
|
plugins:
|
||||||
cpu_monitor:
|
cpu_monitor:
|
||||||
interval: 300 # Check every 5 minutes (default)
|
interval: 300 # Check every 5 minutes (default)
|
||||||
@@ -533,10 +533,14 @@ plugins:
|
|||||||
nagios_runner:
|
nagios_runner:
|
||||||
interval: 300 # Check every 5 minutes (default)
|
interval: 300 # Check every 5 minutes (default)
|
||||||
commands:
|
commands:
|
||||||
- /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
|
- name: check_load
|
||||||
- /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
|
command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
|
||||||
|
- name: check_disk
|
||||||
|
command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The server hostname is always passed as a positional command-line argument; there is no `server:` config key.
|
||||||
|
|
||||||
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
|
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
|
||||||
|
|
||||||
**Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
|
**Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
|
||||||
|
|||||||
Reference in New Issue
Block a user