refactor monitor, add threshold rtesting
This commit is contained in:
@@ -150,6 +150,18 @@ Heartbeat includes a sophisticated threshold alerting system that monitors plugi
|
||||
|
||||
```yaml
|
||||
thresholds:
|
||||
# RTT (Round-Trip Time) thresholds for heartbeat monitoring
|
||||
# These are checked on every HTB message arrival
|
||||
rtt:
|
||||
webserver01:
|
||||
warning: 100.0 # Warn when RTT > 100ms
|
||||
critical: 500.0 # Critical when RTT > 500ms
|
||||
|
||||
database01:
|
||||
warning: 50.0
|
||||
critical: 200.0
|
||||
|
||||
# Plugin metric thresholds
|
||||
cpu_monitor:
|
||||
cpu_percent:
|
||||
warning: 80.0 # Warn when CPU > 80%
|
||||
@@ -177,6 +189,38 @@ thresholds:
|
||||
threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts
|
||||
```
|
||||
|
||||
### RTT Monitoring
|
||||
|
||||
Heartbeat monitors network latency (Round-Trip Time) for each host's heartbeat messages. RTT thresholds are **fully integrated with the threshold alerting system**:
|
||||
|
||||
- **Per-host configuration**: Set different thresholds for each monitored host
|
||||
- **Real-time checking**: Thresholds evaluated on every HTB message arrival
|
||||
- **Alert state tracking**: RTT alerts use the same state management as plugin metrics
|
||||
- **Hysteresis support**: Configurable hysteresis prevents rapid state transitions
|
||||
- **Alerts dashboard**: RTT alerts visible on the `/alerts` web page alongside plugin alerts
|
||||
- **Smart notifications**: Only triggers on state changes (OK → WARNING → CRITICAL)
|
||||
- **Re-notification**: Periodic reminders for ongoing RTT issues
|
||||
- **Event & journal logging**: All RTT events logged for audit trail
|
||||
|
||||
**Configuration format:**
|
||||
```yaml
|
||||
thresholds:
|
||||
rtt:
|
||||
<hostname>:
|
||||
warning: <milliseconds> # Warn when RTT > this value
|
||||
critical: <milliseconds> # Critical when RTT > this value
|
||||
hysteresis: 0.1 # Optional: 10% hysteresis (default)
|
||||
```
|
||||
|
||||
**Example alerts:**
|
||||
```
|
||||
WARNING: webserver01 - rtt.webserver01 = 125.3
|
||||
CRITICAL: database01 - rtt.database01 = 520.1
|
||||
RECOVERED: webserver01 - rtt.webserver01 = 45.2 (WARNING -> OK)
|
||||
```
|
||||
|
||||
RTT alerts appear on the Alerts dashboard and can be filtered by severity level. The `metric_path` format is `rtt.<hostname>`, making it easy to distinguish from plugin metrics.
|
||||
|
||||
### Alert Behavior
|
||||
|
||||
1. **State Changes**: Notifications sent when crossing thresholds
|
||||
|
||||
Reference in New Issue
Block a user