0543266c92
- Restructuring of the project directory into client and server components - Renaming of modules and classes to better reflect their purpose and functionality - Moving common utilities and configurations to a shared location - Updating import statements to reflect the new structure - Adding new documentation files for better clarity on various aspects of the project - Removing deprecated or unused code to streamline the codebase - Ensuring that all existing functionality is preserved and that the codebase remains functional after the refactoring.
743 lines
16 KiB
Markdown
743 lines
16 KiB
Markdown
# Threshold Alerting System
|
|
|
|
## Overview
|
|
|
|
The Heartbeat Monitoring System includes a comprehensive threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured thresholds. This system is designed to:
|
|
|
|
- **Detect anomalies**: Automatically identify when system metrics exceed safe operating ranges
|
|
- **Prevent alert fatigue**: Use hysteresis to prevent notification flapping
|
|
- **Escalate appropriately**: Support WARNING and CRITICAL severity levels
|
|
- **Track state**: Maintain alert history and state transitions per host
|
|
- **Integrate seamlessly**: Work with existing notification infrastructure (email, pushover, etc.)
|
|
|
|
## Architecture
|
|
|
|
### Components
|
|
|
|
1. **ThresholdChecker** (`hbd/threshold.py`)
|
|
- Main threshold checking engine
|
|
- Parses configuration
|
|
- Evaluates metrics against thresholds
|
|
- Triggers notifications on state changes
|
|
|
|
2. **ThresholdConfig**
|
|
- Individual threshold configuration
|
|
- Supports multiple comparison operators
|
|
- Implements hysteresis logic
|
|
|
|
3. **AlertState**
|
|
- Tracks current alert state per metric
|
|
- Records state transitions
|
|
- Manages notification timing
|
|
|
|
4. **Integration Points**
|
|
- UDP handler: Checks thresholds when plugin data arrives
|
|
- Host objects: Store alert states per host
|
|
- Notification system: Sends alerts via configured channels
|
|
|
|
### Alert Levels
|
|
|
|
- **OK**: Metric is within normal range
|
|
- **WARNING**: Metric has exceeded warning threshold (first-level concern)
|
|
- **CRITICAL**: Metric has exceeded critical threshold (requires immediate attention)
|
|
- **UNKNOWN**: Metric value cannot be evaluated (e.g., non-numeric data)
|
|
|
|
## Configuration
|
|
|
|
### Basic Structure
|
|
|
|
Thresholds are configured in the YAML configuration file under the `thresholds` section:
|
|
|
|
```yaml
|
|
thresholds:
|
|
plugin_name:
|
|
metric_name:
|
|
warning: 80.0
|
|
critical: 90.0
|
|
operator: ">"
|
|
hysteresis: 0.1
|
|
enabled: true
|
|
```
|
|
|
|
### Configuration Parameters
|
|
|
|
#### Required Parameters
|
|
|
|
- **warning**: Warning threshold value (numeric)
|
|
- **critical**: Critical threshold value (numeric)
|
|
|
|
Note: At least one of `warning` or `critical` must be specified.
|
|
|
|
#### Optional Parameters
|
|
|
|
- **operator**: Comparison operator (default: `">"`)
|
|
- `">"` - Greater than
|
|
- `">="` - Greater than or equal
|
|
- `"<"` - Less than
|
|
- `"<="` - Less than or equal
|
|
- `"=="` - Equal to
|
|
- `"!="` - Not equal to
|
|
|
|
- **hysteresis**: Hysteresis percentage to prevent flapping (default: `0.1` = 10%)
|
|
- Range: 0.0 to 1.0
|
|
- Prevents rapid state transitions when value hovers near threshold
|
|
|
|
- **enabled**: Whether this threshold is active (default: `true`)
|
|
|
|
### Comparison Operators
|
|
|
|
#### Greater Than (`>`, `>=`)
|
|
|
|
Used for metrics where **higher values are problematic**:
|
|
|
|
```yaml
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 80.0 # Alert when CPU > 80%
|
|
critical: 90.0 # Alert when CPU > 90%
|
|
operator: ">"
|
|
```
|
|
|
|
Examples:
|
|
- CPU usage percentage
|
|
- Memory usage percentage
|
|
- Disk usage percentage
|
|
- Load average
|
|
- Error counters
|
|
|
|
#### Less Than (`<`, `<=`)
|
|
|
|
Used for metrics where **lower values are problematic**:
|
|
|
|
```yaml
|
|
memory_monitor:
|
|
available_mb:
|
|
warning: 1000 # Alert when available memory < 1GB
|
|
critical: 500 # Alert when available memory < 500MB
|
|
operator: "<"
|
|
```
|
|
|
|
Examples:
|
|
- Available memory
|
|
- Free disk space
|
|
- Connection pool availability
|
|
- Battery level
|
|
|
|
## Hysteresis
|
|
|
|
Hysteresis prevents alert flapping by requiring values to improve by a certain amount before recovering from an alert state.
|
|
|
|
### How It Works
|
|
|
|
When a metric crosses a threshold (e.g., CPU goes from 85% to 91%, triggering CRITICAL), hysteresis is applied when the value improves:
|
|
|
|
```
|
|
Threshold: 90
|
|
Hysteresis: 0.1 (10%)
|
|
Recovery threshold: 90 - (90 * 0.1) = 81
|
|
|
|
Value 91 -> CRITICAL (threshold crossed)
|
|
Value 89 -> CRITICAL (still above recovery threshold of 81)
|
|
Value 85 -> CRITICAL (still above recovery threshold)
|
|
Value 80 -> WARNING or OK (below recovery threshold, re-evaluated normally)
|
|
```
|
|
|
|
### Configuration Recommendations
|
|
|
|
- **Stable metrics** (CPU, memory): 10-15% hysteresis
|
|
```yaml
|
|
hysteresis: 0.1
|
|
```
|
|
|
|
- **Very stable metrics** (disk usage): 5% hysteresis
|
|
```yaml
|
|
hysteresis: 0.05
|
|
```
|
|
|
|
- **Counter metrics** (errors, packets): 20% hysteresis
|
|
```yaml
|
|
hysteresis: 0.2
|
|
```
|
|
|
|
- **Binary states** (exit codes): No hysteresis
|
|
```yaml
|
|
hysteresis: 0.0
|
|
```
|
|
|
|
## Plugin-Specific Configuration
|
|
|
|
### CPU Monitor
|
|
|
|
```yaml
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 80.0
|
|
critical: 90.0
|
|
operator: ">"
|
|
hysteresis: 0.1
|
|
|
|
load_1min:
|
|
warning: 4.0
|
|
critical: 8.0
|
|
operator: ">"
|
|
hysteresis: 0.15
|
|
|
|
load_5min:
|
|
warning: 3.0
|
|
critical: 6.0
|
|
operator: ">"
|
|
|
|
load_15min:
|
|
warning: 2.0
|
|
critical: 4.0
|
|
operator: ">"
|
|
```
|
|
|
|
### Memory Monitor
|
|
|
|
```yaml
|
|
memory_monitor:
|
|
# Percentage-based threshold
|
|
percent:
|
|
warning: 85.0
|
|
critical: 95.0
|
|
operator: ">"
|
|
|
|
# Absolute value threshold (inverse - alert when LOW)
|
|
available_mb:
|
|
warning: 1000
|
|
critical: 500
|
|
operator: "<"
|
|
|
|
# Swap usage
|
|
swap_percent:
|
|
warning: 50.0
|
|
critical: 80.0
|
|
operator: ">"
|
|
```
|
|
|
|
### Disk Monitor
|
|
|
|
Disk thresholds support **partition-specific configuration**:
|
|
|
|
```yaml
|
|
disk_monitor:
|
|
partitions:
|
|
/:
|
|
percent:
|
|
warning: 80.0
|
|
critical: 90.0
|
|
operator: ">"
|
|
hysteresis: 0.05
|
|
|
|
free_gb:
|
|
warning: 10.0
|
|
critical: 5.0
|
|
operator: "<"
|
|
|
|
/home:
|
|
percent:
|
|
warning: 85.0
|
|
critical: 95.0
|
|
operator: ">"
|
|
|
|
/var:
|
|
percent:
|
|
warning: 80.0
|
|
critical: 90.0
|
|
operator: ">"
|
|
|
|
free_gb:
|
|
warning: 5.0
|
|
critical: 2.0
|
|
operator: "<"
|
|
```
|
|
|
|
### Network Monitor
|
|
|
|
```yaml
|
|
network_monitor:
|
|
# Error counters
|
|
errors_total:
|
|
warning: 100
|
|
critical: 1000
|
|
operator: ">"
|
|
hysteresis: 0.2
|
|
|
|
# Dropped packets
|
|
dropin_total:
|
|
warning: 50
|
|
critical: 200
|
|
operator: ">"
|
|
|
|
dropout_total:
|
|
warning: 50
|
|
critical: 200
|
|
operator: ">"
|
|
|
|
# Connection states
|
|
connections_TIME_WAIT:
|
|
warning: 1000
|
|
critical: 5000
|
|
operator: ">"
|
|
|
|
connections_ESTABLISHED:
|
|
warning: 500
|
|
critical: 1000
|
|
operator: ">"
|
|
```
|
|
|
|
### Nagios Runner
|
|
|
|
The Nagios plugin runner reports exit codes that can be thresholded:
|
|
|
|
```yaml
|
|
nagios_runner:
|
|
exit_code:
|
|
warning: 1 # Map Nagios WARNING to our WARNING
|
|
critical: 2 # Map Nagios CRITICAL to our CRITICAL
|
|
operator: ">="
|
|
hysteresis: 0.0 # No hysteresis for exit codes
|
|
```
|
|
|
|
## Notification Behavior
|
|
|
|
### When Notifications Are Sent
|
|
|
|
Notifications are triggered on **state changes**:
|
|
|
|
1. **Escalation**: OK → WARNING, OK → CRITICAL, WARNING → CRITICAL
|
|
```
|
|
WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
|
|
```
|
|
|
|
2. **Recovery**: CRITICAL → WARNING, CRITICAL → OK, WARNING → OK
|
|
```
|
|
RECOVERED: webserver01 - cpu_monitor.cpu_percent = 70.0 (CRITICAL -> OK)
|
|
```
|
|
|
|
3. **Re-notifications**: Periodic reminders for ongoing alerts
|
|
```
|
|
REMINDER (CRITICAL): webserver01 - cpu_monitor.cpu_percent = 95.0 (ongoing for 3600s)
|
|
```
|
|
|
|
### Notification Frequency
|
|
|
|
- **State changes**: Immediate notification
|
|
- **Re-notifications**: Controlled by `threshold_renotify_interval` (default: 3600 seconds = 1 hour)
|
|
|
|
```yaml
|
|
threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts
|
|
```
|
|
|
|
### Notification Channels
|
|
|
|
Thresholds use the same notification infrastructure as heartbeat monitoring:
|
|
|
|
- **Email** (via SMTP)
|
|
- **Pushover** (mobile notifications)
|
|
- **Mattermost** (team chat)
|
|
- **Custom webhooks**
|
|
|
|
Configuration:
|
|
|
|
```yaml
|
|
# Email
|
|
toemail:
|
|
- admin@example.com
|
|
- oncall@example.com
|
|
fromemail: heartbeat@example.com
|
|
smtpserver: smtp.example.com
|
|
smtpport: 587
|
|
smtpuser: heartbeat@example.com
|
|
smtppassword: your-password
|
|
|
|
# Pushover
|
|
pushover_token: your-app-token
|
|
pushover_user: your-user-key
|
|
```
|
|
|
|
### Watched Hosts
|
|
|
|
Only hosts in the `watchhosts` list will trigger notifications:
|
|
|
|
```yaml
|
|
watchhosts:
|
|
- webserver01
|
|
- database01
|
|
- mailserver
|
|
```
|
|
|
|
Hosts not in this list will still have thresholds checked and alert states tracked, but won't send notifications.
|
|
|
|
## Alert State Tracking
|
|
|
|
Each host maintains alert states for all monitored metrics:
|
|
|
|
```python
|
|
host.alert_states = {
|
|
"cpu_monitor.cpu_percent": AlertState(level=WARNING, since=1234567890),
|
|
"memory_monitor.percent": AlertState(level=CRITICAL, since=1234567800),
|
|
"disk_monitor./.percent": AlertState(level=OK, since=1234567700),
|
|
}
|
|
```
|
|
|
|
Alert states persist in memory and are saved with host data (pickle).
|
|
|
|
### Alert State Information
|
|
|
|
Each `AlertState` tracks:
|
|
|
|
- **level**: Current alert level (OK, WARNING, CRITICAL, UNKNOWN)
|
|
- **since**: Timestamp when current state started
|
|
- **last_value**: Most recent metric value
|
|
- **last_check**: Timestamp of last threshold check
|
|
- **notification_count**: Number of notifications sent for this alert
|
|
- **last_notification**: Timestamp of last notification
|
|
|
|
### Querying Alert States
|
|
|
|
Via HTTP API (future enhancement):
|
|
|
|
```bash
|
|
GET /api/hosts/webserver01/alerts
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"active_alerts": [
|
|
{
|
|
"metric": "cpu_monitor.cpu_percent",
|
|
"level": "WARNING",
|
|
"since": 1234567890,
|
|
"value": 85.0,
|
|
"duration": 300
|
|
}
|
|
],
|
|
"summary": {
|
|
"ok": 15,
|
|
"warning": 1,
|
|
"critical": 0
|
|
}
|
|
}
|
|
```
|
|
|
|
## Testing
|
|
|
|
A comprehensive test suite is provided in `test_threshold.py`:
|
|
|
|
```bash
|
|
python test_threshold.py
|
|
```
|
|
|
|
Tests cover:
|
|
- Threshold configuration and parsing
|
|
- All comparison operators
|
|
- Hysteresis functionality
|
|
- Alert state tracking
|
|
- State change detection
|
|
- Notification triggering
|
|
- Nested metrics (partitions)
|
|
- Alert summaries
|
|
|
|
## Best Practices
|
|
|
|
### 1. Start Conservative
|
|
|
|
Begin with higher thresholds to avoid alert fatigue:
|
|
|
|
```yaml
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 85.0 # Start higher
|
|
critical: 95.0 # Very high for critical
|
|
```
|
|
|
|
Adjust downward based on observed behavior.
|
|
|
|
### 2. Consider Workload Patterns
|
|
|
|
Different systems have different normal ranges:
|
|
|
|
**Web servers** (bursty traffic):
|
|
```yaml
|
|
cpu_percent:
|
|
warning: 80.0
|
|
critical: 90.0
|
|
hysteresis: 0.15 # Higher hysteresis for burstiness
|
|
```
|
|
|
|
**Database servers** (steady load):
|
|
```yaml
|
|
cpu_percent:
|
|
warning: 70.0
|
|
critical: 85.0
|
|
hysteresis: 0.1 # Lower hysteresis for steady metrics
|
|
```
|
|
|
|
### 3. Use Appropriate Operators
|
|
|
|
Match the operator to the metric:
|
|
|
|
| Metric Type | Example | Operator | Reason |
|
|
|-------------|---------|----------|--------|
|
|
| Resource usage | CPU%, Memory% | `>` | Alert when high |
|
|
| Available resources | Free memory, Free disk | `<` | Alert when low |
|
|
| Error counters | Network errors | `>` | Alert when increasing |
|
|
| Health checks | Nagios exit code | `>=` | Map to standard codes |
|
|
|
|
### 4. Align with Monitoring Intervals
|
|
|
|
Ensure threshold checks align with plugin collection intervals:
|
|
|
|
```yaml
|
|
plugins:
|
|
cpu_monitor:
|
|
interval: 300 # Check every 5 minutes
|
|
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 80.0
|
|
# Will be checked every 5 minutes
|
|
```
|
|
|
|
### 5. Test Before Production
|
|
|
|
1. **Start with disabled thresholds**:
|
|
```yaml
|
|
enabled: false
|
|
```
|
|
|
|
2. **Observe metric ranges** over a week
|
|
|
|
3. **Set thresholds** based on observed data
|
|
|
|
4. **Enable gradually**:
|
|
```yaml
|
|
enabled: true
|
|
```
|
|
|
|
5. **Monitor for false positives**
|
|
|
|
### 6. Document Baseline Values
|
|
|
|
Keep a record of normal operating ranges:
|
|
|
|
```yaml
|
|
# Production web server baseline (observed over 30 days):
|
|
# CPU: 20-40% normal, 60% peak
|
|
# Memory: 60-70% normal, 80% peak
|
|
# Disk /: 40-50% usage, growing 2%/month
|
|
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 75.0 # Above peak + margin
|
|
critical: 90.0 # Danger zone
|
|
```
|
|
|
|
### 7. Layer Alerts
|
|
|
|
Use WARNING for early notification, CRITICAL for immediate action:
|
|
|
|
```yaml
|
|
disk_monitor:
|
|
partitions:
|
|
/:
|
|
percent:
|
|
warning: 75.0 # Early warning: "check in next few days"
|
|
critical: 90.0 # Critical: "act now before outage"
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### No Notifications Being Sent
|
|
|
|
1. **Check if host is watched**:
|
|
```yaml
|
|
watchhosts:
|
|
- your-host-name
|
|
```
|
|
|
|
2. **Verify notification configuration**:
|
|
```yaml
|
|
toemail:
|
|
- admin@example.com
|
|
smtpserver: smtp.example.com
|
|
```
|
|
|
|
3. **Check threshold configuration**:
|
|
```bash
|
|
# Look for parsing errors in server logs
|
|
grep "threshold" /var/log/heartbeat/hbd.log
|
|
```
|
|
|
|
4. **Verify metric names**:
|
|
- Metric names must match exactly (case-sensitive)
|
|
- Check journal or logs for actual metric names
|
|
|
|
### Too Many Alerts (Flapping)
|
|
|
|
1. **Increase hysteresis**:
|
|
```yaml
|
|
hysteresis: 0.2 # Increase from 0.1 to 0.2 (20%)
|
|
```
|
|
|
|
2. **Adjust thresholds**:
|
|
```yaml
|
|
warning: 85.0 # Increase from 80.0
|
|
```
|
|
|
|
3. **Increase renotification interval**:
|
|
```yaml
|
|
threshold_renotify_interval: 7200 # 2 hours instead of 1
|
|
```
|
|
|
|
### Alerts Not Triggering
|
|
|
|
1. **Check threshold operator**:
|
|
```yaml
|
|
# For available memory (alert when LOW):
|
|
operator: "<" # NOT ">"
|
|
```
|
|
|
|
2. **Verify numeric values**:
|
|
- Ensure metric values are numeric
|
|
- Check for unit mismatches (MB vs GB)
|
|
|
|
3. **Check if threshold is enabled**:
|
|
```yaml
|
|
enabled: true # NOT false
|
|
```
|
|
|
|
4. **Review hysteresis settings**:
|
|
- Very high hysteresis may prevent state changes
|
|
- Try reducing or disabling temporarily
|
|
|
|
### Alert State Not Recovering
|
|
|
|
1. **Check recovery threshold calculation**:
|
|
```
|
|
Threshold: 90
|
|
Hysteresis: 0.1
|
|
Recovery: 90 - (90 * 0.1) = 81
|
|
|
|
Value must drop below 81 to recover
|
|
```
|
|
|
|
2. **Temporarily disable hysteresis**:
|
|
```yaml
|
|
hysteresis: 0.0
|
|
```
|
|
|
|
3. **Monitor actual metric values**:
|
|
```bash
|
|
# Check journal for actual values
|
|
grep "cpu_percent" /var/log/heartbeat/messages.journal | tail -20
|
|
```
|
|
|
|
## Advanced Topics
|
|
|
|
### Custom Notification Callbacks
|
|
|
|
The ThresholdChecker supports custom notification functions:
|
|
|
|
```python
|
|
def custom_notifier(message):
|
|
# Send to incident management system
|
|
pagerduty.trigger(message)
|
|
|
|
# Log to custom system
|
|
logger.critical(message)
|
|
|
|
# Update dashboard
|
|
metrics.alert_count.inc()
|
|
|
|
checker = ThresholdChecker(
|
|
config=config,
|
|
notification_callback=custom_notifier
|
|
)
|
|
```
|
|
|
|
### Programmatic Access
|
|
|
|
Query alert states programmatically:
|
|
|
|
```python
|
|
# Get all active alerts for a host
|
|
active = threshold_checker.get_active_alerts(host.alert_states)
|
|
|
|
for alert in active:
|
|
print(f"{alert.metric_path}: {alert.level.name} for {time.time() - alert.since}s")
|
|
|
|
# Get alert summary
|
|
summary = threshold_checker.get_alert_summary(host.alert_states)
|
|
print(f"WARNING: {summary['warning']}, CRITICAL: {summary['critical']}")
|
|
```
|
|
|
|
### Integration with External Systems
|
|
|
|
Threshold violations can be integrated with:
|
|
|
|
- **PagerDuty**: Incident creation and escalation
|
|
- **OpsGenie**: On-call scheduling and routing
|
|
- **ServiceNow**: Ticket creation
|
|
- **Grafana**: Dashboard annotations
|
|
- **Elasticsearch**: Alert indexing and analysis
|
|
|
|
## Future Enhancements
|
|
|
|
Planned features:
|
|
|
|
1. **Composite thresholds**: Alert based on multiple metrics
|
|
```yaml
|
|
composite:
|
|
high_load_with_low_memory:
|
|
conditions:
|
|
- cpu_monitor.load_1min > 8.0
|
|
- memory_monitor.available_mb < 500
|
|
```
|
|
|
|
2. **Time-based thresholds**: Different thresholds by time of day
|
|
```yaml
|
|
schedule:
|
|
business_hours:
|
|
warning: 70.0
|
|
off_hours:
|
|
warning: 85.0
|
|
```
|
|
|
|
3. **Rate-of-change thresholds**: Alert on rapid changes
|
|
```yaml
|
|
rate_of_change:
|
|
metric: cpu_percent
|
|
period: 300
|
|
threshold: 30.0 # Alert if changes >30% in 5 minutes
|
|
```
|
|
|
|
4. **Alert grouping**: Combine related alerts
|
|
```yaml
|
|
groups:
|
|
disk_critical:
|
|
metrics:
|
|
- disk_monitor./.percent
|
|
- disk_monitor./var.percent
|
|
action: single_notification
|
|
```
|
|
|
|
5. **Maintenance windows**: Suppress alerts during planned maintenance
|
|
```yaml
|
|
maintenance:
|
|
- host: webserver01
|
|
start: 2024-01-15T02:00:00Z
|
|
end: 2024-01-15T04:00:00Z
|
|
```
|
|
|
|
## See Also
|
|
|
|
- [Plugin Development Guide](PLUGIN_DEVELOPMENT.md)
|
|
- [Message Journal Documentation](MESSAGE_JOURNAL.md)
|
|
- Configuration examples: `hbd/config_thresholds_example.yaml`
|
|
- Test suite: `test_threshold.py`
|