Files
heartbeat/docs/THRESHOLD_ALERTING.md
T
2026-04-01 15:22:42 -04:00

960 lines
21 KiB
Markdown

# Threshold Alerting System
## Overview
The Heartbeat Monitoring System includes a comprehensive threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured thresholds. This system is designed to:
- **Detect anomalies**: Automatically identify when system metrics exceed safe operating ranges
- **Prevent alert fatigue**: Use hysteresis to prevent notification flapping
- **Escalate appropriately**: Support WARNING and CRITICAL severity levels
- **Track state**: Maintain alert history and state transitions per host
- **Integrate seamlessly**: Work with existing notification infrastructure (email, pushover, etc.)
## Architecture
### Components
1. **ThresholdChecker** (`hbd/threshold.py`)
- Main threshold checking engine
- Parses configuration
- Evaluates metrics against thresholds
- Triggers notifications on state changes
2. **ThresholdConfig**
- Individual threshold configuration
- Supports multiple comparison operators
- Implements hysteresis logic
3. **AlertState**
- Tracks current alert state per metric
- Records state transitions
- Manages notification timing
4. **Integration Points**
- UDP handler: Checks thresholds when plugin data arrives
- Host objects: Store alert states per host
- Notification system: Sends alerts via configured channels
### Alert Levels
- **OK**: Metric is within normal range
- **WARNING**: Metric has exceeded warning threshold (first-level concern)
- **CRITICAL**: Metric has exceeded critical threshold (requires immediate attention)
- **UNKNOWN**: Metric value cannot be evaluated (e.g., non-numeric data)
## Configuration
### Basic Structure
Thresholds are configured in the YAML configuration file under the `thresholds` section:
```yaml
thresholds:
plugin_name:
metric_name:
warning: 80.0
critical: 90.0
operator: ">"
hysteresis: 0.1
display: "display format"
enabled: true
```
### Configuration Parameters
#### Required Parameters
- **warning**: Warning threshold value (numeric)
- **critical**: Critical threshold value (numeric)
Note: At least one of `warning` or `critical` must be specified.
#### Optional Parameters
- **operator**: Comparison operator (default: `">"`)
- `">"` - Greater than
- `">="` - Greater than or equal
- `"<"` - Less than
- `"<="` - Less than or equal
- `"=="` - Equal to
- `"!="` - Not equal to
- **hysteresis**: Hysteresis percentage to prevent flapping (default: `0.1` = 10%)
- Range: 0.0 to 1.0
- Prevents rapid state transitions when value hovers near threshold
- **display**: f-string to hold the display format for alert messages
- defaults to "(threshold: {op_symbol} {threshold_value})"
- **enabled**: Whether this threshold is active (default: `true`)
### Comparison Operators
#### Greater Than (`>`, `>=`)
Used for metrics where **higher values are problematic**:
```yaml
cpu_monitor:
cpu_percent:
warning: 80.0 # Alert when CPU > 80%
critical: 90.0 # Alert when CPU > 90%
operator: ">"
```
Examples:
- CPU usage percentage
- Memory usage percentage
- Disk usage percentage
- Load average
- Error counters
#### Less Than (`<`, `<=`)
Used for metrics where **lower values are problematic**:
```yaml
memory_monitor:
available_mb:
warning: 1000 # Alert when available memory < 1GB
critical: 500 # Alert when available memory < 500MB
operator: "<"
```
Examples:
- Available memory
- Free disk space
- Connection pool availability
- Battery level
## Hysteresis
Hysteresis prevents alert flapping by requiring values to improve by a certain amount before recovering from an alert state.
### How It Works
When a metric crosses a threshold (e.g., CPU goes from 85% to 91%, triggering CRITICAL), hysteresis is applied when the value improves:
```
Threshold: 90
Hysteresis: 0.1 (10%)
Recovery threshold: 90 - (90 * 0.1) = 81
Value 91 -> CRITICAL (threshold crossed)
Value 89 -> CRITICAL (still above recovery threshold of 81)
Value 85 -> CRITICAL (still above recovery threshold)
Value 80 -> WARNING or OK (below recovery threshold, re-evaluated normally)
```
### Configuration Recommendations
- **Stable metrics** (CPU, memory): 10-15% hysteresis
```yaml
hysteresis: 0.1
```
- **Very stable metrics** (disk usage): 5% hysteresis
```yaml
hysteresis: 0.05
```
- **Counter metrics** (errors, packets): 20% hysteresis
```yaml
hysteresis: 0.2
```
- **Binary states** (exit codes): No hysteresis
```yaml
hysteresis: 0.0
```
## Plugin-Specific Configuration
### CPU Monitor
```yaml
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
operator: ">"
hysteresis: 0.1
load_1min:
warning: 4.0
critical: 8.0
operator: ">"
hysteresis: 0.15
load_5min:
warning: 3.0
critical: 6.0
operator: ">"
load_15min:
warning: 2.0
critical: 4.0
operator: ">"
```
### Memory Monitor
```yaml
memory_monitor:
# Percentage-based threshold
percent:
warning: 85.0
critical: 95.0
operator: ">"
# Absolute value threshold (inverse - alert when LOW)
available_mb:
warning: 1000
critical: 500
operator: "<"
# Swap usage
swap_percent:
warning: 50.0
critical: 80.0
operator: ">"
```
### Disk Monitor
Disk thresholds support **partition-specific configuration**:
```yaml
disk_monitor:
partitions:
/:
percent:
warning: 80.0
critical: 90.0
operator: ">"
hysteresis: 0.05
free_gb:
warning: 10.0
critical: 5.0
operator: "<"
/home:
percent:
warning: 85.0
critical: 95.0
operator: ">"
/var:
percent:
warning: 80.0
critical: 90.0
operator: ">"
free_gb:
warning: 5.0
critical: 2.0
operator: "<"
```
### Network Monitor
```yaml
network_monitor:
# Error counters
errors_total:
warning: 100
critical: 1000
operator: ">"
hysteresis: 0.2
# Dropped packets
dropin_total:
warning: 50
critical: 200
operator: ">"
dropout_total:
warning: 50
critical: 200
operator: ">"
# Connection states
connections_TIME_WAIT:
warning: 1000
critical: 5000
operator: ">"
connections_ESTABLISHED:
warning: 500
critical: 1000
operator: ">"
```
### Nagios Runner
The Nagios plugin runner reports exit codes that can be thresholded:
```yaml
nagios_runner:
exit_code:
warning: 1 # Map Nagios WARNING to our WARNING
critical: 2 # Map Nagios CRITICAL to our CRITICAL
operator: ">="
hysteresis: 0.0 # No hysteresis for exit codes
```
## Notification Behavior
### When Notifications Are Sent
Notifications are triggered on **state changes**:
1. **Escalation**: OK → WARNING, OK → CRITICAL, WARNING → CRITICAL
```
WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
```
2. **Recovery**: CRITICAL → WARNING, CRITICAL → OK, WARNING → OK
```
RECOVERED: webserver01 - cpu_monitor.cpu_percent = 70.0 (CRITICAL -> OK)
```
3. **Re-notifications**: Periodic reminders for ongoing alerts
```
REMINDER (CRITICAL): webserver01 - cpu_monitor.cpu_percent = 95.0 (ongoing for 3600s)
```
### Notification Frequency
- **State changes**: Immediate notification
- **Re-notifications**: Controlled by `threshold_renotify_interval` (default: 3600 seconds = 1 hour)
```yaml
threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts
```
### Notification Channels
Thresholds use the same notification infrastructure as heartbeat monitoring:
- **Email** (via SMTP)
- **Pushover** (mobile notifications)
- **Mattermost** (team chat)
- **Custom webhooks**
Configuration:
```yaml
# Email
toemail:
- admin@example.com
- oncall@example.com
fromemail: heartbeat@example.com
smtpserver: smtp.example.com
smtpport: 587
smtpuser: heartbeat@example.com
smtppassword: your-password
# Pushover
pushover_token: your-app-token
pushover_user: your-user-key
```
### Watched Hosts
Only hosts in the `watchhosts` list will trigger notifications:
```yaml
watchhosts:
- webserver01
- database01
- mailserver
```
Hosts not in this list will still have thresholds checked and alert states tracked, but won't send notifications.
## Alert State Tracking
Each host maintains alert states for all monitored metrics:
```python
host.alert_states = {
"cpu_monitor.cpu_percent": AlertState(level=WARNING, since=1234567890),
"memory_monitor.percent": AlertState(level=CRITICAL, since=1234567800),
"disk_monitor./.percent": AlertState(level=OK, since=1234567700),
}
```
Alert states persist in memory and are saved with host data (pickle).
### Alert State Information
Each `AlertState` tracks:
- **level**: Current alert level (OK, WARNING, CRITICAL, UNKNOWN)
- **since**: Timestamp when current state started
- **last_value**: Most recent metric value
- **last_check**: Timestamp of last threshold check
- **notification_count**: Number of notifications sent for this alert
- **last_notification**: Timestamp of last notification
### Querying Alert States
Via HTTP API (future enhancement):
```bash
GET /api/hosts/webserver01/alerts
```
Response:
```json
{
"active_alerts": [
{
"metric": "cpu_monitor.cpu_percent",
"level": "WARNING",
"since": 1234567890,
"value": 85.0,
"duration": 300
}
],
"summary": {
"ok": 15,
"warning": 1,
"critical": 0
}
}
```
## Testing
A comprehensive test suite is provided in `test_threshold.py`:
```bash
python test_threshold.py
```
Tests cover:
- Threshold configuration and parsing
- All comparison operators
- Hysteresis functionality
- Alert state tracking
- State change detection
- Notification triggering
- Nested metrics (partitions)
- Alert summaries
## Best Practices
### 1. Start Conservative
Begin with higher thresholds to avoid alert fatigue:
```yaml
cpu_monitor:
cpu_percent:
warning: 85.0 # Start higher
critical: 95.0 # Very high for critical
```
Adjust downward based on observed behavior.
### 2. Consider Workload Patterns
Different systems have different normal ranges:
**Web servers** (bursty traffic):
```yaml
cpu_percent:
warning: 80.0
critical: 90.0
hysteresis: 0.15 # Higher hysteresis for burstiness
```
**Database servers** (steady load):
```yaml
cpu_percent:
warning: 70.0
critical: 85.0
hysteresis: 0.1 # Lower hysteresis for steady metrics
```
### 3. Use Appropriate Operators
Match the operator to the metric:
| Metric Type | Example | Operator | Reason |
|-------------|---------|----------|--------|
| Resource usage | CPU%, Memory% | `>` | Alert when high |
| Available resources | Free memory, Free disk | `<` | Alert when low |
| Error counters | Network errors | `>` | Alert when increasing |
| Health checks | Nagios exit code | `>=` | Map to standard codes |
### 4. Align with Monitoring Intervals
Ensure threshold checks align with plugin collection intervals:
```yaml
plugins:
cpu_monitor:
interval: 300 # Check every 5 minutes
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
# Will be checked every 5 minutes
```
### 5. Test Before Production
1. **Start with disabled thresholds**:
```yaml
enabled: false
```
2. **Observe metric ranges** over a week
3. **Set thresholds** based on observed data
4. **Enable gradually**:
```yaml
enabled: true
```
5. **Monitor for false positives**
### 6. Document Baseline Values
Keep a record of normal operating ranges:
```yaml
# Production web server baseline (observed over 30 days):
# CPU: 20-40% normal, 60% peak
# Memory: 60-70% normal, 80% peak
# Disk /: 40-50% usage, growing 2%/month
cpu_monitor:
cpu_percent:
warning: 75.0 # Above peak + margin
critical: 90.0 # Danger zone
```
### 7. Layer Alerts
Use WARNING for early notification, CRITICAL for immediate action:
```yaml
disk_monitor:
partitions:
/:
percent:
warning: 75.0 # Early warning: "check in next few days"
critical: 90.0 # Critical: "act now before outage"
```
## Troubleshooting
### No Notifications Being Sent
1. **Check if host is watched**:
```yaml
watchhosts:
- your-host-name
```
2. **Verify notification configuration**:
```yaml
toemail:
- admin@example.com
smtpserver: smtp.example.com
```
3. **Check threshold configuration**:
```bash
# Look for parsing errors in server logs
grep "threshold" /var/log/heartbeat/hbd.log
```
4. **Verify metric names**:
- Metric names must match exactly (case-sensitive)
- Check journal or logs for actual metric names
### Too Many Alerts (Flapping)
1. **Increase hysteresis**:
```yaml
hysteresis: 0.2 # Increase from 0.1 to 0.2 (20%)
```
2. **Adjust thresholds**:
```yaml
warning: 85.0 # Increase from 80.0
```
3. **Increase renotification interval**:
```yaml
threshold_renotify_interval: 7200 # 2 hours instead of 1
```
### Alerts Not Triggering
1. **Check threshold operator**:
```yaml
# For available memory (alert when LOW):
operator: "<" # NOT ">"
```
2. **Verify numeric values**:
- Ensure metric values are numeric
- Check for unit mismatches (MB vs GB)
3. **Check if threshold is enabled**:
```yaml
enabled: true # NOT false
```
4. **Review hysteresis settings**:
- Very high hysteresis may prevent state changes
- Try reducing or disabling temporarily
### Alert State Not Recovering
1. **Check recovery threshold calculation**:
```
Threshold: 90
Hysteresis: 0.1
Recovery: 90 - (90 * 0.1) = 81
Value must drop below 81 to recover
```
2. **Temporarily disable hysteresis**:
```yaml
hysteresis: 0.0
```
3. **Monitor actual metric values**:
```bash
# Check journal for actual values
grep "cpu_percent" /var/log/heartbeat/messages.journal | tail -20
```
## Advanced Topics
### Custom Notification Callbacks
The ThresholdChecker supports custom notification functions:
```python
def custom_notifier(message):
# Send to incident management system
pagerduty.trigger(message)
# Log to custom system
logger.critical(message)
# Update dashboard
metrics.alert_count.inc()
checker = ThresholdChecker(
config=config,
notification_callback=custom_notifier
)
```
### Programmatic Access
Query alert states programmatically:
```python
# Get all active alerts for a host
active = threshold_checker.get_active_alerts(host.alert_states)
for alert in active:
print(f"{alert.metric_path}: {alert.level.name} for {time.time() - alert.since}s")
# Get alert summary
summary = threshold_checker.get_alert_summary(host.alert_states)
print(f"WARNING: {summary['warning']}, CRITICAL: {summary['critical']}")
```
### Integration with External Systems
Threshold violations can be integrated with:
- **PagerDuty**: Incident creation and escalation
- **OpsGenie**: On-call scheduling and routing
- **ServiceNow**: Ticket creation
- **Grafana**: Dashboard annotations
- **Elasticsearch**: Alert indexing and analysis
## Future Enhancements
Planned features:
1. **Composite thresholds**: Alert based on multiple metrics
```yaml
composite:
high_load_with_low_memory:
conditions:
- cpu_monitor.load_1min > 8.0
- memory_monitor.available_mb < 500
```
2. **Time-based thresholds**: Different thresholds by time of day
```yaml
schedule:
business_hours:
warning: 70.0
off_hours:
warning: 85.0
```
3. **Rate-of-change thresholds**: Alert on rapid changes
```yaml
rate_of_change:
metric: cpu_percent
period: 300
threshold: 30.0 # Alert if changes >30% in 5 minutes
```
4. **Alert grouping**: Combine related alerts
```yaml
groups:
disk_critical:
metrics:
- disk_monitor./.percent
- disk_monitor./var.percent
action: single_notification
```
5. **Maintenance windows**: Suppress alerts during planned maintenance
```yaml
maintenance:
- host: webserver01
start: 2024-01-15T02:00:00Z
end: 2024-01-15T04:00:00Z
```
## See Also
- [Plugin Development Guide](PLUGIN_DEVELOPMENT.md)
- [Message Journal Documentation](MESSAGE_JOURNAL.md)
- Configuration examples: `hbd/config_thresholds_example.yaml`
- Test suite: `test_threshold.py`
## Multi-Threshold Configuration
**New in version 2.0**: Support for multiple named threshold configurations with per-host mapping.
### Overview
The multi-threshold feature allows you to:
- Define multiple sets of threshold configurations
- Map different hosts to different threshold sets
- Use different sensitivity levels for different environments
- Maintain a default configuration for unmapped hosts
### Configuration Structure
```yaml
# Optional: Set the default configuration name (defaults to "default")
default_threshold_config: "default"
# Define multiple named threshold configurations
threshold_configs:
# Configuration name 1
default:
thresholds:
# Standard threshold definitions
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
# Configuration name 2
high_sensitivity:
thresholds:
cpu_monitor:
cpu_percent:
warning: 60.0
critical: 75.0
# Configuration name 3
low_sensitivity:
thresholds:
cpu_monitor:
cpu_percent:
warning: 90.0
critical: 95.0
# Map specific hosts to specific configurations
host_threshold_mapping:
prod-web-01: high_sensitivity
prod-web-02: high_sensitivity
dev-server-01: low_sensitivity
# Unmapped hosts use default_threshold_config
```
### Use Cases
#### 1. Environment-Based Thresholds
Different thresholds for production vs. development:
```yaml
threshold_configs:
production:
thresholds:
cpu_monitor:
cpu_percent:
warning: 70.0 # Alert earlier in production
critical: 85.0
development:
thresholds:
cpu_monitor:
cpu_percent:
warning: 90.0 # More relaxed for dev
critical: 98.0
host_threshold_mapping:
prod-web-01: production
prod-web-02: production
dev-web-01: development
dev-web-02: development
```
#### 2. Server Role-Based Thresholds
Different thresholds based on server function:
```yaml
threshold_configs:
webserver:
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
database:
thresholds:
cpu_monitor:
cpu_percent:
warning: 70.0
critical: 85.0
memory_monitor:
percent:
warning: 90.0 # Databases can use high memory
critical: 97.0
disk_monitor:
partitions:
/var/lib/mysql:
percent:
warning: 75.0
critical: 85.0
cache:
thresholds:
memory_monitor:
percent:
warning: 95.0 # Redis/Memcached can use very high memory
critical: 99.0
host_threshold_mapping:
web-01: webserver
web-02: webserver
db-01: database
db-02: database
redis-01: cache
memcached-01: cache
```
#### 3. Sensitivity Levels
Different sensitivity for critical vs. non-critical systems:
```yaml
threshold_configs:
critical:
thresholds:
disk_monitor:
partitions:
/:
percent:
warning: 70.0 # Very sensitive
critical: 80.0
hysteresis: 0.15
standard:
thresholds:
disk_monitor:
partitions:
/:
percent:
warning: 85.0
critical: 95.0
hysteresis: 0.1
relaxed:
thresholds:
disk_monitor:
partitions:
/:
percent:
warning: 90.0
critical: 98.0
hysteresis: 0.05
host_threshold_mapping:
payment-gateway: critical
auth-server: critical
web-01: standard
web-02: standard
test-server: relaxed
```
### Backward Compatibility
The legacy single threshold configuration is fully supported:
```yaml
# Old format - still works
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
```
This is equivalent to:
```yaml
# New format
threshold_configs:
default:
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
```
### Configuration Priority
1. **Host-specific mapping**: If host is in `host_threshold_mapping`, use that config
2. **Default config**: Use `default_threshold_config`
3. **First alphabetically**: If default not found, use first config alphabetically
4. **Legacy fallback**: If `threshold_configs` not present, use `thresholds`
### Example: Complete Multi-Threshold Setup
See `hbd/config_multi_threshold_example.yaml` for a complete example with:
- 4 named configurations (default, high_sensitivity, low_sensitivity, database)
- Host-to-config mappings for production, development, and test systems
- Specialized database server thresholds
- Custom display messages with plugin data