1186 lines
27 KiB
Markdown
1186 lines
27 KiB
Markdown
# Threshold Alerting System
|
|
|
|
## Overview
|
|
|
|
The Heartbeat Monitoring System includes a comprehensive threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured thresholds. This system is designed to:
|
|
|
|
- **Detect anomalies**: Automatically identify when system metrics exceed safe operating ranges
|
|
- **Prevent alert fatigue**: Use hysteresis to prevent notification flapping
|
|
- **Escalate appropriately**: Support WARNING and CRITICAL severity levels
|
|
- **Track state**: Maintain alert history and state transitions per host
|
|
- **Integrate seamlessly**: Work with existing notification infrastructure (email, pushover, etc.)
|
|
|
|
## Architecture
|
|
|
|
### Components
|
|
|
|
1. **ThresholdChecker** (`hbd/threshold.py`)
|
|
- Main threshold checking engine
|
|
- Parses configuration
|
|
- Evaluates metrics against thresholds
|
|
- Triggers notifications on state changes
|
|
|
|
2. **ThresholdConfig**
|
|
- Individual threshold configuration
|
|
- Supports multiple comparison operators
|
|
- Implements hysteresis logic
|
|
|
|
3. **AlertState**
|
|
- Tracks current alert state per metric
|
|
- Records state transitions
|
|
- Manages notification timing
|
|
|
|
4. **Integration Points**
|
|
- UDP handler: Checks thresholds when plugin data arrives
|
|
- Host objects: Store alert states per host
|
|
- Notification system: Sends alerts via configured channels
|
|
|
|
### Alert Levels
|
|
|
|
- **OK**: Metric is within normal range
|
|
- **WARNING**: Metric has exceeded warning threshold (first-level concern)
|
|
- **CRITICAL**: Metric has exceeded critical threshold (requires immediate attention)
|
|
- **UNKNOWN**: Metric value cannot be evaluated (e.g., non-numeric data)
|
|
|
|
## Configuration
|
|
|
|
### Basic Structure
|
|
|
|
Thresholds are configured in the YAML configuration file under the `thresholds` section:
|
|
|
|
```yaml
|
|
thresholds:
|
|
plugin_name:
|
|
metric_name:
|
|
warning: 80.0
|
|
critical: 90.0
|
|
operator: ">"
|
|
hysteresis: 0.1
|
|
display: "display format"
|
|
enabled: true
|
|
```
|
|
|
|
### Configuration Parameters
|
|
|
|
#### Required Parameters
|
|
|
|
- **warning**: Warning threshold value (numeric)
|
|
- **critical**: Critical threshold value (numeric)
|
|
|
|
Note: At least one of `warning` or `critical` must be specified.
|
|
|
|
#### Optional Parameters
|
|
|
|
- **operator**: Comparison operator (default: `">"`)
|
|
- `">"` - Greater than
|
|
- `">="` - Greater than or equal
|
|
- `"<"` - Less than
|
|
- `"<="` - Less than or equal
|
|
- `"=="` - Equal to
|
|
- `"!="` - Not equal to
|
|
|
|
- **hysteresis**: Hysteresis percentage to prevent flapping (default: `0.1` = 10%)
|
|
- Range: 0.0 to 1.0
|
|
- Prevents rapid state transitions when value hovers near threshold
|
|
|
|
- **display**: f-string to hold the display format for alert messages
|
|
- defaults to "(threshold: {op_symbol} {threshold_value})"
|
|
- **enabled**: Whether this threshold is active (default: `true`)
|
|
|
|
### Comparison Operators
|
|
|
|
#### Greater Than (`>`, `>=`)
|
|
|
|
Used for metrics where **higher values are problematic**:
|
|
|
|
```yaml
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 80.0 # Alert when CPU > 80%
|
|
critical: 90.0 # Alert when CPU > 90%
|
|
operator: ">"
|
|
```
|
|
|
|
Examples:
|
|
- CPU usage percentage
|
|
- Memory usage percentage
|
|
- Disk usage percentage
|
|
- Load average
|
|
- Error counters
|
|
|
|
#### Less Than (`<`, `<=`)
|
|
|
|
Used for metrics where **lower values are problematic**:
|
|
|
|
```yaml
|
|
memory_monitor:
|
|
available_mb:
|
|
warning: 1000 # Alert when available memory < 1GB
|
|
critical: 500 # Alert when available memory < 500MB
|
|
operator: "<"
|
|
```
|
|
|
|
Examples:
|
|
- Available memory
|
|
- Free disk space
|
|
- Connection pool availability
|
|
- Battery level
|
|
|
|
## Hysteresis
|
|
|
|
Hysteresis prevents alert flapping by requiring values to improve by a certain amount before recovering from an alert state.
|
|
|
|
### How It Works
|
|
|
|
When a metric crosses a threshold (e.g., CPU goes from 85% to 91%, triggering CRITICAL), hysteresis is applied when the value improves:
|
|
|
|
```
|
|
Threshold: 90
|
|
Hysteresis: 0.1 (10%)
|
|
Recovery threshold: 90 - (90 * 0.1) = 81
|
|
|
|
Value 91 -> CRITICAL (threshold crossed)
|
|
Value 89 -> CRITICAL (still above recovery threshold of 81)
|
|
Value 85 -> CRITICAL (still above recovery threshold)
|
|
Value 80 -> WARNING or OK (below recovery threshold, re-evaluated normally)
|
|
```
|
|
|
|
### Configuration Recommendations
|
|
|
|
- **Stable metrics** (CPU, memory): 10-15% hysteresis
|
|
```yaml
|
|
hysteresis: 0.1
|
|
```
|
|
|
|
- **Very stable metrics** (disk usage): 5% hysteresis
|
|
```yaml
|
|
hysteresis: 0.05
|
|
```
|
|
|
|
- **Counter metrics** (errors, packets): 20% hysteresis
|
|
```yaml
|
|
hysteresis: 0.2
|
|
```
|
|
|
|
- **Binary states** (exit codes): No hysteresis
|
|
```yaml
|
|
hysteresis: 0.0
|
|
```
|
|
|
|
## Plugin-Specific Configuration
|
|
|
|
### CPU Monitor
|
|
|
|
```yaml
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 80.0
|
|
critical: 90.0
|
|
operator: ">"
|
|
hysteresis: 0.1
|
|
|
|
load_1min:
|
|
warning: 4.0
|
|
critical: 8.0
|
|
operator: ">"
|
|
hysteresis: 0.15
|
|
|
|
load_5min:
|
|
warning: 3.0
|
|
critical: 6.0
|
|
operator: ">"
|
|
|
|
load_15min:
|
|
warning: 2.0
|
|
critical: 4.0
|
|
operator: ">"
|
|
```
|
|
|
|
### Memory Monitor
|
|
|
|
```yaml
|
|
memory_monitor:
|
|
# Percentage-based threshold
|
|
percent:
|
|
warning: 85.0
|
|
critical: 95.0
|
|
operator: ">"
|
|
|
|
# Absolute value threshold (inverse - alert when LOW)
|
|
available_mb:
|
|
warning: 1000
|
|
critical: 500
|
|
operator: "<"
|
|
|
|
# Swap usage
|
|
swap_percent:
|
|
warning: 50.0
|
|
critical: 80.0
|
|
operator: ">"
|
|
```
|
|
|
|
### Disk Monitor
|
|
|
|
Disk thresholds support **partition-specific configuration**:
|
|
|
|
```yaml
|
|
disk_monitor:
|
|
partitions:
|
|
/:
|
|
percent:
|
|
warning: 80.0
|
|
critical: 90.0
|
|
operator: ">"
|
|
hysteresis: 0.05
|
|
|
|
free_gb:
|
|
warning: 10.0
|
|
critical: 5.0
|
|
operator: "<"
|
|
|
|
/home:
|
|
percent:
|
|
warning: 85.0
|
|
critical: 95.0
|
|
operator: ">"
|
|
|
|
/var:
|
|
percent:
|
|
warning: 80.0
|
|
critical: 90.0
|
|
operator: ">"
|
|
|
|
free_gb:
|
|
warning: 5.0
|
|
critical: 2.0
|
|
operator: "<"
|
|
```
|
|
|
|
### ZFS Monitor
|
|
|
|
ZFS pool health is checked automatically for every pool. A pool in any state
|
|
other than `ONLINE` (e.g. `DEGRADED`, `SUSPENDED`, `FAULTED`, `UNAVAIL`) raises
|
|
a **CRITICAL** alert by default — no configuration required.
|
|
|
|
The default threshold is equivalent to:
|
|
|
|
```yaml
|
|
zfs_monitor:
|
|
pools:
|
|
'*':
|
|
status:
|
|
critical: 1
|
|
operator: ">"
|
|
hysteresis: 0.0
|
|
display: "ZFS pool {pool_name} is {health}"
|
|
```
|
|
|
|
`'*'` matches every pool on the host. The notification message includes the pool
|
|
name and its current health string, e.g. `ZFS pool tank is DEGRADED`.
|
|
|
|
**Override for specific pools** — named pool entries take priority over `'*'`:
|
|
|
|
```yaml
|
|
zfs_monitor:
|
|
pools:
|
|
# Suppress health alerts for a scratch pool (not mission-critical)
|
|
scratch:
|
|
status:
|
|
enabled: false
|
|
|
|
# Capacity threshold for a specific pool
|
|
tank:
|
|
capacity:
|
|
warning: 75.0
|
|
critical: 90.0
|
|
operator: ">"
|
|
hysteresis: 0.05
|
|
```
|
|
|
|
**Alert state paths** follow the pattern `zfs_monitor.<pool_name>.status`,
|
|
so acknowledgements and silences target individual pools:
|
|
|
|
```
|
|
zfs_monitor.tank.status
|
|
zfs_monitor.backup.status
|
|
```
|
|
|
|
### Network Monitor
|
|
|
|
```yaml
|
|
network_monitor:
|
|
# Error counters
|
|
errors_total:
|
|
warning: 100
|
|
critical: 1000
|
|
operator: ">"
|
|
hysteresis: 0.2
|
|
|
|
# Dropped packets
|
|
dropin_total:
|
|
warning: 50
|
|
critical: 200
|
|
operator: ">"
|
|
|
|
dropout_total:
|
|
warning: 50
|
|
critical: 200
|
|
operator: ">"
|
|
|
|
# Connection states
|
|
connections_TIME_WAIT:
|
|
warning: 1000
|
|
critical: 5000
|
|
operator: ">"
|
|
|
|
connections_ESTABLISHED:
|
|
warning: 500
|
|
critical: 1000
|
|
operator: ">"
|
|
```
|
|
|
|
### Nagios Runner
|
|
|
|
The Nagios plugin runner reports exit codes that can be thresholded:
|
|
|
|
```yaml
|
|
nagios_runner:
|
|
exit_code:
|
|
warning: 1 # Map Nagios WARNING to our WARNING
|
|
critical: 2 # Map Nagios CRITICAL to our CRITICAL
|
|
operator: ">="
|
|
hysteresis: 0.0 # No hysteresis for exit codes
|
|
```
|
|
|
|
## Notification Behavior
|
|
|
|
### When Notifications Are Sent
|
|
|
|
Notifications are triggered on **state changes**:
|
|
|
|
1. **Escalation**: OK → WARNING, OK → CRITICAL, WARNING → CRITICAL
|
|
```
|
|
WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
|
|
```
|
|
|
|
2. **Recovery**: CRITICAL → WARNING, CRITICAL → OK, WARNING → OK
|
|
```
|
|
RECOVERED: webserver01 - cpu_monitor.cpu_percent = 70.0 (CRITICAL -> OK)
|
|
```
|
|
|
|
3. **Re-notifications**: Periodic reminders for ongoing alerts
|
|
```
|
|
REMINDER (CRITICAL): webserver01 - cpu_monitor.cpu_percent = 95.0 (ongoing for 3600s)
|
|
```
|
|
|
|
### Notification Frequency
|
|
|
|
- **State changes**: Immediate notification
|
|
- **Re-notifications**: Controlled by `threshold_renotify_interval` (default: 3600 seconds = 1 hour)
|
|
|
|
```yaml
|
|
threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts
|
|
```
|
|
|
|
### Notification Channels
|
|
|
|
The system supports centralized notification channel definitions, allowing different hosts to use different notification providers and credentials. This provides fine-grained control over who gets notified about what.
|
|
|
|
#### Supported Channel Types
|
|
|
|
- **Email** (via SMTP)
|
|
- **Pushover** (mobile notifications)
|
|
- **Signal** (via signal-cli)
|
|
- **Mattermost** (team chat webhooks)
|
|
|
|
#### Centralized Channel Configuration
|
|
|
|
Define notification channels once in the configuration file:
|
|
|
|
```yaml
|
|
notification_channels:
|
|
# Signal notifications
|
|
signal_ops:
|
|
type: signal
|
|
cli_path: /usr/local/bin/signal-cli
|
|
user: +1234567890
|
|
recipient: +1234567890
|
|
|
|
# Email notifications
|
|
email_ops:
|
|
type: email
|
|
recipients: [ops@example.com, alerts@example.com]
|
|
sender: heartbeat@example.com
|
|
smtp_server: smtp.example.com
|
|
smtp_port: 587
|
|
smtp_user: heartbeat@example.com
|
|
smtp_password: your-smtp-password
|
|
|
|
# Pushover notifications
|
|
pushover_urgent:
|
|
type: pushover
|
|
token: your-pushover-app-token
|
|
user: your-pushover-user-key
|
|
|
|
# Mattermost notifications
|
|
mattermost_devops:
|
|
type: mattermost
|
|
host: mattermost.example.com
|
|
token: your-webhook-token
|
|
channel: devops-alerts
|
|
username: heartbeat-bot
|
|
icon: https://example.com/heartbeat-icon.png
|
|
|
|
# Default channels for hosts that don't specify channels
|
|
default_notification_channels: [email_ops]
|
|
```
|
|
|
|
#### Per-Host Channel Assignment
|
|
|
|
Assign notification channels to specific hosts in the `hosts` section:
|
|
|
|
```yaml
|
|
hosts:
|
|
# Critical server - multiple notification channels
|
|
prod-web-01:
|
|
threshold_config: high_sensitivity
|
|
watch: true
|
|
notification_channels: [signal_ops, pushover_urgent, email_ops]
|
|
dyndns: false
|
|
|
|
# Database server - ops team only
|
|
prod-db-01:
|
|
threshold_config: database
|
|
watch: true
|
|
notification_channels: [signal_ops, email_ops]
|
|
dyndns: false
|
|
|
|
# Development server - email only
|
|
dev-server-01:
|
|
threshold_config: low_sensitivity
|
|
watch: false
|
|
notification_channels: [email_ops]
|
|
dyndns: false
|
|
|
|
# Uses default_notification_channels if not specified
|
|
test-server-01:
|
|
threshold_config: default
|
|
watch: false
|
|
dyndns: false
|
|
```
|
|
|
|
### Watched Hosts
|
|
|
|
Only hosts with `watch: true` in the `hosts` section will trigger notifications:
|
|
|
|
```yaml
|
|
hosts:
|
|
webserver01:
|
|
watch: true
|
|
notification_channels: [email_ops]
|
|
|
|
database01:
|
|
watch: true
|
|
notification_channels: [signal_ops, email_ops]
|
|
|
|
mailserver:
|
|
watch: true
|
|
notification_channels: [pushover_urgent]
|
|
```
|
|
|
|
Hosts not marked for watching will still have thresholds checked and alert states tracked, but won't send notifications.
|
|
|
|
## Alert State Tracking
|
|
|
|
Each host maintains alert states for all monitored metrics:
|
|
|
|
```python
|
|
host.alert_states = {
|
|
"cpu_monitor.cpu_percent": AlertState(level=WARNING, since=1234567890),
|
|
"memory_monitor.percent": AlertState(level=CRITICAL, since=1234567800),
|
|
"disk_monitor./.percent": AlertState(level=OK, since=1234567700),
|
|
}
|
|
```
|
|
|
|
Alert states persist in memory and are saved with host data (pickle).
|
|
|
|
### Alert State Information
|
|
|
|
Each `AlertState` tracks:
|
|
|
|
- **level**: Current alert level (OK, WARNING, CRITICAL, UNKNOWN)
|
|
- **since**: Timestamp when current state started
|
|
- **last_value**: Most recent metric value
|
|
- **last_check**: Timestamp of last threshold check
|
|
- **notification_count**: Number of notifications sent for this alert
|
|
- **last_notification**: Timestamp of last notification
|
|
|
|
### Querying Alert States
|
|
|
|
Via HTTP API (future enhancement):
|
|
|
|
```bash
|
|
GET /api/hosts/webserver01/alerts
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"active_alerts": [
|
|
{
|
|
"metric": "cpu_monitor.cpu_percent",
|
|
"level": "WARNING",
|
|
"since": 1234567890,
|
|
"value": 85.0,
|
|
"duration": 300
|
|
}
|
|
],
|
|
"summary": {
|
|
"ok": 15,
|
|
"warning": 1,
|
|
"critical": 0
|
|
}
|
|
}
|
|
```
|
|
|
|
## Testing
|
|
|
|
A comprehensive test suite is provided in `test_threshold.py`:
|
|
|
|
```bash
|
|
python test_threshold.py
|
|
```
|
|
|
|
Tests cover:
|
|
- Threshold configuration and parsing
|
|
- All comparison operators
|
|
- Hysteresis functionality
|
|
- Alert state tracking
|
|
- State change detection
|
|
- Notification triggering
|
|
- Nested metrics (partitions)
|
|
- Alert summaries
|
|
|
|
## Best Practices
|
|
|
|
### 1. Start Conservative
|
|
|
|
Begin with higher thresholds to avoid alert fatigue:
|
|
|
|
```yaml
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 85.0 # Start higher
|
|
critical: 95.0 # Very high for critical
|
|
```
|
|
|
|
Adjust downward based on observed behavior.
|
|
|
|
### 2. Consider Workload Patterns
|
|
|
|
Different systems have different normal ranges:
|
|
|
|
**Web servers** (bursty traffic):
|
|
```yaml
|
|
cpu_percent:
|
|
warning: 80.0
|
|
critical: 90.0
|
|
hysteresis: 0.15 # Higher hysteresis for burstiness
|
|
```
|
|
|
|
**Database servers** (steady load):
|
|
```yaml
|
|
cpu_percent:
|
|
warning: 70.0
|
|
critical: 85.0
|
|
hysteresis: 0.1 # Lower hysteresis for steady metrics
|
|
```
|
|
|
|
### 3. Use Appropriate Operators
|
|
|
|
Match the operator to the metric:
|
|
|
|
| Metric Type | Example | Operator | Reason |
|
|
|-------------|---------|----------|--------|
|
|
| Resource usage | CPU%, Memory% | `>` | Alert when high |
|
|
| Available resources | Free memory, Free disk | `<` | Alert when low |
|
|
| Error counters | Network errors | `>` | Alert when increasing |
|
|
| Health checks | Nagios exit code | `>=` | Map to standard codes |
|
|
|
|
### 4. Align with Monitoring Intervals
|
|
|
|
Ensure threshold checks align with plugin collection intervals:
|
|
|
|
```yaml
|
|
plugins:
|
|
cpu_monitor:
|
|
interval: 300 # Check every 5 minutes
|
|
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 80.0
|
|
# Will be checked every 5 minutes
|
|
```
|
|
|
|
### 5. Test Before Production
|
|
|
|
1. **Start with disabled thresholds**:
|
|
```yaml
|
|
enabled: false
|
|
```
|
|
|
|
2. **Observe metric ranges** over a week
|
|
|
|
3. **Set thresholds** based on observed data
|
|
|
|
4. **Enable gradually**:
|
|
```yaml
|
|
enabled: true
|
|
```
|
|
|
|
5. **Monitor for false positives**
|
|
|
|
### 6. Document Baseline Values
|
|
|
|
Keep a record of normal operating ranges:
|
|
|
|
```yaml
|
|
# Production web server baseline (observed over 30 days):
|
|
# CPU: 20-40% normal, 60% peak
|
|
# Memory: 60-70% normal, 80% peak
|
|
# Disk /: 40-50% usage, growing 2%/month
|
|
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 75.0 # Above peak + margin
|
|
critical: 90.0 # Danger zone
|
|
```
|
|
|
|
### 7. Layer Alerts
|
|
|
|
Use WARNING for early notification, CRITICAL for immediate action:
|
|
|
|
```yaml
|
|
disk_monitor:
|
|
partitions:
|
|
/:
|
|
percent:
|
|
warning: 75.0 # Early warning: "check in next few days"
|
|
critical: 90.0 # Critical: "act now before outage"
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### No Notifications Being Sent
|
|
|
|
1. **Check if host is watched**:
|
|
```yaml
|
|
watchhosts:
|
|
- your-host-name
|
|
```
|
|
|
|
2. **Verify notification configuration**:
|
|
```yaml
|
|
toemail:
|
|
- admin@example.com
|
|
smtpserver: smtp.example.com
|
|
```
|
|
|
|
3. **Check threshold configuration**:
|
|
```bash
|
|
# Look for parsing errors in server logs
|
|
grep "threshold" /var/log/heartbeat/hbd.log
|
|
```
|
|
|
|
4. **Verify metric names**:
|
|
- Metric names must match exactly (case-sensitive)
|
|
- Check journal or logs for actual metric names
|
|
|
|
### Too Many Alerts (Flapping)
|
|
|
|
1. **Increase hysteresis**:
|
|
```yaml
|
|
hysteresis: 0.2 # Increase from 0.1 to 0.2 (20%)
|
|
```
|
|
|
|
2. **Adjust thresholds**:
|
|
```yaml
|
|
warning: 85.0 # Increase from 80.0
|
|
```
|
|
|
|
3. **Increase renotification interval**:
|
|
```yaml
|
|
threshold_renotify_interval: 7200 # 2 hours instead of 1
|
|
```
|
|
|
|
### Alerts Not Triggering
|
|
|
|
1. **Check threshold operator**:
|
|
```yaml
|
|
# For available memory (alert when LOW):
|
|
operator: "<" # NOT ">"
|
|
```
|
|
|
|
2. **Verify numeric values**:
|
|
- Ensure metric values are numeric
|
|
- Check for unit mismatches (MB vs GB)
|
|
|
|
3. **Check if threshold is enabled**:
|
|
```yaml
|
|
enabled: true # NOT false
|
|
```
|
|
|
|
4. **Review hysteresis settings**:
|
|
- Very high hysteresis may prevent state changes
|
|
- Try reducing or disabling temporarily
|
|
|
|
### Alert State Not Recovering
|
|
|
|
1. **Check recovery threshold calculation**:
|
|
```
|
|
Threshold: 90
|
|
Hysteresis: 0.1
|
|
Recovery: 90 - (90 * 0.1) = 81
|
|
|
|
Value must drop below 81 to recover
|
|
```
|
|
|
|
2. **Temporarily disable hysteresis**:
|
|
```yaml
|
|
hysteresis: 0.0
|
|
```
|
|
|
|
3. **Monitor actual metric values**:
|
|
```bash
|
|
# Check journal for actual values
|
|
grep "cpu_percent" /var/log/heartbeat/messages.journal | tail -20
|
|
```
|
|
|
|
## Advanced Topics
|
|
|
|
### Custom Notification Callbacks
|
|
|
|
The ThresholdChecker supports custom notification functions:
|
|
|
|
```python
|
|
def custom_notifier(message):
|
|
# Send to incident management system
|
|
pagerduty.trigger(message)
|
|
|
|
# Log to custom system
|
|
logger.critical(message)
|
|
|
|
# Update dashboard
|
|
metrics.alert_count.inc()
|
|
|
|
checker = ThresholdChecker(
|
|
config=config,
|
|
notification_callback=custom_notifier
|
|
)
|
|
```
|
|
|
|
### Programmatic Access
|
|
|
|
Query alert states programmatically:
|
|
|
|
```python
|
|
# Get all active alerts for a host
|
|
active = threshold_checker.get_active_alerts(host.alert_states)
|
|
|
|
for alert in active:
|
|
print(f"{alert.metric_path}: {alert.level.name} for {time.time() - alert.since}s")
|
|
|
|
# Get alert summary
|
|
summary = threshold_checker.get_alert_summary(host.alert_states)
|
|
print(f"WARNING: {summary['warning']}, CRITICAL: {summary['critical']}")
|
|
```
|
|
|
|
### Integration with External Systems
|
|
|
|
Threshold violations can be integrated with:
|
|
|
|
- **PagerDuty**: Incident creation and escalation
|
|
- **OpsGenie**: On-call scheduling and routing
|
|
- **ServiceNow**: Ticket creation
|
|
- **Grafana**: Dashboard annotations
|
|
- **Elasticsearch**: Alert indexing and analysis
|
|
|
|
## Future Enhancements
|
|
|
|
Planned features:
|
|
|
|
1. **Composite thresholds**: Alert based on multiple metrics
|
|
```yaml
|
|
composite:
|
|
high_load_with_low_memory:
|
|
conditions:
|
|
- cpu_monitor.load_1min > 8.0
|
|
- memory_monitor.available_mb < 500
|
|
```
|
|
|
|
2. **Time-based thresholds**: Different thresholds by time of day
|
|
```yaml
|
|
schedule:
|
|
business_hours:
|
|
warning: 70.0
|
|
off_hours:
|
|
warning: 85.0
|
|
```
|
|
|
|
3. **Rate-of-change thresholds**: Alert on rapid changes
|
|
```yaml
|
|
rate_of_change:
|
|
metric: cpu_percent
|
|
period: 300
|
|
threshold: 30.0 # Alert if changes >30% in 5 minutes
|
|
```
|
|
|
|
4. **Alert grouping**: Combine related alerts
|
|
```yaml
|
|
groups:
|
|
disk_critical:
|
|
metrics:
|
|
- disk_monitor./.percent
|
|
- disk_monitor./var.percent
|
|
action: single_notification
|
|
```
|
|
|
|
5. **Maintenance windows**: Suppress alerts during planned maintenance
|
|
```yaml
|
|
maintenance:
|
|
- host: webserver01
|
|
start: 2024-01-15T02:00:00Z
|
|
end: 2024-01-15T04:00:00Z
|
|
```
|
|
|
|
## See Also
|
|
|
|
- [Plugin Development Guide](PLUGIN_DEVELOPMENT.md)
|
|
- [Message Journal Documentation](MESSAGE_JOURNAL.md)
|
|
- Configuration examples: `hbd/config_thresholds_example.yaml`
|
|
- Test suite: `test_threshold.py`
|
|
|
|
## Multi-Threshold Configuration
|
|
|
|
Support for multiple named threshold configurations with per-host mapping and composable layering.
|
|
|
|
### Overview
|
|
|
|
The multi-threshold feature allows you to:
|
|
- Define multiple named threshold configurations
|
|
- Assign one or more configurations to each host
|
|
- Compose configurations by layering — each named config's overrides are applied in order on top of the defaults
|
|
- Use different sensitivity levels for different environments
|
|
|
|
### Configuration Structure
|
|
|
|
Named configurations are defined under `threshold_configs`. Each host selects which ones to use via `threshold_config` in the `hosts` section (a string for a single config, or a list to layer multiple):
|
|
|
|
```yaml
|
|
# Optional: set the default configuration name (defaults to "default")
|
|
default_threshold_config: "default"
|
|
|
|
threshold_configs:
|
|
default:
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 80.0
|
|
critical: 90.0
|
|
|
|
high_sensitivity:
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 60.0
|
|
critical: 75.0
|
|
|
|
low_sensitivity:
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 90.0
|
|
critical: 95.0
|
|
|
|
hosts:
|
|
prod-web-01:
|
|
threshold_config: high_sensitivity # single config
|
|
|
|
dev-server-01:
|
|
threshold_config: low_sensitivity
|
|
|
|
# Hosts with no threshold_config use default_threshold_config
|
|
```
|
|
|
|
### Composable Configurations (list form)
|
|
|
|
`threshold_config` can be a list. Configs are applied **left to right**: the defaults are the base, then each named config's overrides are layered on top. Later entries in the list win on any metric they define.
|
|
|
|
```yaml
|
|
threshold_configs:
|
|
default:
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent: {warning: 80, critical: 90}
|
|
memory_monitor:
|
|
memory_percent: {warning: 85, critical: 95}
|
|
disk_monitor:
|
|
partitions:
|
|
/:
|
|
percent: {warning: 80, critical: 90}
|
|
|
|
# Tighter CPU limits for busy servers
|
|
high_cpu_load:
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent: {warning: 60, critical: 75}
|
|
|
|
# Tighter disk limits for data-heavy servers
|
|
busy_disk:
|
|
thresholds:
|
|
disk_monitor:
|
|
partitions:
|
|
/:
|
|
percent: {warning: 70, critical: 85}
|
|
|
|
hosts:
|
|
# Gets default thresholds only
|
|
web-01:
|
|
threshold_config: default
|
|
|
|
# Gets tighter CPU limits, default memory and disk
|
|
build-server:
|
|
threshold_config: high_cpu_load
|
|
|
|
# Layers both: tighter CPU AND tighter disk, default memory
|
|
db-01:
|
|
threshold_config: [high_cpu_load, busy_disk]
|
|
|
|
# Three layers: busy_disk overrides high_cpu_load if they conflict
|
|
storage-01:
|
|
threshold_config: [default, high_cpu_load, busy_disk]
|
|
```
|
|
|
|
**How layering works:**
|
|
|
|
Starting from the `default` thresholds:
|
|
|
|
| Layer | Applied config | Effect |
|
|
|-------|---------------|--------|
|
|
| Base | `default` | all default thresholds |
|
|
| +1 | `high_cpu_load` | cpu_percent overridden to 60/75 |
|
|
| +2 | `busy_disk` | disk percent overridden to 70/85; cpu_percent stays at 60/75 |
|
|
|
|
Each named config only overrides the metrics it explicitly defines. Metrics not mentioned in a config inherit from the layers beneath.
|
|
|
|
### Use Cases
|
|
|
|
#### 1. Environment-Based Thresholds
|
|
|
|
Different thresholds for production vs. development:
|
|
|
|
```yaml
|
|
threshold_configs:
|
|
production:
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 70.0 # Alert earlier in production
|
|
critical: 85.0
|
|
|
|
development:
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 90.0 # More relaxed for dev
|
|
critical: 98.0
|
|
|
|
hosts:
|
|
prod-web-01:
|
|
threshold_config: production
|
|
prod-web-02:
|
|
threshold_config: production
|
|
dev-web-01:
|
|
threshold_config: development
|
|
dev-web-02:
|
|
threshold_config: development
|
|
```
|
|
|
|
#### 2. Server Role-Based Thresholds
|
|
|
|
Different thresholds based on server function:
|
|
|
|
```yaml
|
|
threshold_configs:
|
|
webserver:
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 80.0
|
|
critical: 90.0
|
|
|
|
database:
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 70.0
|
|
critical: 85.0
|
|
memory_monitor:
|
|
memory_percent:
|
|
warning: 90.0 # Databases can use high memory
|
|
critical: 97.0
|
|
disk_monitor:
|
|
partitions:
|
|
/var/lib/mysql:
|
|
percent:
|
|
warning: 75.0
|
|
critical: 85.0
|
|
|
|
cache:
|
|
thresholds:
|
|
memory_monitor:
|
|
memory_percent:
|
|
warning: 95.0 # Redis/Memcached can use very high memory
|
|
critical: 99.0
|
|
|
|
hosts:
|
|
web-01:
|
|
threshold_config: webserver
|
|
web-02:
|
|
threshold_config: webserver
|
|
db-01:
|
|
threshold_config: database
|
|
db-02:
|
|
threshold_config: database
|
|
redis-01:
|
|
threshold_config: cache
|
|
memcached-01:
|
|
threshold_config: cache
|
|
```
|
|
|
|
#### 3. Sensitivity Levels
|
|
|
|
Different sensitivity for critical vs. non-critical systems:
|
|
|
|
```yaml
|
|
threshold_configs:
|
|
critical:
|
|
thresholds:
|
|
disk_monitor:
|
|
partitions:
|
|
/:
|
|
percent:
|
|
warning: 70.0
|
|
critical: 80.0
|
|
hysteresis: 0.15
|
|
|
|
standard:
|
|
thresholds:
|
|
disk_monitor:
|
|
partitions:
|
|
/:
|
|
percent:
|
|
warning: 85.0
|
|
critical: 95.0
|
|
hysteresis: 0.1
|
|
|
|
relaxed:
|
|
thresholds:
|
|
disk_monitor:
|
|
partitions:
|
|
/:
|
|
percent:
|
|
warning: 90.0
|
|
critical: 98.0
|
|
hysteresis: 0.05
|
|
|
|
hosts:
|
|
payment-gateway:
|
|
threshold_config: critical
|
|
auth-server:
|
|
threshold_config: critical
|
|
web-01:
|
|
threshold_config: standard
|
|
web-02:
|
|
threshold_config: standard
|
|
test-server:
|
|
threshold_config: relaxed
|
|
```
|
|
|
|
#### 4. Composable Profiles
|
|
|
|
Build host-specific thresholds by combining small, focused configs:
|
|
|
|
```yaml
|
|
threshold_configs:
|
|
# Baseline — everything at default levels
|
|
default:
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent: {warning: 80, critical: 90}
|
|
memory_monitor:
|
|
memory_percent: {warning: 85, critical: 95}
|
|
|
|
# Overlay: tighter CPU only
|
|
tight_cpu:
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent: {warning: 60, critical: 75}
|
|
|
|
# Overlay: tighter memory only
|
|
tight_memory:
|
|
thresholds:
|
|
memory_monitor:
|
|
memory_percent: {warning: 70, critical: 85}
|
|
|
|
# Overlay: extra disk partition for database servers
|
|
db_disk:
|
|
thresholds:
|
|
disk_monitor:
|
|
partitions:
|
|
/var/lib/postgresql:
|
|
percent: {warning: 75, critical: 88}
|
|
|
|
hosts:
|
|
# Plain web server
|
|
web-01:
|
|
threshold_config: default
|
|
|
|
# Build server: tight CPU, default memory and disk
|
|
build-01:
|
|
threshold_config: tight_cpu
|
|
|
|
# Database: tight CPU + tight memory + extra disk partition
|
|
db-01:
|
|
threshold_config: [tight_cpu, tight_memory, db_disk]
|
|
|
|
# Replica database: tight memory + extra disk, normal CPU
|
|
db-02:
|
|
threshold_config: [tight_memory, db_disk]
|
|
```
|
|
### Configuration Priority
|
|
|
|
1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults
|
|
2. **Host `threshold_config` (string)**: Use that single named config directly
|
|
3. **`host_threshold_mapping`** (legacy): Same as above, string only
|
|
4. **`default_threshold_config`**: Used for hosts with no mapping
|
|
5. **First alphabetically**: If the default config is not found, use the first config alphabetically
|
|
6. **Legacy `thresholds` section**: Used when `threshold_configs` is absent entirely
|
|
|
|
### Backward Compatibility
|
|
|
|
The legacy `host_threshold_mapping` top-level key and the flat `thresholds` section are still fully supported:
|
|
|
|
```yaml
|
|
# Still works — equivalent to hosts: {prod-web-01: {threshold_config: high_sensitivity}}
|
|
host_threshold_mapping:
|
|
prod-web-01: high_sensitivity
|
|
|
|
# Still works — equivalent to threshold_configs: {default: {thresholds: ...}}
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent: {warning: 80, critical: 90}
|
|
```
|
|
|