# Threshold Alerting System ## Overview The Heartbeat Monitoring System includes a comprehensive threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured thresholds. This system is designed to: - **Detect anomalies**: Automatically identify when system metrics exceed safe operating ranges - **Prevent alert fatigue**: Use hysteresis to prevent notification flapping - **Escalate appropriately**: Support WARNING and CRITICAL severity levels - **Track state**: Maintain alert history and state transitions per host - **Integrate seamlessly**: Work with existing notification infrastructure (email, pushover, etc.) ## Architecture ### Components 1. **ThresholdChecker** (`hbd/threshold.py`) - Main threshold checking engine - Parses configuration - Evaluates metrics against thresholds - Triggers notifications on state changes 2. **ThresholdConfig** - Individual threshold configuration - Supports multiple comparison operators - Implements hysteresis logic 3. **AlertState** - Tracks current alert state per metric - Records state transitions - Manages notification timing 4. **Integration Points** - UDP handler: Checks thresholds when plugin data arrives - Host objects: Store alert states per host - Notification system: Sends alerts via configured channels ### Alert Levels - **OK**: Metric is within normal range - **WARNING**: Metric has exceeded warning threshold (first-level concern) - **CRITICAL**: Metric has exceeded critical threshold (requires immediate attention) - **UNKNOWN**: Metric value cannot be evaluated (e.g., non-numeric data) ## Configuration ### Basic Structure Thresholds are configured in the YAML configuration file under the `thresholds` section: ```yaml thresholds: plugin_name: metric_name: warning: 80.0 critical: 90.0 operator: ">" hysteresis: 0.1 enabled: true ``` ### Configuration Parameters #### Required Parameters - **warning**: Warning threshold value (numeric) - **critical**: Critical threshold value (numeric) Note: At least one of `warning` or `critical` must be specified. #### Optional Parameters - **operator**: Comparison operator (default: `">"`) - `">"` - Greater than - `">="` - Greater than or equal - `"<"` - Less than - `"<="` - Less than or equal - `"=="` - Equal to - `"!="` - Not equal to - **hysteresis**: Hysteresis percentage to prevent flapping (default: `0.1` = 10%) - Range: 0.0 to 1.0 - Prevents rapid state transitions when value hovers near threshold - **enabled**: Whether this threshold is active (default: `true`) ### Comparison Operators #### Greater Than (`>`, `>=`) Used for metrics where **higher values are problematic**: ```yaml cpu_monitor: cpu_percent: warning: 80.0 # Alert when CPU > 80% critical: 90.0 # Alert when CPU > 90% operator: ">" ``` Examples: - CPU usage percentage - Memory usage percentage - Disk usage percentage - Load average - Error counters #### Less Than (`<`, `<=`) Used for metrics where **lower values are problematic**: ```yaml memory_monitor: available_mb: warning: 1000 # Alert when available memory < 1GB critical: 500 # Alert when available memory < 500MB operator: "<" ``` Examples: - Available memory - Free disk space - Connection pool availability - Battery level ## Hysteresis Hysteresis prevents alert flapping by requiring values to improve by a certain amount before recovering from an alert state. ### How It Works When a metric crosses a threshold (e.g., CPU goes from 85% to 91%, triggering CRITICAL), hysteresis is applied when the value improves: ``` Threshold: 90 Hysteresis: 0.1 (10%) Recovery threshold: 90 - (90 * 0.1) = 81 Value 91 -> CRITICAL (threshold crossed) Value 89 -> CRITICAL (still above recovery threshold of 81) Value 85 -> CRITICAL (still above recovery threshold) Value 80 -> WARNING or OK (below recovery threshold, re-evaluated normally) ``` ### Configuration Recommendations - **Stable metrics** (CPU, memory): 10-15% hysteresis ```yaml hysteresis: 0.1 ``` - **Very stable metrics** (disk usage): 5% hysteresis ```yaml hysteresis: 0.05 ``` - **Counter metrics** (errors, packets): 20% hysteresis ```yaml hysteresis: 0.2 ``` - **Binary states** (exit codes): No hysteresis ```yaml hysteresis: 0.0 ``` ## Plugin-Specific Configuration ### CPU Monitor ```yaml cpu_monitor: cpu_percent: warning: 80.0 critical: 90.0 operator: ">" hysteresis: 0.1 load_1min: warning: 4.0 critical: 8.0 operator: ">" hysteresis: 0.15 load_5min: warning: 3.0 critical: 6.0 operator: ">" load_15min: warning: 2.0 critical: 4.0 operator: ">" ``` ### Memory Monitor ```yaml memory_monitor: # Percentage-based threshold percent: warning: 85.0 critical: 95.0 operator: ">" # Absolute value threshold (inverse - alert when LOW) available_mb: warning: 1000 critical: 500 operator: "<" # Swap usage swap_percent: warning: 50.0 critical: 80.0 operator: ">" ``` ### Disk Monitor Disk thresholds support **partition-specific configuration**: ```yaml disk_monitor: partitions: /: percent: warning: 80.0 critical: 90.0 operator: ">" hysteresis: 0.05 free_gb: warning: 10.0 critical: 5.0 operator: "<" /home: percent: warning: 85.0 critical: 95.0 operator: ">" /var: percent: warning: 80.0 critical: 90.0 operator: ">" free_gb: warning: 5.0 critical: 2.0 operator: "<" ``` ### Network Monitor ```yaml network_monitor: # Error counters errors_total: warning: 100 critical: 1000 operator: ">" hysteresis: 0.2 # Dropped packets dropin_total: warning: 50 critical: 200 operator: ">" dropout_total: warning: 50 critical: 200 operator: ">" # Connection states connections_TIME_WAIT: warning: 1000 critical: 5000 operator: ">" connections_ESTABLISHED: warning: 500 critical: 1000 operator: ">" ``` ### Nagios Runner The Nagios plugin runner reports exit codes that can be thresholded: ```yaml nagios_runner: exit_code: warning: 1 # Map Nagios WARNING to our WARNING critical: 2 # Map Nagios CRITICAL to our CRITICAL operator: ">=" hysteresis: 0.0 # No hysteresis for exit codes ``` ## Notification Behavior ### When Notifications Are Sent Notifications are triggered on **state changes**: 1. **Escalation**: OK → WARNING, OK → CRITICAL, WARNING → CRITICAL ``` WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0 ``` 2. **Recovery**: CRITICAL → WARNING, CRITICAL → OK, WARNING → OK ``` RECOVERED: webserver01 - cpu_monitor.cpu_percent = 70.0 (CRITICAL -> OK) ``` 3. **Re-notifications**: Periodic reminders for ongoing alerts ``` REMINDER (CRITICAL): webserver01 - cpu_monitor.cpu_percent = 95.0 (ongoing for 3600s) ``` ### Notification Frequency - **State changes**: Immediate notification - **Re-notifications**: Controlled by `threshold_renotify_interval` (default: 3600 seconds = 1 hour) ```yaml threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts ``` ### Notification Channels Thresholds use the same notification infrastructure as heartbeat monitoring: - **Email** (via SMTP) - **Pushover** (mobile notifications) - **Mattermost** (team chat) - **Custom webhooks** Configuration: ```yaml # Email toemail: - admin@example.com - oncall@example.com fromemail: heartbeat@example.com smtpserver: smtp.example.com smtpport: 587 smtpuser: heartbeat@example.com smtppassword: your-password # Pushover pushover_token: your-app-token pushover_user: your-user-key ``` ### Watched Hosts Only hosts in the `watchhosts` list will trigger notifications: ```yaml watchhosts: - webserver01 - database01 - mailserver ``` Hosts not in this list will still have thresholds checked and alert states tracked, but won't send notifications. ## Alert State Tracking Each host maintains alert states for all monitored metrics: ```python host.alert_states = { "cpu_monitor.cpu_percent": AlertState(level=WARNING, since=1234567890), "memory_monitor.percent": AlertState(level=CRITICAL, since=1234567800), "disk_monitor./.percent": AlertState(level=OK, since=1234567700), } ``` Alert states persist in memory and are saved with host data (pickle). ### Alert State Information Each `AlertState` tracks: - **level**: Current alert level (OK, WARNING, CRITICAL, UNKNOWN) - **since**: Timestamp when current state started - **last_value**: Most recent metric value - **last_check**: Timestamp of last threshold check - **notification_count**: Number of notifications sent for this alert - **last_notification**: Timestamp of last notification ### Querying Alert States Via HTTP API (future enhancement): ```bash GET /api/hosts/webserver01/alerts ``` Response: ```json { "active_alerts": [ { "metric": "cpu_monitor.cpu_percent", "level": "WARNING", "since": 1234567890, "value": 85.0, "duration": 300 } ], "summary": { "ok": 15, "warning": 1, "critical": 0 } } ``` ## Testing A comprehensive test suite is provided in `test_threshold.py`: ```bash python test_threshold.py ``` Tests cover: - Threshold configuration and parsing - All comparison operators - Hysteresis functionality - Alert state tracking - State change detection - Notification triggering - Nested metrics (partitions) - Alert summaries ## Best Practices ### 1. Start Conservative Begin with higher thresholds to avoid alert fatigue: ```yaml cpu_monitor: cpu_percent: warning: 85.0 # Start higher critical: 95.0 # Very high for critical ``` Adjust downward based on observed behavior. ### 2. Consider Workload Patterns Different systems have different normal ranges: **Web servers** (bursty traffic): ```yaml cpu_percent: warning: 80.0 critical: 90.0 hysteresis: 0.15 # Higher hysteresis for burstiness ``` **Database servers** (steady load): ```yaml cpu_percent: warning: 70.0 critical: 85.0 hysteresis: 0.1 # Lower hysteresis for steady metrics ``` ### 3. Use Appropriate Operators Match the operator to the metric: | Metric Type | Example | Operator | Reason | |-------------|---------|----------|--------| | Resource usage | CPU%, Memory% | `>` | Alert when high | | Available resources | Free memory, Free disk | `<` | Alert when low | | Error counters | Network errors | `>` | Alert when increasing | | Health checks | Nagios exit code | `>=` | Map to standard codes | ### 4. Align with Monitoring Intervals Ensure threshold checks align with plugin collection intervals: ```yaml plugins: cpu_monitor: interval: 300 # Check every 5 minutes thresholds: cpu_monitor: cpu_percent: warning: 80.0 # Will be checked every 5 minutes ``` ### 5. Test Before Production 1. **Start with disabled thresholds**: ```yaml enabled: false ``` 2. **Observe metric ranges** over a week 3. **Set thresholds** based on observed data 4. **Enable gradually**: ```yaml enabled: true ``` 5. **Monitor for false positives** ### 6. Document Baseline Values Keep a record of normal operating ranges: ```yaml # Production web server baseline (observed over 30 days): # CPU: 20-40% normal, 60% peak # Memory: 60-70% normal, 80% peak # Disk /: 40-50% usage, growing 2%/month cpu_monitor: cpu_percent: warning: 75.0 # Above peak + margin critical: 90.0 # Danger zone ``` ### 7. Layer Alerts Use WARNING for early notification, CRITICAL for immediate action: ```yaml disk_monitor: partitions: /: percent: warning: 75.0 # Early warning: "check in next few days" critical: 90.0 # Critical: "act now before outage" ``` ## Troubleshooting ### No Notifications Being Sent 1. **Check if host is watched**: ```yaml watchhosts: - your-host-name ``` 2. **Verify notification configuration**: ```yaml toemail: - admin@example.com smtpserver: smtp.example.com ``` 3. **Check threshold configuration**: ```bash # Look for parsing errors in server logs grep "threshold" /var/log/heartbeat/hbd.log ``` 4. **Verify metric names**: - Metric names must match exactly (case-sensitive) - Check journal or logs for actual metric names ### Too Many Alerts (Flapping) 1. **Increase hysteresis**: ```yaml hysteresis: 0.2 # Increase from 0.1 to 0.2 (20%) ``` 2. **Adjust thresholds**: ```yaml warning: 85.0 # Increase from 80.0 ``` 3. **Increase renotification interval**: ```yaml threshold_renotify_interval: 7200 # 2 hours instead of 1 ``` ### Alerts Not Triggering 1. **Check threshold operator**: ```yaml # For available memory (alert when LOW): operator: "<" # NOT ">" ``` 2. **Verify numeric values**: - Ensure metric values are numeric - Check for unit mismatches (MB vs GB) 3. **Check if threshold is enabled**: ```yaml enabled: true # NOT false ``` 4. **Review hysteresis settings**: - Very high hysteresis may prevent state changes - Try reducing or disabling temporarily ### Alert State Not Recovering 1. **Check recovery threshold calculation**: ``` Threshold: 90 Hysteresis: 0.1 Recovery: 90 - (90 * 0.1) = 81 Value must drop below 81 to recover ``` 2. **Temporarily disable hysteresis**: ```yaml hysteresis: 0.0 ``` 3. **Monitor actual metric values**: ```bash # Check journal for actual values grep "cpu_percent" /var/log/heartbeat/messages.journal | tail -20 ``` ## Advanced Topics ### Custom Notification Callbacks The ThresholdChecker supports custom notification functions: ```python def custom_notifier(message): # Send to incident management system pagerduty.trigger(message) # Log to custom system logger.critical(message) # Update dashboard metrics.alert_count.inc() checker = ThresholdChecker( config=config, notification_callback=custom_notifier ) ``` ### Programmatic Access Query alert states programmatically: ```python # Get all active alerts for a host active = threshold_checker.get_active_alerts(host.alert_states) for alert in active: print(f"{alert.metric_path}: {alert.level.name} for {time.time() - alert.since}s") # Get alert summary summary = threshold_checker.get_alert_summary(host.alert_states) print(f"WARNING: {summary['warning']}, CRITICAL: {summary['critical']}") ``` ### Integration with External Systems Threshold violations can be integrated with: - **PagerDuty**: Incident creation and escalation - **OpsGenie**: On-call scheduling and routing - **ServiceNow**: Ticket creation - **Grafana**: Dashboard annotations - **Elasticsearch**: Alert indexing and analysis ## Future Enhancements Planned features: 1. **Composite thresholds**: Alert based on multiple metrics ```yaml composite: high_load_with_low_memory: conditions: - cpu_monitor.load_1min > 8.0 - memory_monitor.available_mb < 500 ``` 2. **Time-based thresholds**: Different thresholds by time of day ```yaml schedule: business_hours: warning: 70.0 off_hours: warning: 85.0 ``` 3. **Rate-of-change thresholds**: Alert on rapid changes ```yaml rate_of_change: metric: cpu_percent period: 300 threshold: 30.0 # Alert if changes >30% in 5 minutes ``` 4. **Alert grouping**: Combine related alerts ```yaml groups: disk_critical: metrics: - disk_monitor./.percent - disk_monitor./var.percent action: single_notification ``` 5. **Maintenance windows**: Suppress alerts during planned maintenance ```yaml maintenance: - host: webserver01 start: 2024-01-15T02:00:00Z end: 2024-01-15T04:00:00Z ``` ## See Also - [Plugin Development Guide](PLUGIN_DEVELOPMENT.md) - [Message Journal Documentation](MESSAGE_JOURNAL.md) - Configuration examples: `hbd/config_thresholds_example.yaml` - Test suite: `test_threshold.py`