# Threshold Alerting System ## Overview The Heartbeat Monitoring System includes a comprehensive threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured thresholds. This system is designed to: - **Detect anomalies**: Automatically identify when system metrics exceed safe operating ranges - **Prevent alert fatigue**: Use hysteresis to prevent notification flapping - **Escalate appropriately**: Support WARNING and CRITICAL severity levels - **Track state**: Maintain alert history and state transitions per host - **Integrate seamlessly**: Work with existing notification infrastructure (email, pushover, etc.) ## Architecture ### Components 1. **ThresholdChecker** (`hbd/threshold.py`) - Main threshold checking engine - Parses configuration - Evaluates metrics against thresholds - Triggers notifications on state changes 2. **ThresholdConfig** - Individual threshold configuration - Supports multiple comparison operators - Implements hysteresis logic 3. **AlertState** - Tracks current alert state per metric - Records state transitions - Manages notification timing 4. **Integration Points** - UDP handler: Checks thresholds when plugin data arrives - Host objects: Store alert states per host - Notification system: Sends alerts via configured channels ### Alert Levels - **OK**: Metric is within normal range - **WARNING**: Metric has exceeded warning threshold (first-level concern) - **CRITICAL**: Metric has exceeded critical threshold (requires immediate attention) - **UNKNOWN**: Metric value cannot be evaluated (e.g., non-numeric data) ## Configuration ### Basic Structure Thresholds are configured in the YAML configuration file under the `thresholds` section: ```yaml thresholds: plugin_name: metric_name: warning: 80.0 critical: 90.0 operator: ">" hysteresis: 0.1 display: "display format" enabled: true ``` ### Configuration Parameters #### Required Parameters - **warning**: Warning threshold value (numeric) - **critical**: Critical threshold value (numeric) Note: At least one of `warning` or `critical` must be specified. #### Optional Parameters - **operator**: Comparison operator (default: `">"`) - `">"` - Greater than - `">="` - Greater than or equal - `"<"` - Less than - `"<="` - Less than or equal - `"=="` - Equal to - `"!="` - Not equal to - **hysteresis**: Hysteresis percentage to prevent flapping (default: `0.1` = 10%) - Range: 0.0 to 1.0 - Prevents rapid state transitions when value hovers near threshold - **display**: f-string to hold the display format for alert messages - defaults to "(threshold: {op_symbol} {threshold_value})" - **enabled**: Whether this threshold is active (default: `true`) ### Comparison Operators #### Greater Than (`>`, `>=`) Used for metrics where **higher values are problematic**: ```yaml cpu_monitor: cpu_percent: warning: 80.0 # Alert when CPU > 80% critical: 90.0 # Alert when CPU > 90% operator: ">" ``` Examples: - CPU usage percentage - Memory usage percentage - Disk usage percentage - Load average - Error counters #### Less Than (`<`, `<=`) Used for metrics where **lower values are problematic**: ```yaml memory_monitor: available_mb: warning: 1000 # Alert when available memory < 1GB critical: 500 # Alert when available memory < 500MB operator: "<" ``` Examples: - Available memory - Free disk space - Connection pool availability - Battery level ## Hysteresis Hysteresis prevents alert flapping by requiring values to improve by a certain amount before recovering from an alert state. ### How It Works When a metric crosses a threshold (e.g., CPU goes from 85% to 91%, triggering CRITICAL), hysteresis is applied when the value improves: ``` Threshold: 90 Hysteresis: 0.1 (10%) Recovery threshold: 90 - (90 * 0.1) = 81 Value 91 -> CRITICAL (threshold crossed) Value 89 -> CRITICAL (still above recovery threshold of 81) Value 85 -> CRITICAL (still above recovery threshold) Value 80 -> WARNING or OK (below recovery threshold, re-evaluated normally) ``` ### Configuration Recommendations - **Stable metrics** (CPU, memory): 10-15% hysteresis ```yaml hysteresis: 0.1 ``` - **Very stable metrics** (disk usage): 5% hysteresis ```yaml hysteresis: 0.05 ``` - **Counter metrics** (errors, packets): 20% hysteresis ```yaml hysteresis: 0.2 ``` - **Binary states** (exit codes): No hysteresis ```yaml hysteresis: 0.0 ``` ## Plugin-Specific Configuration ### CPU Monitor ```yaml cpu_monitor: cpu_percent: warning: 80.0 critical: 90.0 operator: ">" hysteresis: 0.1 load_1min: warning: 4.0 critical: 8.0 operator: ">" hysteresis: 0.15 load_5min: warning: 3.0 critical: 6.0 operator: ">" load_15min: warning: 2.0 critical: 4.0 operator: ">" ``` ### Memory Monitor ```yaml memory_monitor: # Percentage-based threshold percent: warning: 85.0 critical: 95.0 operator: ">" # Absolute value threshold (inverse - alert when LOW) available_mb: warning: 1000 critical: 500 operator: "<" # Swap usage swap_percent: warning: 50.0 critical: 80.0 operator: ">" ``` ### Disk Monitor Disk thresholds support **partition-specific configuration**: ```yaml disk_monitor: partitions: /: percent: warning: 80.0 critical: 90.0 operator: ">" hysteresis: 0.05 free_gb: warning: 10.0 critical: 5.0 operator: "<" /home: percent: warning: 85.0 critical: 95.0 operator: ">" /var: percent: warning: 80.0 critical: 90.0 operator: ">" free_gb: warning: 5.0 critical: 2.0 operator: "<" ``` ### ZFS Monitor ZFS pool health is checked automatically for every pool. A pool in any state other than `ONLINE` (e.g. `DEGRADED`, `SUSPENDED`, `FAULTED`, `UNAVAIL`) raises a **CRITICAL** alert by default — no configuration required. The default threshold is equivalent to: ```yaml zfs_monitor: pools: '*': health_ok: critical: 1 operator: "<" hysteresis: 0.0 display: "ZFS pool {pool_name} is {health}" ``` `'*'` matches every pool on the host. The notification message includes the pool name and its current health string, e.g. `ZFS pool tank is DEGRADED`. **Override for specific pools** — named pool entries take priority over `'*'`: ```yaml zfs_monitor: pools: # Suppress health alerts for a scratch pool (not mission-critical) scratch: health_ok: enabled: false # Capacity threshold for a specific pool tank: capacity: warning: 75.0 critical: 90.0 operator: ">" hysteresis: 0.05 ``` **Alert state paths** follow the pattern `zfs_monitor..health_ok`, so acknowledgements and silences target individual pools: ``` zfs_monitor.tank.health_ok zfs_monitor.backup.health_ok ``` ### Network Monitor ```yaml network_monitor: # Error counters errors_total: warning: 100 critical: 1000 operator: ">" hysteresis: 0.2 # Dropped packets dropin_total: warning: 50 critical: 200 operator: ">" dropout_total: warning: 50 critical: 200 operator: ">" # Connection states connections_TIME_WAIT: warning: 1000 critical: 5000 operator: ">" connections_ESTABLISHED: warning: 500 critical: 1000 operator: ">" ``` ### Nagios Runner The Nagios plugin runner reports exit codes that can be thresholded: ```yaml nagios_runner: exit_code: warning: 1 # Map Nagios WARNING to our WARNING critical: 2 # Map Nagios CRITICAL to our CRITICAL operator: ">=" hysteresis: 0.0 # No hysteresis for exit codes ``` ## Notification Behavior ### When Notifications Are Sent Notifications are triggered on **state changes**: 1. **Escalation**: OK → WARNING, OK → CRITICAL, WARNING → CRITICAL ``` WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0 ``` 2. **Recovery**: CRITICAL → WARNING, CRITICAL → OK, WARNING → OK ``` RECOVERED: webserver01 - cpu_monitor.cpu_percent = 70.0 (CRITICAL -> OK) ``` 3. **Re-notifications**: Periodic reminders for ongoing alerts ``` REMINDER (CRITICAL): webserver01 - cpu_monitor.cpu_percent = 95.0 (ongoing for 3600s) ``` ### Notification Frequency - **State changes**: Immediate notification - **Re-notifications**: Controlled by `threshold_renotify_interval` (default: 3600 seconds = 1 hour) ```yaml threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts ``` ### Notification Channels The system supports centralized notification channel definitions, allowing different hosts to use different notification providers and credentials. This provides fine-grained control over who gets notified about what. #### Supported Channel Types - **Email** (via SMTP) - **Pushover** (mobile notifications) - **Signal** (via signal-cli) - **Mattermost** (team chat webhooks) #### Centralized Channel Configuration Define notification channels once in the configuration file: ```yaml notification_channels: # Signal notifications signal_ops: type: signal cli_path: /usr/local/bin/signal-cli user: +1234567890 recipient: +1234567890 # Email notifications email_ops: type: email recipients: [ops@example.com, alerts@example.com] sender: heartbeat@example.com smtp_server: smtp.example.com smtp_port: 587 smtp_user: heartbeat@example.com smtp_password: your-smtp-password # Pushover notifications pushover_urgent: type: pushover token: your-pushover-app-token user: your-pushover-user-key # Mattermost notifications mattermost_devops: type: mattermost host: mattermost.example.com token: your-webhook-token channel: devops-alerts username: heartbeat-bot icon: https://example.com/heartbeat-icon.png # Default channels for hosts that don't specify channels default_notification_channels: [email_ops] ``` #### Per-Host Channel Assignment Assign notification channels to specific hosts in the `hosts` section: ```yaml hosts: # Critical server - multiple notification channels prod-web-01: threshold_config: high_sensitivity watch: true notification_channels: [signal_ops, pushover_urgent, email_ops] dyndns: false # Database server - ops team only prod-db-01: threshold_config: database watch: true notification_channels: [signal_ops, email_ops] dyndns: false # Development server - email only dev-server-01: threshold_config: low_sensitivity watch: false notification_channels: [email_ops] dyndns: false # Uses default_notification_channels if not specified test-server-01: threshold_config: default watch: false dyndns: false ``` ### Watched Hosts Only hosts with `watch: true` in the `hosts` section will trigger notifications: ```yaml hosts: webserver01: watch: true notification_channels: [email_ops] database01: watch: true notification_channels: [signal_ops, email_ops] mailserver: watch: true notification_channels: [pushover_urgent] ``` Hosts not marked for watching will still have thresholds checked and alert states tracked, but won't send notifications. ## Alert State Tracking Each host maintains alert states for all monitored metrics: ```python host.alert_states = { "cpu_monitor.cpu_percent": AlertState(level=WARNING, since=1234567890), "memory_monitor.percent": AlertState(level=CRITICAL, since=1234567800), "disk_monitor./.percent": AlertState(level=OK, since=1234567700), } ``` Alert states persist in memory and are saved with host data (pickle). ### Alert State Information Each `AlertState` tracks: - **level**: Current alert level (OK, WARNING, CRITICAL, UNKNOWN) - **since**: Timestamp when current state started - **last_value**: Most recent metric value - **last_check**: Timestamp of last threshold check - **notification_count**: Number of notifications sent for this alert - **last_notification**: Timestamp of last notification ### Querying Alert States Via HTTP API (future enhancement): ```bash GET /api/hosts/webserver01/alerts ``` Response: ```json { "active_alerts": [ { "metric": "cpu_monitor.cpu_percent", "level": "WARNING", "since": 1234567890, "value": 85.0, "duration": 300 } ], "summary": { "ok": 15, "warning": 1, "critical": 0 } } ``` ## Testing A comprehensive test suite is provided in `test_threshold.py`: ```bash python test_threshold.py ``` Tests cover: - Threshold configuration and parsing - All comparison operators - Hysteresis functionality - Alert state tracking - State change detection - Notification triggering - Nested metrics (partitions) - Alert summaries ## Best Practices ### 1. Start Conservative Begin with higher thresholds to avoid alert fatigue: ```yaml cpu_monitor: cpu_percent: warning: 85.0 # Start higher critical: 95.0 # Very high for critical ``` Adjust downward based on observed behavior. ### 2. Consider Workload Patterns Different systems have different normal ranges: **Web servers** (bursty traffic): ```yaml cpu_percent: warning: 80.0 critical: 90.0 hysteresis: 0.15 # Higher hysteresis for burstiness ``` **Database servers** (steady load): ```yaml cpu_percent: warning: 70.0 critical: 85.0 hysteresis: 0.1 # Lower hysteresis for steady metrics ``` ### 3. Use Appropriate Operators Match the operator to the metric: | Metric Type | Example | Operator | Reason | |-------------|---------|----------|--------| | Resource usage | CPU%, Memory% | `>` | Alert when high | | Available resources | Free memory, Free disk | `<` | Alert when low | | Error counters | Network errors | `>` | Alert when increasing | | Health checks | Nagios exit code | `>=` | Map to standard codes | ### 4. Align with Monitoring Intervals Ensure threshold checks align with plugin collection intervals: ```yaml plugins: cpu_monitor: interval: 300 # Check every 5 minutes thresholds: cpu_monitor: cpu_percent: warning: 80.0 # Will be checked every 5 minutes ``` ### 5. Test Before Production 1. **Start with disabled thresholds**: ```yaml enabled: false ``` 2. **Observe metric ranges** over a week 3. **Set thresholds** based on observed data 4. **Enable gradually**: ```yaml enabled: true ``` 5. **Monitor for false positives** ### 6. Document Baseline Values Keep a record of normal operating ranges: ```yaml # Production web server baseline (observed over 30 days): # CPU: 20-40% normal, 60% peak # Memory: 60-70% normal, 80% peak # Disk /: 40-50% usage, growing 2%/month cpu_monitor: cpu_percent: warning: 75.0 # Above peak + margin critical: 90.0 # Danger zone ``` ### 7. Layer Alerts Use WARNING for early notification, CRITICAL for immediate action: ```yaml disk_monitor: partitions: /: percent: warning: 75.0 # Early warning: "check in next few days" critical: 90.0 # Critical: "act now before outage" ``` ## Troubleshooting ### No Notifications Being Sent 1. **Check if host is watched**: ```yaml watchhosts: - your-host-name ``` 2. **Verify notification configuration**: ```yaml toemail: - admin@example.com smtpserver: smtp.example.com ``` 3. **Check threshold configuration**: ```bash # Look for parsing errors in server logs grep "threshold" /var/log/heartbeat/hbd.log ``` 4. **Verify metric names**: - Metric names must match exactly (case-sensitive) - Check journal or logs for actual metric names ### Too Many Alerts (Flapping) 1. **Increase hysteresis**: ```yaml hysteresis: 0.2 # Increase from 0.1 to 0.2 (20%) ``` 2. **Adjust thresholds**: ```yaml warning: 85.0 # Increase from 80.0 ``` 3. **Increase renotification interval**: ```yaml threshold_renotify_interval: 7200 # 2 hours instead of 1 ``` ### Alerts Not Triggering 1. **Check threshold operator**: ```yaml # For available memory (alert when LOW): operator: "<" # NOT ">" ``` 2. **Verify numeric values**: - Ensure metric values are numeric - Check for unit mismatches (MB vs GB) 3. **Check if threshold is enabled**: ```yaml enabled: true # NOT false ``` 4. **Review hysteresis settings**: - Very high hysteresis may prevent state changes - Try reducing or disabling temporarily ### Alert State Not Recovering 1. **Check recovery threshold calculation**: ``` Threshold: 90 Hysteresis: 0.1 Recovery: 90 - (90 * 0.1) = 81 Value must drop below 81 to recover ``` 2. **Temporarily disable hysteresis**: ```yaml hysteresis: 0.0 ``` 3. **Monitor actual metric values**: ```bash # Check journal for actual values grep "cpu_percent" /var/log/heartbeat/messages.journal | tail -20 ``` ## Advanced Topics ### Custom Notification Callbacks The ThresholdChecker supports custom notification functions: ```python def custom_notifier(message): # Send to incident management system pagerduty.trigger(message) # Log to custom system logger.critical(message) # Update dashboard metrics.alert_count.inc() checker = ThresholdChecker( config=config, notification_callback=custom_notifier ) ``` ### Programmatic Access Query alert states programmatically: ```python # Get all active alerts for a host active = threshold_checker.get_active_alerts(host.alert_states) for alert in active: print(f"{alert.metric_path}: {alert.level.name} for {time.time() - alert.since}s") # Get alert summary summary = threshold_checker.get_alert_summary(host.alert_states) print(f"WARNING: {summary['warning']}, CRITICAL: {summary['critical']}") ``` ### Integration with External Systems Threshold violations can be integrated with: - **PagerDuty**: Incident creation and escalation - **OpsGenie**: On-call scheduling and routing - **ServiceNow**: Ticket creation - **Grafana**: Dashboard annotations - **Elasticsearch**: Alert indexing and analysis ## Future Enhancements Planned features: 1. **Composite thresholds**: Alert based on multiple metrics ```yaml composite: high_load_with_low_memory: conditions: - cpu_monitor.load_1min > 8.0 - memory_monitor.available_mb < 500 ``` 2. **Time-based thresholds**: Different thresholds by time of day ```yaml schedule: business_hours: warning: 70.0 off_hours: warning: 85.0 ``` 3. **Rate-of-change thresholds**: Alert on rapid changes ```yaml rate_of_change: metric: cpu_percent period: 300 threshold: 30.0 # Alert if changes >30% in 5 minutes ``` 4. **Alert grouping**: Combine related alerts ```yaml groups: disk_critical: metrics: - disk_monitor./.percent - disk_monitor./var.percent action: single_notification ``` 5. **Maintenance windows**: Suppress alerts during planned maintenance ```yaml maintenance: - host: webserver01 start: 2024-01-15T02:00:00Z end: 2024-01-15T04:00:00Z ``` ## See Also - [Plugin Development Guide](PLUGIN_DEVELOPMENT.md) - [Message Journal Documentation](MESSAGE_JOURNAL.md) - Configuration examples: `hbd/config_thresholds_example.yaml` - Test suite: `test_threshold.py` ## Multi-Threshold Configuration Support for multiple named threshold configurations with per-host mapping and composable layering. ### Overview The multi-threshold feature allows you to: - Define multiple named threshold configurations - Assign one or more configurations to each host - Compose configurations by layering — each named config's overrides are applied in order on top of the defaults - Use different sensitivity levels for different environments ### Configuration Structure Named configurations are defined under `threshold_configs`. Each host selects which ones to use via `threshold_config` in the `hosts` section (a string for a single config, or a list to layer multiple): ```yaml # Optional: set the default configuration name (defaults to "default") default_threshold_config: "default" threshold_configs: default: thresholds: cpu_monitor: cpu_percent: warning: 80.0 critical: 90.0 high_sensitivity: thresholds: cpu_monitor: cpu_percent: warning: 60.0 critical: 75.0 low_sensitivity: thresholds: cpu_monitor: cpu_percent: warning: 90.0 critical: 95.0 hosts: prod-web-01: threshold_config: high_sensitivity # single config dev-server-01: threshold_config: low_sensitivity # Hosts with no threshold_config use default_threshold_config ``` ### Composable Configurations (list form) `threshold_config` can be a list. Configs are applied **left to right**: the defaults are the base, then each named config's overrides are layered on top. Later entries in the list win on any metric they define. ```yaml threshold_configs: default: thresholds: cpu_monitor: cpu_percent: {warning: 80, critical: 90} memory_monitor: memory_percent: {warning: 85, critical: 95} disk_monitor: partitions: /: percent: {warning: 80, critical: 90} # Tighter CPU limits for busy servers high_cpu_load: thresholds: cpu_monitor: cpu_percent: {warning: 60, critical: 75} # Tighter disk limits for data-heavy servers busy_disk: thresholds: disk_monitor: partitions: /: percent: {warning: 70, critical: 85} hosts: # Gets default thresholds only web-01: threshold_config: default # Gets tighter CPU limits, default memory and disk build-server: threshold_config: high_cpu_load # Layers both: tighter CPU AND tighter disk, default memory db-01: threshold_config: [high_cpu_load, busy_disk] # Three layers: busy_disk overrides high_cpu_load if they conflict storage-01: threshold_config: [default, high_cpu_load, busy_disk] ``` **How layering works:** Starting from the `default` thresholds: | Layer | Applied config | Effect | |-------|---------------|--------| | Base | `default` | all default thresholds | | +1 | `high_cpu_load` | cpu_percent overridden to 60/75 | | +2 | `busy_disk` | disk percent overridden to 70/85; cpu_percent stays at 60/75 | Each named config only overrides the metrics it explicitly defines. Metrics not mentioned in a config inherit from the layers beneath. ### Use Cases #### 1. Environment-Based Thresholds Different thresholds for production vs. development: ```yaml threshold_configs: production: thresholds: cpu_monitor: cpu_percent: warning: 70.0 # Alert earlier in production critical: 85.0 development: thresholds: cpu_monitor: cpu_percent: warning: 90.0 # More relaxed for dev critical: 98.0 hosts: prod-web-01: threshold_config: production prod-web-02: threshold_config: production dev-web-01: threshold_config: development dev-web-02: threshold_config: development ``` #### 2. Server Role-Based Thresholds Different thresholds based on server function: ```yaml threshold_configs: webserver: thresholds: cpu_monitor: cpu_percent: warning: 80.0 critical: 90.0 database: thresholds: cpu_monitor: cpu_percent: warning: 70.0 critical: 85.0 memory_monitor: memory_percent: warning: 90.0 # Databases can use high memory critical: 97.0 disk_monitor: partitions: /var/lib/mysql: percent: warning: 75.0 critical: 85.0 cache: thresholds: memory_monitor: memory_percent: warning: 95.0 # Redis/Memcached can use very high memory critical: 99.0 hosts: web-01: threshold_config: webserver web-02: threshold_config: webserver db-01: threshold_config: database db-02: threshold_config: database redis-01: threshold_config: cache memcached-01: threshold_config: cache ``` #### 3. Sensitivity Levels Different sensitivity for critical vs. non-critical systems: ```yaml threshold_configs: critical: thresholds: disk_monitor: partitions: /: percent: warning: 70.0 critical: 80.0 hysteresis: 0.15 standard: thresholds: disk_monitor: partitions: /: percent: warning: 85.0 critical: 95.0 hysteresis: 0.1 relaxed: thresholds: disk_monitor: partitions: /: percent: warning: 90.0 critical: 98.0 hysteresis: 0.05 hosts: payment-gateway: threshold_config: critical auth-server: threshold_config: critical web-01: threshold_config: standard web-02: threshold_config: standard test-server: threshold_config: relaxed ``` #### 4. Composable Profiles Build host-specific thresholds by combining small, focused configs: ```yaml threshold_configs: # Baseline — everything at default levels default: thresholds: cpu_monitor: cpu_percent: {warning: 80, critical: 90} memory_monitor: memory_percent: {warning: 85, critical: 95} # Overlay: tighter CPU only tight_cpu: thresholds: cpu_monitor: cpu_percent: {warning: 60, critical: 75} # Overlay: tighter memory only tight_memory: thresholds: memory_monitor: memory_percent: {warning: 70, critical: 85} # Overlay: extra disk partition for database servers db_disk: thresholds: disk_monitor: partitions: /var/lib/postgresql: percent: {warning: 75, critical: 88} hosts: # Plain web server web-01: threshold_config: default # Build server: tight CPU, default memory and disk build-01: threshold_config: tight_cpu # Database: tight CPU + tight memory + extra disk partition db-01: threshold_config: [tight_cpu, tight_memory, db_disk] # Replica database: tight memory + extra disk, normal CPU db-02: threshold_config: [tight_memory, db_disk] ``` ### Configuration Priority 1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults 2. **Host `threshold_config` (string)**: Use that single named config directly 3. **`host_threshold_mapping`** (legacy): Same as above, string only 4. **`default_threshold_config`**: Used for hosts with no mapping 5. **First alphabetically**: If the default config is not found, use the first config alphabetically 6. **Legacy `thresholds` section**: Used when `threshold_configs` is absent entirely ### Backward Compatibility The legacy `host_threshold_mapping` top-level key and the flat `thresholds` section are still fully supported: ```yaml # Still works — equivalent to hosts: {prod-web-01: {threshold_config: high_sensitivity}} host_threshold_mapping: prod-web-01: high_sensitivity # Still works — equivalent to threshold_configs: {default: {thresholds: ...}} thresholds: cpu_monitor: cpu_percent: {warning: 80, critical: 90} ```