Major refactoring of the codebase, including restructuring of files and directories, renaming of modules and classes, and improvements to the overall organization and readability of the code. This refactoring aims to enhance maintainability, scalability, and clarity of the codebase while preserving existing functionality. The changes include:

- Restructuring of the project directory into client and server components - Renaming of modules and classes to better reflect their purpose and functionality - Moving common utilities and configurations to a shared location - Updating import statements to reflect the new structure - Adding new documentation files for better clarity on various aspects of the project - Removing deprecated or unused code to streamline the codebase - Ensuring that all existing functionality is preserved and that the codebase remains functional after the refactoring.
2026-03-29 11:13:40 -04:00
parent 7e2038ecac
commit 0543266c92
65 changed files with 11371 additions and 140 deletions
@@ -0,0 +1,742 @@
+# Threshold Alerting System
+
+## Overview
+
+The Heartbeat Monitoring System includes a comprehensive threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured thresholds. This system is designed to:
+
+- **Detect anomalies**: Automatically identify when system metrics exceed safe operating ranges
+- **Prevent alert fatigue**: Use hysteresis to prevent notification flapping
+- **Escalate appropriately**: Support WARNING and CRITICAL severity levels
+- **Track state**: Maintain alert history and state transitions per host
+- **Integrate seamlessly**: Work with existing notification infrastructure (email, pushover, etc.)
+
+## Architecture
+
+### Components
+
+1. **ThresholdChecker** (`hbd/threshold.py`)
+   - Main threshold checking engine
+   - Parses configuration
+   - Evaluates metrics against thresholds
+   - Triggers notifications on state changes
+
+2. **ThresholdConfig**
+   - Individual threshold configuration
+   - Supports multiple comparison operators
+   - Implements hysteresis logic
+
+3. **AlertState**
+   - Tracks current alert state per metric
+   - Records state transitions
+   - Manages notification timing
+
+4. **Integration Points**
+   - UDP handler: Checks thresholds when plugin data arrives
+   - Host objects: Store alert states per host
+   - Notification system: Sends alerts via configured channels
+
+### Alert Levels
+
+- **OK**: Metric is within normal range
+- **WARNING**: Metric has exceeded warning threshold (first-level concern)
+- **CRITICAL**: Metric has exceeded critical threshold (requires immediate attention)
+- **UNKNOWN**: Metric value cannot be evaluated (e.g., non-numeric data)
+
+## Configuration
+
+### Basic Structure
+
+Thresholds are configured in the YAML configuration file under the `thresholds` section:
+
+```yaml
+thresholds:
+  plugin_name:
+    metric_name:
+      warning: 80.0
+      critical: 90.0
+      operator: ">"
+      hysteresis: 0.1
+      enabled: true
+```
+
+### Configuration Parameters
+
+#### Required Parameters
+
+- **warning**: Warning threshold value (numeric)
+- **critical**: Critical threshold value (numeric)
+
+Note: At least one of `warning` or `critical` must be specified.
+
+#### Optional Parameters
+
+- **operator**: Comparison operator (default: `">"`)
+  - `">"` - Greater than
+  - `">="` - Greater than or equal
+  - `"<"` - Less than
+  - `"<="` - Less than or equal
+  - `"=="` - Equal to
+  - `"!="` - Not equal to
+
+- **hysteresis**: Hysteresis percentage to prevent flapping (default: `0.1` = 10%)
+  - Range: 0.0 to 1.0
+  - Prevents rapid state transitions when value hovers near threshold
+
+- **enabled**: Whether this threshold is active (default: `true`)
+
+### Comparison Operators
+
+#### Greater Than (`>`, `>=`)
+
+Used for metrics where **higher values are problematic**:
+
+```yaml
+cpu_monitor:
+  cpu_percent:
+    warning: 80.0      # Alert when CPU > 80%
+    critical: 90.0     # Alert when CPU > 90%
+    operator: ">"
+```
+
+Examples:
+- CPU usage percentage
+- Memory usage percentage
+- Disk usage percentage
+- Load average
+- Error counters
+
+#### Less Than (`<`, `<=`)
+
+Used for metrics where **lower values are problematic**:
+
+```yaml
+memory_monitor:
+  available_mb:
+    warning: 1000      # Alert when available memory < 1GB
+    critical: 500      # Alert when available memory < 500MB
+    operator: "<"
+```
+
+Examples:
+- Available memory
+- Free disk space
+- Connection pool availability
+- Battery level
+
+## Hysteresis
+
+Hysteresis prevents alert flapping by requiring values to improve by a certain amount before recovering from an alert state.
+
+### How It Works
+
+When a metric crosses a threshold (e.g., CPU goes from 85% to 91%, triggering CRITICAL), hysteresis is applied when the value improves:
+
+```
+Threshold: 90
+Hysteresis: 0.1 (10%)
+Recovery threshold: 90 - (90 * 0.1) = 81
+
+Value 91 -> CRITICAL (threshold crossed)
+Value 89 -> CRITICAL (still above recovery threshold of 81)
+Value 85 -> CRITICAL (still above recovery threshold)
+Value 80 -> WARNING or OK (below recovery threshold, re-evaluated normally)
+```
+
+### Configuration Recommendations
+
+- **Stable metrics** (CPU, memory): 10-15% hysteresis
+  ```yaml
+  hysteresis: 0.1
+  ```
+
+- **Very stable metrics** (disk usage): 5% hysteresis
+  ```yaml
+  hysteresis: 0.05
+  ```
+
+- **Counter metrics** (errors, packets): 20% hysteresis
+  ```yaml
+  hysteresis: 0.2
+  ```
+
+- **Binary states** (exit codes): No hysteresis
+  ```yaml
+  hysteresis: 0.0
+  ```
+
+## Plugin-Specific Configuration
+
+### CPU Monitor
+
+```yaml
+cpu_monitor:
+  cpu_percent:
+    warning: 80.0
+    critical: 90.0
+    operator: ">"
+    hysteresis: 0.1
+  
+  load_1min:
+    warning: 4.0
+    critical: 8.0
+    operator: ">"
+    hysteresis: 0.15
+  
+  load_5min:
+    warning: 3.0
+    critical: 6.0
+    operator: ">"
+  
+  load_15min:
+    warning: 2.0
+    critical: 4.0
+    operator: ">"
+```
+
+### Memory Monitor
+
+```yaml
+memory_monitor:
+  # Percentage-based threshold
+  percent:
+    warning: 85.0
+    critical: 95.0
+    operator: ">"
+  
+  # Absolute value threshold (inverse - alert when LOW)
+  available_mb:
+    warning: 1000
+    critical: 500
+    operator: "<"
+  
+  # Swap usage
+  swap_percent:
+    warning: 50.0
+    critical: 80.0
+    operator: ">"
+```
+
+### Disk Monitor
+
+Disk thresholds support **partition-specific configuration**:
+
+```yaml
+disk_monitor:
+  partitions:
+    /:
+      percent:
+        warning: 80.0
+        critical: 90.0
+        operator: ">"
+        hysteresis: 0.05
+      
+      free_gb:
+        warning: 10.0
+        critical: 5.0
+        operator: "<"
+    
+    /home:
+      percent:
+        warning: 85.0
+        critical: 95.0
+        operator: ">"
+    
+    /var:
+      percent:
+        warning: 80.0
+        critical: 90.0
+        operator: ">"
+      
+      free_gb:
+        warning: 5.0
+        critical: 2.0
+        operator: "<"
+```
+
+### Network Monitor
+
+```yaml
+network_monitor:
+  # Error counters
+  errors_total:
+    warning: 100
+    critical: 1000
+    operator: ">"
+    hysteresis: 0.2
+  
+  # Dropped packets
+  dropin_total:
+    warning: 50
+    critical: 200
+    operator: ">"
+  
+  dropout_total:
+    warning: 50
+    critical: 200
+    operator: ">"
+  
+  # Connection states
+  connections_TIME_WAIT:
+    warning: 1000
+    critical: 5000
+    operator: ">"
+  
+  connections_ESTABLISHED:
+    warning: 500
+    critical: 1000
+    operator: ">"
+```
+
+### Nagios Runner
+
+The Nagios plugin runner reports exit codes that can be thresholded:
+
+```yaml
+nagios_runner:
+  exit_code:
+    warning: 1       # Map Nagios WARNING to our WARNING
+    critical: 2      # Map Nagios CRITICAL to our CRITICAL
+    operator: ">="
+    hysteresis: 0.0  # No hysteresis for exit codes
+```
+
+## Notification Behavior
+
+### When Notifications Are Sent
+
+Notifications are triggered on **state changes**:
+
+1. **Escalation**: OK → WARNING, OK → CRITICAL, WARNING → CRITICAL
+   ```
+   WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
+   ```
+
+2. **Recovery**: CRITICAL → WARNING, CRITICAL → OK, WARNING → OK
+   ```
+   RECOVERED: webserver01 - cpu_monitor.cpu_percent = 70.0 (CRITICAL -> OK)
+   ```
+
+3. **Re-notifications**: Periodic reminders for ongoing alerts
+   ```
+   REMINDER (CRITICAL): webserver01 - cpu_monitor.cpu_percent = 95.0 (ongoing for 3600s)
+   ```
+
+### Notification Frequency
+
+- **State changes**: Immediate notification
+- **Re-notifications**: Controlled by `threshold_renotify_interval` (default: 3600 seconds = 1 hour)
+
+```yaml
+threshold_renotify_interval: 3600  # Re-notify every hour for ongoing alerts
+```
+
+### Notification Channels
+
+Thresholds use the same notification infrastructure as heartbeat monitoring:
+
+- **Email** (via SMTP)
+- **Pushover** (mobile notifications)
+- **Mattermost** (team chat)
+- **Custom webhooks**
+
+Configuration:
+
+```yaml
+# Email
+toemail:
+  - admin@example.com
+  - oncall@example.com
+fromemail: heartbeat@example.com
+smtpserver: smtp.example.com
+smtpport: 587
+smtpuser: heartbeat@example.com
+smtppassword: your-password
+
+# Pushover
+pushover_token: your-app-token
+pushover_user: your-user-key
+```
+
+### Watched Hosts
+
+Only hosts in the `watchhosts` list will trigger notifications:
+
+```yaml
+watchhosts:
+  - webserver01
+  - database01
+  - mailserver
+```
+
+Hosts not in this list will still have thresholds checked and alert states tracked, but won't send notifications.
+
+## Alert State Tracking
+
+Each host maintains alert states for all monitored metrics:
+
+```python
+host.alert_states = {
+    "cpu_monitor.cpu_percent": AlertState(level=WARNING, since=1234567890),
+    "memory_monitor.percent": AlertState(level=CRITICAL, since=1234567800),
+    "disk_monitor./.percent": AlertState(level=OK, since=1234567700),
+}
+```
+
+Alert states persist in memory and are saved with host data (pickle).
+
+### Alert State Information
+
+Each `AlertState` tracks:
+
+- **level**: Current alert level (OK, WARNING, CRITICAL, UNKNOWN)
+- **since**: Timestamp when current state started
+- **last_value**: Most recent metric value
+- **last_check**: Timestamp of last threshold check
+- **notification_count**: Number of notifications sent for this alert
+- **last_notification**: Timestamp of last notification
+
+### Querying Alert States
+
+Via HTTP API (future enhancement):
+
+```bash
+GET /api/hosts/webserver01/alerts
+```
+
+Response:
+```json
+{
+  "active_alerts": [
+    {
+      "metric": "cpu_monitor.cpu_percent",
+      "level": "WARNING",
+      "since": 1234567890,
+      "value": 85.0,
+      "duration": 300
+    }
+  ],
+  "summary": {
+    "ok": 15,
+    "warning": 1,
+    "critical": 0
+  }
+}
+```
+
+## Testing
+
+A comprehensive test suite is provided in `test_threshold.py`:
+
+```bash
+python test_threshold.py
+```
+
+Tests cover:
+- Threshold configuration and parsing
+- All comparison operators
+- Hysteresis functionality
+- Alert state tracking
+- State change detection
+- Notification triggering
+- Nested metrics (partitions)
+- Alert summaries
+
+## Best Practices
+
+### 1. Start Conservative
+
+Begin with higher thresholds to avoid alert fatigue:
+
+```yaml
+cpu_monitor:
+  cpu_percent:
+    warning: 85.0    # Start higher
+    critical: 95.0   # Very high for critical
+```
+
+Adjust downward based on observed behavior.
+
+### 2. Consider Workload Patterns
+
+Different systems have different normal ranges:
+
+**Web servers** (bursty traffic):
+```yaml
+cpu_percent:
+  warning: 80.0
+  critical: 90.0
+  hysteresis: 0.15  # Higher hysteresis for burstiness
+```
+
+**Database servers** (steady load):
+```yaml
+cpu_percent:
+  warning: 70.0
+  critical: 85.0
+  hysteresis: 0.1   # Lower hysteresis for steady metrics
+```
+
+### 3. Use Appropriate Operators
+
+Match the operator to the metric:
+
+| Metric Type | Example | Operator | Reason |
+|-------------|---------|----------|--------|
+| Resource usage | CPU%, Memory% | `>` | Alert when high |
+| Available resources | Free memory, Free disk | `<` | Alert when low |
+| Error counters | Network errors | `>` | Alert when increasing |
+| Health checks | Nagios exit code | `>=` | Map to standard codes |
+
+### 4. Align with Monitoring Intervals
+
+Ensure threshold checks align with plugin collection intervals:
+
+```yaml
+plugins:
+  cpu_monitor:
+    interval: 300    # Check every 5 minutes
+
+thresholds:
+  cpu_monitor:
+    cpu_percent:
+      warning: 80.0
+      # Will be checked every 5 minutes
+```
+
+### 5. Test Before Production
+
+1. **Start with disabled thresholds**:
+   ```yaml
+   enabled: false
+   ```
+
+2. **Observe metric ranges** over a week
+
+3. **Set thresholds** based on observed data
+
+4. **Enable gradually**:
+   ```yaml
+   enabled: true
+   ```
+
+5. **Monitor for false positives**
+
+### 6. Document Baseline Values
+
+Keep a record of normal operating ranges:
+
+```yaml
+# Production web server baseline (observed over 30 days):
+# CPU: 20-40% normal, 60% peak
+# Memory: 60-70% normal, 80% peak
+# Disk /: 40-50% usage, growing 2%/month
+
+cpu_monitor:
+  cpu_percent:
+    warning: 75.0   # Above peak + margin
+    critical: 90.0  # Danger zone
+```
+
+### 7. Layer Alerts
+
+Use WARNING for early notification, CRITICAL for immediate action:
+
+```yaml
+disk_monitor:
+  partitions:
+    /:
+      percent:
+        warning: 75.0    # Early warning: "check in next few days"
+        critical: 90.0   # Critical: "act now before outage"
+```
+
+## Troubleshooting
+
+### No Notifications Being Sent
+
+1. **Check if host is watched**:
+   ```yaml
+   watchhosts:
+     - your-host-name
+   ```
+
+2. **Verify notification configuration**:
+   ```yaml
+   toemail:
+     - admin@example.com
+   smtpserver: smtp.example.com
+   ```
+
+3. **Check threshold configuration**:
+   ```bash
+   # Look for parsing errors in server logs
+   grep "threshold" /var/log/heartbeat/hbd.log
+   ```
+
+4. **Verify metric names**:
+   - Metric names must match exactly (case-sensitive)
+   - Check journal or logs for actual metric names
+
+### Too Many Alerts (Flapping)
+
+1. **Increase hysteresis**:
+   ```yaml
+   hysteresis: 0.2  # Increase from 0.1 to 0.2 (20%)
+   ```
+
+2. **Adjust thresholds**:
+   ```yaml
+   warning: 85.0  # Increase from 80.0
+   ```
+
+3. **Increase renotification interval**:
+   ```yaml
+   threshold_renotify_interval: 7200  # 2 hours instead of 1
+   ```
+
+### Alerts Not Triggering
+
+1. **Check threshold operator**:
+   ```yaml
+   # For available memory (alert when LOW):
+   operator: "<"   # NOT ">"
+   ```
+
+2. **Verify numeric values**:
+   - Ensure metric values are numeric
+   - Check for unit mismatches (MB vs GB)
+
+3. **Check if threshold is enabled**:
+   ```yaml
+   enabled: true  # NOT false
+   ```
+
+4. **Review hysteresis settings**:
+   - Very high hysteresis may prevent state changes
+   - Try reducing or disabling temporarily
+
+### Alert State Not Recovering
+
+1. **Check recovery threshold calculation**:
+   ```
+   Threshold: 90
+   Hysteresis: 0.1
+   Recovery: 90 - (90 * 0.1) = 81
+   
+   Value must drop below 81 to recover
+   ```
+
+2. **Temporarily disable hysteresis**:
+   ```yaml
+   hysteresis: 0.0
+   ```
+
+3. **Monitor actual metric values**:
+   ```bash
+   # Check journal for actual values
+   grep "cpu_percent" /var/log/heartbeat/messages.journal | tail -20
+   ```
+
+## Advanced Topics
+
+### Custom Notification Callbacks
+
+The ThresholdChecker supports custom notification functions:
+
+```python
+def custom_notifier(message):
+    # Send to incident management system
+    pagerduty.trigger(message)
+    
+    # Log to custom system
+    logger.critical(message)
+    
+    # Update dashboard
+    metrics.alert_count.inc()
+
+checker = ThresholdChecker(
+    config=config,
+    notification_callback=custom_notifier
+)
+```
+
+### Programmatic Access
+
+Query alert states programmatically:
+
+```python
+# Get all active alerts for a host
+active = threshold_checker.get_active_alerts(host.alert_states)
+
+for alert in active:
+    print(f"{alert.metric_path}: {alert.level.name} for {time.time() - alert.since}s")
+
+# Get alert summary
+summary = threshold_checker.get_alert_summary(host.alert_states)
+print(f"WARNING: {summary['warning']}, CRITICAL: {summary['critical']}")
+```
+
+### Integration with External Systems
+
+Threshold violations can be integrated with:
+
+- **PagerDuty**: Incident creation and escalation
+- **OpsGenie**: On-call scheduling and routing
+- **ServiceNow**: Ticket creation
+- **Grafana**: Dashboard annotations
+- **Elasticsearch**: Alert indexing and analysis
+
+## Future Enhancements
+
+Planned features:
+
+1. **Composite thresholds**: Alert based on multiple metrics
+   ```yaml
+   composite:
+     high_load_with_low_memory:
+       conditions:
+         - cpu_monitor.load_1min > 8.0
+         - memory_monitor.available_mb < 500
+   ```
+
+2. **Time-based thresholds**: Different thresholds by time of day
+   ```yaml
+   schedule:
+     business_hours:
+       warning: 70.0
+     off_hours:
+       warning: 85.0
+   ```
+
+3. **Rate-of-change thresholds**: Alert on rapid changes
+   ```yaml
+   rate_of_change:
+     metric: cpu_percent
+     period: 300
+     threshold: 30.0  # Alert if changes >30% in 5 minutes
+   ```
+
+4. **Alert grouping**: Combine related alerts
+   ```yaml
+   groups:
+     disk_critical:
+       metrics:
+         - disk_monitor./.percent
+         - disk_monitor./var.percent
+       action: single_notification
+   ```
+
+5. **Maintenance windows**: Suppress alerts during planned maintenance
+   ```yaml
+   maintenance:
+     - host: webserver01
+       start: 2024-01-15T02:00:00Z
+       end: 2024-01-15T04:00:00Z
+   ```
+
+## See Also
+
+- [Plugin Development Guide](PLUGIN_DEVELOPMENT.md)
+- [Message Journal Documentation](MESSAGE_JOURNAL.md)
+- Configuration examples: `hbd/config_thresholds_example.yaml`
+- Test suite: `test_threshold.py`