heartbeat/docs/THRESHOLD_ALERTING.md

# Threshold Alerting System

## Overview

The Heartbeat Monitoring System includes a comprehensive threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured thresholds. This system is designed to:

- **Detect anomalies**: Automatically identify when system metrics exceed safe operating ranges
- **Prevent alert fatigue**: Use hysteresis to prevent notification flapping
- **Escalate appropriately**: Support WARNING and CRITICAL severity levels
- **Track state**: Maintain alert history and state transitions per host
- **Integrate seamlessly**: Work with existing notification infrastructure (email, pushover, etc.)

## Architecture

### Components

1. **ThresholdChecker** (`hbd/threshold.py`)
   - Main threshold checking engine
   - Parses configuration
   - Evaluates metrics against thresholds
   - Triggers notifications on state changes

2. **ThresholdConfig**
   - Individual threshold configuration
   - Supports multiple comparison operators
   - Implements hysteresis logic

3. **AlertState**
   - Tracks current alert state per metric
   - Records state transitions
   - Manages notification timing

4. **Integration Points**
   - UDP handler: Checks thresholds when plugin data arrives
   - Host objects: Store alert states per host
   - Notification system: Sends alerts via configured channels

### Alert Levels

- **OK**: Metric is within normal range
- **WARNING**: Metric has exceeded warning threshold (first-level concern)
- **CRITICAL**: Metric has exceeded critical threshold (requires immediate attention)
- **UNKNOWN**: Metric value cannot be evaluated (e.g., non-numeric data)

## Configuration

### Basic Structure

Thresholds are configured in the YAML configuration file under the `thresholds` section:

```yaml
thresholds:
  plugin_name:
    metric_name:
      warning: 80.0
      critical: 90.0
      operator: ">"
      hysteresis: 0.1
      enabled: true
```

### Configuration Parameters

#### Required Parameters

- **warning**: Warning threshold value (numeric)
- **critical**: Critical threshold value (numeric)

Note: At least one of `warning` or `critical` must be specified.

#### Optional Parameters

- **operator**: Comparison operator (default: `">"`)
  - `">"` - Greater than
  - `">="` - Greater than or equal
  - `"<"` - Less than
  - `"<="` - Less than or equal
  - `"=="` - Equal to
  - `"!="` - Not equal to

- **hysteresis**: Hysteresis percentage to prevent flapping (default: `0.1` = 10%)
  - Range: 0.0 to 1.0
  - Prevents rapid state transitions when value hovers near threshold

- **enabled**: Whether this threshold is active (default: `true`)

### Comparison Operators

#### Greater Than (`>`, `>=`)

Used for metrics where **higher values are problematic**:

```yaml
cpu_monitor:
  cpu_percent:
    warning: 80.0      # Alert when CPU > 80%
    critical: 90.0     # Alert when CPU > 90%
    operator: ">"
```

Examples:
- CPU usage percentage
- Memory usage percentage
- Disk usage percentage
- Load average
- Error counters

#### Less Than (`<`, `<=`)

Used for metrics where **lower values are problematic**:

```yaml
memory_monitor:
  available_mb:
    warning: 1000      # Alert when available memory < 1GB
    critical: 500      # Alert when available memory < 500MB
    operator: "<"
```

Examples:
- Available memory
- Free disk space
- Connection pool availability
- Battery level

## Hysteresis

Hysteresis prevents alert flapping by requiring values to improve by a certain amount before recovering from an alert state.

### How It Works

When a metric crosses a threshold (e.g., CPU goes from 85% to 91%, triggering CRITICAL), hysteresis is applied when the value improves:

```
Threshold: 90
Hysteresis: 0.1 (10%)
Recovery threshold: 90 - (90 * 0.1) = 81

Value 91 -> CRITICAL (threshold crossed)
Value 89 -> CRITICAL (still above recovery threshold of 81)
Value 85 -> CRITICAL (still above recovery threshold)
Value 80 -> WARNING or OK (below recovery threshold, re-evaluated normally)
```

### Configuration Recommendations

- **Stable metrics** (CPU, memory): 10-15% hysteresis
  ```yaml
  hysteresis: 0.1
  ```

- **Very stable metrics** (disk usage): 5% hysteresis
  ```yaml
  hysteresis: 0.05
  ```

- **Counter metrics** (errors, packets): 20% hysteresis
  ```yaml
  hysteresis: 0.2
  ```

- **Binary states** (exit codes): No hysteresis
  ```yaml
  hysteresis: 0.0
  ```

## Plugin-Specific Configuration

### CPU Monitor

```yaml
cpu_monitor:
  cpu_percent:
    warning: 80.0
    critical: 90.0
    operator: ">"
    hysteresis: 0.1

  load_1min:
    warning: 4.0
    critical: 8.0
    operator: ">"
    hysteresis: 0.15

  load_5min:
    warning: 3.0
    critical: 6.0
    operator: ">"

  load_15min:
    warning: 2.0
    critical: 4.0
    operator: ">"
```

### Memory Monitor

```yaml
memory_monitor:
  # Percentage-based threshold
  percent:
    warning: 85.0
    critical: 95.0
    operator: ">"

  # Absolute value threshold (inverse - alert when LOW)
  available_mb:
    warning: 1000
    critical: 500
    operator: "<"

  # Swap usage
  swap_percent:
    warning: 50.0
    critical: 80.0
    operator: ">"
```

### Disk Monitor

Disk thresholds support **partition-specific configuration**:

```yaml
disk_monitor:
  partitions:
    /:
      percent:
        warning: 80.0
        critical: 90.0
        operator: ">"
        hysteresis: 0.05

      free_gb:
        warning: 10.0
        critical: 5.0
        operator: "<"

    /home:
      percent:
        warning: 85.0
        critical: 95.0
        operator: ">"

    /var:
      percent:
        warning: 80.0
        critical: 90.0
        operator: ">"

      free_gb:
        warning: 5.0
        critical: 2.0
        operator: "<"
```

### Network Monitor

```yaml
network_monitor:
  # Error counters
  errors_total:
    warning: 100
    critical: 1000
    operator: ">"
    hysteresis: 0.2

  # Dropped packets
  dropin_total:
    warning: 50
    critical: 200
    operator: ">"

  dropout_total:
    warning: 50
    critical: 200
    operator: ">"

  # Connection states
  connections_TIME_WAIT:
    warning: 1000
    critical: 5000
    operator: ">"

  connections_ESTABLISHED:
    warning: 500
    critical: 1000
    operator: ">"
```

### Nagios Runner

The Nagios plugin runner reports exit codes that can be thresholded:

```yaml
nagios_runner:
  exit_code:
    warning: 1       # Map Nagios WARNING to our WARNING
    critical: 2      # Map Nagios CRITICAL to our CRITICAL
    operator: ">="
    hysteresis: 0.0  # No hysteresis for exit codes
```

## Notification Behavior

### When Notifications Are Sent

Notifications are triggered on **state changes**:

1. **Escalation**: OK → WARNING, OK → CRITICAL, WARNING → CRITICAL
   ```
   WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
   ```

2. **Recovery**: CRITICAL → WARNING, CRITICAL → OK, WARNING → OK
   ```
   RECOVERED: webserver01 - cpu_monitor.cpu_percent = 70.0 (CRITICAL -> OK)
   ```

3. **Re-notifications**: Periodic reminders for ongoing alerts
   ```
   REMINDER (CRITICAL): webserver01 - cpu_monitor.cpu_percent = 95.0 (ongoing for 3600s)
   ```

### Notification Frequency

- **State changes**: Immediate notification
- **Re-notifications**: Controlled by `threshold_renotify_interval` (default: 3600 seconds = 1 hour)

```yaml
threshold_renotify_interval: 3600  # Re-notify every hour for ongoing alerts
```

### Notification Channels

Thresholds use the same notification infrastructure as heartbeat monitoring:

- **Email** (via SMTP)
- **Pushover** (mobile notifications)
- **Mattermost** (team chat)
- **Custom webhooks**

Configuration:

```yaml
# Email
toemail:
  - admin@example.com
  - oncall@example.com
fromemail: heartbeat@example.com
smtpserver: smtp.example.com
smtpport: 587
smtpuser: heartbeat@example.com
smtppassword: your-password

# Pushover
pushover_token: your-app-token
pushover_user: your-user-key
```

### Watched Hosts

Only hosts in the `watchhosts` list will trigger notifications:

```yaml
watchhosts:
  - webserver01
  - database01
  - mailserver
```

Hosts not in this list will still have thresholds checked and alert states tracked, but won't send notifications.

## Alert State Tracking

Each host maintains alert states for all monitored metrics:

```python
host.alert_states = {
    "cpu_monitor.cpu_percent": AlertState(level=WARNING, since=1234567890),
    "memory_monitor.percent": AlertState(level=CRITICAL, since=1234567800),
    "disk_monitor./.percent": AlertState(level=OK, since=1234567700),
}
```

Alert states persist in memory and are saved with host data (pickle).

### Alert State Information

Each `AlertState` tracks:

- **level**: Current alert level (OK, WARNING, CRITICAL, UNKNOWN)
- **since**: Timestamp when current state started
- **last_value**: Most recent metric value
- **last_check**: Timestamp of last threshold check
- **notification_count**: Number of notifications sent for this alert
- **last_notification**: Timestamp of last notification

### Querying Alert States

Via HTTP API (future enhancement):

```bash
GET /api/hosts/webserver01/alerts
```

Response:
```json
{
  "active_alerts": [
    {
      "metric": "cpu_monitor.cpu_percent",
      "level": "WARNING",
      "since": 1234567890,
      "value": 85.0,
      "duration": 300
    }
  ],
  "summary": {
    "ok": 15,
    "warning": 1,
    "critical": 0
  }
}
```

## Testing

A comprehensive test suite is provided in `test_threshold.py`:

```bash
python test_threshold.py
```

Tests cover:
- Threshold configuration and parsing
- All comparison operators
- Hysteresis functionality
- Alert state tracking
- State change detection
- Notification triggering
- Nested metrics (partitions)
- Alert summaries

## Best Practices

### 1. Start Conservative

Begin with higher thresholds to avoid alert fatigue:

```yaml
cpu_monitor:
  cpu_percent:
    warning: 85.0    # Start higher
    critical: 95.0   # Very high for critical
```

Adjust downward based on observed behavior.

### 2. Consider Workload Patterns

Different systems have different normal ranges:

**Web servers** (bursty traffic):
```yaml
cpu_percent:
  warning: 80.0
  critical: 90.0
  hysteresis: 0.15  # Higher hysteresis for burstiness
```

**Database servers** (steady load):
```yaml
cpu_percent:
  warning: 70.0
  critical: 85.0
  hysteresis: 0.1   # Lower hysteresis for steady metrics
```

### 3. Use Appropriate Operators

Match the operator to the metric:

| Metric Type | Example | Operator | Reason |
|-------------|---------|----------|--------|
| Resource usage | CPU%, Memory% | `>` | Alert when high |
| Available resources | Free memory, Free disk | `<` | Alert when low |
| Error counters | Network errors | `>` | Alert when increasing |
| Health checks | Nagios exit code | `>=` | Map to standard codes |

### 4. Align with Monitoring Intervals

Ensure threshold checks align with plugin collection intervals:

```yaml
plugins:
  cpu_monitor:
    interval: 300    # Check every 5 minutes

thresholds:
  cpu_monitor:
    cpu_percent:
      warning: 80.0
      # Will be checked every 5 minutes
```

### 5. Test Before Production

1. **Start with disabled thresholds**:
   ```yaml
   enabled: false
   ```

2. **Observe metric ranges** over a week

3. **Set thresholds** based on observed data

4. **Enable gradually**:
   ```yaml
   enabled: true
   ```

5. **Monitor for false positives**

### 6. Document Baseline Values

Keep a record of normal operating ranges:

```yaml
# Production web server baseline (observed over 30 days):
# CPU: 20-40% normal, 60% peak
# Memory: 60-70% normal, 80% peak
# Disk /: 40-50% usage, growing 2%/month

cpu_monitor:
  cpu_percent:
    warning: 75.0   # Above peak + margin
    critical: 90.0  # Danger zone
```

### 7. Layer Alerts

Use WARNING for early notification, CRITICAL for immediate action:

```yaml
disk_monitor:
  partitions:
    /:
      percent:
        warning: 75.0    # Early warning: "check in next few days"
        critical: 90.0   # Critical: "act now before outage"
```

## Troubleshooting

### No Notifications Being Sent

1. **Check if host is watched**:
   ```yaml
   watchhosts:
     - your-host-name
   ```

2. **Verify notification configuration**:
   ```yaml
   toemail:
     - admin@example.com
   smtpserver: smtp.example.com
   ```

3. **Check threshold configuration**:
   ```bash
   # Look for parsing errors in server logs
   grep "threshold" /var/log/heartbeat/hbd.log
   ```

4. **Verify metric names**:
   - Metric names must match exactly (case-sensitive)
   - Check journal or logs for actual metric names

### Too Many Alerts (Flapping)

1. **Increase hysteresis**:
   ```yaml
   hysteresis: 0.2  # Increase from 0.1 to 0.2 (20%)
   ```

2. **Adjust thresholds**:
   ```yaml
   warning: 85.0  # Increase from 80.0
   ```

3. **Increase renotification interval**:
   ```yaml
   threshold_renotify_interval: 7200  # 2 hours instead of 1
   ```

### Alerts Not Triggering

1. **Check threshold operator**:
   ```yaml
   # For available memory (alert when LOW):
   operator: "<"   # NOT ">"
   ```

2. **Verify numeric values**:
   - Ensure metric values are numeric
   - Check for unit mismatches (MB vs GB)

3. **Check if threshold is enabled**:
   ```yaml
   enabled: true  # NOT false
   ```

4. **Review hysteresis settings**:
   - Very high hysteresis may prevent state changes
   - Try reducing or disabling temporarily

### Alert State Not Recovering

1. **Check recovery threshold calculation**:
   ```
   Threshold: 90
   Hysteresis: 0.1
   Recovery: 90 - (90 * 0.1) = 81

   Value must drop below 81 to recover
   ```

2. **Temporarily disable hysteresis**:
   ```yaml
   hysteresis: 0.0
   ```

3. **Monitor actual metric values**:
   ```bash
   # Check journal for actual values
   grep "cpu_percent" /var/log/heartbeat/messages.journal | tail -20
   ```

## Advanced Topics

### Custom Notification Callbacks

The ThresholdChecker supports custom notification functions:

```python
def custom_notifier(message):
    # Send to incident management system
    pagerduty.trigger(message)

    # Log to custom system
    logger.critical(message)

    # Update dashboard
    metrics.alert_count.inc()

checker = ThresholdChecker(
    config=config,
    notification_callback=custom_notifier
)
```

### Programmatic Access

Query alert states programmatically:

```python
# Get all active alerts for a host
active = threshold_checker.get_active_alerts(host.alert_states)

for alert in active:
    print(f"{alert.metric_path}: {alert.level.name} for {time.time() - alert.since}s")

# Get alert summary
summary = threshold_checker.get_alert_summary(host.alert_states)
print(f"WARNING: {summary['warning']}, CRITICAL: {summary['critical']}")
```

### Integration with External Systems

Threshold violations can be integrated with:

- **PagerDuty**: Incident creation and escalation
- **OpsGenie**: On-call scheduling and routing
- **ServiceNow**: Ticket creation
- **Grafana**: Dashboard annotations
- **Elasticsearch**: Alert indexing and analysis

## Future Enhancements

Planned features:

1. **Composite thresholds**: Alert based on multiple metrics
   ```yaml
   composite:
     high_load_with_low_memory:
       conditions:
         - cpu_monitor.load_1min > 8.0
         - memory_monitor.available_mb < 500
   ```

2. **Time-based thresholds**: Different thresholds by time of day
   ```yaml
   schedule:
     business_hours:
       warning: 70.0
     off_hours:
       warning: 85.0
   ```

3. **Rate-of-change thresholds**: Alert on rapid changes
   ```yaml
   rate_of_change:
     metric: cpu_percent
     period: 300
     threshold: 30.0  # Alert if changes >30% in 5 minutes
   ```

4. **Alert grouping**: Combine related alerts
   ```yaml
   groups:
     disk_critical:
       metrics:
         - disk_monitor./.percent
         - disk_monitor./var.percent
       action: single_notification
   ```

5. **Maintenance windows**: Suppress alerts during planned maintenance
   ```yaml
   maintenance:
     - host: webserver01
       start: 2024-01-15T02:00:00Z
       end: 2024-01-15T04:00:00Z
   ```

## See Also

- [Plugin Development Guide](PLUGIN_DEVELOPMENT.md)
- [Message Journal Documentation](MESSAGE_JOURNAL.md)
- Configuration examples: `hbd/config_thresholds_example.yaml`
- Test suite: `test_threshold.py`