heartbeat/docs/THRESHOLD_ALERTING.md

# Threshold Alerting System

## Overview

The Heartbeat Monitoring System includes a comprehensive threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured thresholds. This system is designed to:

- **Detect anomalies**: Automatically identify when system metrics exceed safe operating ranges
- **Prevent alert fatigue**: Use hysteresis to prevent notification flapping
- **Escalate appropriately**: Support WARNING and CRITICAL severity levels
- **Track state**: Maintain alert history and state transitions per host
- **Integrate seamlessly**: Work with existing notification infrastructure (email, pushover, etc.)

## Architecture

### Components

1. **ThresholdChecker** (`hbd/threshold.py`)
   - Main threshold checking engine
   - Parses configuration
   - Evaluates metrics against thresholds
   - Triggers notifications on state changes

2. **ThresholdConfig**
   - Individual threshold configuration
   - Supports multiple comparison operators
   - Implements hysteresis logic

3. **AlertState**
   - Tracks current alert state per metric
   - Records state transitions
   - Manages notification timing

4. **Integration Points**
   - UDP handler: Checks thresholds when plugin data arrives
   - Host objects: Store alert states per host
   - Notification system: Sends alerts via configured channels

### Alert Levels

- **OK**: Metric is within normal range
- **WARNING**: Metric has exceeded warning threshold (first-level concern)
- **CRITICAL**: Metric has exceeded critical threshold (requires immediate attention)
- **UNKNOWN**: Metric value cannot be evaluated (e.g., non-numeric data)

## Configuration

### Basic Structure

Thresholds are configured in the YAML configuration file under the `thresholds` section:

```yaml
thresholds:
  plugin_name:
    metric_name:
      warning: 80.0
      critical: 90.0
      operator: ">"
      hysteresis: 0.1
      display: "display format"
      enabled: true
```

### Configuration Parameters

#### Required Parameters

- **warning**: Warning threshold value (numeric)
- **critical**: Critical threshold value (numeric)

Note: At least one of `warning` or `critical` must be specified.

#### Optional Parameters

- **operator**: Comparison operator (default: `">"`)
  - `">"` - Greater than
  - `">="` - Greater than or equal
  - `"<"` - Less than
  - `"<="` - Less than or equal
  - `"=="` - Equal to
  - `"!="` - Not equal to

- **hysteresis**: Hysteresis percentage to prevent flapping (default: `0.1` = 10%)
  - Range: 0.0 to 1.0
  - Prevents rapid state transitions when value hovers near threshold

- **display**: f-string to hold the display format for alert messages
  - defaults to "(threshold: {op_symbol} {threshold_value})"
- **enabled**: Whether this threshold is active (default: `true`)

### Comparison Operators

#### Greater Than (`>`, `>=`)

Used for metrics where **higher values are problematic**:

```yaml
cpu_monitor:
  cpu_percent:
    warning: 80.0      # Alert when CPU > 80%
    critical: 90.0     # Alert when CPU > 90%
    operator: ">"
```

Examples:
- CPU usage percentage
- Memory usage percentage
- Disk usage percentage
- Load average
- Error counters

#### Less Than (`<`, `<=`)

Used for metrics where **lower values are problematic**:

```yaml
memory_monitor:
  available_mb:
    warning: 1000      # Alert when available memory < 1GB
    critical: 500      # Alert when available memory < 500MB
    operator: "<"
```

Examples:
- Available memory
- Free disk space
- Connection pool availability
- Battery level

## Hysteresis

Hysteresis prevents alert flapping by requiring values to improve by a certain amount before recovering from an alert state.

### How It Works

When a metric crosses a threshold (e.g., CPU goes from 85% to 91%, triggering CRITICAL), hysteresis is applied when the value improves:

```
Threshold: 90
Hysteresis: 0.1 (10%)
Recovery threshold: 90 - (90 * 0.1) = 81

Value 91 -> CRITICAL (threshold crossed)
Value 89 -> CRITICAL (still above recovery threshold of 81)
Value 85 -> CRITICAL (still above recovery threshold)
Value 80 -> WARNING or OK (below recovery threshold, re-evaluated normally)
```

### Configuration Recommendations

- **Stable metrics** (CPU, memory): 10-15% hysteresis
  ```yaml
  hysteresis: 0.1
  ```

- **Very stable metrics** (disk usage): 5% hysteresis
  ```yaml
  hysteresis: 0.05
  ```

- **Counter metrics** (errors, packets): 20% hysteresis
  ```yaml
  hysteresis: 0.2
  ```

- **Binary states** (exit codes): No hysteresis
  ```yaml
  hysteresis: 0.0
  ```

## Plugin-Specific Configuration

### CPU Monitor

```yaml
cpu_monitor:
  cpu_percent:
    warning: 80.0
    critical: 90.0
    operator: ">"
    hysteresis: 0.1

  load_1min:
    warning: 4.0
    critical: 8.0
    operator: ">"
    hysteresis: 0.15

  load_5min:
    warning: 3.0
    critical: 6.0
    operator: ">"

  load_15min:
    warning: 2.0
    critical: 4.0
    operator: ">"
```

### Memory Monitor

```yaml
memory_monitor:
  # Percentage-based threshold
  percent:
    warning: 85.0
    critical: 95.0
    operator: ">"

  # Absolute value threshold (inverse - alert when LOW)
  available_mb:
    warning: 1000
    critical: 500
    operator: "<"

  # Swap usage
  swap_percent:
    warning: 50.0
    critical: 80.0
    operator: ">"
```

### Disk Monitor

Disk thresholds support **partition-specific configuration**:

```yaml
disk_monitor:
  partitions:
    /:
      percent:
        warning: 80.0
        critical: 90.0
        operator: ">"
        hysteresis: 0.05

      free_gb:
        warning: 10.0
        critical: 5.0
        operator: "<"

    /home:
      percent:
        warning: 85.0
        critical: 95.0
        operator: ">"

    /var:
      percent:
        warning: 80.0
        critical: 90.0
        operator: ">"

      free_gb:
        warning: 5.0
        critical: 2.0
        operator: "<"
```

### ZFS Monitor

ZFS pool health is checked automatically for every pool. A pool in any state
other than `ONLINE` (e.g. `DEGRADED`, `SUSPENDED`, `FAULTED`, `UNAVAIL`) raises
a **CRITICAL** alert by default — no configuration required.

The default threshold is equivalent to:

```yaml
zfs_monitor:
  pools:
    '*':
      status:
        warning: 1
        critical: 2
        operator: ">"
        hysteresis: 0.0
        display: "ZFS pool {pool_name} is {health}"
```

`'*'` matches every pool on the host. The notification message includes the pool
name and its current health string, e.g. `ZFS pool tank is DEGRADED`.

**Override for specific pools** — named pool entries take priority over `'*'`:

```yaml
zfs_monitor:
  pools:
    # Suppress health alerts for a scratch pool (not mission-critical)
    scratch:
      status:
        enabled: false

    # Capacity threshold for a specific pool
    tank:
      capacity:
        warning: 75.0
        critical: 90.0
        operator: ">"
        hysteresis: 0.05
```

**Alert state paths** follow the pattern `zfs_monitor.<pool_name>.status`,
so acknowledgements and silences target individual pools:

```
zfs_monitor.tank.status
zfs_monitor.backup.status
```

### Network Monitor

```yaml
network_monitor:
  # Error counters
  errors_total:
    warning: 100
    critical: 1000
    operator: ">"
    hysteresis: 0.2

  # Dropped packets
  dropin_total:
    warning: 50
    critical: 200
    operator: ">"

  dropout_total:
    warning: 50
    critical: 200
    operator: ">"

  # Connection states
  connections_TIME_WAIT:
    warning: 1000
    critical: 5000
    operator: ">"

  connections_ESTABLISHED:
    warning: 500
    critical: 1000
    operator: ">"
```

### Nagios Runner

The Nagios plugin runner reports exit codes that can be thresholded:

```yaml
nagios_runner:
  exit_code:
    warning: 1       # Map Nagios WARNING to our WARNING
    critical: 2      # Map Nagios CRITICAL to our CRITICAL
    operator: ">="
    hysteresis: 0.0  # No hysteresis for exit codes
```

## Notification Behavior

### When Notifications Are Sent

Notifications are triggered on **state changes**:

1. **Escalation**: OK → WARNING, OK → CRITICAL, WARNING → CRITICAL
   ```
   WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
   ```

2. **Recovery**: CRITICAL → WARNING, CRITICAL → OK, WARNING → OK
   ```
   RECOVERED: webserver01 - cpu_monitor.cpu_percent = 70.0 (CRITICAL -> OK)
   ```

3. **Re-notifications**: Periodic reminders for ongoing alerts
   ```
   REMINDER (CRITICAL): webserver01 - cpu_monitor.cpu_percent = 95.0 (ongoing for 3600s)
   ```

### Notification Frequency

- **State changes**: Immediate notification
- **Re-notifications**: Controlled by `threshold_renotify_interval` (default: 3600 seconds = 1 hour)

```yaml
threshold_renotify_interval: 3600  # Re-notify every hour for ongoing alerts
```

### Notification Channels

The system supports centralized notification channel definitions, allowing different hosts to use different notification providers and credentials. This provides fine-grained control over who gets notified about what.

#### Supported Channel Types

- **Email** (via SMTP)
- **Pushover** (mobile notifications)
- **Signal** (via signal-cli)
- **Mattermost** (team chat webhooks)

#### Centralized Channel Configuration

Define notification channels once in the configuration file:

```yaml
notification_channels:
  # Signal notifications
  signal_ops:
    type: signal
    cli_path: /usr/local/bin/signal-cli
    user: +1234567890
    recipient: +1234567890

  # Email notifications
  email_ops:
    type: email
    recipients: [ops@example.com, alerts@example.com]
    sender: heartbeat@example.com
    smtp_server: smtp.example.com
    smtp_port: 587
    smtp_user: heartbeat@example.com
    smtp_password: your-smtp-password

  # Pushover notifications
  pushover_urgent:
    type: pushover
    token: your-pushover-app-token
    user: your-pushover-user-key

  # Mattermost notifications
  mattermost_devops:
    type: mattermost
    host: mattermost.example.com
    token: your-webhook-token
    channel: devops-alerts
    username: heartbeat-bot
    icon: https://example.com/heartbeat-icon.png

# Default channels for hosts that don't specify channels
default_notification_channels: [email_ops]
```

#### Per-Host Channel Assignment

Assign notification channels to specific hosts in the `hosts` section:

```yaml
hosts:
  # Critical server - multiple notification channels
  prod-web-01:
    threshold_config: high_sensitivity
    watch: true
    notification_channels: [signal_ops, pushover_urgent, email_ops]
    dyndns: false

  # Database server - ops team only
  prod-db-01:
    threshold_config: database
    watch: true
    notification_channels: [signal_ops, email_ops]
    dyndns: false

  # Development server - email only
  dev-server-01:
    threshold_config: low_sensitivity
    watch: false
    notification_channels: [email_ops]
    dyndns: false

  # Uses default_notification_channels if not specified
  test-server-01:
    threshold_config: default
    watch: false
    dyndns: false
```

### Watched Hosts

Only hosts with `watch: true` in the `hosts` section will trigger notifications:

```yaml
hosts:
  webserver01:
    watch: true
    notification_channels: [email_ops]

  database01:
    watch: true
    notification_channels: [signal_ops, email_ops]

  mailserver:
    watch: true
    notification_channels: [pushover_urgent]
```

Hosts not marked for watching will still have thresholds checked and alert states tracked, but won't send notifications.

## Alert State Tracking

Each host maintains alert states for all monitored metrics:

```python
host.alert_states = {
    "cpu_monitor.cpu_percent": AlertState(level=WARNING, since=1234567890),
    "memory_monitor.percent": AlertState(level=CRITICAL, since=1234567800),
    "disk_monitor./.percent": AlertState(level=OK, since=1234567700),
}
```

Alert states persist in memory and are saved with host data (pickle).

### Alert State Information

Each `AlertState` tracks:

- **level**: Current alert level (OK, WARNING, CRITICAL, UNKNOWN)
- **since**: Timestamp when current state started
- **last_value**: Most recent metric value
- **last_check**: Timestamp of last threshold check
- **notification_count**: Number of notifications sent for this alert
- **last_notification**: Timestamp of last notification

### Querying Alert States

Via HTTP API (future enhancement):

```bash
GET /api/hosts/webserver01/alerts
```

Response:
```json
{
  "active_alerts": [
    {
      "metric": "cpu_monitor.cpu_percent",
      "level": "WARNING",
      "since": 1234567890,
      "value": 85.0,
      "duration": 300
    }
  ],
  "summary": {
    "ok": 15,
    "warning": 1,
    "critical": 0
  }
}
```

## Testing

A comprehensive test suite is provided in `test_threshold.py`:

```bash
python test_threshold.py
```

Tests cover:
- Threshold configuration and parsing
- All comparison operators
- Hysteresis functionality
- Alert state tracking
- State change detection
- Notification triggering
- Nested metrics (partitions)
- Alert summaries

## Best Practices

### 1. Start Conservative

Begin with higher thresholds to avoid alert fatigue:

```yaml
cpu_monitor:
  cpu_percent:
    warning: 85.0    # Start higher
    critical: 95.0   # Very high for critical
```

Adjust downward based on observed behavior.

### 2. Consider Workload Patterns

Different systems have different normal ranges:

**Web servers** (bursty traffic):
```yaml
cpu_percent:
  warning: 80.0
  critical: 90.0
  hysteresis: 0.15  # Higher hysteresis for burstiness
```

**Database servers** (steady load):
```yaml
cpu_percent:
  warning: 70.0
  critical: 85.0
  hysteresis: 0.1   # Lower hysteresis for steady metrics
```

### 3. Use Appropriate Operators

Match the operator to the metric:

| Metric Type | Example | Operator | Reason |
|-------------|---------|----------|--------|
| Resource usage | CPU%, Memory% | `>` | Alert when high |
| Available resources | Free memory, Free disk | `<` | Alert when low |
| Error counters | Network errors | `>` | Alert when increasing |
| Health checks | Nagios exit code | `>=` | Map to standard codes |

### 4. Align with Monitoring Intervals

Ensure threshold checks align with plugin collection intervals:

```yaml
plugins:
  cpu_monitor:
    interval: 300    # Check every 5 minutes

thresholds:
  cpu_monitor:
    cpu_percent:
      warning: 80.0
      # Will be checked every 5 minutes
```

### 5. Test Before Production

1. **Start with disabled thresholds**:
   ```yaml
   enabled: false
   ```

2. **Observe metric ranges** over a week

3. **Set thresholds** based on observed data

4. **Enable gradually**:
   ```yaml
   enabled: true
   ```

5. **Monitor for false positives**

### 6. Document Baseline Values

Keep a record of normal operating ranges:

```yaml
# Production web server baseline (observed over 30 days):
# CPU: 20-40% normal, 60% peak
# Memory: 60-70% normal, 80% peak
# Disk /: 40-50% usage, growing 2%/month

cpu_monitor:
  cpu_percent:
    warning: 75.0   # Above peak + margin
    critical: 90.0  # Danger zone
```

### 7. Layer Alerts

Use WARNING for early notification, CRITICAL for immediate action:

```yaml
disk_monitor:
  partitions:
    /:
      percent:
        warning: 75.0    # Early warning: "check in next few days"
        critical: 90.0   # Critical: "act now before outage"
```

## Troubleshooting

### No Notifications Being Sent

1. **Check if host is watched**:
   ```yaml
   watchhosts:
     - your-host-name
   ```

2. **Verify notification configuration**:
   ```yaml
   toemail:
     - admin@example.com
   smtpserver: smtp.example.com
   ```

3. **Check threshold configuration**:
   ```bash
   # Look for parsing errors in server logs
   grep "threshold" /var/log/heartbeat/hbd.log
   ```

4. **Verify metric names**:
   - Metric names must match exactly (case-sensitive)
   - Check journal or logs for actual metric names

### Too Many Alerts (Flapping)

1. **Increase hysteresis**:
   ```yaml
   hysteresis: 0.2  # Increase from 0.1 to 0.2 (20%)
   ```

2. **Adjust thresholds**:
   ```yaml
   warning: 85.0  # Increase from 80.0
   ```

3. **Increase renotification interval**:
   ```yaml
   threshold_renotify_interval: 7200  # 2 hours instead of 1
   ```

### Alerts Not Triggering

1. **Check threshold operator**:
   ```yaml
   # For available memory (alert when LOW):
   operator: "<"   # NOT ">"
   ```

2. **Verify numeric values**:
   - Ensure metric values are numeric
   - Check for unit mismatches (MB vs GB)

3. **Check if threshold is enabled**:
   ```yaml
   enabled: true  # NOT false
   ```

4. **Review hysteresis settings**:
   - Very high hysteresis may prevent state changes
   - Try reducing or disabling temporarily

### Alert State Not Recovering

1. **Check recovery threshold calculation**:
   ```
   Threshold: 90
   Hysteresis: 0.1
   Recovery: 90 - (90 * 0.1) = 81

   Value must drop below 81 to recover
   ```

2. **Temporarily disable hysteresis**:
   ```yaml
   hysteresis: 0.0
   ```

3. **Monitor actual metric values**:
   ```bash
   # Check journal for actual values
   grep "cpu_percent" /var/log/heartbeat/messages.journal | tail -20
   ```

## Advanced Topics

### Custom Notification Callbacks

The ThresholdChecker supports custom notification functions:

```python
def custom_notifier(message):
    # Send to incident management system
    pagerduty.trigger(message)

    # Log to custom system
    logger.critical(message)

    # Update dashboard
    metrics.alert_count.inc()

checker = ThresholdChecker(
    config=config,
    notification_callback=custom_notifier
)
```

### Programmatic Access

Query alert states programmatically:

```python
# Get all active alerts for a host
active = threshold_checker.get_active_alerts(host.alert_states)

for alert in active:
    print(f"{alert.metric_path}: {alert.level.name} for {time.time() - alert.since}s")

# Get alert summary
summary = threshold_checker.get_alert_summary(host.alert_states)
print(f"WARNING: {summary['warning']}, CRITICAL: {summary['critical']}")
```

### Integration with External Systems

Threshold violations can be integrated with:

- **PagerDuty**: Incident creation and escalation
- **OpsGenie**: On-call scheduling and routing
- **ServiceNow**: Ticket creation
- **Grafana**: Dashboard annotations
- **Elasticsearch**: Alert indexing and analysis

## Future Enhancements

Planned features:

1. **Composite thresholds**: Alert based on multiple metrics
   ```yaml
   composite:
     high_load_with_low_memory:
       conditions:
         - cpu_monitor.load_1min > 8.0
         - memory_monitor.available_mb < 500
   ```

2. **Time-based thresholds**: Different thresholds by time of day
   ```yaml
   schedule:
     business_hours:
       warning: 70.0
     off_hours:
       warning: 85.0
   ```

3. **Rate-of-change thresholds**: Alert on rapid changes
   ```yaml
   rate_of_change:
     metric: cpu_percent
     period: 300
     threshold: 30.0  # Alert if changes >30% in 5 minutes
   ```

4. **Alert grouping**: Combine related alerts
   ```yaml
   groups:
     disk_critical:
       metrics:
         - disk_monitor./.percent
         - disk_monitor./var.percent
       action: single_notification
   ```

5. **Maintenance windows**: Suppress alerts during planned maintenance
   ```yaml
   maintenance:
     - host: webserver01
       start: 2024-01-15T02:00:00Z
       end: 2024-01-15T04:00:00Z
   ```

## See Also

- [Plugin Development Guide](PLUGIN_DEVELOPMENT.md)
- [Message Journal Documentation](MESSAGE_JOURNAL.md)
- Configuration examples: `hbd/config_thresholds_example.yaml`
- Test suite: `test_threshold.py`

## Multi-Threshold Configuration

Support for multiple named threshold configurations with per-host mapping and composable layering.

### Overview

The multi-threshold feature allows you to:
- Define multiple named threshold configurations
- Assign one or more configurations to each host
- Compose configurations by layering — each named config's overrides are applied in order on top of the defaults
- Use different sensitivity levels for different environments

### Configuration Structure

Named configurations are defined under `threshold_configs`. Each host selects which ones to use via `threshold_config` in the `hosts` section (a string for a single config, or a list to layer multiple):

```yaml
# Optional: set the default configuration name (defaults to "default")
default_threshold_config: "default"

threshold_configs:
  default:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 80.0
          critical: 90.0

  high_sensitivity:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 60.0
          critical: 75.0

  low_sensitivity:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 90.0
          critical: 95.0

hosts:
  prod-web-01:
    threshold_config: high_sensitivity   # single config

  dev-server-01:
    threshold_config: low_sensitivity

  # Hosts with no threshold_config use default_threshold_config
```

### Composable Configurations (list form)

`threshold_config` can be a list. Configs are applied **left to right**: the defaults are the base, then each named config's overrides are layered on top. Later entries in the list win on any metric they define.

```yaml
threshold_configs:
  default:
    thresholds:
      cpu_monitor:
        cpu_percent: {warning: 80, critical: 90}
      memory_monitor:
        memory_percent: {warning: 85, critical: 95}
      disk_monitor:
        partitions:
          /:
            percent: {warning: 80, critical: 90}

  # Tighter CPU limits for busy servers
  high_cpu_load:
    thresholds:
      cpu_monitor:
        cpu_percent: {warning: 60, critical: 75}

  # Tighter disk limits for data-heavy servers
  busy_disk:
    thresholds:
      disk_monitor:
        partitions:
          /:
            percent: {warning: 70, critical: 85}

hosts:
  # Gets default thresholds only
  web-01:
    threshold_config: default

  # Gets tighter CPU limits, default memory and disk
  build-server:
    threshold_config: high_cpu_load

  # Layers both: tighter CPU AND tighter disk, default memory
  db-01:
    threshold_config: [high_cpu_load, busy_disk]

  # Three layers: busy_disk overrides high_cpu_load if they conflict
  storage-01:
    threshold_config: [default, high_cpu_load, busy_disk]
```

**How layering works:**

Starting from the `default` thresholds:

| Layer | Applied config | Effect |
|-------|---------------|--------|
| Base  | `default` | all default thresholds |
| +1    | `high_cpu_load` | cpu_percent overridden to 60/75 |
| +2    | `busy_disk` | disk percent overridden to 70/85; cpu_percent stays at 60/75 |

Each named config only overrides the metrics it explicitly defines. Metrics not mentioned in a config inherit from the layers beneath.

### Use Cases

#### 1. Environment-Based Thresholds

Different thresholds for production vs. development:

```yaml
threshold_configs:
  production:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 70.0   # Alert earlier in production
          critical: 85.0

  development:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 90.0   # More relaxed for dev
          critical: 98.0

hosts:
  prod-web-01:
    threshold_config: production
  prod-web-02:
    threshold_config: production
  dev-web-01:
    threshold_config: development
  dev-web-02:
    threshold_config: development
```

#### 2. Server Role-Based Thresholds

Different thresholds based on server function:

```yaml
threshold_configs:
  webserver:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 80.0
          critical: 90.0

  database:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 70.0
          critical: 85.0
      memory_monitor:
        memory_percent:
          warning: 90.0   # Databases can use high memory
          critical: 97.0
      disk_monitor:
        partitions:
          /var/lib/mysql:
            percent:
              warning: 75.0
              critical: 85.0

  cache:
    thresholds:
      memory_monitor:
        memory_percent:
          warning: 95.0   # Redis/Memcached can use very high memory
          critical: 99.0

hosts:
  web-01:
    threshold_config: webserver
  web-02:
    threshold_config: webserver
  db-01:
    threshold_config: database
  db-02:
    threshold_config: database
  redis-01:
    threshold_config: cache
  memcached-01:
    threshold_config: cache
```

#### 3. Sensitivity Levels

Different sensitivity for critical vs. non-critical systems:

```yaml
threshold_configs:
  critical:
    thresholds:
      disk_monitor:
        partitions:
          /:
            percent:
              warning: 70.0
              critical: 80.0
              hysteresis: 0.15

  standard:
    thresholds:
      disk_monitor:
        partitions:
          /:
            percent:
              warning: 85.0
              critical: 95.0
              hysteresis: 0.1

  relaxed:
    thresholds:
      disk_monitor:
        partitions:
          /:
            percent:
              warning: 90.0
              critical: 98.0
              hysteresis: 0.05

hosts:
  payment-gateway:
    threshold_config: critical
  auth-server:
    threshold_config: critical
  web-01:
    threshold_config: standard
  web-02:
    threshold_config: standard
  test-server:
    threshold_config: relaxed
```

#### 4. Composable Profiles

Build host-specific thresholds by combining small, focused configs:

```yaml
threshold_configs:
  # Baseline — everything at default levels
  default:
    thresholds:
      cpu_monitor:
        cpu_percent: {warning: 80, critical: 90}
      memory_monitor:
        memory_percent: {warning: 85, critical: 95}

  # Overlay: tighter CPU only
  tight_cpu:
    thresholds:
      cpu_monitor:
        cpu_percent: {warning: 60, critical: 75}

  # Overlay: tighter memory only
  tight_memory:
    thresholds:
      memory_monitor:
        memory_percent: {warning: 70, critical: 85}

  # Overlay: extra disk partition for database servers
  db_disk:
    thresholds:
      disk_monitor:
        partitions:
          /var/lib/postgresql:
            percent: {warning: 75, critical: 88}

hosts:
  # Plain web server
  web-01:
    threshold_config: default

  # Build server: tight CPU, default memory and disk
  build-01:
    threshold_config: tight_cpu

  # Database: tight CPU + tight memory + extra disk partition
  db-01:
    threshold_config: [tight_cpu, tight_memory, db_disk]

  # Replica database: tight memory + extra disk, normal CPU
  db-02:
    threshold_config: [tight_memory, db_disk]
```
### Configuration Priority

1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults
2. **Host `threshold_config` (string)**: Use that single named config directly
3. **`host_threshold_mapping`** (legacy): Same as above, string only
4. **`default_threshold_config`**: Used for hosts with no mapping
5. **First alphabetically**: If the default config is not found, use the first config alphabetically
6. **Legacy `thresholds` section**: Used when `threshold_configs` is absent entirely

### Backward Compatibility

The legacy `host_threshold_mapping` top-level key and the flat `thresholds` section are still fully supported:

```yaml
# Still works — equivalent to hosts: {prod-web-01: {threshold_config: high_sensitivity}}
host_threshold_mapping:
  prod-web-01: high_sensitivity

# Still works — equivalent to threshold_configs: {default: {thresholds: ...}}
thresholds:
  cpu_monitor:
    cpu_percent: {warning: 80, critical: 90}
```