Files
heartbeat/docs/THRESHOLD_ALERTING.md
T

1187 lines
27 KiB
Markdown

# Threshold Alerting System
## Overview
The Heartbeat Monitoring System includes a comprehensive threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured thresholds. This system is designed to:
- **Detect anomalies**: Automatically identify when system metrics exceed safe operating ranges
- **Prevent alert fatigue**: Use hysteresis to prevent notification flapping
- **Escalate appropriately**: Support WARNING and CRITICAL severity levels
- **Track state**: Maintain alert history and state transitions per host
- **Integrate seamlessly**: Work with existing notification infrastructure (email, pushover, etc.)
## Architecture
### Components
1. **ThresholdChecker** (`hbd/threshold.py`)
- Main threshold checking engine
- Parses configuration
- Evaluates metrics against thresholds
- Triggers notifications on state changes
2. **ThresholdConfig**
- Individual threshold configuration
- Supports multiple comparison operators
- Implements hysteresis logic
3. **AlertState**
- Tracks current alert state per metric
- Records state transitions
- Manages notification timing
4. **Integration Points**
- UDP handler: Checks thresholds when plugin data arrives
- Host objects: Store alert states per host
- Notification system: Sends alerts via configured channels
### Alert Levels
- **OK**: Metric is within normal range
- **WARNING**: Metric has exceeded warning threshold (first-level concern)
- **CRITICAL**: Metric has exceeded critical threshold (requires immediate attention)
- **UNKNOWN**: Metric value cannot be evaluated (e.g., non-numeric data)
## Configuration
### Basic Structure
Thresholds are configured in the YAML configuration file under the `thresholds` section:
```yaml
thresholds:
plugin_name:
metric_name:
warning: 80.0
critical: 90.0
operator: ">"
hysteresis: 0.1
display: "display format"
enabled: true
```
### Configuration Parameters
#### Required Parameters
- **warning**: Warning threshold value (numeric)
- **critical**: Critical threshold value (numeric)
Note: At least one of `warning` or `critical` must be specified.
#### Optional Parameters
- **operator**: Comparison operator (default: `">"`)
- `">"` - Greater than
- `">="` - Greater than or equal
- `"<"` - Less than
- `"<="` - Less than or equal
- `"=="` - Equal to
- `"!="` - Not equal to
- **hysteresis**: Hysteresis percentage to prevent flapping (default: `0.1` = 10%)
- Range: 0.0 to 1.0
- Prevents rapid state transitions when value hovers near threshold
- **display**: f-string to hold the display format for alert messages
- defaults to "(threshold: {op_symbol} {threshold_value})"
- **enabled**: Whether this threshold is active (default: `true`)
### Comparison Operators
#### Greater Than (`>`, `>=`)
Used for metrics where **higher values are problematic**:
```yaml
cpu_monitor:
cpu_percent:
warning: 80.0 # Alert when CPU > 80%
critical: 90.0 # Alert when CPU > 90%
operator: ">"
```
Examples:
- CPU usage percentage
- Memory usage percentage
- Disk usage percentage
- Load average
- Error counters
#### Less Than (`<`, `<=`)
Used for metrics where **lower values are problematic**:
```yaml
memory_monitor:
available_mb:
warning: 1000 # Alert when available memory < 1GB
critical: 500 # Alert when available memory < 500MB
operator: "<"
```
Examples:
- Available memory
- Free disk space
- Connection pool availability
- Battery level
## Hysteresis
Hysteresis prevents alert flapping by requiring values to improve by a certain amount before recovering from an alert state.
### How It Works
When a metric crosses a threshold (e.g., CPU goes from 85% to 91%, triggering CRITICAL), hysteresis is applied when the value improves:
```
Threshold: 90
Hysteresis: 0.1 (10%)
Recovery threshold: 90 - (90 * 0.1) = 81
Value 91 -> CRITICAL (threshold crossed)
Value 89 -> CRITICAL (still above recovery threshold of 81)
Value 85 -> CRITICAL (still above recovery threshold)
Value 80 -> WARNING or OK (below recovery threshold, re-evaluated normally)
```
### Configuration Recommendations
- **Stable metrics** (CPU, memory): 10-15% hysteresis
```yaml
hysteresis: 0.1
```
- **Very stable metrics** (disk usage): 5% hysteresis
```yaml
hysteresis: 0.05
```
- **Counter metrics** (errors, packets): 20% hysteresis
```yaml
hysteresis: 0.2
```
- **Binary states** (exit codes): No hysteresis
```yaml
hysteresis: 0.0
```
## Plugin-Specific Configuration
### CPU Monitor
```yaml
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
operator: ">"
hysteresis: 0.1
load_1min:
warning: 4.0
critical: 8.0
operator: ">"
hysteresis: 0.15
load_5min:
warning: 3.0
critical: 6.0
operator: ">"
load_15min:
warning: 2.0
critical: 4.0
operator: ">"
```
### Memory Monitor
```yaml
memory_monitor:
# Percentage-based threshold
percent:
warning: 85.0
critical: 95.0
operator: ">"
# Absolute value threshold (inverse - alert when LOW)
available_mb:
warning: 1000
critical: 500
operator: "<"
# Swap usage
swap_percent:
warning: 50.0
critical: 80.0
operator: ">"
```
### Disk Monitor
Disk thresholds support **partition-specific configuration**:
```yaml
disk_monitor:
partitions:
/:
percent:
warning: 80.0
critical: 90.0
operator: ">"
hysteresis: 0.05
free_gb:
warning: 10.0
critical: 5.0
operator: "<"
/home:
percent:
warning: 85.0
critical: 95.0
operator: ">"
/var:
percent:
warning: 80.0
critical: 90.0
operator: ">"
free_gb:
warning: 5.0
critical: 2.0
operator: "<"
```
### ZFS Monitor
ZFS pool health is checked automatically for every pool. A pool in any state
other than `ONLINE` (e.g. `DEGRADED`, `SUSPENDED`, `FAULTED`, `UNAVAIL`) raises
a **CRITICAL** alert by default — no configuration required.
The default threshold is equivalent to:
```yaml
zfs_monitor:
pools:
'*':
status:
warning: 1
critical: 2
operator: ">"
hysteresis: 0.0
display: "ZFS pool {pool_name} is {health}"
```
`'*'` matches every pool on the host. The notification message includes the pool
name and its current health string, e.g. `ZFS pool tank is DEGRADED`.
**Override for specific pools** — named pool entries take priority over `'*'`:
```yaml
zfs_monitor:
pools:
# Suppress health alerts for a scratch pool (not mission-critical)
scratch:
status:
enabled: false
# Capacity threshold for a specific pool
tank:
capacity:
warning: 75.0
critical: 90.0
operator: ">"
hysteresis: 0.05
```
**Alert state paths** follow the pattern `zfs_monitor.<pool_name>.status`,
so acknowledgements and silences target individual pools:
```
zfs_monitor.tank.status
zfs_monitor.backup.status
```
### Network Monitor
```yaml
network_monitor:
# Error counters
errors_total:
warning: 100
critical: 1000
operator: ">"
hysteresis: 0.2
# Dropped packets
dropin_total:
warning: 50
critical: 200
operator: ">"
dropout_total:
warning: 50
critical: 200
operator: ">"
# Connection states
connections_TIME_WAIT:
warning: 1000
critical: 5000
operator: ">"
connections_ESTABLISHED:
warning: 500
critical: 1000
operator: ">"
```
### Nagios Runner
The Nagios plugin runner reports exit codes that can be thresholded:
```yaml
nagios_runner:
exit_code:
warning: 1 # Map Nagios WARNING to our WARNING
critical: 2 # Map Nagios CRITICAL to our CRITICAL
operator: ">="
hysteresis: 0.0 # No hysteresis for exit codes
```
## Notification Behavior
### When Notifications Are Sent
Notifications are triggered on **state changes**:
1. **Escalation**: OK → WARNING, OK → CRITICAL, WARNING → CRITICAL
```
WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
```
2. **Recovery**: CRITICAL → WARNING, CRITICAL → OK, WARNING → OK
```
RECOVERED: webserver01 - cpu_monitor.cpu_percent = 70.0 (CRITICAL -> OK)
```
3. **Re-notifications**: Periodic reminders for ongoing alerts
```
REMINDER (CRITICAL): webserver01 - cpu_monitor.cpu_percent = 95.0 (ongoing for 3600s)
```
### Notification Frequency
- **State changes**: Immediate notification
- **Re-notifications**: Controlled by `threshold_renotify_interval` (default: 3600 seconds = 1 hour)
```yaml
threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts
```
### Notification Channels
The system supports centralized notification channel definitions, allowing different hosts to use different notification providers and credentials. This provides fine-grained control over who gets notified about what.
#### Supported Channel Types
- **Email** (via SMTP)
- **Pushover** (mobile notifications)
- **Signal** (via signal-cli)
- **Mattermost** (team chat webhooks)
#### Centralized Channel Configuration
Define notification channels once in the configuration file:
```yaml
notification_channels:
# Signal notifications
signal_ops:
type: signal
cli_path: /usr/local/bin/signal-cli
user: +1234567890
recipient: +1234567890
# Email notifications
email_ops:
type: email
recipients: [ops@example.com, alerts@example.com]
sender: heartbeat@example.com
smtp_server: smtp.example.com
smtp_port: 587
smtp_user: heartbeat@example.com
smtp_password: your-smtp-password
# Pushover notifications
pushover_urgent:
type: pushover
token: your-pushover-app-token
user: your-pushover-user-key
# Mattermost notifications
mattermost_devops:
type: mattermost
host: mattermost.example.com
token: your-webhook-token
channel: devops-alerts
username: heartbeat-bot
icon: https://example.com/heartbeat-icon.png
# Default channels for hosts that don't specify channels
default_notification_channels: [email_ops]
```
#### Per-Host Channel Assignment
Assign notification channels to specific hosts in the `hosts` section:
```yaml
hosts:
# Critical server - multiple notification channels
prod-web-01:
threshold_config: high_sensitivity
watch: true
notification_channels: [signal_ops, pushover_urgent, email_ops]
dyndns: false
# Database server - ops team only
prod-db-01:
threshold_config: database
watch: true
notification_channels: [signal_ops, email_ops]
dyndns: false
# Development server - email only
dev-server-01:
threshold_config: low_sensitivity
watch: false
notification_channels: [email_ops]
dyndns: false
# Uses default_notification_channels if not specified
test-server-01:
threshold_config: default
watch: false
dyndns: false
```
### Watched Hosts
Only hosts with `watch: true` in the `hosts` section will trigger notifications:
```yaml
hosts:
webserver01:
watch: true
notification_channels: [email_ops]
database01:
watch: true
notification_channels: [signal_ops, email_ops]
mailserver:
watch: true
notification_channels: [pushover_urgent]
```
Hosts not marked for watching will still have thresholds checked and alert states tracked, but won't send notifications.
## Alert State Tracking
Each host maintains alert states for all monitored metrics:
```python
host.alert_states = {
"cpu_monitor.cpu_percent": AlertState(level=WARNING, since=1234567890),
"memory_monitor.percent": AlertState(level=CRITICAL, since=1234567800),
"disk_monitor./.percent": AlertState(level=OK, since=1234567700),
}
```
Alert states persist in memory and are saved with host data (pickle).
### Alert State Information
Each `AlertState` tracks:
- **level**: Current alert level (OK, WARNING, CRITICAL, UNKNOWN)
- **since**: Timestamp when current state started
- **last_value**: Most recent metric value
- **last_check**: Timestamp of last threshold check
- **notification_count**: Number of notifications sent for this alert
- **last_notification**: Timestamp of last notification
### Querying Alert States
Via HTTP API (future enhancement):
```bash
GET /api/hosts/webserver01/alerts
```
Response:
```json
{
"active_alerts": [
{
"metric": "cpu_monitor.cpu_percent",
"level": "WARNING",
"since": 1234567890,
"value": 85.0,
"duration": 300
}
],
"summary": {
"ok": 15,
"warning": 1,
"critical": 0
}
}
```
## Testing
A comprehensive test suite is provided in `test_threshold.py`:
```bash
python test_threshold.py
```
Tests cover:
- Threshold configuration and parsing
- All comparison operators
- Hysteresis functionality
- Alert state tracking
- State change detection
- Notification triggering
- Nested metrics (partitions)
- Alert summaries
## Best Practices
### 1. Start Conservative
Begin with higher thresholds to avoid alert fatigue:
```yaml
cpu_monitor:
cpu_percent:
warning: 85.0 # Start higher
critical: 95.0 # Very high for critical
```
Adjust downward based on observed behavior.
### 2. Consider Workload Patterns
Different systems have different normal ranges:
**Web servers** (bursty traffic):
```yaml
cpu_percent:
warning: 80.0
critical: 90.0
hysteresis: 0.15 # Higher hysteresis for burstiness
```
**Database servers** (steady load):
```yaml
cpu_percent:
warning: 70.0
critical: 85.0
hysteresis: 0.1 # Lower hysteresis for steady metrics
```
### 3. Use Appropriate Operators
Match the operator to the metric:
| Metric Type | Example | Operator | Reason |
|-------------|---------|----------|--------|
| Resource usage | CPU%, Memory% | `>` | Alert when high |
| Available resources | Free memory, Free disk | `<` | Alert when low |
| Error counters | Network errors | `>` | Alert when increasing |
| Health checks | Nagios exit code | `>=` | Map to standard codes |
### 4. Align with Monitoring Intervals
Ensure threshold checks align with plugin collection intervals:
```yaml
plugins:
cpu_monitor:
interval: 300 # Check every 5 minutes
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
# Will be checked every 5 minutes
```
### 5. Test Before Production
1. **Start with disabled thresholds**:
```yaml
enabled: false
```
2. **Observe metric ranges** over a week
3. **Set thresholds** based on observed data
4. **Enable gradually**:
```yaml
enabled: true
```
5. **Monitor for false positives**
### 6. Document Baseline Values
Keep a record of normal operating ranges:
```yaml
# Production web server baseline (observed over 30 days):
# CPU: 20-40% normal, 60% peak
# Memory: 60-70% normal, 80% peak
# Disk /: 40-50% usage, growing 2%/month
cpu_monitor:
cpu_percent:
warning: 75.0 # Above peak + margin
critical: 90.0 # Danger zone
```
### 7. Layer Alerts
Use WARNING for early notification, CRITICAL for immediate action:
```yaml
disk_monitor:
partitions:
/:
percent:
warning: 75.0 # Early warning: "check in next few days"
critical: 90.0 # Critical: "act now before outage"
```
## Troubleshooting
### No Notifications Being Sent
1. **Check if host is watched**:
```yaml
watchhosts:
- your-host-name
```
2. **Verify notification configuration**:
```yaml
toemail:
- admin@example.com
smtpserver: smtp.example.com
```
3. **Check threshold configuration**:
```bash
# Look for parsing errors in server logs
grep "threshold" /var/log/heartbeat/hbd.log
```
4. **Verify metric names**:
- Metric names must match exactly (case-sensitive)
- Check journal or logs for actual metric names
### Too Many Alerts (Flapping)
1. **Increase hysteresis**:
```yaml
hysteresis: 0.2 # Increase from 0.1 to 0.2 (20%)
```
2. **Adjust thresholds**:
```yaml
warning: 85.0 # Increase from 80.0
```
3. **Increase renotification interval**:
```yaml
threshold_renotify_interval: 7200 # 2 hours instead of 1
```
### Alerts Not Triggering
1. **Check threshold operator**:
```yaml
# For available memory (alert when LOW):
operator: "<" # NOT ">"
```
2. **Verify numeric values**:
- Ensure metric values are numeric
- Check for unit mismatches (MB vs GB)
3. **Check if threshold is enabled**:
```yaml
enabled: true # NOT false
```
4. **Review hysteresis settings**:
- Very high hysteresis may prevent state changes
- Try reducing or disabling temporarily
### Alert State Not Recovering
1. **Check recovery threshold calculation**:
```
Threshold: 90
Hysteresis: 0.1
Recovery: 90 - (90 * 0.1) = 81
Value must drop below 81 to recover
```
2. **Temporarily disable hysteresis**:
```yaml
hysteresis: 0.0
```
3. **Monitor actual metric values**:
```bash
# Check journal for actual values
grep "cpu_percent" /var/log/heartbeat/messages.journal | tail -20
```
## Advanced Topics
### Custom Notification Callbacks
The ThresholdChecker supports custom notification functions:
```python
def custom_notifier(message):
# Send to incident management system
pagerduty.trigger(message)
# Log to custom system
logger.critical(message)
# Update dashboard
metrics.alert_count.inc()
checker = ThresholdChecker(
config=config,
notification_callback=custom_notifier
)
```
### Programmatic Access
Query alert states programmatically:
```python
# Get all active alerts for a host
active = threshold_checker.get_active_alerts(host.alert_states)
for alert in active:
print(f"{alert.metric_path}: {alert.level.name} for {time.time() - alert.since}s")
# Get alert summary
summary = threshold_checker.get_alert_summary(host.alert_states)
print(f"WARNING: {summary['warning']}, CRITICAL: {summary['critical']}")
```
### Integration with External Systems
Threshold violations can be integrated with:
- **PagerDuty**: Incident creation and escalation
- **OpsGenie**: On-call scheduling and routing
- **ServiceNow**: Ticket creation
- **Grafana**: Dashboard annotations
- **Elasticsearch**: Alert indexing and analysis
## Future Enhancements
Planned features:
1. **Composite thresholds**: Alert based on multiple metrics
```yaml
composite:
high_load_with_low_memory:
conditions:
- cpu_monitor.load_1min > 8.0
- memory_monitor.available_mb < 500
```
2. **Time-based thresholds**: Different thresholds by time of day
```yaml
schedule:
business_hours:
warning: 70.0
off_hours:
warning: 85.0
```
3. **Rate-of-change thresholds**: Alert on rapid changes
```yaml
rate_of_change:
metric: cpu_percent
period: 300
threshold: 30.0 # Alert if changes >30% in 5 minutes
```
4. **Alert grouping**: Combine related alerts
```yaml
groups:
disk_critical:
metrics:
- disk_monitor./.percent
- disk_monitor./var.percent
action: single_notification
```
5. **Maintenance windows**: Suppress alerts during planned maintenance
```yaml
maintenance:
- host: webserver01
start: 2024-01-15T02:00:00Z
end: 2024-01-15T04:00:00Z
```
## See Also
- [Plugin Development Guide](PLUGIN_DEVELOPMENT.md)
- [Message Journal Documentation](MESSAGE_JOURNAL.md)
- Configuration examples: `hbd/config_thresholds_example.yaml`
- Test suite: `test_threshold.py`
## Multi-Threshold Configuration
Support for multiple named threshold configurations with per-host mapping and composable layering.
### Overview
The multi-threshold feature allows you to:
- Define multiple named threshold configurations
- Assign one or more configurations to each host
- Compose configurations by layering — each named config's overrides are applied in order on top of the defaults
- Use different sensitivity levels for different environments
### Configuration Structure
Named configurations are defined under `threshold_configs`. Each host selects which ones to use via `threshold_config` in the `hosts` section (a string for a single config, or a list to layer multiple):
```yaml
# Optional: set the default configuration name (defaults to "default")
default_threshold_config: "default"
threshold_configs:
default:
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
high_sensitivity:
thresholds:
cpu_monitor:
cpu_percent:
warning: 60.0
critical: 75.0
low_sensitivity:
thresholds:
cpu_monitor:
cpu_percent:
warning: 90.0
critical: 95.0
hosts:
prod-web-01:
threshold_config: high_sensitivity # single config
dev-server-01:
threshold_config: low_sensitivity
# Hosts with no threshold_config use default_threshold_config
```
### Composable Configurations (list form)
`threshold_config` can be a list. Configs are applied **left to right**: the defaults are the base, then each named config's overrides are layered on top. Later entries in the list win on any metric they define.
```yaml
threshold_configs:
default:
thresholds:
cpu_monitor:
cpu_percent: {warning: 80, critical: 90}
memory_monitor:
memory_percent: {warning: 85, critical: 95}
disk_monitor:
partitions:
/:
percent: {warning: 80, critical: 90}
# Tighter CPU limits for busy servers
high_cpu_load:
thresholds:
cpu_monitor:
cpu_percent: {warning: 60, critical: 75}
# Tighter disk limits for data-heavy servers
busy_disk:
thresholds:
disk_monitor:
partitions:
/:
percent: {warning: 70, critical: 85}
hosts:
# Gets default thresholds only
web-01:
threshold_config: default
# Gets tighter CPU limits, default memory and disk
build-server:
threshold_config: high_cpu_load
# Layers both: tighter CPU AND tighter disk, default memory
db-01:
threshold_config: [high_cpu_load, busy_disk]
# Three layers: busy_disk overrides high_cpu_load if they conflict
storage-01:
threshold_config: [default, high_cpu_load, busy_disk]
```
**How layering works:**
Starting from the `default` thresholds:
| Layer | Applied config | Effect |
|-------|---------------|--------|
| Base | `default` | all default thresholds |
| +1 | `high_cpu_load` | cpu_percent overridden to 60/75 |
| +2 | `busy_disk` | disk percent overridden to 70/85; cpu_percent stays at 60/75 |
Each named config only overrides the metrics it explicitly defines. Metrics not mentioned in a config inherit from the layers beneath.
### Use Cases
#### 1. Environment-Based Thresholds
Different thresholds for production vs. development:
```yaml
threshold_configs:
production:
thresholds:
cpu_monitor:
cpu_percent:
warning: 70.0 # Alert earlier in production
critical: 85.0
development:
thresholds:
cpu_monitor:
cpu_percent:
warning: 90.0 # More relaxed for dev
critical: 98.0
hosts:
prod-web-01:
threshold_config: production
prod-web-02:
threshold_config: production
dev-web-01:
threshold_config: development
dev-web-02:
threshold_config: development
```
#### 2. Server Role-Based Thresholds
Different thresholds based on server function:
```yaml
threshold_configs:
webserver:
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
database:
thresholds:
cpu_monitor:
cpu_percent:
warning: 70.0
critical: 85.0
memory_monitor:
memory_percent:
warning: 90.0 # Databases can use high memory
critical: 97.0
disk_monitor:
partitions:
/var/lib/mysql:
percent:
warning: 75.0
critical: 85.0
cache:
thresholds:
memory_monitor:
memory_percent:
warning: 95.0 # Redis/Memcached can use very high memory
critical: 99.0
hosts:
web-01:
threshold_config: webserver
web-02:
threshold_config: webserver
db-01:
threshold_config: database
db-02:
threshold_config: database
redis-01:
threshold_config: cache
memcached-01:
threshold_config: cache
```
#### 3. Sensitivity Levels
Different sensitivity for critical vs. non-critical systems:
```yaml
threshold_configs:
critical:
thresholds:
disk_monitor:
partitions:
/:
percent:
warning: 70.0
critical: 80.0
hysteresis: 0.15
standard:
thresholds:
disk_monitor:
partitions:
/:
percent:
warning: 85.0
critical: 95.0
hysteresis: 0.1
relaxed:
thresholds:
disk_monitor:
partitions:
/:
percent:
warning: 90.0
critical: 98.0
hysteresis: 0.05
hosts:
payment-gateway:
threshold_config: critical
auth-server:
threshold_config: critical
web-01:
threshold_config: standard
web-02:
threshold_config: standard
test-server:
threshold_config: relaxed
```
#### 4. Composable Profiles
Build host-specific thresholds by combining small, focused configs:
```yaml
threshold_configs:
# Baseline — everything at default levels
default:
thresholds:
cpu_monitor:
cpu_percent: {warning: 80, critical: 90}
memory_monitor:
memory_percent: {warning: 85, critical: 95}
# Overlay: tighter CPU only
tight_cpu:
thresholds:
cpu_monitor:
cpu_percent: {warning: 60, critical: 75}
# Overlay: tighter memory only
tight_memory:
thresholds:
memory_monitor:
memory_percent: {warning: 70, critical: 85}
# Overlay: extra disk partition for database servers
db_disk:
thresholds:
disk_monitor:
partitions:
/var/lib/postgresql:
percent: {warning: 75, critical: 88}
hosts:
# Plain web server
web-01:
threshold_config: default
# Build server: tight CPU, default memory and disk
build-01:
threshold_config: tight_cpu
# Database: tight CPU + tight memory + extra disk partition
db-01:
threshold_config: [tight_cpu, tight_memory, db_disk]
# Replica database: tight memory + extra disk, normal CPU
db-02:
threshold_config: [tight_memory, db_disk]
```
### Configuration Priority
1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults
2. **Host `threshold_config` (string)**: Use that single named config directly
3. **`host_threshold_mapping`** (legacy): Same as above, string only
4. **`default_threshold_config`**: Used for hosts with no mapping
5. **First alphabetically**: If the default config is not found, use the first config alphabetically
6. **Legacy `thresholds` section**: Used when `threshold_configs` is absent entirely
### Backward Compatibility
The legacy `host_threshold_mapping` top-level key and the flat `thresholds` section are still fully supported:
```yaml
# Still works — equivalent to hosts: {prod-web-01: {threshold_config: high_sensitivity}}
host_threshold_mapping:
prod-web-01: high_sensitivity
# Still works — equivalent to threshold_configs: {default: {thresholds: ...}}
thresholds:
cpu_monitor:
cpu_percent: {warning: 80, critical: 90}
```