threshold_config in the hosts section now accepts a list of named configs applied left-to-right on top of the defaults, so focused override profiles can be mixed without duplication. Single-string and legacy host_threshold_mapping forms are unchanged. - Add threshold_raw_configs to store per-config overrides separately - Normalise threshold_config to list on parse (string or list) - get_thresholds_for_host folds the list over the default base - Update README and docs/THRESHOLD_ALERTING.md with examples Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
26 KiB
Threshold Alerting System
Overview
The Heartbeat Monitoring System includes a comprehensive threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured thresholds. This system is designed to:
- Detect anomalies: Automatically identify when system metrics exceed safe operating ranges
- Prevent alert fatigue: Use hysteresis to prevent notification flapping
- Escalate appropriately: Support WARNING and CRITICAL severity levels
- Track state: Maintain alert history and state transitions per host
- Integrate seamlessly: Work with existing notification infrastructure (email, pushover, etc.)
Architecture
Components
-
ThresholdChecker (
hbd/threshold.py)- Main threshold checking engine
- Parses configuration
- Evaluates metrics against thresholds
- Triggers notifications on state changes
-
ThresholdConfig
- Individual threshold configuration
- Supports multiple comparison operators
- Implements hysteresis logic
-
AlertState
- Tracks current alert state per metric
- Records state transitions
- Manages notification timing
-
Integration Points
- UDP handler: Checks thresholds when plugin data arrives
- Host objects: Store alert states per host
- Notification system: Sends alerts via configured channels
Alert Levels
- OK: Metric is within normal range
- WARNING: Metric has exceeded warning threshold (first-level concern)
- CRITICAL: Metric has exceeded critical threshold (requires immediate attention)
- UNKNOWN: Metric value cannot be evaluated (e.g., non-numeric data)
Configuration
Basic Structure
Thresholds are configured in the YAML configuration file under the thresholds section:
thresholds:
plugin_name:
metric_name:
warning: 80.0
critical: 90.0
operator: ">"
hysteresis: 0.1
display: "display format"
enabled: true
Configuration Parameters
Required Parameters
- warning: Warning threshold value (numeric)
- critical: Critical threshold value (numeric)
Note: At least one of warning or critical must be specified.
Optional Parameters
-
operator: Comparison operator (default:
">")">"- Greater than">="- Greater than or equal"<"- Less than"<="- Less than or equal"=="- Equal to"!="- Not equal to
-
hysteresis: Hysteresis percentage to prevent flapping (default:
0.1= 10%)- Range: 0.0 to 1.0
- Prevents rapid state transitions when value hovers near threshold
-
display: f-string to hold the display format for alert messages
- defaults to "(threshold: {op_symbol} {threshold_value})"
-
enabled: Whether this threshold is active (default:
true)
Comparison Operators
Greater Than (>, >=)
Used for metrics where higher values are problematic:
cpu_monitor:
cpu_percent:
warning: 80.0 # Alert when CPU > 80%
critical: 90.0 # Alert when CPU > 90%
operator: ">"
Examples:
- CPU usage percentage
- Memory usage percentage
- Disk usage percentage
- Load average
- Error counters
Less Than (<, <=)
Used for metrics where lower values are problematic:
memory_monitor:
available_mb:
warning: 1000 # Alert when available memory < 1GB
critical: 500 # Alert when available memory < 500MB
operator: "<"
Examples:
- Available memory
- Free disk space
- Connection pool availability
- Battery level
Hysteresis
Hysteresis prevents alert flapping by requiring values to improve by a certain amount before recovering from an alert state.
How It Works
When a metric crosses a threshold (e.g., CPU goes from 85% to 91%, triggering CRITICAL), hysteresis is applied when the value improves:
Threshold: 90
Hysteresis: 0.1 (10%)
Recovery threshold: 90 - (90 * 0.1) = 81
Value 91 -> CRITICAL (threshold crossed)
Value 89 -> CRITICAL (still above recovery threshold of 81)
Value 85 -> CRITICAL (still above recovery threshold)
Value 80 -> WARNING or OK (below recovery threshold, re-evaluated normally)
Configuration Recommendations
-
Stable metrics (CPU, memory): 10-15% hysteresis
hysteresis: 0.1 -
Very stable metrics (disk usage): 5% hysteresis
hysteresis: 0.05 -
Counter metrics (errors, packets): 20% hysteresis
hysteresis: 0.2 -
Binary states (exit codes): No hysteresis
hysteresis: 0.0
Plugin-Specific Configuration
CPU Monitor
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
operator: ">"
hysteresis: 0.1
load_1min:
warning: 4.0
critical: 8.0
operator: ">"
hysteresis: 0.15
load_5min:
warning: 3.0
critical: 6.0
operator: ">"
load_15min:
warning: 2.0
critical: 4.0
operator: ">"
Memory Monitor
memory_monitor:
# Percentage-based threshold
percent:
warning: 85.0
critical: 95.0
operator: ">"
# Absolute value threshold (inverse - alert when LOW)
available_mb:
warning: 1000
critical: 500
operator: "<"
# Swap usage
swap_percent:
warning: 50.0
critical: 80.0
operator: ">"
Disk Monitor
Disk thresholds support partition-specific configuration:
disk_monitor:
partitions:
/:
percent:
warning: 80.0
critical: 90.0
operator: ">"
hysteresis: 0.05
free_gb:
warning: 10.0
critical: 5.0
operator: "<"
/home:
percent:
warning: 85.0
critical: 95.0
operator: ">"
/var:
percent:
warning: 80.0
critical: 90.0
operator: ">"
free_gb:
warning: 5.0
critical: 2.0
operator: "<"
Network Monitor
network_monitor:
# Error counters
errors_total:
warning: 100
critical: 1000
operator: ">"
hysteresis: 0.2
# Dropped packets
dropin_total:
warning: 50
critical: 200
operator: ">"
dropout_total:
warning: 50
critical: 200
operator: ">"
# Connection states
connections_TIME_WAIT:
warning: 1000
critical: 5000
operator: ">"
connections_ESTABLISHED:
warning: 500
critical: 1000
operator: ">"
Nagios Runner
The Nagios plugin runner reports exit codes that can be thresholded:
nagios_runner:
exit_code:
warning: 1 # Map Nagios WARNING to our WARNING
critical: 2 # Map Nagios CRITICAL to our CRITICAL
operator: ">="
hysteresis: 0.0 # No hysteresis for exit codes
Notification Behavior
When Notifications Are Sent
Notifications are triggered on state changes:
-
Escalation: OK → WARNING, OK → CRITICAL, WARNING → CRITICAL
WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0 -
Recovery: CRITICAL → WARNING, CRITICAL → OK, WARNING → OK
RECOVERED: webserver01 - cpu_monitor.cpu_percent = 70.0 (CRITICAL -> OK) -
Re-notifications: Periodic reminders for ongoing alerts
REMINDER (CRITICAL): webserver01 - cpu_monitor.cpu_percent = 95.0 (ongoing for 3600s)
Notification Frequency
- State changes: Immediate notification
- Re-notifications: Controlled by
threshold_renotify_interval(default: 3600 seconds = 1 hour)
threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts
Notification Channels
The system supports centralized notification channel definitions, allowing different hosts to use different notification providers and credentials. This provides fine-grained control over who gets notified about what.
Supported Channel Types
- Email (via SMTP)
- Pushover (mobile notifications)
- Signal (via signal-cli)
- Mattermost (team chat webhooks)
Centralized Channel Configuration
Define notification channels once in the configuration file:
notification_channels:
# Signal notifications
signal_ops:
type: signal
cli_path: /usr/local/bin/signal-cli
user: +1234567890
recipient: +1234567890
# Email notifications
email_ops:
type: email
recipients: [ops@example.com, alerts@example.com]
sender: heartbeat@example.com
smtp_server: smtp.example.com
smtp_port: 587
smtp_user: heartbeat@example.com
smtp_password: your-smtp-password
# Pushover notifications
pushover_urgent:
type: pushover
token: your-pushover-app-token
user: your-pushover-user-key
# Mattermost notifications
mattermost_devops:
type: mattermost
host: mattermost.example.com
token: your-webhook-token
channel: devops-alerts
username: heartbeat-bot
icon: https://example.com/heartbeat-icon.png
# Default channels for hosts that don't specify channels
default_notification_channels: [email_ops]
Per-Host Channel Assignment
Assign notification channels to specific hosts in the hosts section:
hosts:
# Critical server - multiple notification channels
prod-web-01:
threshold_config: high_sensitivity
watch: true
notification_channels: [signal_ops, pushover_urgent, email_ops]
dyndns: false
# Database server - ops team only
prod-db-01:
threshold_config: database
watch: true
notification_channels: [signal_ops, email_ops]
dyndns: false
# Development server - email only
dev-server-01:
threshold_config: low_sensitivity
watch: false
notification_channels: [email_ops]
dyndns: false
# Uses default_notification_channels if not specified
test-server-01:
threshold_config: default
watch: false
dyndns: false
Watched Hosts
Only hosts with watch: true in the hosts section will trigger notifications:
hosts:
webserver01:
watch: true
notification_channels: [email_ops]
database01:
watch: true
notification_channels: [signal_ops, email_ops]
mailserver:
watch: true
notification_channels: [pushover_urgent]
Hosts not marked for watching will still have thresholds checked and alert states tracked, but won't send notifications.
Alert State Tracking
Each host maintains alert states for all monitored metrics:
host.alert_states = {
"cpu_monitor.cpu_percent": AlertState(level=WARNING, since=1234567890),
"memory_monitor.percent": AlertState(level=CRITICAL, since=1234567800),
"disk_monitor./.percent": AlertState(level=OK, since=1234567700),
}
Alert states persist in memory and are saved with host data (pickle).
Alert State Information
Each AlertState tracks:
- level: Current alert level (OK, WARNING, CRITICAL, UNKNOWN)
- since: Timestamp when current state started
- last_value: Most recent metric value
- last_check: Timestamp of last threshold check
- notification_count: Number of notifications sent for this alert
- last_notification: Timestamp of last notification
Querying Alert States
Via HTTP API (future enhancement):
GET /api/hosts/webserver01/alerts
Response:
{
"active_alerts": [
{
"metric": "cpu_monitor.cpu_percent",
"level": "WARNING",
"since": 1234567890,
"value": 85.0,
"duration": 300
}
],
"summary": {
"ok": 15,
"warning": 1,
"critical": 0
}
}
Testing
A comprehensive test suite is provided in test_threshold.py:
python test_threshold.py
Tests cover:
- Threshold configuration and parsing
- All comparison operators
- Hysteresis functionality
- Alert state tracking
- State change detection
- Notification triggering
- Nested metrics (partitions)
- Alert summaries
Best Practices
1. Start Conservative
Begin with higher thresholds to avoid alert fatigue:
cpu_monitor:
cpu_percent:
warning: 85.0 # Start higher
critical: 95.0 # Very high for critical
Adjust downward based on observed behavior.
2. Consider Workload Patterns
Different systems have different normal ranges:
Web servers (bursty traffic):
cpu_percent:
warning: 80.0
critical: 90.0
hysteresis: 0.15 # Higher hysteresis for burstiness
Database servers (steady load):
cpu_percent:
warning: 70.0
critical: 85.0
hysteresis: 0.1 # Lower hysteresis for steady metrics
3. Use Appropriate Operators
Match the operator to the metric:
| Metric Type | Example | Operator | Reason |
|---|---|---|---|
| Resource usage | CPU%, Memory% | > |
Alert when high |
| Available resources | Free memory, Free disk | < |
Alert when low |
| Error counters | Network errors | > |
Alert when increasing |
| Health checks | Nagios exit code | >= |
Map to standard codes |
4. Align with Monitoring Intervals
Ensure threshold checks align with plugin collection intervals:
plugins:
cpu_monitor:
interval: 300 # Check every 5 minutes
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
# Will be checked every 5 minutes
5. Test Before Production
-
Start with disabled thresholds:
enabled: false -
Observe metric ranges over a week
-
Set thresholds based on observed data
-
Enable gradually:
enabled: true -
Monitor for false positives
6. Document Baseline Values
Keep a record of normal operating ranges:
# Production web server baseline (observed over 30 days):
# CPU: 20-40% normal, 60% peak
# Memory: 60-70% normal, 80% peak
# Disk /: 40-50% usage, growing 2%/month
cpu_monitor:
cpu_percent:
warning: 75.0 # Above peak + margin
critical: 90.0 # Danger zone
7. Layer Alerts
Use WARNING for early notification, CRITICAL for immediate action:
disk_monitor:
partitions:
/:
percent:
warning: 75.0 # Early warning: "check in next few days"
critical: 90.0 # Critical: "act now before outage"
Troubleshooting
No Notifications Being Sent
-
Check if host is watched:
watchhosts: - your-host-name -
Verify notification configuration:
toemail: - admin@example.com smtpserver: smtp.example.com -
Check threshold configuration:
# Look for parsing errors in server logs grep "threshold" /var/log/heartbeat/hbd.log -
Verify metric names:
- Metric names must match exactly (case-sensitive)
- Check journal or logs for actual metric names
Too Many Alerts (Flapping)
-
Increase hysteresis:
hysteresis: 0.2 # Increase from 0.1 to 0.2 (20%) -
Adjust thresholds:
warning: 85.0 # Increase from 80.0 -
Increase renotification interval:
threshold_renotify_interval: 7200 # 2 hours instead of 1
Alerts Not Triggering
-
Check threshold operator:
# For available memory (alert when LOW): operator: "<" # NOT ">" -
Verify numeric values:
- Ensure metric values are numeric
- Check for unit mismatches (MB vs GB)
-
Check if threshold is enabled:
enabled: true # NOT false -
Review hysteresis settings:
- Very high hysteresis may prevent state changes
- Try reducing or disabling temporarily
Alert State Not Recovering
-
Check recovery threshold calculation:
Threshold: 90 Hysteresis: 0.1 Recovery: 90 - (90 * 0.1) = 81 Value must drop below 81 to recover -
Temporarily disable hysteresis:
hysteresis: 0.0 -
Monitor actual metric values:
# Check journal for actual values grep "cpu_percent" /var/log/heartbeat/messages.journal | tail -20
Advanced Topics
Custom Notification Callbacks
The ThresholdChecker supports custom notification functions:
def custom_notifier(message):
# Send to incident management system
pagerduty.trigger(message)
# Log to custom system
logger.critical(message)
# Update dashboard
metrics.alert_count.inc()
checker = ThresholdChecker(
config=config,
notification_callback=custom_notifier
)
Programmatic Access
Query alert states programmatically:
# Get all active alerts for a host
active = threshold_checker.get_active_alerts(host.alert_states)
for alert in active:
print(f"{alert.metric_path}: {alert.level.name} for {time.time() - alert.since}s")
# Get alert summary
summary = threshold_checker.get_alert_summary(host.alert_states)
print(f"WARNING: {summary['warning']}, CRITICAL: {summary['critical']}")
Integration with External Systems
Threshold violations can be integrated with:
- PagerDuty: Incident creation and escalation
- OpsGenie: On-call scheduling and routing
- ServiceNow: Ticket creation
- Grafana: Dashboard annotations
- Elasticsearch: Alert indexing and analysis
Future Enhancements
Planned features:
-
Composite thresholds: Alert based on multiple metrics
composite: high_load_with_low_memory: conditions: - cpu_monitor.load_1min > 8.0 - memory_monitor.available_mb < 500 -
Time-based thresholds: Different thresholds by time of day
schedule: business_hours: warning: 70.0 off_hours: warning: 85.0 -
Rate-of-change thresholds: Alert on rapid changes
rate_of_change: metric: cpu_percent period: 300 threshold: 30.0 # Alert if changes >30% in 5 minutes -
Alert grouping: Combine related alerts
groups: disk_critical: metrics: - disk_monitor./.percent - disk_monitor./var.percent action: single_notification -
Maintenance windows: Suppress alerts during planned maintenance
maintenance: - host: webserver01 start: 2024-01-15T02:00:00Z end: 2024-01-15T04:00:00Z
See Also
- Plugin Development Guide
- Message Journal Documentation
- Configuration examples:
hbd/config_thresholds_example.yaml - Test suite:
test_threshold.py
Multi-Threshold Configuration
Support for multiple named threshold configurations with per-host mapping and composable layering.
Overview
The multi-threshold feature allows you to:
- Define multiple named threshold configurations
- Assign one or more configurations to each host
- Compose configurations by layering — each named config's overrides are applied in order on top of the defaults
- Use different sensitivity levels for different environments
Configuration Structure
Named configurations are defined under threshold_configs. Each host selects which ones to use via threshold_config in the hosts section (a string for a single config, or a list to layer multiple):
# Optional: set the default configuration name (defaults to "default")
default_threshold_config: "default"
threshold_configs:
default:
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
high_sensitivity:
thresholds:
cpu_monitor:
cpu_percent:
warning: 60.0
critical: 75.0
low_sensitivity:
thresholds:
cpu_monitor:
cpu_percent:
warning: 90.0
critical: 95.0
hosts:
prod-web-01:
threshold_config: high_sensitivity # single config
dev-server-01:
threshold_config: low_sensitivity
# Hosts with no threshold_config use default_threshold_config
Composable Configurations (list form)
threshold_config can be a list. Configs are applied left to right: the defaults are the base, then each named config's overrides are layered on top. Later entries in the list win on any metric they define.
threshold_configs:
default:
thresholds:
cpu_monitor:
cpu_percent: {warning: 80, critical: 90}
memory_monitor:
memory_percent: {warning: 85, critical: 95}
disk_monitor:
partitions:
/:
percent: {warning: 80, critical: 90}
# Tighter CPU limits for busy servers
high_cpu_load:
thresholds:
cpu_monitor:
cpu_percent: {warning: 60, critical: 75}
# Tighter disk limits for data-heavy servers
busy_disk:
thresholds:
disk_monitor:
partitions:
/:
percent: {warning: 70, critical: 85}
hosts:
# Gets default thresholds only
web-01:
threshold_config: default
# Gets tighter CPU limits, default memory and disk
build-server:
threshold_config: high_cpu_load
# Layers both: tighter CPU AND tighter disk, default memory
db-01:
threshold_config: [high_cpu_load, busy_disk]
# Three layers: busy_disk overrides high_cpu_load if they conflict
storage-01:
threshold_config: [default, high_cpu_load, busy_disk]
How layering works:
Starting from the default thresholds:
| Layer | Applied config | Effect |
|---|---|---|
| Base | default |
all default thresholds |
| +1 | high_cpu_load |
cpu_percent overridden to 60/75 |
| +2 | busy_disk |
disk percent overridden to 70/85; cpu_percent stays at 60/75 |
Each named config only overrides the metrics it explicitly defines. Metrics not mentioned in a config inherit from the layers beneath.
Use Cases
1. Environment-Based Thresholds
Different thresholds for production vs. development:
threshold_configs:
production:
thresholds:
cpu_monitor:
cpu_percent:
warning: 70.0 # Alert earlier in production
critical: 85.0
development:
thresholds:
cpu_monitor:
cpu_percent:
warning: 90.0 # More relaxed for dev
critical: 98.0
hosts:
prod-web-01:
threshold_config: production
prod-web-02:
threshold_config: production
dev-web-01:
threshold_config: development
dev-web-02:
threshold_config: development
2. Server Role-Based Thresholds
Different thresholds based on server function:
threshold_configs:
webserver:
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
database:
thresholds:
cpu_monitor:
cpu_percent:
warning: 70.0
critical: 85.0
memory_monitor:
memory_percent:
warning: 90.0 # Databases can use high memory
critical: 97.0
disk_monitor:
partitions:
/var/lib/mysql:
percent:
warning: 75.0
critical: 85.0
cache:
thresholds:
memory_monitor:
memory_percent:
warning: 95.0 # Redis/Memcached can use very high memory
critical: 99.0
hosts:
web-01:
threshold_config: webserver
web-02:
threshold_config: webserver
db-01:
threshold_config: database
db-02:
threshold_config: database
redis-01:
threshold_config: cache
memcached-01:
threshold_config: cache
3. Sensitivity Levels
Different sensitivity for critical vs. non-critical systems:
threshold_configs:
critical:
thresholds:
disk_monitor:
partitions:
/:
percent:
warning: 70.0
critical: 80.0
hysteresis: 0.15
standard:
thresholds:
disk_monitor:
partitions:
/:
percent:
warning: 85.0
critical: 95.0
hysteresis: 0.1
relaxed:
thresholds:
disk_monitor:
partitions:
/:
percent:
warning: 90.0
critical: 98.0
hysteresis: 0.05
hosts:
payment-gateway:
threshold_config: critical
auth-server:
threshold_config: critical
web-01:
threshold_config: standard
web-02:
threshold_config: standard
test-server:
threshold_config: relaxed
4. Composable Profiles
Build host-specific thresholds by combining small, focused configs:
threshold_configs:
# Baseline — everything at default levels
default:
thresholds:
cpu_monitor:
cpu_percent: {warning: 80, critical: 90}
memory_monitor:
memory_percent: {warning: 85, critical: 95}
# Overlay: tighter CPU only
tight_cpu:
thresholds:
cpu_monitor:
cpu_percent: {warning: 60, critical: 75}
# Overlay: tighter memory only
tight_memory:
thresholds:
memory_monitor:
memory_percent: {warning: 70, critical: 85}
# Overlay: extra disk partition for database servers
db_disk:
thresholds:
disk_monitor:
partitions:
/var/lib/postgresql:
percent: {warning: 75, critical: 88}
hosts:
# Plain web server
web-01:
threshold_config: default
# Build server: tight CPU, default memory and disk
build-01:
threshold_config: tight_cpu
# Database: tight CPU + tight memory + extra disk partition
db-01:
threshold_config: [tight_cpu, tight_memory, db_disk]
# Replica database: tight memory + extra disk, normal CPU
db-02:
threshold_config: [tight_memory, db_disk]
Backward Compatibility
The legacy single threshold configuration is fully supported:
# Old format - still works
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
This is equivalent to:
# New format
threshold_configs:
default:
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
Configuration Priority
- Host
threshold_config(list): Layer each named config's overrides left-to-right on top of the defaults - Host
threshold_config(string): Use that single named config directly host_threshold_mapping(legacy): Same as above, string onlydefault_threshold_config: Used for hosts with no mapping- First alphabetically: If the default config is not found, use the first config alphabetically
- Legacy
thresholdssection: Used whenthreshold_configsis absent entirely
Backward Compatibility
The legacy host_threshold_mapping top-level key and the flat thresholds section are still fully supported:
# Still works — equivalent to hosts: {prod-web-01: {threshold_config: high_sensitivity}}
host_threshold_mapping:
prod-web-01: high_sensitivity
# Still works — equivalent to threshold_configs: {default: {thresholds: ...}}
thresholds:
cpu_monitor:
cpu_percent: {warning: 80, critical: 90}