feat: generic threshold matching for nagios_runner with {check_name} display support
_find_threshold() now returns the stripped prefix ("check_name") alongside
the ThresholdConfig, enabling a single generic entry (e.g. nagios_runner.status_code)
to cover all per-command metrics (check_disk_root_status_code, check_load_status_code,
…). The prefix is threaded through to _format_display() as {check_name}, with
{metric_name} also available in display templates. purge_stale_alerts() updated
to use generic matching so it does not incorrectly drop alerts on generic-matched
metrics. README updated with Display Format Templates and Generic Threshold
Matching sections.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -181,7 +181,8 @@ thresholds:
|
||||
warning: 80.0 # Warn when CPU > 80%
|
||||
critical: 90.0 # Critical when CPU > 90%
|
||||
operator: ">"
|
||||
hysteresis: 0.1 # 10% hysteresis to prevent flapping
|
||||
hysteresis: 0.02 # 2% hysteresis to prevent flapping
|
||||
display: "(threshold: {op_symbol} {threshold_value}%)" # optional
|
||||
|
||||
memory_monitor:
|
||||
percent:
|
||||
@@ -274,7 +275,59 @@ All plugin metrics can be thresholded:
|
||||
- **Memory**: percent, available_mb, swap_percent
|
||||
- **Disk**: Per-partition percent, free_gb, free_mb
|
||||
- **Network**: errors_total, dropped packets, connection counts
|
||||
- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
|
||||
- **Nagios**: Any field emitted by `nagios_runner` (status_code, exit_code, performance data, …)
|
||||
|
||||
### Display Format Templates
|
||||
|
||||
Each threshold entry accepts an optional `display` field — a Python format string shown in notifications and on the Alerts dashboard:
|
||||
|
||||
```yaml
|
||||
nagios_runner:
|
||||
status_code:
|
||||
warning: 1
|
||||
critical: 2
|
||||
operator: ">="
|
||||
display: "{check_name}: exit {value} (expected < {threshold_value})"
|
||||
```
|
||||
|
||||
Available variables:
|
||||
|
||||
| Variable | Description |
|
||||
|---|---|
|
||||
| `{value}` | Current metric value |
|
||||
| `{threshold_value}` | Threshold that was crossed |
|
||||
| `{op_symbol}` | Comparison operator (`>`, `<`, `>=`, …) |
|
||||
| `{check_name}` | Prefix stripped by generic matching (see below) |
|
||||
| `{metric_name}` | Full field name within the plugin data |
|
||||
| any plugin field | Any other field present in the plugin's data |
|
||||
|
||||
### Generic Threshold Matching
|
||||
|
||||
When a metric name has no exact threshold entry, the server progressively strips leading underscore-separated segments and re-tries the lookup. This lets a single generic entry cover an entire family of metrics.
|
||||
|
||||
The classic use case is `nagios_runner`, which names each metric after the command that produced it:
|
||||
|
||||
```
|
||||
nagios_runner.check_disk_root_status_code → no exact match
|
||||
nagios_runner.disk_root_status_code → no match
|
||||
nagios_runner.root_status_code → no match
|
||||
nagios_runner.status_code → matched ✓
|
||||
```
|
||||
|
||||
Configure the generic threshold once:
|
||||
|
||||
```yaml
|
||||
nagios_runner:
|
||||
status_code:
|
||||
warning: 1
|
||||
critical: 2
|
||||
operator: ">="
|
||||
display: "{check_name}: exit {value}"
|
||||
```
|
||||
|
||||
The stripped prefix (`check_disk_root` in the example above) is available as `{check_name}` in the display template, so you can identify which check triggered the alert without writing a separate threshold entry per command.
|
||||
|
||||
Exact matches always take priority. A generic entry only applies when no specific one is defined.
|
||||
|
||||
### Per-Host Threshold Profiles
|
||||
|
||||
|
||||
Reference in New Issue
Block a user