Compare commits
8 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| f3d08d1c9e | |||
| 1e4263b793 | |||
| e931acb9f5 | |||
| 018409e71d | |||
| 1824f637b4 | |||
| a534c06b26 | |||
| d7b5c97a4e | |||
| ae447ac4a6 |
@@ -58,10 +58,11 @@ Heartbeat includes a comprehensive plugin architecture that extends monitoring b
|
|||||||
### Built-in Plugins
|
### Built-in Plugins
|
||||||
|
|
||||||
- `os_info`: Collects OS, kernel, distribution, and architecture information
|
- `os_info`: Collects OS, kernel, distribution, and architecture information
|
||||||
- `cpu_monitor`: Monitors CPU usage, load average, frequency, and process counts
|
- `cpu_monitor`: Monitors CPU usage, load average, frequency, process counts, and uptime
|
||||||
- `memory_monitor`: Monitors RAM and swap usage, available memory
|
- `memory_monitor`: Monitors RAM and swap usage, available memory (ZFS ARC-aware)
|
||||||
- `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
|
- `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
|
||||||
- `network_monitor`: Monitors network interface statistics, bandwidth, and connections
|
- `network_monitor`: Monitors network interface statistics, bandwidth, and connections
|
||||||
|
- `ping_monitor`: Measures round-trip latency to configured hosts
|
||||||
- `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
|
- `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
|
||||||
- `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
|
- `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
|
||||||
- `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`
|
- `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`
|
||||||
@@ -76,7 +77,7 @@ The `nagios_runner` plugin provides seamless integration with the vast Nagios pl
|
|||||||
- Validates absolute command paths at startup and warns on missing or non-executable files
|
- Validates absolute command paths at startup and warns on missing or non-executable files
|
||||||
- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
|
- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
|
||||||
- Extracts performance data with thresholds
|
- Extracts performance data with thresholds
|
||||||
- Reports aggregated status across all configured checks
|
- Reports per-check status, exit code, and output; no aggregate rollup field
|
||||||
|
|
||||||
See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.
|
See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.
|
||||||
|
|
||||||
@@ -224,7 +225,7 @@ thresholds:
|
|||||||
<hostname>:
|
<hostname>:
|
||||||
warning: <milliseconds> # Warn when RTT > this value
|
warning: <milliseconds> # Warn when RTT > this value
|
||||||
critical: <milliseconds> # Critical when RTT > this value
|
critical: <milliseconds> # Critical when RTT > this value
|
||||||
hysteresis: 0.1 # Optional: 10% hysteresis (default)
|
hysteresis: 0.02 # Optional: 2% hysteresis (default)
|
||||||
```
|
```
|
||||||
|
|
||||||
**Example alerts:**
|
**Example alerts:**
|
||||||
@@ -275,7 +276,7 @@ All plugin metrics can be thresholded:
|
|||||||
- **Memory**: percent, available_mb, swap_percent
|
- **Memory**: percent, available_mb, swap_percent
|
||||||
- **Disk**: Per-partition percent, free_gb, free_mb
|
- **Disk**: Per-partition percent, free_gb, free_mb
|
||||||
- **Network**: errors_total, dropped packets, connection counts
|
- **Network**: errors_total, dropped packets, connection counts
|
||||||
- **Nagios**: Any field emitted by `nagios_runner` (status_code, exit_code, performance data, …)
|
- **Nagios**: Any field emitted by `nagios_runner` (`<name>_status_code`, `<name>_status`, `<name>_output`, performance data fields)
|
||||||
|
|
||||||
### Display Format Templates
|
### Display Format Templates
|
||||||
|
|
||||||
@@ -296,9 +297,11 @@ Available variables:
|
|||||||
|---|---|
|
|---|---|
|
||||||
| `{value}` | Current metric value |
|
| `{value}` | Current metric value |
|
||||||
| `{threshold_value}` | Threshold that was crossed |
|
| `{threshold_value}` | Threshold that was crossed |
|
||||||
| `{op_symbol}` | Comparison operator (`>`, `<`, `>=`, …) |
|
| `{op_symbol}` | Comparison operator (`>`, `<`, `>=`, …); `"nagios"` for the nagios operator |
|
||||||
| `{check_name}` | Prefix stripped by generic matching (see below) |
|
| `{check_name}` | Prefix stripped by generic matching (see below) |
|
||||||
| `{metric_name}` | Full field name within the plugin data |
|
| `{metric_name}` | Full field name within the plugin data |
|
||||||
|
| `{output}` | For `nagios_runner` generic matches: the matched check's status text (alias for `{check_name}_output`) |
|
||||||
|
| `{status}` | For `nagios_runner` generic matches: the matched check's status name — OK/WARNING/CRITICAL/UNKNOWN (alias for `{check_name}_status`) |
|
||||||
| any plugin field | Any other field present in the plugin's data |
|
| any plugin field | Any other field present in the plugin's data |
|
||||||
|
|
||||||
### Generic Threshold Matching
|
### Generic Threshold Matching
|
||||||
@@ -314,15 +317,13 @@ nagios_runner.root_status_code → no match
|
|||||||
nagios_runner.status_code → matched ✓
|
nagios_runner.status_code → matched ✓
|
||||||
```
|
```
|
||||||
|
|
||||||
Configure the generic threshold once:
|
Configure the generic threshold once using the `nagios` operator, which maps exit codes directly to alert severity without requiring numeric warning/critical values:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
nagios_runner:
|
nagios_runner:
|
||||||
status_code:
|
status_code:
|
||||||
warning: 1
|
operator: "nagios" # 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN
|
||||||
critical: 2
|
display: "{check_name}: {output}"
|
||||||
operator: ">="
|
|
||||||
display: "{check_name}: exit {value}"
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The stripped prefix (`check_disk_root` in the example above) is available as `{check_name}` in the display template, so you can identify which check triggered the alert without writing a separate threshold entry per command.
|
The stripped prefix (`check_disk_root` in the example above) is available as `{check_name}` in the display template, so you can identify which check triggered the alert without writing a separate threshold entry per command.
|
||||||
@@ -514,12 +515,11 @@ You can also run it via the module entrypoint:
|
|||||||
python -m hbd.client.main your-server.example.com
|
python -m hbd.client.main your-server.example.com
|
||||||
```
|
```
|
||||||
|
|
||||||
Client configuration can also be specified in YAML:
|
Client configuration can also be specified in YAML (`~/.hbc.yaml`):
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
server: hbd.example.com
|
hb_port: 50003 # Server port (default: 50003)
|
||||||
port: 50003
|
interval: 30 # Heartbeat interval in seconds
|
||||||
interval: 30
|
|
||||||
plugins:
|
plugins:
|
||||||
cpu_monitor:
|
cpu_monitor:
|
||||||
interval: 300 # Check every 5 minutes (default)
|
interval: 300 # Check every 5 minutes (default)
|
||||||
@@ -533,10 +533,14 @@ plugins:
|
|||||||
nagios_runner:
|
nagios_runner:
|
||||||
interval: 300 # Check every 5 minutes (default)
|
interval: 300 # Check every 5 minutes (default)
|
||||||
commands:
|
commands:
|
||||||
- /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
|
- name: check_load
|
||||||
- /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
|
command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
|
||||||
|
- name: check_disk
|
||||||
|
command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The server hostname is always passed as a positional command-line argument; there is no `server:` config key.
|
||||||
|
|
||||||
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
|
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
|
||||||
|
|
||||||
**Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
|
**Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
|
||||||
|
|||||||
@@ -104,11 +104,6 @@ The `nagios_runner` plugin collects:
|
|||||||
- `{name}_{metric}_min` - Minimum value (if present)
|
- `{name}_{metric}_min` - Minimum value (if present)
|
||||||
- `{name}_{metric}_max` - Maximum value (if present)
|
- `{name}_{metric}_max` - Maximum value (if present)
|
||||||
|
|
||||||
**Overall:**
|
|
||||||
- `overall_status` - Worst status from all commands
|
|
||||||
- `overall_status_code` - Worst status code
|
|
||||||
- `plugin_count` - Number of Nagios plugins executed
|
|
||||||
|
|
||||||
## Configuration Options
|
## Configuration Options
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
|
|||||||
@@ -1110,33 +1110,6 @@ hosts:
|
|||||||
db-02:
|
db-02:
|
||||||
threshold_config: [tight_memory, db_disk]
|
threshold_config: [tight_memory, db_disk]
|
||||||
```
|
```
|
||||||
|
|
||||||
### Backward Compatibility
|
|
||||||
|
|
||||||
The legacy single threshold configuration is fully supported:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# Old format - still works
|
|
||||||
thresholds:
|
|
||||||
cpu_monitor:
|
|
||||||
cpu_percent:
|
|
||||||
warning: 80.0
|
|
||||||
critical: 90.0
|
|
||||||
```
|
|
||||||
|
|
||||||
This is equivalent to:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# New format
|
|
||||||
threshold_configs:
|
|
||||||
default:
|
|
||||||
thresholds:
|
|
||||||
cpu_monitor:
|
|
||||||
cpu_percent:
|
|
||||||
warning: 80.0
|
|
||||||
critical: 90.0
|
|
||||||
```
|
|
||||||
|
|
||||||
### Configuration Priority
|
### Configuration Priority
|
||||||
|
|
||||||
1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults
|
1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults
|
||||||
|
|||||||
+1
-1
@@ -14,4 +14,4 @@ Install options:
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
__all__ = ["__version__"]
|
__all__ = ["__version__"]
|
||||||
__version__ = "5.1.20"
|
__version__ = "5.2.1"
|
||||||
|
|||||||
@@ -31,16 +31,13 @@ from hbd.client.plugin import MonitorPlugin
|
|||||||
|
|
||||||
|
|
||||||
# Nagios exit codes
|
# Nagios exit codes
|
||||||
NAGIOS_OK = 0
|
|
||||||
NAGIOS_WARNING = 1
|
|
||||||
NAGIOS_CRITICAL = 2
|
|
||||||
NAGIOS_UNKNOWN = 3
|
NAGIOS_UNKNOWN = 3
|
||||||
|
|
||||||
STATUS_NAMES = {
|
STATUS_NAMES = {
|
||||||
NAGIOS_OK: "OK",
|
0: "OK",
|
||||||
NAGIOS_WARNING: "WARNING",
|
1: "WARNING",
|
||||||
NAGIOS_CRITICAL: "CRITICAL",
|
2: "CRITICAL",
|
||||||
NAGIOS_UNKNOWN: "UNKNOWN"
|
3: "UNKNOWN",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -128,52 +125,39 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
|||||||
Dictionary with results from all plugins
|
Dictionary with results from all plugins
|
||||||
"""
|
"""
|
||||||
results = {}
|
results = {}
|
||||||
|
|
||||||
# Track overall status (worst status wins)
|
|
||||||
worst_status = NAGIOS_OK
|
|
||||||
|
|
||||||
for cmd_config in self.commands:
|
for cmd_config in self.commands:
|
||||||
name = cmd_config.get("name")
|
name = cmd_config.get("name")
|
||||||
command = cmd_config.get("command")
|
command = cmd_config.get("command")
|
||||||
|
|
||||||
if not name or not command:
|
if not name or not command:
|
||||||
self.logger.warning("Skipping command with missing name or command")
|
self.logger.warning("Skipping command with missing name or command")
|
||||||
continue
|
continue
|
||||||
|
|
||||||
# Execute plugin
|
# Execute plugin
|
||||||
try:
|
try:
|
||||||
status_code, output, perfdata = await self._run_nagios_plugin(command)
|
status_code, output, perfdata = await self._run_nagios_plugin(command)
|
||||||
|
|
||||||
# Store results
|
# Store results
|
||||||
results[f"{name}_status"] = STATUS_NAMES.get(status_code, "UNKNOWN")
|
results[f"{name}_status"] = STATUS_NAMES.get(status_code, "UNKNOWN")
|
||||||
results[f"{name}_status_code"] = status_code
|
results[f"{name}_status_code"] = status_code
|
||||||
results[f"{name}_output"] = output
|
results[f"{name}_output"] = output
|
||||||
|
|
||||||
# Track worst status
|
|
||||||
if status_code > worst_status:
|
|
||||||
worst_status = status_code
|
|
||||||
|
|
||||||
# Parse and add performance data
|
# Parse and add performance data
|
||||||
if perfdata:
|
if perfdata:
|
||||||
for metric_name, metric_value in perfdata.items():
|
for metric_name, metric_value in perfdata.items():
|
||||||
results[f"{name}_{metric_name}"] = metric_value
|
results[f"{name}_{metric_name}"] = metric_value
|
||||||
|
|
||||||
self.logger.info(
|
self.logger.info(
|
||||||
f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
|
f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
|
||||||
)
|
)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
self.logger.error(f"Error running {name}: {e}", exc_info=True)
|
self.logger.error(f"Error running {name}: {e}", exc_info=True)
|
||||||
results[f"{name}_status"] = "ERROR"
|
results[f"{name}_status"] = "ERROR"
|
||||||
results[f"{name}_status_code"] = NAGIOS_UNKNOWN
|
results[f"{name}_status_code"] = NAGIOS_UNKNOWN
|
||||||
results[f"{name}_output"] = str(e)
|
results[f"{name}_output"] = str(e)
|
||||||
worst_status = NAGIOS_UNKNOWN
|
|
||||||
|
|
||||||
# Add overall status
|
|
||||||
results["overall_status"] = STATUS_NAMES.get(worst_status, "UNKNOWN")
|
|
||||||
results["overall_status_code"] = worst_status
|
|
||||||
results["plugin_count"] = len(self.commands)
|
|
||||||
|
|
||||||
return results
|
return results
|
||||||
|
|
||||||
async def _run_nagios_plugin(
|
async def _run_nagios_plugin(
|
||||||
|
|||||||
@@ -95,6 +95,12 @@ THRESHOLD_DEFAULTS = {
|
|||||||
'warning': 200,
|
'warning': 200,
|
||||||
'critical': 250.0,
|
'critical': 250.0,
|
||||||
'count': 3 # Optional: number of consecutive breaches before alerting
|
'count': 3 # Optional: number of consecutive breaches before alerting
|
||||||
|
},
|
||||||
|
'nagios_runner': {
|
||||||
|
'status_code': {
|
||||||
|
'display': '{check_name} {output}',
|
||||||
|
'operator': "nagios"
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -475,6 +475,7 @@ def run(config, config_path=None):
|
|||||||
if config.get("debug", 0) > 0:
|
if config.get("debug", 0) > 0:
|
||||||
log_level = logging.DEBUG
|
log_level = logging.DEBUG
|
||||||
logging.basicConfig(level=log_level)
|
logging.basicConfig(level=log_level)
|
||||||
|
logging.getLogger("aiohttp.access").setLevel(logging.DEBUG)
|
||||||
load_pickled_hosts(config, hbdclass)
|
load_pickled_hosts(config, hbdclass)
|
||||||
|
|
||||||
notify_mod.initlog(logfile=config.get("logfile", "messages.log"))
|
notify_mod.initlog(logfile=config.get("logfile", "messages.log"))
|
||||||
|
|||||||
@@ -4,7 +4,7 @@
|
|||||||
|
|
||||||
<style>
|
<style>
|
||||||
|
|
||||||
body {
|
html, body {
|
||||||
height: auto;
|
height: auto;
|
||||||
overflow-y: auto;
|
overflow-y: auto;
|
||||||
}
|
}
|
||||||
@@ -175,8 +175,12 @@
|
|||||||
|
|
||||||
.alert-hostname {
|
.alert-hostname {
|
||||||
font-weight: bold;
|
font-weight: bold;
|
||||||
color: #333;
|
color: #0066cc;
|
||||||
font-size: 1.1em;
|
font-size: 1.1em;
|
||||||
|
text-decoration: none;
|
||||||
|
}
|
||||||
|
.alert-hostname:hover {
|
||||||
|
text-decoration: underline;
|
||||||
}
|
}
|
||||||
|
|
||||||
.alert-metric {
|
.alert-metric {
|
||||||
@@ -433,7 +437,7 @@
|
|||||||
<div class="alert-main">
|
<div class="alert-main">
|
||||||
<div class="alert-header">
|
<div class="alert-header">
|
||||||
<span class="alert-level ${level}">${alert.level}</span>
|
<span class="alert-level ${level}">${alert.level}</span>
|
||||||
<span class="alert-hostname">${alert.hostname}</span>
|
<a class="alert-hostname" href="/plugins#${alert.hostname}">${alert.hostname}</a>
|
||||||
</div>
|
</div>
|
||||||
<div class="alert-metric">${alert.metric_path}</div>
|
<div class="alert-metric">${alert.metric_path}</div>
|
||||||
<div class="alert-details">
|
<div class="alert-details">
|
||||||
|
|||||||
@@ -499,6 +499,17 @@
|
|||||||
return pluginCache[hostname]?.[pluginName] ?? null;
|
return pluginCache[hostname]?.[pluginName] ?? null;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Return worst nagios exit code (0-3) found in a nagios_runner data object.
|
||||||
|
function nagiosWorstStatus(data) {
|
||||||
|
let worst = 0;
|
||||||
|
for (const [k, v] of Object.entries(data || {})) {
|
||||||
|
if (k.endsWith('_status_code') && typeof v === 'number' && v > worst) {
|
||||||
|
worst = v;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return worst;
|
||||||
|
}
|
||||||
|
|
||||||
// ── Fetch helpers ───────────────────────────────────────────────────────
|
// ── Fetch helpers ───────────────────────────────────────────────────────
|
||||||
|
|
||||||
async function fetchPlugin(hostname, pluginName) {
|
async function fetchPlugin(hostname, pluginName) {
|
||||||
@@ -600,13 +611,13 @@
|
|||||||
? chips.join('')
|
? chips.join('')
|
||||||
: '<span class="glance-loading">—</span>';
|
: '<span class="glance-loading">—</span>';
|
||||||
|
|
||||||
// Nagios badge
|
// Nagios badge — derive worst status from individual check codes
|
||||||
const nagios = getCache(hostname, 'nagios_runner');
|
const nagios = getCache(hostname, 'nagios_runner');
|
||||||
if (nagosBadge && nagios) {
|
if (nagosBadge && nagios) {
|
||||||
const status = (nagios.data.overall_status || '—').toUpperCase();
|
const worst = nagiosWorstStatus(nagios.data);
|
||||||
const cls = status === 'OK' ? 'ok'
|
const names = {0:'OK', 1:'WARNING', 2:'CRITICAL', 3:'UNKNOWN'};
|
||||||
: status === 'WARNING' ? 'warning'
|
const status = names[worst] || '—';
|
||||||
: status === 'CRITICAL' ? 'critical' : '';
|
const cls = worst === 0 ? 'ok' : worst === 1 ? 'warning' : worst >= 2 ? 'critical' : '';
|
||||||
nagosBadge.className = `nagios-badge ${cls}`;
|
nagosBadge.className = `nagios-badge ${cls}`;
|
||||||
nagosBadge.textContent = status;
|
nagosBadge.textContent = status;
|
||||||
}
|
}
|
||||||
@@ -715,9 +726,10 @@
|
|||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
case 'nagios_runner': {
|
case 'nagios_runner': {
|
||||||
const status = (d.overall_status || '?').toUpperCase();
|
const worst = nagiosWorstStatus(d);
|
||||||
const count = d.plugin_count;
|
const names = {0:'OK', 1:'WARNING', 2:'CRITICAL', 3:'UNKNOWN'};
|
||||||
text = status + (count != null ? ` — ${count} checks` : '');
|
const codes = Object.keys(d).filter(k => k.endsWith('_status_code'));
|
||||||
|
text = (names[worst] || '?') + (codes.length ? ` — ${codes.length} checks` : '');
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
case 'filesystem_info': {
|
case 'filesystem_info': {
|
||||||
|
|||||||
+108
-66
@@ -30,12 +30,13 @@ class AlertLevel(Enum):
|
|||||||
|
|
||||||
class ComparisonOperator(Enum):
|
class ComparisonOperator(Enum):
|
||||||
"""Supported comparison operators for threshold checks."""
|
"""Supported comparison operators for threshold checks."""
|
||||||
GT = ">" # Greater than
|
GT = ">" # Greater than
|
||||||
GTE = ">=" # Greater than or equal
|
GTE = ">=" # Greater than or equal
|
||||||
LT = "<" # Less than
|
LT = "<" # Less than
|
||||||
LTE = "<=" # Less than or equal
|
LTE = "<=" # Less than or equal
|
||||||
EQ = "==" # Equal to
|
EQ = "==" # Equal to
|
||||||
NEQ = "!=" # Not equal to
|
NEQ = "!=" # Not equal to
|
||||||
|
NAGIOS = "nagios" # Nagios exit-code semantics: 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN
|
||||||
|
|
||||||
|
|
||||||
class AlertState:
|
class AlertState:
|
||||||
@@ -229,33 +230,43 @@ class ThresholdConfig:
|
|||||||
def evaluate(self, value: float) -> AlertLevel:
|
def evaluate(self, value: float) -> AlertLevel:
|
||||||
"""
|
"""
|
||||||
Evaluate a value against this threshold.
|
Evaluate a value against this threshold.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
value: Metric value to check
|
value: Metric value to check
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
AlertLevel indicating the severity
|
AlertLevel indicating the severity
|
||||||
"""
|
"""
|
||||||
if not self.enabled:
|
if not self.enabled:
|
||||||
return AlertLevel.OK
|
return AlertLevel.OK
|
||||||
|
|
||||||
|
# Nagios exit-code semantics: value IS the severity
|
||||||
|
if self.operator == ComparisonOperator.NAGIOS:
|
||||||
|
try:
|
||||||
|
code = int(value)
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return AlertLevel.UNKNOWN
|
||||||
|
return {0: AlertLevel.OK, 1: AlertLevel.WARNING, 2: AlertLevel.CRITICAL}.get(
|
||||||
|
code, AlertLevel.UNKNOWN
|
||||||
|
)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# Convert value to float for comparison
|
# Convert value to float for comparison
|
||||||
value = float(value)
|
value = float(value)
|
||||||
except (TypeError, ValueError):
|
except (TypeError, ValueError):
|
||||||
logger.warning("Cannot convert value %s to float for %s", value, self.metric_path)
|
logger.warning("Cannot convert value %s to float for %s", value, self.metric_path)
|
||||||
return AlertLevel.UNKNOWN
|
return AlertLevel.UNKNOWN
|
||||||
|
|
||||||
# Check critical threshold first
|
# Check critical threshold first
|
||||||
if self.critical is not None:
|
if self.critical is not None:
|
||||||
if self._compare(value, self.critical):
|
if self._compare(value, self.critical):
|
||||||
return AlertLevel.CRITICAL
|
return AlertLevel.CRITICAL
|
||||||
|
|
||||||
# Then check warning threshold
|
# Then check warning threshold
|
||||||
if self.warning is not None:
|
if self.warning is not None:
|
||||||
if self._compare(value, self.warning):
|
if self._compare(value, self.warning):
|
||||||
return AlertLevel.WARNING
|
return AlertLevel.WARNING
|
||||||
|
|
||||||
return AlertLevel.OK
|
return AlertLevel.OK
|
||||||
|
|
||||||
def evaluate_with_hysteresis(
|
def evaluate_with_hysteresis(
|
||||||
@@ -274,7 +285,11 @@ class ThresholdConfig:
|
|||||||
New alert level considering hysteresis
|
New alert level considering hysteresis
|
||||||
"""
|
"""
|
||||||
new_level = self.evaluate(value)
|
new_level = self.evaluate(value)
|
||||||
|
|
||||||
|
# Nagios exit codes are discrete integers — hysteresis doesn't apply
|
||||||
|
if self.operator == ComparisonOperator.NAGIOS:
|
||||||
|
return new_level
|
||||||
|
|
||||||
# If no hysteresis, return new level
|
# If no hysteresis, return new level
|
||||||
if self.hysteresis == 0.0:
|
if self.hysteresis == 0.0:
|
||||||
return new_level
|
return new_level
|
||||||
@@ -404,14 +419,28 @@ class ThresholdChecker:
|
|||||||
|
|
||||||
def _parse_config(self, config: Dict[str, Any]):
|
def _parse_config(self, config: Dict[str, Any]):
|
||||||
"""Parse threshold configuration from YAML structure.
|
"""Parse threshold configuration from YAML structure.
|
||||||
|
|
||||||
Supports two formats:
|
Supports two formats:
|
||||||
1. Legacy format with direct 'thresholds' section
|
1. Legacy format with direct 'thresholds' section
|
||||||
2. New format with 'threshold_configs' and 'host_threshold_mapping'
|
2. New format with 'threshold_configs' and 'host_threshold_mapping'
|
||||||
|
|
||||||
|
In all cases, THRESHOLD_DEFAULTS are seeded into threshold_configs["default"]
|
||||||
|
so the Settings page always shows the built-in defaults.
|
||||||
|
_parse_multi_config() overwrites this with the fully-merged effective defaults.
|
||||||
"""
|
"""
|
||||||
|
# Always expose built-in defaults through threshold_configs["default"] so
|
||||||
|
# the Settings page has something to display even in legacy/no-config mode.
|
||||||
|
seed: Dict[str, ThresholdConfig] = {}
|
||||||
|
for plugin_name, plugin_thresholds in THRESHOLD_DEFAULTS.get("thresholds", {}).items():
|
||||||
|
if isinstance(plugin_thresholds, dict):
|
||||||
|
self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=seed)
|
||||||
|
if seed:
|
||||||
|
self.threshold_configs["default"] = seed
|
||||||
|
self.threshold_raw_configs["default"] = {}
|
||||||
|
|
||||||
# Check for new multi-config format
|
# Check for new multi-config format
|
||||||
if "threshold_configs" in config:
|
if "threshold_configs" in config:
|
||||||
self._parse_multi_config(config)
|
self._parse_multi_config(config) # overwrites threshold_configs["default"]
|
||||||
elif "thresholds" in config:
|
elif "thresholds" in config:
|
||||||
# Legacy single threshold configuration
|
# Legacy single threshold configuration
|
||||||
self._parse_legacy_config(config)
|
self._parse_legacy_config(config)
|
||||||
@@ -557,11 +586,14 @@ class ThresholdChecker:
|
|||||||
warning = threshold_config.get("warning")
|
warning = threshold_config.get("warning")
|
||||||
critical = threshold_config.get("critical")
|
critical = threshold_config.get("critical")
|
||||||
operator = threshold_config.get("operator", ">")
|
operator = threshold_config.get("operator", ">")
|
||||||
display = threshold_config.get("display", "(threshold: {op_symbol} {threshold_value})")
|
# Nagios operator maps exit codes directly; no numeric thresholds needed
|
||||||
hysteresis = threshold_config.get("hysteresis", 0.02) # 2% default
|
is_nagios_op = (operator == "nagios")
|
||||||
|
default_display = "{check_name}: {output}" if is_nagios_op else "(threshold: {op_symbol} {threshold_value})"
|
||||||
|
display = threshold_config.get("display", default_display)
|
||||||
|
hysteresis = threshold_config.get("hysteresis", 0.0 if is_nagios_op else 0.02)
|
||||||
enabled = threshold_config.get("enabled", True)
|
enabled = threshold_config.get("enabled", True)
|
||||||
|
|
||||||
if warning is None and critical is None:
|
if warning is None and critical is None and not is_nagios_op:
|
||||||
logger.warning("No thresholds defined for %s, skipping", metric_path)
|
logger.warning("No thresholds defined for %s, skipping", metric_path)
|
||||||
continue
|
continue
|
||||||
|
|
||||||
@@ -1011,53 +1043,20 @@ class ThresholdChecker:
|
|||||||
|
|
||||||
# Format operator symbol
|
# Format operator symbol
|
||||||
op_symbol = threshold.operator.value
|
op_symbol = threshold.operator.value
|
||||||
|
|
||||||
|
# Short metric label: strip the plugin-name prefix for readability
|
||||||
|
short_path = metric_path.partition(".")[2] or metric_path
|
||||||
|
|
||||||
# Use a display-friendly value (inf is the sentinel for "overdue")
|
# Use a display-friendly value (inf is the sentinel for "overdue")
|
||||||
import math
|
import math
|
||||||
display_value = "overdue" if isinstance(value, float) and math.isinf(value) else value
|
display_value = "overdue" if isinstance(value, float) and math.isinf(value) else value
|
||||||
|
|
||||||
# Format message
|
# Format message — for the nagios operator there is no numeric threshold_value;
|
||||||
if new_level == AlertLevel.OK:
|
# render the display template whenever one is available.
|
||||||
lvl = "RECOVER"
|
has_display = threshold_value is not None or threshold.operator == ComparisonOperator.NAGIOS
|
||||||
message = f"{metric_path} = {display_value} ({old_level.name} -> OK)"
|
|
||||||
elif new_level == AlertLevel.WARNING:
|
|
||||||
lvl = "WARNING"
|
|
||||||
if threshold_value is not None:
|
|
||||||
threshold_info = self._format_display(
|
|
||||||
threshold.display,
|
|
||||||
value=display_value,
|
|
||||||
threshold_value=threshold_value,
|
|
||||||
op_symbol=op_symbol,
|
|
||||||
plugin_data=plugin_data,
|
|
||||||
check_name=check_name,
|
|
||||||
metric_name=metric_name,
|
|
||||||
)
|
|
||||||
message = f"{metric_path} = {display_value} {threshold_info}"
|
|
||||||
else:
|
|
||||||
message = f"{metric_path} = {display_value}"
|
|
||||||
elif new_level == AlertLevel.CRITICAL:
|
|
||||||
lvl = "CRITICAL"
|
|
||||||
if threshold_value is not None:
|
|
||||||
threshold_info = self._format_display(
|
|
||||||
threshold.display,
|
|
||||||
value=display_value,
|
|
||||||
threshold_value=threshold_value,
|
|
||||||
op_symbol=op_symbol,
|
|
||||||
plugin_data=plugin_data,
|
|
||||||
check_name=check_name,
|
|
||||||
metric_name=metric_name,
|
|
||||||
)
|
|
||||||
message = f"{metric_path} = {display_value} {threshold_info}"
|
|
||||||
else:
|
|
||||||
message = f"{metric_path} = {display_value}"
|
|
||||||
else:
|
|
||||||
lvl = "UNKNOWN"
|
|
||||||
message = f"{metric_path} = {display_value}"
|
|
||||||
|
|
||||||
# Return the formatted threshold info for storing in AlertState
|
def _fmt():
|
||||||
formatted_threshold_msg = None
|
return self._format_display(
|
||||||
if threshold_value is not None and new_level != AlertLevel.OK:
|
|
||||||
formatted_threshold_msg = self._format_display(
|
|
||||||
threshold.display,
|
threshold.display,
|
||||||
value=display_value,
|
value=display_value,
|
||||||
threshold_value=threshold_value,
|
threshold_value=threshold_value,
|
||||||
@@ -1067,6 +1066,31 @@ class ThresholdChecker:
|
|||||||
metric_name=metric_name,
|
metric_name=metric_name,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
if new_level == AlertLevel.OK:
|
||||||
|
lvl = "RECOVER"
|
||||||
|
message = f"{short_path} = {display_value} ({old_level.name} -> OK)"
|
||||||
|
elif new_level == AlertLevel.WARNING:
|
||||||
|
lvl = "WARNING"
|
||||||
|
if has_display:
|
||||||
|
message = f"{short_path} = {display_value} {_fmt()}"
|
||||||
|
else:
|
||||||
|
message = f"{short_path} = {display_value}"
|
||||||
|
elif new_level == AlertLevel.CRITICAL:
|
||||||
|
lvl = "CRITICAL"
|
||||||
|
if has_display:
|
||||||
|
message = f"{short_path} = {display_value} {_fmt()}"
|
||||||
|
else:
|
||||||
|
message = f"{short_path} = {display_value}"
|
||||||
|
else:
|
||||||
|
lvl = "UNKNOWN"
|
||||||
|
if has_display:
|
||||||
|
message = f"{short_path} = {display_value} {_fmt()}"
|
||||||
|
else:
|
||||||
|
message = f"{short_path} = {display_value}"
|
||||||
|
|
||||||
|
# Formatted threshold info stored on AlertState for the UI
|
||||||
|
formatted_threshold_msg = _fmt() if has_display and new_level != AlertLevel.OK else None
|
||||||
|
|
||||||
return lvl, message, formatted_threshold_msg
|
return lvl, message, formatted_threshold_msg
|
||||||
|
|
||||||
def _send_notification(
|
def _send_notification(
|
||||||
@@ -1114,7 +1138,7 @@ class ThresholdChecker:
|
|||||||
self,
|
self,
|
||||||
display_format: str,
|
display_format: str,
|
||||||
value: Any,
|
value: Any,
|
||||||
threshold_value: float,
|
threshold_value: Optional[float],
|
||||||
op_symbol: str,
|
op_symbol: str,
|
||||||
plugin_data: Optional[Dict[str, Any]] = None,
|
plugin_data: Optional[Dict[str, Any]] = None,
|
||||||
check_name: Optional[str] = None,
|
check_name: Optional[str] = None,
|
||||||
@@ -1136,12 +1160,16 @@ class ThresholdChecker:
|
|||||||
Returns:
|
Returns:
|
||||||
Formatted display string
|
Formatted display string
|
||||||
"""
|
"""
|
||||||
|
if not display_format:
|
||||||
|
display_format = "(threshold: {op_symbol} {threshold_value})" if threshold_value is not None else ""
|
||||||
|
|
||||||
# Build format context with standard variables
|
# Build format context with standard variables
|
||||||
format_context = {
|
format_context = {
|
||||||
'value': value,
|
'value': value,
|
||||||
'threshold_value': threshold_value,
|
|
||||||
'op_symbol': op_symbol,
|
'op_symbol': op_symbol,
|
||||||
}
|
}
|
||||||
|
if threshold_value is not None:
|
||||||
|
format_context['threshold_value'] = threshold_value
|
||||||
|
|
||||||
# Add generic-match context variables when available
|
# Add generic-match context variables when available
|
||||||
if check_name is not None:
|
if check_name is not None:
|
||||||
@@ -1152,6 +1180,19 @@ class ThresholdChecker:
|
|||||||
# Add all plugin data fields if available
|
# Add all plugin data fields if available
|
||||||
if plugin_data:
|
if plugin_data:
|
||||||
format_context.update(plugin_data)
|
format_context.update(plugin_data)
|
||||||
|
|
||||||
|
# For nagios_runner generic matches, expose the matched check's output
|
||||||
|
# and status as short aliases {output} and {status} so display templates
|
||||||
|
# don't need to use the full {check_disk_root_output} form.
|
||||||
|
if check_name and plugin_data:
|
||||||
|
if 'output' not in format_context:
|
||||||
|
output = plugin_data.get(f"{check_name}_output")
|
||||||
|
if output is not None:
|
||||||
|
format_context['output'] = output
|
||||||
|
if 'status' not in format_context:
|
||||||
|
status = plugin_data.get(f"{check_name}_status")
|
||||||
|
if status is not None:
|
||||||
|
format_context['status'] = status
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# Format the display string
|
# Format the display string
|
||||||
@@ -1303,7 +1344,8 @@ class ThresholdChecker:
|
|||||||
|
|
||||||
# Format operator symbol
|
# Format operator symbol
|
||||||
op_symbol = threshold.operator.value
|
op_symbol = threshold.operator.value
|
||||||
|
short_path = metric_path.partition(".")[2] or metric_path
|
||||||
|
|
||||||
# Time to re-notify
|
# Time to re-notify
|
||||||
if threshold_value is not None:
|
if threshold_value is not None:
|
||||||
# Use display format string
|
# Use display format string
|
||||||
@@ -1316,9 +1358,9 @@ class ThresholdChecker:
|
|||||||
check_name=check_name,
|
check_name=check_name,
|
||||||
metric_name=metric_name,
|
metric_name=metric_name,
|
||||||
)
|
)
|
||||||
message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} {threshold_info}, ongoing for {int(now - alert_state.since)}s"
|
message = f"REMINDER ({alert_state.level.name}): {host_name} - {short_path} = {value} {threshold_info}, ongoing for {int(now - alert_state.since)}s"
|
||||||
else:
|
else:
|
||||||
message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} (ongoing for {int(now - alert_state.since)}s)"
|
message = f"REMINDER ({alert_state.level.name}): {host_name} - {short_path} = {value} (ongoing for {int(now - alert_state.since)}s)"
|
||||||
|
|
||||||
from . import hbdclass
|
from . import hbdclass
|
||||||
host = hbdclass.Host.hosts.get(host_name)
|
host = hbdclass.Host.hosts.get(host_name)
|
||||||
|
|||||||
+1
-2
@@ -336,8 +336,7 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
|
|||||||
# Apply user-access settings from config
|
# Apply user-access settings from config
|
||||||
access = config_mod.get_host_access(cfg, uname)
|
access = config_mod.get_host_access(cfg, uname)
|
||||||
host.apply_access(access["owner"], access["managers"], access["monitors"])
|
host.apply_access(access["owner"], access["managers"], access["monitors"])
|
||||||
if verbose:
|
logger.info("New host signed on: %s (dyn=%s, access=%s)", uname, host.dyn, access)
|
||||||
print(("XX: New host, num now %s" % (len(hbdcls.Host.hosts))))
|
|
||||||
newh = True
|
newh = True
|
||||||
else:
|
else:
|
||||||
host = hbdcls.Host.hosts[uname]
|
host = hbdcls.Host.hosts[uname]
|
||||||
|
|||||||
+1
-1
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|||||||
|
|
||||||
[project]
|
[project]
|
||||||
name = "hbd"
|
name = "hbd"
|
||||||
version = "5.1.20"
|
version = "5.2.1"
|
||||||
description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
|
description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
requires-python = ">=3.11"
|
requires-python = ">=3.11"
|
||||||
|
|||||||
+1
-6
@@ -41,7 +41,7 @@ from pathlib import Path
|
|||||||
from typing import Any, Dict, List, Optional, Tuple
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
|
||||||
# updated by scripts/bumpminor.sh
|
# updated by scripts/bumpminor.sh
|
||||||
__version__ = "5.1.20"
|
__version__ = "5.2.1"
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Protocol (mirrors hbd/common/proto.py)
|
# Protocol (mirrors hbd/common/proto.py)
|
||||||
@@ -388,7 +388,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
|||||||
|
|
||||||
async def _collect_metrics(self) -> Dict[str, Any]:
|
async def _collect_metrics(self) -> Dict[str, Any]:
|
||||||
results: Dict[str, Any] = {}
|
results: Dict[str, Any] = {}
|
||||||
worst = 0
|
|
||||||
for cmd_cfg in self.commands:
|
for cmd_cfg in self.commands:
|
||||||
name = cmd_cfg.get("name")
|
name = cmd_cfg.get("name")
|
||||||
command = cmd_cfg.get("command")
|
command = cmd_cfg.get("command")
|
||||||
@@ -399,10 +398,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
|||||||
results[f"{name}_status_code"] = rc
|
results[f"{name}_status_code"] = rc
|
||||||
results[f"{name}_output"] = msg
|
results[f"{name}_output"] = msg
|
||||||
results.update({f"{name}_{k}": v for k, v in perf.items()})
|
results.update({f"{name}_{k}": v for k, v in perf.items()})
|
||||||
worst = max(worst, rc)
|
|
||||||
results["overall_status"] = _NAGIOS_STATUS.get(worst, "UNKNOWN")
|
|
||||||
results["overall_status_code"] = worst
|
|
||||||
results["plugin_count"] = len(self.commands)
|
|
||||||
return results
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
+1
-2
@@ -68,8 +68,7 @@ async def test_nagios_runner():
|
|||||||
print(f" ✓ Collected {len(data)} data points")
|
print(f" ✓ Collected {len(data)} data points")
|
||||||
|
|
||||||
print(f"\n4. Results:")
|
print(f"\n4. Results:")
|
||||||
print(f" Overall Status: {data.get('overall_status')} (code: {data.get('overall_status_code')})")
|
print(f" Data points collected: {len(data)}")
|
||||||
print(f" Plugins Executed: {data.get('plugin_count')}")
|
|
||||||
|
|
||||||
# Show individual plugin results
|
# Show individual plugin results
|
||||||
print(f"\n5. Individual Plugin Results:")
|
print(f"\n5. Individual Plugin Results:")
|
||||||
|
|||||||
Reference in New Issue
Block a user