Compare commits
26 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 54fbd8d73d | |||
| 7ab17e26e2 | |||
| 28f5fa951c | |||
| 37f1c58969 | |||
| f006077a71 | |||
| d9fc8d632f | |||
| f640574e4f | |||
| 9a19424279 | |||
| ca8ba84e65 | |||
| f3d08d1c9e | |||
| 1e4263b793 | |||
| e931acb9f5 | |||
| 018409e71d | |||
| 1824f637b4 | |||
| a534c06b26 | |||
| d7b5c97a4e | |||
| ae447ac4a6 | |||
| d44ce3d124 | |||
| b1985d0eb2 | |||
| de778f680f | |||
| d7b368c7c6 | |||
| e790663f9f | |||
| 475319e248 | |||
| ca5ef384a8 | |||
| c93dbdc0f4 | |||
| 3a546a1e5c |
@@ -58,10 +58,11 @@ Heartbeat includes a comprehensive plugin architecture that extends monitoring b
|
||||
### Built-in Plugins
|
||||
|
||||
- `os_info`: Collects OS, kernel, distribution, and architecture information
|
||||
- `cpu_monitor`: Monitors CPU usage, load average, frequency, and process counts
|
||||
- `memory_monitor`: Monitors RAM and swap usage, available memory
|
||||
- `cpu_monitor`: Monitors CPU usage, load average, frequency, process counts, and uptime
|
||||
- `memory_monitor`: Monitors RAM and swap usage, available memory (ZFS ARC-aware)
|
||||
- `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
|
||||
- `network_monitor`: Monitors network interface statistics, bandwidth, and connections
|
||||
- `ping_monitor`: Measures round-trip latency to configured hosts
|
||||
- `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
|
||||
- `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
|
||||
- `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`
|
||||
@@ -76,7 +77,7 @@ The `nagios_runner` plugin provides seamless integration with the vast Nagios pl
|
||||
- Validates absolute command paths at startup and warns on missing or non-executable files
|
||||
- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
|
||||
- Extracts performance data with thresholds
|
||||
- Reports aggregated status across all configured checks
|
||||
- Reports per-check status, exit code, and output; no aggregate rollup field
|
||||
|
||||
See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.
|
||||
|
||||
@@ -181,7 +182,8 @@ thresholds:
|
||||
warning: 80.0 # Warn when CPU > 80%
|
||||
critical: 90.0 # Critical when CPU > 90%
|
||||
operator: ">"
|
||||
hysteresis: 0.1 # 10% hysteresis to prevent flapping
|
||||
hysteresis: 0.02 # 2% hysteresis to prevent flapping
|
||||
display: "(threshold: {op_symbol} {threshold_value}%)" # optional
|
||||
|
||||
memory_monitor:
|
||||
percent:
|
||||
@@ -223,7 +225,7 @@ thresholds:
|
||||
<hostname>:
|
||||
warning: <milliseconds> # Warn when RTT > this value
|
||||
critical: <milliseconds> # Critical when RTT > this value
|
||||
hysteresis: 0.1 # Optional: 10% hysteresis (default)
|
||||
hysteresis: 0.02 # Optional: 2% hysteresis (default)
|
||||
```
|
||||
|
||||
**Example alerts:**
|
||||
@@ -274,7 +276,59 @@ All plugin metrics can be thresholded:
|
||||
- **Memory**: percent, available_mb, swap_percent
|
||||
- **Disk**: Per-partition percent, free_gb, free_mb
|
||||
- **Network**: errors_total, dropped packets, connection counts
|
||||
- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
|
||||
- **Nagios**: Any field emitted by `nagios_runner` (`<name>_status_code`, `<name>_status`, `<name>_output`, performance data fields)
|
||||
|
||||
### Display Format Templates
|
||||
|
||||
Each threshold entry accepts an optional `display` field — a Python format string shown in notifications and on the Alerts dashboard:
|
||||
|
||||
```yaml
|
||||
nagios_runner:
|
||||
status_code:
|
||||
warning: 1
|
||||
critical: 2
|
||||
operator: ">="
|
||||
display: "{check_name}: exit {value} (expected < {threshold_value})"
|
||||
```
|
||||
|
||||
Available variables:
|
||||
|
||||
| Variable | Description |
|
||||
|---|---|
|
||||
| `{value}` | Current metric value |
|
||||
| `{threshold_value}` | Threshold that was crossed |
|
||||
| `{op_symbol}` | Comparison operator (`>`, `<`, `>=`, …); `"nagios"` for the nagios operator |
|
||||
| `{check_name}` | Prefix stripped by generic matching (see below) |
|
||||
| `{metric_name}` | Full field name within the plugin data |
|
||||
| `{output}` | For `nagios_runner` generic matches: the matched check's status text (alias for `{check_name}_output`) |
|
||||
| `{status}` | For `nagios_runner` generic matches: the matched check's status name — OK/WARNING/CRITICAL/UNKNOWN (alias for `{check_name}_status`) |
|
||||
| any plugin field | Any other field present in the plugin's data |
|
||||
|
||||
### Generic Threshold Matching
|
||||
|
||||
When a metric name has no exact threshold entry, the server progressively strips leading underscore-separated segments and re-tries the lookup. This lets a single generic entry cover an entire family of metrics.
|
||||
|
||||
The classic use case is `nagios_runner`, which names each metric after the command that produced it:
|
||||
|
||||
```
|
||||
nagios_runner.check_disk_root_status_code → no exact match
|
||||
nagios_runner.disk_root_status_code → no match
|
||||
nagios_runner.root_status_code → no match
|
||||
nagios_runner.status_code → matched ✓
|
||||
```
|
||||
|
||||
Configure the generic threshold once using the `nagios` operator, which maps exit codes directly to alert severity without requiring numeric warning/critical values:
|
||||
|
||||
```yaml
|
||||
nagios_runner:
|
||||
status_code:
|
||||
operator: "nagios" # 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN
|
||||
display: "{check_name}: {output}"
|
||||
```
|
||||
|
||||
The stripped prefix (`check_disk_root` in the example above) is available as `{check_name}` in the display template, so you can identify which check triggered the alert without writing a separate threshold entry per command.
|
||||
|
||||
Exact matches always take priority. A generic entry only applies when no specific one is defined.
|
||||
|
||||
### Per-Host Threshold Profiles
|
||||
|
||||
@@ -453,6 +507,9 @@ hbc --boot your-server.example.com
|
||||
|
||||
# Verbose output
|
||||
hbc -v your-server.example.com
|
||||
|
||||
# Send 'boot' and 'shutdown' messages on start and exit
|
||||
hbc -b your-server.example.com
|
||||
```
|
||||
|
||||
You can also run it via the module entrypoint:
|
||||
@@ -461,12 +518,11 @@ You can also run it via the module entrypoint:
|
||||
python -m hbd.client.main your-server.example.com
|
||||
```
|
||||
|
||||
Client configuration can also be specified in YAML:
|
||||
Client configuration can also be specified in YAML (`~/.hbc.yaml`):
|
||||
|
||||
```yaml
|
||||
server: hbd.example.com
|
||||
port: 50003
|
||||
interval: 30
|
||||
hb_port: 50003 # Server port (default: 50003)
|
||||
interval: 30 # Heartbeat interval in seconds
|
||||
plugins:
|
||||
cpu_monitor:
|
||||
interval: 300 # Check every 5 minutes (default)
|
||||
@@ -480,10 +536,14 @@ plugins:
|
||||
nagios_runner:
|
||||
interval: 300 # Check every 5 minutes (default)
|
||||
commands:
|
||||
- /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
|
||||
- /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
|
||||
- name: check_load
|
||||
command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
|
||||
- name: check_disk
|
||||
command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
|
||||
```
|
||||
|
||||
The server hostname is always passed as a positional command-line argument; there is no `server:` config key.
|
||||
|
||||
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
|
||||
|
||||
**Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
|
||||
|
||||
@@ -104,11 +104,6 @@ The `nagios_runner` plugin collects:
|
||||
- `{name}_{metric}_min` - Minimum value (if present)
|
||||
- `{name}_{metric}_max` - Maximum value (if present)
|
||||
|
||||
**Overall:**
|
||||
- `overall_status` - Worst status from all commands
|
||||
- `overall_status_code` - Worst status code
|
||||
- `plugin_count` - Number of Nagios plugins executed
|
||||
|
||||
## Configuration Options
|
||||
|
||||
```yaml
|
||||
|
||||
@@ -1110,33 +1110,6 @@ hosts:
|
||||
db-02:
|
||||
threshold_config: [tight_memory, db_disk]
|
||||
```
|
||||
|
||||
### Backward Compatibility
|
||||
|
||||
The legacy single threshold configuration is fully supported:
|
||||
|
||||
```yaml
|
||||
# Old format - still works
|
||||
thresholds:
|
||||
cpu_monitor:
|
||||
cpu_percent:
|
||||
warning: 80.0
|
||||
critical: 90.0
|
||||
```
|
||||
|
||||
This is equivalent to:
|
||||
|
||||
```yaml
|
||||
# New format
|
||||
threshold_configs:
|
||||
default:
|
||||
thresholds:
|
||||
cpu_monitor:
|
||||
cpu_percent:
|
||||
warning: 80.0
|
||||
critical: 90.0
|
||||
```
|
||||
|
||||
### Configuration Priority
|
||||
|
||||
1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults
|
||||
|
||||
+1
-1
@@ -14,4 +14,4 @@ Install options:
|
||||
"""
|
||||
|
||||
__all__ = ["__version__"]
|
||||
__version__ = "5.1.17"
|
||||
__version__ = "5.2.3"
|
||||
|
||||
+15
-14
@@ -21,6 +21,7 @@ from typing import Dict, List, Optional
|
||||
# Import protocol and config
|
||||
from .config import load_config
|
||||
from ..common.proto import dicttos, stodict
|
||||
from .. import __version__
|
||||
|
||||
# Import plugin system
|
||||
from .plugin import PluginRegistry, PluginLoader, InfoPlugin, MonitorPlugin
|
||||
@@ -172,9 +173,8 @@ class HeartbeatProtocol(asyncio.DatagramProtocol):
|
||||
self.logger.error(f"Error processing datagram: {e}", exc_info=True)
|
||||
|
||||
def error_received(self, exc):
|
||||
"""Handle protocol errors."""
|
||||
self.logger.warning(f"Protocol error on {self.connection.addr}: {exc} — dropping connection")
|
||||
self.connection._dead = True
|
||||
"""Handle protocol errors — close transport so the heartbeat sender retries."""
|
||||
self.logger.warning(f"Protocol error on {self.connection.addr}: {exc} — will retry")
|
||||
self.connection.close()
|
||||
|
||||
|
||||
@@ -463,16 +463,13 @@ async def cleanup(connections: List[AsyncConnection]):
|
||||
logger = logging.getLogger("hbc.cleanup")
|
||||
logger.info("Cleaning up connections")
|
||||
|
||||
for conn in connections:
|
||||
target = next((c for c in connections if c.transport), connections[0] if connections else None)
|
||||
if target and send_shutdown:
|
||||
try:
|
||||
msg = {
|
||||
"shutdown": 1,
|
||||
"acks": conn.ackcount
|
||||
}
|
||||
await conn.sendto(msg)
|
||||
await target.sendto({"shutdown": 1, "acks": target.ackcount})
|
||||
except Exception as e:
|
||||
logger.error(f"Error sending shutdown: {e}")
|
||||
|
||||
for conn in connections:
|
||||
conn.close()
|
||||
|
||||
# Give messages time to send
|
||||
@@ -481,7 +478,7 @@ async def cleanup(connections: List[AsyncConnection]):
|
||||
|
||||
async def async_main(args, config):
|
||||
"""Async main function."""
|
||||
global running, shutdown_event, active_tasks
|
||||
global running, shutdown_event, active_tasks, send_shutdown
|
||||
|
||||
# Create shutdown event
|
||||
shutdown_event = asyncio.Event()
|
||||
@@ -498,6 +495,7 @@ async def async_main(args, config):
|
||||
hb_port = config.get("hb_port", PORT)
|
||||
interval = config.get("interval", INTERVAL)
|
||||
|
||||
logger.info(f"hbc {__version__} starting on {iam}")
|
||||
logger.info(f"Starting hbc for {iam} -> {hb_hosts}")
|
||||
logger.info(f"Port: {hb_port}, Interval: {interval}s")
|
||||
|
||||
@@ -529,17 +527,20 @@ async def async_main(args, config):
|
||||
logger.info(f"Created {len(connections)} connections")
|
||||
|
||||
# Send boot/message if requested
|
||||
send_shutdown = False
|
||||
if args.boot or args.message:
|
||||
boot_msg = {}
|
||||
if args.boot:
|
||||
boot_msg["boot"] = 1
|
||||
args.boot = False # Clear boot flag so we don't send it again in main loop
|
||||
send_shutdown = True
|
||||
if args.message:
|
||||
boot_msg["service"] = "service"
|
||||
boot_msg["msg"] = args.message
|
||||
|
||||
boot_msg["acks"] = 0
|
||||
for conn in connections:
|
||||
await conn.sendto(boot_msg)
|
||||
target = next((c for c in connections if c.transport), connections[0])
|
||||
await target.sendto(boot_msg)
|
||||
|
||||
if args.message and not args.daemon:
|
||||
# Message-only mode
|
||||
@@ -739,7 +740,7 @@ def main(argv=None):
|
||||
|
||||
# Daemonize if requested
|
||||
if args.daemon:
|
||||
print("Daemonizing...")
|
||||
logging.info("Daemonizing...")
|
||||
daemonize()
|
||||
_reconfigure_logging_for_daemon(log_level)
|
||||
logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
|
||||
|
||||
@@ -119,6 +119,13 @@ class CPUMonitorPlugin(MonitorPlugin):
|
||||
except Exception as e:
|
||||
self.logger.debug(f"Could not get CPU times: {e}")
|
||||
|
||||
# Uptime in seconds
|
||||
try:
|
||||
import time
|
||||
data["uptime_seconds"] = int(time.time() - self.psutil.boot_time())
|
||||
except Exception as e:
|
||||
self.logger.debug(f"Could not get uptime: {e}")
|
||||
|
||||
self.logger.debug(
|
||||
f"Collected CPU metrics: {data.get('cpu_percent', 'N/A')}% usage"
|
||||
)
|
||||
|
||||
@@ -14,6 +14,24 @@ except ImportError:
|
||||
|
||||
from hbd.client.plugin import MonitorPlugin
|
||||
|
||||
|
||||
def _zfs_arc_bytes() -> int:
|
||||
"""Return current ZFS ARC size in bytes, or 0 if ZFS is not present.
|
||||
|
||||
ZFS ARC is reclaimable but is not included in MemAvailable by the Linux
|
||||
kernel (it is not in SReclaimable), so it would otherwise be counted as
|
||||
used memory.
|
||||
"""
|
||||
try:
|
||||
with open("/proc/spl/kstat/zfs/arcstats") as fh:
|
||||
for line in fh:
|
||||
parts = line.split()
|
||||
if len(parts) >= 3 and parts[0] == "size":
|
||||
return int(parts[2])
|
||||
except (OSError, ValueError):
|
||||
pass
|
||||
return 0
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@@ -101,11 +119,21 @@ class MemoryMonitorPlugin(MonitorPlugin):
|
||||
|
||||
# Virtual (physical) memory statistics
|
||||
vmem = psutil.virtual_memory()
|
||||
|
||||
# psutil's available already excludes page cache / file buffers
|
||||
# (uses MemAvailable on Linux). Add ZFS ARC on top because the kernel
|
||||
# does not include it in SReclaimable / MemAvailable even though it is
|
||||
# reclaimable.
|
||||
arc_bytes = _zfs_arc_bytes()
|
||||
available = min(vmem.available + arc_bytes, vmem.total)
|
||||
used = vmem.total - available
|
||||
percent = round(used / vmem.total * 100, 1) if vmem.total else 0.0
|
||||
|
||||
metrics['memory_total'] = vmem.total
|
||||
metrics['memory_available'] = vmem.available
|
||||
metrics['memory_used'] = vmem.used
|
||||
metrics['memory_available'] = available
|
||||
metrics['memory_used'] = used
|
||||
metrics['memory_free'] = vmem.free
|
||||
metrics['memory_percent'] = vmem.percent
|
||||
metrics['memory_percent'] = percent
|
||||
|
||||
# Platform-specific memory details
|
||||
if hasattr(vmem, 'active'):
|
||||
|
||||
@@ -31,16 +31,13 @@ from hbd.client.plugin import MonitorPlugin
|
||||
|
||||
|
||||
# Nagios exit codes
|
||||
NAGIOS_OK = 0
|
||||
NAGIOS_WARNING = 1
|
||||
NAGIOS_CRITICAL = 2
|
||||
NAGIOS_UNKNOWN = 3
|
||||
|
||||
STATUS_NAMES = {
|
||||
NAGIOS_OK: "OK",
|
||||
NAGIOS_WARNING: "WARNING",
|
||||
NAGIOS_CRITICAL: "CRITICAL",
|
||||
NAGIOS_UNKNOWN: "UNKNOWN"
|
||||
0: "OK",
|
||||
1: "WARNING",
|
||||
2: "CRITICAL",
|
||||
3: "UNKNOWN",
|
||||
}
|
||||
|
||||
|
||||
@@ -129,9 +126,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
||||
"""
|
||||
results = {}
|
||||
|
||||
# Track overall status (worst status wins)
|
||||
worst_status = NAGIOS_OK
|
||||
|
||||
for cmd_config in self.commands:
|
||||
name = cmd_config.get("name")
|
||||
command = cmd_config.get("command")
|
||||
@@ -149,10 +143,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
||||
results[f"{name}_status_code"] = status_code
|
||||
results[f"{name}_output"] = output
|
||||
|
||||
# Track worst status
|
||||
if status_code > worst_status:
|
||||
worst_status = status_code
|
||||
|
||||
# Parse and add performance data
|
||||
if perfdata:
|
||||
for metric_name, metric_value in perfdata.items():
|
||||
@@ -167,12 +157,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
||||
results[f"{name}_status"] = "ERROR"
|
||||
results[f"{name}_status_code"] = NAGIOS_UNKNOWN
|
||||
results[f"{name}_output"] = str(e)
|
||||
worst_status = NAGIOS_UNKNOWN
|
||||
|
||||
# Add overall status
|
||||
results["overall_status"] = STATUS_NAMES.get(worst_status, "UNKNOWN")
|
||||
results["overall_status_code"] = worst_status
|
||||
results["plugin_count"] = len(self.commands)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
@@ -13,12 +13,8 @@ plugins:
|
||||
count: 3 # ICMP packets per ping run (default 3)
|
||||
timeout: 5 # seconds before a host is considered unreachable (default 5)
|
||||
hosts:
|
||||
8.8.8.8:
|
||||
warning: 20.0 # ms
|
||||
critical: 100.0 # ms
|
||||
192.168.1.1:
|
||||
warning: 5.0
|
||||
critical: 20.0
|
||||
- 8.8.8.8
|
||||
- 192.168.1.1
|
||||
```
|
||||
|
||||
Reported metrics per host (metric key uses the hostname with dots/colons replaced
|
||||
|
||||
@@ -95,6 +95,12 @@ THRESHOLD_DEFAULTS = {
|
||||
'warning': 200,
|
||||
'critical': 250.0,
|
||||
'count': 3 # Optional: number of consecutive breaches before alerting
|
||||
},
|
||||
'nagios_runner': {
|
||||
'status_code': {
|
||||
'display': '{check_name} {output}',
|
||||
'operator': "nagios"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
+1
-1
@@ -890,7 +890,7 @@ async def start(
|
||||
tmpl = env.get_template("settings.html")
|
||||
body = tmpl.render(
|
||||
title="Settings - Heartbeat",
|
||||
sections=settings_mod.get_settings_sections(config),
|
||||
sections=settings_mod.get_settings_sections(config, threshold_checker=threshold_checker),
|
||||
current_user=current_user.to_dict() if current_user else None,
|
||||
active_page="settings",
|
||||
)
|
||||
|
||||
@@ -255,6 +255,7 @@ async def _run_async(config, config_path=None):
|
||||
config=config,
|
||||
hbdclass=hbdclass,
|
||||
tcss=None,
|
||||
threshold_checker=threshold_checker,
|
||||
verbose=config.get("verbose", False),
|
||||
get_now=lambda: time.time(),
|
||||
VER="",
|
||||
@@ -474,6 +475,8 @@ def run(config, config_path=None):
|
||||
if config.get("debug", 0) > 0:
|
||||
log_level = logging.DEBUG
|
||||
logging.basicConfig(level=log_level)
|
||||
if not config.get("debug", 0):
|
||||
logging.getLogger("aiohttp.access").propagate = False
|
||||
load_pickled_hosts(config, hbdclass)
|
||||
|
||||
notify_mod.initlog(logfile=config.get("logfile", "messages.log"))
|
||||
|
||||
+30
-37
@@ -88,7 +88,7 @@ def _sanitize_channel(name, cfg):
|
||||
# Public API
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def get_settings_sections(config: dict) -> list:
|
||||
def get_settings_sections(config: dict, threshold_checker=None) -> list:
|
||||
"""Return ordered list of setting sections for the settings page.
|
||||
|
||||
Each section:
|
||||
@@ -182,46 +182,39 @@ def get_settings_sections(config: dict) -> list:
|
||||
})
|
||||
|
||||
# ---- Threshold configurations -----------------------------------------
|
||||
def _parse_metric_row(metric_path, metric_cfg):
|
||||
if not isinstance(metric_cfg, dict):
|
||||
return None
|
||||
def _tc_to_row(tc):
|
||||
return {
|
||||
"metric": metric_path,
|
||||
"operator": metric_cfg.get("operator", ">"),
|
||||
"warning": metric_cfg.get("warning"),
|
||||
"critical": metric_cfg.get("critical"),
|
||||
"hysteresis": metric_cfg.get("hysteresis"),
|
||||
"count": metric_cfg.get("count", 1),
|
||||
"enabled": metric_cfg.get("enabled", True),
|
||||
"metric": tc.metric_path,
|
||||
"operator": tc.operator.value,
|
||||
"warning": tc.warning,
|
||||
"critical": tc.critical,
|
||||
"hysteresis": tc.hysteresis,
|
||||
"count": tc.count,
|
||||
"enabled": tc.enabled,
|
||||
}
|
||||
|
||||
threshold_config_list = []
|
||||
raw_tconfigs = config.get("threshold_configs") or {}
|
||||
if raw_tconfigs:
|
||||
for cfg_name, cfg_data in sorted(raw_tconfigs.items()):
|
||||
if not isinstance(cfg_data, dict):
|
||||
continue
|
||||
metrics = [
|
||||
r for r in (
|
||||
_parse_metric_row(mp, mc)
|
||||
for mp, mc in (cfg_data.get("thresholds") or {}).items()
|
||||
) if r
|
||||
]
|
||||
threshold_config_list.append({
|
||||
"name": cfg_name,
|
||||
"metrics": sorted(metrics, key=lambda m: m["metric"]),
|
||||
})
|
||||
elif config.get("thresholds"):
|
||||
metrics = [
|
||||
r for r in (
|
||||
_parse_metric_row(mp, mc)
|
||||
for mp, mc in config["thresholds"].items()
|
||||
) if r
|
||||
]
|
||||
threshold_config_list.append({
|
||||
"name": "default",
|
||||
"metrics": sorted(metrics, key=lambda m: m["metric"]),
|
||||
})
|
||||
if threshold_checker is not None:
|
||||
if threshold_checker.threshold_configs:
|
||||
for cfg_name, cfg_metrics in sorted(threshold_checker.threshold_configs.items()):
|
||||
# For the default config use the merged effective set;
|
||||
# for named overrides use only the explicitly defined metrics
|
||||
# (threshold_raw_configs) so inherited defaults are not repeated.
|
||||
if cfg_name == "default":
|
||||
display_metrics = cfg_metrics
|
||||
else:
|
||||
display_metrics = threshold_checker.threshold_raw_configs.get(cfg_name, cfg_metrics)
|
||||
metrics = sorted(
|
||||
[_tc_to_row(tc) for tc in display_metrics.values()],
|
||||
key=lambda m: m["metric"],
|
||||
)
|
||||
threshold_config_list.append({"name": cfg_name, "metrics": metrics})
|
||||
elif threshold_checker.thresholds:
|
||||
metrics = sorted(
|
||||
[_tc_to_row(tc) for tc in threshold_checker.thresholds.values()],
|
||||
key=lambda m: m["metric"],
|
||||
)
|
||||
threshold_config_list.append({"name": "default", "metrics": metrics})
|
||||
|
||||
# ---- Hosts summary ----------------------------------------------------
|
||||
hosts_list = []
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
<style>
|
||||
|
||||
body {
|
||||
html, body {
|
||||
height: auto;
|
||||
overflow-y: auto;
|
||||
}
|
||||
@@ -175,14 +175,18 @@
|
||||
|
||||
.alert-hostname {
|
||||
font-weight: bold;
|
||||
color: #333;
|
||||
color: #0066cc;
|
||||
font-size: 1.1em;
|
||||
text-decoration: none;
|
||||
}
|
||||
.alert-hostname:hover {
|
||||
text-decoration: underline;
|
||||
}
|
||||
|
||||
.alert-metric {
|
||||
color: #666;
|
||||
font-family: 'Courier New', monospace;
|
||||
font-size: 0.9em;
|
||||
color: #0066cc;
|
||||
font-size: 1.1em;
|
||||
font-weight: normal;
|
||||
}
|
||||
|
||||
.alert-details {
|
||||
@@ -405,6 +409,10 @@
|
||||
} else if (alert.threshold_value !== undefined && alert.threshold_value !== null && alert.operator) {
|
||||
valueText += ` <span class="threshold-info">(threshold: ${alert.operator} ${formatValue(alert.threshold_value)})</span>`;
|
||||
}
|
||||
if (alert.recovery_threshold !== undefined && alert.recovery_threshold !== null) {
|
||||
const recOp = (alert.operator === '>' || alert.operator === '>=') ? '<' : '>';
|
||||
valueText += ` <span class="threshold-info" style="color:#888">(recovers ${recOp} ${formatValue(alert.recovery_threshold)})</span>`;
|
||||
}
|
||||
|
||||
// Build actions section
|
||||
let actionsHtml = '';
|
||||
@@ -429,9 +437,9 @@
|
||||
<div class="alert-main">
|
||||
<div class="alert-header">
|
||||
<span class="alert-level ${level}">${alert.level}</span>
|
||||
<span class="alert-hostname">${alert.hostname}</span>
|
||||
<a class="alert-hostname" href="/plugins#${alert.hostname}">${alert.hostname}</a>
|
||||
<span class="alert-metric">${alert.metric_path.includes('.') ? alert.metric_path.slice(alert.metric_path.indexOf('.') + 1) : alert.metric_path}</span>
|
||||
</div>
|
||||
<div class="alert-metric">${alert.metric_path}</div>
|
||||
<div class="alert-details">
|
||||
<span>${valueText}</span>
|
||||
<span class="alert-duration">Active for ${duration}</span>
|
||||
|
||||
@@ -152,6 +152,31 @@
|
||||
}
|
||||
.host-action-btn.delete-btn:hover { background: #ffcdd2; }
|
||||
|
||||
/* ── Action result toast ───────────────────────────────────── */
|
||||
#action-toast {
|
||||
position: fixed;
|
||||
bottom: 24px;
|
||||
left: 50%;
|
||||
transform: translateX(-50%) translateY(20px);
|
||||
background: #323232;
|
||||
color: #fff;
|
||||
padding: 12px 22px;
|
||||
border-radius: 6px;
|
||||
font-size: 0.9em;
|
||||
max-width: 480px;
|
||||
text-align: center;
|
||||
opacity: 0;
|
||||
pointer-events: none;
|
||||
transition: opacity 0.25s, transform 0.25s;
|
||||
z-index: 9000;
|
||||
white-space: pre-wrap;
|
||||
}
|
||||
#action-toast.show {
|
||||
opacity: 1;
|
||||
transform: translateX(-50%) translateY(0);
|
||||
}
|
||||
#action-toast.error { background: #c62828; }
|
||||
|
||||
/* ── Host body ──────────────────────────────────────────────── */
|
||||
|
||||
.host-body {
|
||||
@@ -401,12 +426,10 @@
|
||||
{% endif %}
|
||||
<span class="os-label" id="os-label-{{ host.name }}"></span>
|
||||
{% if host.is_owner %}
|
||||
<a class="host-action-btn update-btn"
|
||||
href="/u?h={{ host.name }}"
|
||||
onclick="event.stopPropagation()">Update</a>
|
||||
<a class="host-action-btn delete-btn"
|
||||
href="/d?h={{ host.name }}"
|
||||
onclick="event.stopPropagation(); return confirm('Delete host {{ host.name }}?')">Delete</a>
|
||||
<button class="host-action-btn update-btn"
|
||||
onclick="event.stopPropagation(); hostAction(this, '/u?h={{ host.name }}')">Update</button>
|
||||
<button class="host-action-btn delete-btn"
|
||||
onclick="event.stopPropagation(); hostDelete(this, '{{ host.name }}')">Delete</button>
|
||||
{% endif %}
|
||||
</div>
|
||||
</div>
|
||||
@@ -476,6 +499,17 @@
|
||||
return pluginCache[hostname]?.[pluginName] ?? null;
|
||||
}
|
||||
|
||||
// Return worst nagios exit code (0-3) found in a nagios_runner data object.
|
||||
function nagiosWorstStatus(data) {
|
||||
let worst = 0;
|
||||
for (const [k, v] of Object.entries(data || {})) {
|
||||
if (k.endsWith('_status_code') && typeof v === 'number' && v > worst) {
|
||||
worst = v;
|
||||
}
|
||||
}
|
||||
return worst;
|
||||
}
|
||||
|
||||
// ── Fetch helpers ───────────────────────────────────────────────────────
|
||||
|
||||
async function fetchPlugin(hostname, pluginName) {
|
||||
@@ -577,13 +611,13 @@
|
||||
? chips.join('')
|
||||
: '<span class="glance-loading">—</span>';
|
||||
|
||||
// Nagios badge
|
||||
// Nagios badge — derive worst status from individual check codes
|
||||
const nagios = getCache(hostname, 'nagios_runner');
|
||||
if (nagosBadge && nagios) {
|
||||
const status = (nagios.data.overall_status || '—').toUpperCase();
|
||||
const cls = status === 'OK' ? 'ok'
|
||||
: status === 'WARNING' ? 'warning'
|
||||
: status === 'CRITICAL' ? 'critical' : '';
|
||||
const worst = nagiosWorstStatus(nagios.data);
|
||||
const names = {0:'OK', 1:'WARNING', 2:'CRITICAL', 3:'UNKNOWN'};
|
||||
const status = names[worst] || '—';
|
||||
const cls = worst === 0 ? 'ok' : worst === 1 ? 'warning' : worst >= 2 ? 'critical' : '';
|
||||
nagosBadge.className = `nagios-badge ${cls}`;
|
||||
nagosBadge.textContent = status;
|
||||
}
|
||||
@@ -692,9 +726,10 @@
|
||||
break;
|
||||
}
|
||||
case 'nagios_runner': {
|
||||
const status = (d.overall_status || '?').toUpperCase();
|
||||
const count = d.plugin_count;
|
||||
text = status + (count != null ? ` — ${count} checks` : '');
|
||||
const worst = nagiosWorstStatus(d);
|
||||
const names = {0:'OK', 1:'WARNING', 2:'CRITICAL', 3:'UNKNOWN'};
|
||||
const codes = Object.keys(d).filter(k => k.endsWith('_status_code'));
|
||||
text = (names[worst] || '?') + (codes.length ? ` — ${codes.length} checks` : '');
|
||||
break;
|
||||
}
|
||||
case 'filesystem_info': {
|
||||
@@ -1204,6 +1239,49 @@
|
||||
fetchHostGlance(first.dataset.hostname);
|
||||
}
|
||||
});
|
||||
// ── Host action helpers ──────────────────────────────────────
|
||||
|
||||
let _toastTimer = null;
|
||||
function showToast(msg, isError) {
|
||||
const t = document.getElementById('action-toast');
|
||||
t.textContent = msg;
|
||||
t.classList.toggle('error', !!isError);
|
||||
t.classList.add('show');
|
||||
clearTimeout(_toastTimer);
|
||||
_toastTimer = setTimeout(() => t.classList.remove('show'), 4000);
|
||||
}
|
||||
|
||||
async function hostAction(btn, url) {
|
||||
btn.disabled = true;
|
||||
try {
|
||||
const res = await fetch(url);
|
||||
const text = await res.text();
|
||||
showToast(text, !res.ok);
|
||||
} catch (e) {
|
||||
showToast('Request failed: ' + e.message, true);
|
||||
} finally {
|
||||
btn.disabled = false;
|
||||
}
|
||||
}
|
||||
|
||||
async function hostDelete(btn, hostname) {
|
||||
if (!confirm('Delete host ' + hostname + '?')) return;
|
||||
btn.disabled = true;
|
||||
try {
|
||||
const res = await fetch('/d?h=' + encodeURIComponent(hostname));
|
||||
const text = await res.text();
|
||||
showToast(text, !res.ok);
|
||||
if (res.ok) {
|
||||
const card = document.querySelector(`.host-card[data-hostname="${hostname}"]`);
|
||||
if (card) card.remove();
|
||||
}
|
||||
} catch (e) {
|
||||
showToast('Request failed: ' + e.message, true);
|
||||
btn.disabled = false;
|
||||
}
|
||||
}
|
||||
</script>
|
||||
|
||||
<div id="action-toast"></div>
|
||||
</body>
|
||||
</html>
|
||||
|
||||
+186
-79
@@ -30,12 +30,13 @@ class AlertLevel(Enum):
|
||||
|
||||
class ComparisonOperator(Enum):
|
||||
"""Supported comparison operators for threshold checks."""
|
||||
GT = ">" # Greater than
|
||||
GTE = ">=" # Greater than or equal
|
||||
LT = "<" # Less than
|
||||
LTE = "<=" # Less than or equal
|
||||
EQ = "==" # Equal to
|
||||
NEQ = "!=" # Not equal to
|
||||
GT = ">" # Greater than
|
||||
GTE = ">=" # Greater than or equal
|
||||
LT = "<" # Less than
|
||||
LTE = "<=" # Less than or equal
|
||||
EQ = "==" # Equal to
|
||||
NEQ = "!=" # Not equal to
|
||||
NAGIOS = "nagios" # Nagios exit-code semantics: 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN
|
||||
|
||||
|
||||
class AlertState:
|
||||
@@ -57,6 +58,7 @@ class AlertState:
|
||||
self.last_notification = None
|
||||
self.threshold_value = None # The threshold value that triggered alert
|
||||
self.operator = None # The comparison operator (>, <, >=, etc.)
|
||||
self.hysteresis: Optional[float] = None # Hysteresis fraction used for recovery
|
||||
self.formatted_message = None # Formatted display message for UI
|
||||
self.acknowledged = False # Whether alert has been acknowledged
|
||||
self.acknowledged_at = None # Timestamp when acknowledged
|
||||
@@ -152,6 +154,15 @@ class AlertState:
|
||||
if self.formatted_message is not None:
|
||||
result["formatted_message"] = self.formatted_message
|
||||
|
||||
# Compute and expose the recovery threshold so the UI can display it
|
||||
if (self.hysteresis and self.threshold_value is not None
|
||||
and self.operator is not None):
|
||||
ha = abs(self.threshold_value * self.hysteresis)
|
||||
if self.operator in ('>', '>='):
|
||||
result["recovery_threshold"] = round(self.threshold_value - ha, 4)
|
||||
elif self.operator in ('<', '<='):
|
||||
result["recovery_threshold"] = round(self.threshold_value + ha, 4)
|
||||
|
||||
return result
|
||||
|
||||
def __setstate__(self, state):
|
||||
@@ -159,6 +170,8 @@ class AlertState:
|
||||
self.__dict__.update(state)
|
||||
if not hasattr(self, 'consecutive_count'):
|
||||
self.consecutive_count = 0
|
||||
if not hasattr(self, 'hysteresis'):
|
||||
self.hysteresis = None
|
||||
|
||||
def acknowledge(self):
|
||||
"""Acknowledge this alert to stop reminder notifications."""
|
||||
@@ -227,6 +240,16 @@ class ThresholdConfig:
|
||||
if not self.enabled:
|
||||
return AlertLevel.OK
|
||||
|
||||
# Nagios exit-code semantics: value IS the severity
|
||||
if self.operator == ComparisonOperator.NAGIOS:
|
||||
try:
|
||||
code = int(value)
|
||||
except (TypeError, ValueError):
|
||||
return AlertLevel.UNKNOWN
|
||||
return {0: AlertLevel.OK, 1: AlertLevel.WARNING, 2: AlertLevel.CRITICAL}.get(
|
||||
code, AlertLevel.UNKNOWN
|
||||
)
|
||||
|
||||
try:
|
||||
# Convert value to float for comparison
|
||||
value = float(value)
|
||||
@@ -263,6 +286,10 @@ class ThresholdConfig:
|
||||
"""
|
||||
new_level = self.evaluate(value)
|
||||
|
||||
# Nagios exit codes are discrete integers — hysteresis doesn't apply
|
||||
if self.operator == ComparisonOperator.NAGIOS:
|
||||
return new_level
|
||||
|
||||
# If no hysteresis, return new level
|
||||
if self.hysteresis == 0.0:
|
||||
return new_level
|
||||
@@ -396,10 +423,24 @@ class ThresholdChecker:
|
||||
Supports two formats:
|
||||
1. Legacy format with direct 'thresholds' section
|
||||
2. New format with 'threshold_configs' and 'host_threshold_mapping'
|
||||
|
||||
In all cases, THRESHOLD_DEFAULTS are seeded into threshold_configs["default"]
|
||||
so the Settings page always shows the built-in defaults.
|
||||
_parse_multi_config() overwrites this with the fully-merged effective defaults.
|
||||
"""
|
||||
# Always expose built-in defaults through threshold_configs["default"] so
|
||||
# the Settings page has something to display even in legacy/no-config mode.
|
||||
seed: Dict[str, ThresholdConfig] = {}
|
||||
for plugin_name, plugin_thresholds in THRESHOLD_DEFAULTS.get("thresholds", {}).items():
|
||||
if isinstance(plugin_thresholds, dict):
|
||||
self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=seed)
|
||||
if seed:
|
||||
self.threshold_configs["default"] = seed
|
||||
self.threshold_raw_configs["default"] = {}
|
||||
|
||||
# Check for new multi-config format
|
||||
if "threshold_configs" in config:
|
||||
self._parse_multi_config(config)
|
||||
self._parse_multi_config(config) # overwrites threshold_configs["default"]
|
||||
elif "thresholds" in config:
|
||||
# Legacy single threshold configuration
|
||||
self._parse_legacy_config(config)
|
||||
@@ -545,11 +586,14 @@ class ThresholdChecker:
|
||||
warning = threshold_config.get("warning")
|
||||
critical = threshold_config.get("critical")
|
||||
operator = threshold_config.get("operator", ">")
|
||||
display = threshold_config.get("display", "(threshold: {op_symbol} {threshold_value})")
|
||||
hysteresis = threshold_config.get("hysteresis", 0.1) # 10% default
|
||||
# Nagios operator maps exit codes directly; no numeric thresholds needed
|
||||
is_nagios_op = (operator == "nagios")
|
||||
default_display = "{check_name}: {output}" if is_nagios_op else "(threshold: {op_symbol} {threshold_value})"
|
||||
display = threshold_config.get("display", default_display)
|
||||
hysteresis = threshold_config.get("hysteresis", 0.0 if is_nagios_op else 0.02)
|
||||
enabled = threshold_config.get("enabled", True)
|
||||
|
||||
if warning is None and critical is None:
|
||||
if warning is None and critical is None and not is_nagios_op:
|
||||
logger.warning("No thresholds defined for %s, skipping", metric_path)
|
||||
continue
|
||||
|
||||
@@ -649,7 +693,7 @@ class ThresholdChecker:
|
||||
warning = rtt_thresholds.get("warning")
|
||||
critical = rtt_thresholds.get("critical")
|
||||
operator = rtt_thresholds.get("operator", ">")
|
||||
hysteresis = rtt_thresholds.get("hysteresis", 0.1) # 10% default
|
||||
hysteresis = rtt_thresholds.get("hysteresis", 0.02) # 2% default
|
||||
enabled = rtt_thresholds.get("enabled", True)
|
||||
display = rtt_thresholds.get("display")
|
||||
count = rtt_thresholds.get("count", 1)
|
||||
@@ -794,6 +838,12 @@ class ThresholdChecker:
|
||||
elif new_level == AlertLevel.WARNING and threshold.warning is not None:
|
||||
threshold_value = threshold.warning
|
||||
|
||||
# Keep hysteresis on the state so the UI can show the recovery threshold
|
||||
if new_level != AlertLevel.OK:
|
||||
alert_state.hysteresis = threshold.hysteresis
|
||||
else:
|
||||
alert_state.hysteresis = None
|
||||
|
||||
# Update state and check for changes
|
||||
old_level = alert_state.level
|
||||
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
|
||||
@@ -805,26 +855,33 @@ class ThresholdChecker:
|
||||
return None
|
||||
def _find_threshold(
|
||||
self, thresholds: Dict[str, "ThresholdConfig"], metric_path: str
|
||||
) -> Optional["ThresholdConfig"]:
|
||||
"""Return the threshold for *metric_path*, falling back to suffix matches.
|
||||
) -> Tuple[Optional["ThresholdConfig"], Optional[str]]:
|
||||
"""Return (threshold, check_name) for *metric_path*, falling back to suffix matches.
|
||||
|
||||
Allows generic thresholds like ``ping_monitor.rtt_avg`` to match
|
||||
fully-qualified paths like ``ping_monitor.8_8_8_8_rtt_avg``.
|
||||
Allows generic thresholds like ``nagios_runner.status_code`` to match
|
||||
fully-qualified paths like ``nagios_runner.check_disk_root_status_code``.
|
||||
The exact match is always tried first; then successive leading
|
||||
underscore-delimited segments are stripped from the field name until
|
||||
a match is found or no segments remain.
|
||||
|
||||
Returns:
|
||||
(ThresholdConfig, None) for an exact match.
|
||||
(ThresholdConfig, "check_disk_root") for a suffix match — the second
|
||||
element is the stripped prefix, available as ``{check_name}`` in
|
||||
display format templates.
|
||||
(None, None) when no threshold is found.
|
||||
"""
|
||||
if metric_path in thresholds:
|
||||
return thresholds[metric_path]
|
||||
return thresholds[metric_path], None
|
||||
plugin, sep, field = metric_path.partition(".")
|
||||
if not sep:
|
||||
return None
|
||||
return None, None
|
||||
parts = field.split("_")
|
||||
for i in range(1, len(parts)):
|
||||
candidate = plugin + "." + "_".join(parts[i:])
|
||||
if candidate in thresholds:
|
||||
return thresholds[candidate]
|
||||
return None
|
||||
return thresholds[candidate], "_".join(parts[:i])
|
||||
return None, None
|
||||
|
||||
def check_plugin_data(
|
||||
self,
|
||||
@@ -854,7 +911,7 @@ class ThresholdChecker:
|
||||
for metric_name, value in data.items():
|
||||
metric_path = f"{plugin_name}.{metric_name}"
|
||||
|
||||
threshold = self._find_threshold(thresholds, metric_path)
|
||||
threshold, check_name = self._find_threshold(thresholds, metric_path)
|
||||
if threshold is None:
|
||||
continue
|
||||
|
||||
@@ -877,13 +934,15 @@ class ThresholdChecker:
|
||||
elif new_level == AlertLevel.WARNING and threshold.warning is not None:
|
||||
threshold_value = threshold.warning
|
||||
|
||||
alert_state.hysteresis = threshold.hysteresis if new_level != AlertLevel.OK else None
|
||||
|
||||
# Update state and check for changes
|
||||
old_level = alert_state.level
|
||||
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
|
||||
state_changes.append((metric_path, old_level, new_level, value))
|
||||
self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data)
|
||||
self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data, check_name=check_name, metric_name=metric_name)
|
||||
elif new_level != AlertLevel.OK:
|
||||
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data)
|
||||
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data, check_name=check_name, metric_name=metric_name)
|
||||
|
||||
# Check nested metrics (e.g., partition data in disk_monitor)
|
||||
self._check_nested_metrics(
|
||||
@@ -943,6 +1002,8 @@ class ThresholdChecker:
|
||||
elif new_level == AlertLevel.WARNING and threshold.warning is not None:
|
||||
threshold_value = threshold.warning
|
||||
|
||||
alert_state.hysteresis = threshold.hysteresis if new_level != AlertLevel.OK else None
|
||||
|
||||
old_level = alert_state.level
|
||||
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
|
||||
state_changes.append((metric_path, old_level, new_level, value))
|
||||
@@ -959,6 +1020,8 @@ class ThresholdChecker:
|
||||
value: Any,
|
||||
threshold: ThresholdConfig,
|
||||
plugin_data: Optional[Dict[str, Any]] = None,
|
||||
check_name: Optional[str] = None,
|
||||
metric_name: Optional[str] = None,
|
||||
):
|
||||
"""Trigger a notification for an alert state change.
|
||||
|
||||
@@ -981,55 +1044,53 @@ class ThresholdChecker:
|
||||
# Format operator symbol
|
||||
op_symbol = threshold.operator.value
|
||||
|
||||
# Short metric label: strip the plugin-name prefix for readability
|
||||
short_path = metric_path.partition(".")[2] or metric_path
|
||||
|
||||
# Use a display-friendly value (inf is the sentinel for "overdue")
|
||||
import math
|
||||
display_value = "overdue" if isinstance(value, float) and math.isinf(value) else value
|
||||
|
||||
# Format message
|
||||
if new_level == AlertLevel.OK:
|
||||
lvl = "RECOVER"
|
||||
message = f"{metric_path} = {display_value} ({old_level.name} -> OK)"
|
||||
elif new_level == AlertLevel.WARNING:
|
||||
lvl = "WARNING"
|
||||
if threshold_value is not None:
|
||||
threshold_info = self._format_display(
|
||||
threshold.display,
|
||||
value=display_value,
|
||||
threshold_value=threshold_value,
|
||||
op_symbol=op_symbol,
|
||||
plugin_data=plugin_data
|
||||
)
|
||||
message = f"{metric_path} = {display_value} {threshold_info}"
|
||||
else:
|
||||
message = f"{metric_path} = {display_value}"
|
||||
elif new_level == AlertLevel.CRITICAL:
|
||||
lvl = "CRITICAL"
|
||||
if threshold_value is not None:
|
||||
threshold_info = self._format_display(
|
||||
threshold.display,
|
||||
value=display_value,
|
||||
threshold_value=threshold_value,
|
||||
op_symbol=op_symbol,
|
||||
plugin_data=plugin_data
|
||||
)
|
||||
message = f"{metric_path} = {display_value} {threshold_info}"
|
||||
else:
|
||||
message = f"{metric_path} = {display_value}"
|
||||
else:
|
||||
lvl = "UNKNOWN"
|
||||
message = f"{metric_path} = {display_value}"
|
||||
# Format message — for the nagios operator there is no numeric threshold_value;
|
||||
# render the display template whenever one is available.
|
||||
has_display = threshold_value is not None or threshold.operator == ComparisonOperator.NAGIOS
|
||||
|
||||
# Return the formatted threshold info for storing in AlertState
|
||||
formatted_threshold_msg = None
|
||||
if threshold_value is not None and new_level != AlertLevel.OK:
|
||||
formatted_threshold_msg = self._format_display(
|
||||
def _fmt():
|
||||
return self._format_display(
|
||||
threshold.display,
|
||||
value=display_value,
|
||||
threshold_value=threshold_value,
|
||||
op_symbol=op_symbol,
|
||||
plugin_data=plugin_data
|
||||
plugin_data=plugin_data,
|
||||
check_name=check_name,
|
||||
metric_name=metric_name,
|
||||
)
|
||||
|
||||
if new_level == AlertLevel.OK:
|
||||
lvl = "RECOVER"
|
||||
message = f"{short_path} = {display_value} ({old_level.name} -> OK)"
|
||||
elif new_level == AlertLevel.WARNING:
|
||||
lvl = "WARNING"
|
||||
if has_display:
|
||||
message = f"{short_path} = {display_value} {_fmt()}"
|
||||
else:
|
||||
message = f"{short_path} = {display_value}"
|
||||
elif new_level == AlertLevel.CRITICAL:
|
||||
lvl = "CRITICAL"
|
||||
if has_display:
|
||||
message = f"{short_path} = {display_value} {_fmt()}"
|
||||
else:
|
||||
message = f"{short_path} = {display_value}"
|
||||
else:
|
||||
lvl = "UNKNOWN"
|
||||
if has_display:
|
||||
message = f"{short_path} = {display_value} {_fmt()}"
|
||||
else:
|
||||
message = f"{short_path} = {display_value}"
|
||||
|
||||
# Formatted threshold info stored on AlertState for the UI
|
||||
formatted_threshold_msg = _fmt() if has_display and new_level != AlertLevel.OK else None
|
||||
|
||||
return lvl, message, formatted_threshold_msg
|
||||
|
||||
def _send_notification(
|
||||
@@ -1048,11 +1109,16 @@ class ThresholdChecker:
|
||||
if host is not None and not host.watched:
|
||||
eventlog(host_name, lvl, message, service="threshold")
|
||||
return
|
||||
short_path = metric_path.partition(".")[2] or metric_path
|
||||
title = f"[{lvl}] {host_name} {short_path}"
|
||||
# Strip the "metric = " prefix from message so body is just the value/detail
|
||||
prefix = short_path + " = "
|
||||
body = message[len(prefix):] if message.startswith(prefix) else message
|
||||
asyncio.get_event_loop().create_task(notify_mod.send_notification(
|
||||
host_name,
|
||||
notify_mod.Notification(
|
||||
title=f"[{lvl}] {host_name}",
|
||||
body=message,
|
||||
title=title,
|
||||
body=body,
|
||||
level=lvl,
|
||||
),
|
||||
))
|
||||
@@ -1077,33 +1143,62 @@ class ThresholdChecker:
|
||||
self,
|
||||
display_format: str,
|
||||
value: Any,
|
||||
threshold_value: float,
|
||||
threshold_value: Optional[float],
|
||||
op_symbol: str,
|
||||
plugin_data: Optional[Dict[str, Any]] = None,
|
||||
check_name: Optional[str] = None,
|
||||
metric_name: Optional[str] = None,
|
||||
) -> str:
|
||||
"""Format the display string using available data.
|
||||
|
||||
Args:
|
||||
display_format: Format string from threshold config
|
||||
value: Current metric value
|
||||
threshold_value: Threshold value that was exceeded
|
||||
op_symbol: Comparison operator symbol
|
||||
plugin_data: Optional dictionary of plugin data fields
|
||||
Available template variables:
|
||||
{value} - current metric value
|
||||
{threshold_value} - threshold that was exceeded
|
||||
{op_symbol} - comparison operator (>, <, >=, <=, ==, !=)
|
||||
{check_name} - prefix stripped for generic threshold match
|
||||
(e.g. "check_disk_root" when metric
|
||||
"check_disk_root_status_code" matched generic
|
||||
threshold "status_code")
|
||||
{metric_name} - field name within the plugin data dict
|
||||
Any key from plugin_data is also available.
|
||||
|
||||
Returns:
|
||||
Formatted display string
|
||||
"""
|
||||
if not display_format:
|
||||
display_format = "(threshold: {op_symbol} {threshold_value})" if threshold_value is not None else ""
|
||||
|
||||
# Build format context with standard variables
|
||||
format_context = {
|
||||
'value': value,
|
||||
'threshold_value': threshold_value,
|
||||
'op_symbol': op_symbol,
|
||||
}
|
||||
if threshold_value is not None:
|
||||
format_context['threshold_value'] = threshold_value
|
||||
|
||||
# Add generic-match context variables when available
|
||||
if check_name is not None:
|
||||
format_context['check_name'] = check_name
|
||||
if metric_name is not None:
|
||||
format_context['metric_name'] = metric_name
|
||||
|
||||
# Add all plugin data fields if available
|
||||
if plugin_data:
|
||||
format_context.update(plugin_data)
|
||||
|
||||
# For nagios_runner generic matches, expose the matched check's output
|
||||
# and status as short aliases {output} and {status} so display templates
|
||||
# don't need to use the full {check_disk_root_output} form.
|
||||
if check_name and plugin_data:
|
||||
if 'output' not in format_context:
|
||||
output = plugin_data.get(f"{check_name}_output")
|
||||
if output is not None:
|
||||
format_context['output'] = output
|
||||
if 'status' not in format_context:
|
||||
status = plugin_data.get(f"{check_name}_status")
|
||||
if status is not None:
|
||||
format_context['status'] = status
|
||||
|
||||
try:
|
||||
# Format the display string
|
||||
return display_format.format(**format_context)
|
||||
@@ -1133,6 +1228,8 @@ class ThresholdChecker:
|
||||
value: Any,
|
||||
threshold: ThresholdConfig,
|
||||
plugin_data: Optional[Dict[str, Any]],
|
||||
check_name: Optional[str] = None,
|
||||
metric_name: Optional[str] = None,
|
||||
) -> None:
|
||||
"""Handle a state-change transition with grace-period logic.
|
||||
|
||||
@@ -1145,7 +1242,8 @@ class ThresholdChecker:
|
||||
- Past grace: fires the RECOVER notification normally.
|
||||
"""
|
||||
lvl, message, formatted_msg = self._trigger_notification(
|
||||
host_name, metric_path, old_level, new_level, value, threshold, plugin_data
|
||||
host_name, metric_path, old_level, new_level, value, threshold, plugin_data,
|
||||
check_name=check_name, metric_name=metric_name,
|
||||
)
|
||||
alert_state.formatted_message = formatted_msg
|
||||
|
||||
@@ -1181,6 +1279,8 @@ class ThresholdChecker:
|
||||
value: Any,
|
||||
threshold: ThresholdConfig,
|
||||
plugin_data: Optional[Dict[str, Any]],
|
||||
check_name: Optional[str] = None,
|
||||
metric_name: Optional[str] = None,
|
||||
) -> None:
|
||||
"""Called when alert level is unchanged and non-OK.
|
||||
|
||||
@@ -1190,7 +1290,8 @@ class ThresholdChecker:
|
||||
if alert_state.pending_since is not None:
|
||||
if time.time() - alert_state.pending_since >= self.grace_seconds:
|
||||
lvl, message, formatted_msg = self._trigger_notification(
|
||||
host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data
|
||||
host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data,
|
||||
check_name=check_name, metric_name=metric_name,
|
||||
)
|
||||
alert_state.formatted_message = formatted_msg
|
||||
self._send_notification(
|
||||
@@ -1199,7 +1300,7 @@ class ThresholdChecker:
|
||||
alert_state.pending_since = None
|
||||
# else: still within grace window, do nothing
|
||||
else:
|
||||
self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data)
|
||||
self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data, check_name=check_name, metric_name=metric_name)
|
||||
|
||||
def _check_renotify(
|
||||
self,
|
||||
@@ -1209,6 +1310,8 @@ class ThresholdChecker:
|
||||
value: Any,
|
||||
threshold: ThresholdConfig,
|
||||
plugin_data: Optional[Dict[str, Any]] = None,
|
||||
check_name: Optional[str] = None,
|
||||
metric_name: Optional[str] = None,
|
||||
):
|
||||
"""Check if we should send a repeat notification.
|
||||
|
||||
@@ -1246,6 +1349,7 @@ class ThresholdChecker:
|
||||
|
||||
# Format operator symbol
|
||||
op_symbol = threshold.operator.value
|
||||
short_path = metric_path.partition(".")[2] or metric_path
|
||||
|
||||
# Time to re-notify
|
||||
if threshold_value is not None:
|
||||
@@ -1255,11 +1359,14 @@ class ThresholdChecker:
|
||||
value=value,
|
||||
threshold_value=threshold_value,
|
||||
op_symbol=op_symbol,
|
||||
plugin_data=plugin_data
|
||||
plugin_data=plugin_data,
|
||||
check_name=check_name,
|
||||
metric_name=metric_name,
|
||||
)
|
||||
message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} {threshold_info}, ongoing for {int(now - alert_state.since)}s"
|
||||
body = f"{value} {threshold_info}, ongoing for {int(now - alert_state.since)}s"
|
||||
else:
|
||||
message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} (ongoing for {int(now - alert_state.since)}s)"
|
||||
body = f"{value} (ongoing for {int(now - alert_state.since)}s)"
|
||||
message = f"REMINDER ({alert_state.level.name}): {host_name} - {short_path} = {body}"
|
||||
|
||||
from . import hbdclass
|
||||
host = hbdclass.Host.hosts.get(host_name)
|
||||
@@ -1267,8 +1374,8 @@ class ThresholdChecker:
|
||||
asyncio.get_event_loop().create_task(notify_mod.send_notification(
|
||||
host_name,
|
||||
notify_mod.Notification(
|
||||
title=f"[REMINDER/{alert_state.level.name}] {host_name}",
|
||||
body=message,
|
||||
title=f"[REMINDER/{alert_state.level.name}] {host_name} {short_path}",
|
||||
body=body,
|
||||
level=alert_state.level.name,
|
||||
),
|
||||
))
|
||||
@@ -1288,7 +1395,7 @@ class ThresholdChecker:
|
||||
if not host.alert_states:
|
||||
continue
|
||||
configured = self.get_thresholds_for_host(hostname)
|
||||
stale = [mp for mp in host.alert_states if mp not in configured]
|
||||
stale = [mp for mp in host.alert_states if self._find_threshold(configured, mp)[0] is None]
|
||||
for mp in stale:
|
||||
logger.info(
|
||||
"Purging stale alert state for %s / %s (no threshold configured)",
|
||||
|
||||
+1
-2
@@ -336,8 +336,7 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
|
||||
# Apply user-access settings from config
|
||||
access = config_mod.get_host_access(cfg, uname)
|
||||
host.apply_access(access["owner"], access["managers"], access["monitors"])
|
||||
if verbose:
|
||||
print(("XX: New host, num now %s" % (len(hbdcls.Host.hosts))))
|
||||
logger.info("New host signed on: %s (dyn=%s, access=%s)", uname, host.dyn, access)
|
||||
newh = True
|
||||
else:
|
||||
host = hbdcls.Host.hosts[uname]
|
||||
|
||||
+1
-1
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "hbd"
|
||||
version = "5.1.17"
|
||||
version = "5.2.3"
|
||||
description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.11"
|
||||
|
||||
+33
-14
@@ -41,7 +41,7 @@ from pathlib import Path
|
||||
from typing import Any, Dict, List, Optional, Tuple
|
||||
|
||||
# updated by scripts/bumpminor.sh
|
||||
__version__ = "5.1.17"
|
||||
__version__ = "5.2.3"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Protocol (mirrors hbd/common/proto.py)
|
||||
@@ -388,7 +388,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
||||
|
||||
async def _collect_metrics(self) -> Dict[str, Any]:
|
||||
results: Dict[str, Any] = {}
|
||||
worst = 0
|
||||
for cmd_cfg in self.commands:
|
||||
name = cmd_cfg.get("name")
|
||||
command = cmd_cfg.get("command")
|
||||
@@ -399,10 +398,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
||||
results[f"{name}_status_code"] = rc
|
||||
results[f"{name}_output"] = msg
|
||||
results.update({f"{name}_{k}": v for k, v in perf.items()})
|
||||
worst = max(worst, rc)
|
||||
results["overall_status"] = _NAGIOS_STATUS.get(worst, "UNKNOWN")
|
||||
results["overall_status_code"] = worst
|
||||
results["plugin_count"] = len(self.commands)
|
||||
return results
|
||||
|
||||
|
||||
@@ -487,6 +482,12 @@ class CPUMonitorPlugin(MonitorPlugin):
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
try:
|
||||
with open("/proc/uptime") as fh:
|
||||
data["uptime_seconds"] = int(float(fh.read().split()[0]))
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return data
|
||||
|
||||
|
||||
@@ -535,6 +536,20 @@ class MemoryMonitorPlugin(MonitorPlugin):
|
||||
total = mi.get("MemTotal", 0)
|
||||
avail = mi.get("MemAvailable", mi.get("MemFree", 0))
|
||||
free = mi.get("MemFree", 0)
|
||||
|
||||
# ZFS ARC is reclaimable but not included in MemAvailable; add it.
|
||||
arc_kb = 0
|
||||
try:
|
||||
with open("/proc/spl/kstat/zfs/arcstats") as _f:
|
||||
for _line in _f:
|
||||
_p = _line.split()
|
||||
if len(_p) >= 3 and _p[0] == "size":
|
||||
arc_kb = int(_p[2]) // 1024
|
||||
break
|
||||
except (OSError, ValueError):
|
||||
pass
|
||||
|
||||
avail = min(avail + arc_kb, total)
|
||||
used = total - avail
|
||||
data: Dict[str, Any] = {
|
||||
"memory_total": total * 1024,
|
||||
@@ -782,8 +797,7 @@ class _HeartbeatProtocol(asyncio.DatagramProtocol):
|
||||
self._log.error("datagram error: %s", e)
|
||||
|
||||
def error_received(self, exc):
|
||||
self._log.warning("protocol error on %s: %s — dropping connection", self._conn.addr, exc)
|
||||
self._conn._dead = True
|
||||
self._log.warning("protocol error on %s: %s — will retry", self._conn.addr, exc)
|
||||
self._conn.close()
|
||||
|
||||
|
||||
@@ -1014,7 +1028,7 @@ def _reconfigure_syslog(level: int):
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
async def _async_main(args, cfg: Dict[str, Any]) -> int:
|
||||
global _running, _shutdown_event, _active_tasks
|
||||
global _running, _shutdown_event, _active_tasks, send_shutdown
|
||||
_running = True
|
||||
_shutdown_event = asyncio.Event()
|
||||
_active_tasks = []
|
||||
@@ -1024,7 +1038,7 @@ async def _async_main(args, cfg: Dict[str, Any]) -> int:
|
||||
port = cfg.get("hb_port", PORT)
|
||||
interval = cfg.get("interval", INTERVAL)
|
||||
|
||||
log.info("starting: %s -> %s port=%d interval=%ds", iam, args.hosts, port, interval)
|
||||
log.info("starting hbc_mini %s on %s -> %s port=%d interval=%ds",__version__, iam, args.hosts, port, interval)
|
||||
|
||||
connections: List[AsyncConnection] = []
|
||||
conn_id = 1
|
||||
@@ -1045,15 +1059,18 @@ async def _async_main(args, cfg: Dict[str, Any]) -> int:
|
||||
return 1
|
||||
|
||||
# Boot / one-shot message
|
||||
send_shutdown = False
|
||||
if args.boot or args.message:
|
||||
bmsg: Dict[str, Any] = {"acks": 0}
|
||||
if args.boot:
|
||||
bmsg["boot"] = 1
|
||||
args.boot = False # don't repeat on restart
|
||||
send_shutdown = True
|
||||
if args.message:
|
||||
bmsg["service"] = "service"
|
||||
bmsg["msg"] = args.message
|
||||
for c in connections:
|
||||
await c.sendto(bmsg)
|
||||
target = next((c for c in connections if c._transport), connections[0])
|
||||
await target.sendto(bmsg)
|
||||
if args.message and not args.daemon:
|
||||
await asyncio.sleep(0.3)
|
||||
for c in connections:
|
||||
@@ -1085,11 +1102,13 @@ async def _async_main(args, cfg: Dict[str, Any]) -> int:
|
||||
pass
|
||||
|
||||
log.info("shutting down")
|
||||
for conn in connections:
|
||||
target = next((c for c in connections if c._transport), connections[0] if connections else None)
|
||||
if target and send_shutdown:
|
||||
try:
|
||||
await conn.sendto({"shutdown": 1, "acks": conn.ackcount})
|
||||
await target.sendto({"shutdown": 1, "acks": target.ackcount})
|
||||
except Exception:
|
||||
pass
|
||||
for conn in connections:
|
||||
conn.close()
|
||||
await asyncio.sleep(0.3)
|
||||
for plugin in plugins:
|
||||
|
||||
+1
-2
@@ -68,8 +68,7 @@ async def test_nagios_runner():
|
||||
print(f" ✓ Collected {len(data)} data points")
|
||||
|
||||
print(f"\n4. Results:")
|
||||
print(f" Overall Status: {data.get('overall_status')} (code: {data.get('overall_status_code')})")
|
||||
print(f" Plugins Executed: {data.get('plugin_count')}")
|
||||
print(f" Data points collected: {len(data)}")
|
||||
|
||||
# Show individual plugin results
|
||||
print(f"\n5. Individual Plugin Results:")
|
||||
|
||||
Reference in New Issue
Block a user