Compare commits

..

28 Commits

Author SHA1 Message Date
andreas 4349ae217a version 5.2.4
Release / release (push) Successful in 5s
2026-05-08 08:50:06 -04:00
andreas b3aa7b585f udp/config: fall back to default_owner when os_info has no owner; log debug
- When os_info arrives with no owner field, apply default_owner from server config
- Stop applying default_owner unconditionally in get_host_access (now deferred to os_info handling)
- os_info plugin logs debug message when injecting owner from client config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-08 08:49:42 -04:00
andreas 88a3c09b51 hbc/server: request InfoPlugin refresh when host has no plugin data; update docs
- Server sets request_update=1 in ACK when host.plugin_data is empty
- hbc: AsyncConnection.request_info_event; handle_ack sets it on request_update
- hbc: _info_plugin_refresh_loop clears InfoPlugin caches and resends on demand
- hbc_mini: same via _request_info event and _info_refresh_loop
- docs/USERS.md: document client-declared owner config key
- docs/PLUGIN_DEVELOPMENT.md: document server-initiated InfoPlugin refresh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-08 07:37:41 -04:00
andreas 0504402a8a hbc/hbc_mini: add owner config; include in os_info; server applies to host
- owner: optional top-level config key in ~/.hbc.yaml / ~/.hbc.json
- Propagated into plugin configs at load time so os_info can include it
- os_info PLG data carries owner field when set
- udp: sets host.owner from os_info if not already configured server-side
- live.html: format event log timestamps as YYYY-MM-DD HH:MM:SS (24-hour)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-08 07:25:47 -04:00
andreas ca58c18802 eventlog: store structured dicts; filter by user; clock: fix minute hand step
- eventlog() now stores {ts, host, level, service, message} dicts instead of strings
- WebSocket sends/broadcasts filter event log messages by the user's managed hosts
- live.html renders structured log entries with level-coloured spans
- Swiss railway clock minute hand now holds until second hand reaches 12, then steps

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-08 07:00:17 -04:00
andreas 1ddc4b8132 threshold/alerts: strip _status_code suffix from displayed metric names
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-08 06:19:16 -04:00
andreas 5e1720ed32 notify: use plain URL in Mattermost plugin metrics link
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 10:43:18 -04:00
andreas 77f127fe60 hbc/hbc_mini: consolidate startup log into single line
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 10:33:31 -04:00
andreas 54fbd8d73d version 5.2.3
Release / release (push) Successful in 5s
2026-05-07 10:15:11 -04:00
andreas 7ab17e26e2 hbc/hbc_mini: log name and version at startup; ui: bump alert-metric font size
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 10:15:03 -04:00
andreas 28f5fa951c ui: show metric name inline with hostname in alerts and notifications
Alerts page: move metric name into the header row alongside hostname.
Notifications: include metric name in title (hostname  metric) and
strip the metric prefix from the body so it contains only value/detail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 06:26:27 -04:00
andreas 37f1c58969 docs: remove dead warning/critical keys from ping_monitor config example
These fields were never read by the plugin; thresholds are configured
server-side. Also document the -b flag in README.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 06:12:15 -04:00
andreas f006077a71 send shutdown msg only if we sent a boot msg. Don't send eithe when restarting. 2026-05-06 11:57:43 -04:00
andreas d9fc8d632f send shutdown msg only if we sent a boot msg. Don't send eithe when restarting. 2026-05-06 11:54:09 -04:00
andreas f640574e4f version 5.2.2
Release / release (push) Successful in 5s
2026-05-06 09:57:43 -04:00
andreas 9a19424279 fix: retry connection on network error instead of permanently dropping it
error_received() no longer sets _dead=True; it just closes the transport
so the existing retry loop in heartbeat_sender (hbc) and sendto (hbc_mini)
reopens the connection on the next interval. This allows hbc to recover
when it starts before network connectivity is established.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 09:57:32 -04:00
andreas ca8ba84e65 fix: silence aiohttp.access log and strip plugin prefix in alerts UI
- main: disable aiohttp.access propagation unless --debug is active
- alerts.html: strip plugin-name prefix from metric_path display
  (nagios_runner.check_disk_root_status_code → check_disk_root_status_code)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 07:39:55 -04:00
andreas f3d08d1c9e version 5.2.1
Release / release (push) Successful in 5s
2026-05-06 07:07:01 -04:00
andreas 1e4263b793 fix: threshold and logging improvements
- threshold: fix crash when display is None (_format_display now falls
  back to default format string instead of calling None.format())
- threshold: shorten notification messages by stripping plugin-name prefix
  from metric_path (cpu_percent instead of cpu_monitor.cpu_percent)
- main: demote aiohttp.access log records from INFO to DEBUG
- udp: replace debug print with proper logger.info for new host sign-on

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 07:06:56 -04:00
andreas e931acb9f5 version 5.2.0
Release / release (push) Successful in 5s
2026-05-05 13:47:46 -04:00
andreas 018409e71d docs: correct README inaccuracies found during code audit
- Add ping_monitor to built-in plugins list
- Update cpu_monitor (uptime) and memory_monitor (ZFS ARC) descriptions
- Replace "aggregated status" bullet with accurate per-check reporting note
- Fix RTT hysteresis default: 0.1 → 0.02
- Fix client YAML config: remove non-existent server:/port: keys, use hb_port:
- Fix nagios_runner commands format: plain strings → {name:, command:} dicts
- Fix Supported Metrics: exit_code → actual <name>_status_code/<name>_status/<name>_output fields

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 13:45:43 -04:00
andreas 1824f637b4 fix: always show THRESHOLD_DEFAULTS in Settings threshold config
Seed threshold_configs["default"] from THRESHOLD_DEFAULTS at the start
of _parse_config() so the Settings page displays built-in defaults
regardless of whether the server config uses the multi-config format,
the legacy thresholds: format, or has no threshold config at all.
_parse_multi_config() overwrites the seed with the fully-merged
effective defaults when a threshold_configs section is present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 13:02:28 -04:00
andreas a534c06b26 feat: nagios operator for direct exit-code severity mapping
Add ComparisonOperator.NAGIOS ("nagios") that maps Nagios exit codes
directly to alert levels (0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN) without
requiring numeric warning/critical thresholds. Hysteresis is bypassed for
discrete codes. Display template defaults to "{check_name}: {output}".
_format_display() handles None threshold_value gracefully.

Add nagios_runner.status_code as a built-in default threshold config so
nagios checks alert out of the box.

Also: fix alerts.html scrolling (override html,body), make hostname a link
to /plugins#<hostname>, remove overall_status/overall_status_code/plugin_count
from nagios_runner and hbc_mini, replace with computed worst-status in
plugins.html via nagiosWorstStatus() helper.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 12:26:56 -04:00
andreas d7b5c97a4e version 5.1.21
Release / release (push) Successful in 6s
2026-05-05 11:05:48 -04:00
andreas ae447ac4a6 feat: nagios_runner improvements and alerts page fixes
- nagios_runner: remove overall_status/overall_status_code/plugin_count fields;
  each command still reports its own <name>_status and <name>_status_code
- threshold: expose {output} and {status} aliases in display templates for
  nagios_runner generic matches (mapped from <check_name>_output/status)
- alerts.html: fix scrolling by overriding html,body height/overflow (style.css
  sets both); make hostname a link to /plugins/<hostname>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 11:05:45 -04:00
andreas d44ce3d124 version 5.1.20
Release / release (push) Successful in 6s
2026-05-05 10:48:24 -04:00
andreas b1985d0eb2 feat: generic threshold matching for nagios_runner with {check_name} display support
_find_threshold() now returns the stripped prefix ("check_name") alongside
the ThresholdConfig, enabling a single generic entry (e.g. nagios_runner.status_code)
to cover all per-command metrics (check_disk_root_status_code, check_load_status_code,
…). The prefix is threaded through to _format_display() as {check_name}, with
{metric_name} also available in display templates. purge_stale_alerts() updated
to use generic matching so it does not incorrectly drop alerts on generic-matched
metrics. README updated with Display Format Templates and Generic Threshold
Matching sections.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 10:48:17 -04:00
andreas de778f680f fix: reduce default hysteresis 10%→2%; show recovery threshold in alerts UI
The 10% default hysteresis created an unreasonably wide recovery band:
a 95% threshold would only clear once the value dropped below 85.5%,
causing alerts to linger long after the metric was well below the
trigger level.

Change default hysteresis to 2% across all threshold parsers (plugin
metrics, partitions, RTT). For a 95% threshold, recovery is now at
93.1% instead of 85.5%.

Add AlertState.hysteresis field (set on every check, cleared on OK) and
expose recovery_threshold in to_dict() so the Alerts dashboard can
display "recovers < 93.1" alongside the trigger threshold, making the
hysteresis band visible to the user. Pickle backward-compatible via
__setstate__.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 14:47:50 -04:00
25 changed files with 538 additions and 264 deletions
+72 -12
View File
@@ -58,10 +58,11 @@ Heartbeat includes a comprehensive plugin architecture that extends monitoring b
### Built-in Plugins
- `os_info`: Collects OS, kernel, distribution, and architecture information
- `cpu_monitor`: Monitors CPU usage, load average, frequency, and process counts
- `memory_monitor`: Monitors RAM and swap usage, available memory
- `cpu_monitor`: Monitors CPU usage, load average, frequency, process counts, and uptime
- `memory_monitor`: Monitors RAM and swap usage, available memory (ZFS ARC-aware)
- `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
- `network_monitor`: Monitors network interface statistics, bandwidth, and connections
- `ping_monitor`: Measures round-trip latency to configured hosts
- `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
- `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
- `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`
@@ -76,7 +77,7 @@ The `nagios_runner` plugin provides seamless integration with the vast Nagios pl
- Validates absolute command paths at startup and warns on missing or non-executable files
- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
- Extracts performance data with thresholds
- Reports aggregated status across all configured checks
- Reports per-check status, exit code, and output; no aggregate rollup field
See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.
@@ -181,7 +182,8 @@ thresholds:
warning: 80.0 # Warn when CPU > 80%
critical: 90.0 # Critical when CPU > 90%
operator: ">"
hysteresis: 0.1 # 10% hysteresis to prevent flapping
hysteresis: 0.02 # 2% hysteresis to prevent flapping
display: "(threshold: {op_symbol} {threshold_value}%)" # optional
memory_monitor:
percent:
@@ -223,7 +225,7 @@ thresholds:
<hostname>:
warning: <milliseconds> # Warn when RTT > this value
critical: <milliseconds> # Critical when RTT > this value
hysteresis: 0.1 # Optional: 10% hysteresis (default)
hysteresis: 0.02 # Optional: 2% hysteresis (default)
```
**Example alerts:**
@@ -274,7 +276,59 @@ All plugin metrics can be thresholded:
- **Memory**: percent, available_mb, swap_percent
- **Disk**: Per-partition percent, free_gb, free_mb
- **Network**: errors_total, dropped packets, connection counts
- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
- **Nagios**: Any field emitted by `nagios_runner` (`<name>_status_code`, `<name>_status`, `<name>_output`, performance data fields)
### Display Format Templates
Each threshold entry accepts an optional `display` field — a Python format string shown in notifications and on the Alerts dashboard:
```yaml
nagios_runner:
status_code:
warning: 1
critical: 2
operator: ">="
display: "{check_name}: exit {value} (expected < {threshold_value})"
```
Available variables:
| Variable | Description |
|---|---|
| `{value}` | Current metric value |
| `{threshold_value}` | Threshold that was crossed |
| `{op_symbol}` | Comparison operator (`>`, `<`, `>=`, …); `"nagios"` for the nagios operator |
| `{check_name}` | Prefix stripped by generic matching (see below) |
| `{metric_name}` | Full field name within the plugin data |
| `{output}` | For `nagios_runner` generic matches: the matched check's status text (alias for `{check_name}_output`) |
| `{status}` | For `nagios_runner` generic matches: the matched check's status name — OK/WARNING/CRITICAL/UNKNOWN (alias for `{check_name}_status`) |
| any plugin field | Any other field present in the plugin's data |
### Generic Threshold Matching
When a metric name has no exact threshold entry, the server progressively strips leading underscore-separated segments and re-tries the lookup. This lets a single generic entry cover an entire family of metrics.
The classic use case is `nagios_runner`, which names each metric after the command that produced it:
```
nagios_runner.check_disk_root_status_code → no exact match
nagios_runner.disk_root_status_code → no match
nagios_runner.root_status_code → no match
nagios_runner.status_code → matched ✓
```
Configure the generic threshold once using the `nagios` operator, which maps exit codes directly to alert severity without requiring numeric warning/critical values:
```yaml
nagios_runner:
status_code:
operator: "nagios" # 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN
display: "{check_name}: {output}"
```
The stripped prefix (`check_disk_root` in the example above) is available as `{check_name}` in the display template, so you can identify which check triggered the alert without writing a separate threshold entry per command.
Exact matches always take priority. A generic entry only applies when no specific one is defined.
### Per-Host Threshold Profiles
@@ -453,6 +507,9 @@ hbc --boot your-server.example.com
# Verbose output
hbc -v your-server.example.com
# Send 'boot' and 'shutdown' messages on start and exit
hbc -b your-server.example.com
```
You can also run it via the module entrypoint:
@@ -461,12 +518,11 @@ You can also run it via the module entrypoint:
python -m hbd.client.main your-server.example.com
```
Client configuration can also be specified in YAML:
Client configuration can also be specified in YAML (`~/.hbc.yaml`):
```yaml
server: hbd.example.com
port: 50003
interval: 30
hb_port: 50003 # Server port (default: 50003)
interval: 30 # Heartbeat interval in seconds
plugins:
cpu_monitor:
interval: 300 # Check every 5 minutes (default)
@@ -480,10 +536,14 @@ plugins:
nagios_runner:
interval: 300 # Check every 5 minutes (default)
commands:
- /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
- /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
- name: check_load
command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
- name: check_disk
command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
```
The server hostname is always passed as a positional command-line argument; there is no `server:` config key.
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
**Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
-5
View File
@@ -104,11 +104,6 @@ The `nagios_runner` plugin collects:
- `{name}_{metric}_min` - Minimum value (if present)
- `{name}_{metric}_max` - Maximum value (if present)
**Overall:**
- `overall_status` - Worst status from all commands
- `overall_status_code` - Worst status code
- `plugin_count` - Number of Nagios plugins executed
## Configuration Options
```yaml
+23
View File
@@ -8,6 +8,7 @@ This guide explains how to create custom plugins for the Heartbeat monitoring sy
- [Plugin Types](#plugin-types)
- [Creating a Plugin](#creating-a-plugin)
- [Plugin Lifecycle](#plugin-lifecycle)
- [Server-initiated InfoPlugin refresh](#server-initiated-infoplugin-refresh)
- [Configuration](#configuration)
- [Best Practices](#best-practices)
- [Examples](#examples)
@@ -250,6 +251,28 @@ Understanding the plugin lifecycle helps you implement plugins correctly:
└─> Plugin releases resources, closes connections
```
## Server-initiated InfoPlugin refresh
When a heartbeat packet arrives from a host the server has no plugin data for (e.g. after a server restart), the server sets `request_update = 1` in the ACK reply. The client detects this flag and immediately re-runs all InfoPlugins — clearing their cached results first — then resends the data as PLG messages.
This means InfoPlugin data will always reach the server as soon as possible without requiring a client restart. No action is needed from plugin authors: the framework handles cache invalidation and re-collection automatically.
The lifecycle for this case looks like:
```
Server restarts, host reconnects
└─> hbd receives HTB with no existing plugin_data for host
└─> hbd sets request_update=1 in ACK
Client receives ACK
└─> Detects request_update flag
└─> Clears _cache on every registered InfoPlugin
└─> Calls collect() on each InfoPlugin
└─> Sends fresh PLG messages to server
```
If you write an `InfoPlugin` with side effects in `_collect_info()` (opening connections, writing files, etc.), be aware it may be called more than once per client session when this mechanism triggers.
## Configuration
### Plugin-Specific Configuration
-27
View File
@@ -1110,33 +1110,6 @@ hosts:
db-02:
threshold_config: [tight_memory, db_disk]
```
### Backward Compatibility
The legacy single threshold configuration is fully supported:
```yaml
# Old format - still works
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
```
This is equivalent to:
```yaml
# New format
threshold_configs:
default:
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
```
### Configuration Priority
1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults
+18
View File
@@ -46,6 +46,24 @@ default_owner: andreas # owns hosts with no explicit owner
# falls back to the first admin user if omitted
```
### Client-declared host ownership
A host can declare its own owner directly in the hbc or hbc_mini client configuration. This is useful for hosts that are not listed in the server config, or during initial setup before a server-side config entry has been created.
**`~/.hbc.yaml`** (hbc):
```yaml
owner: andreas
```
**`~/.hbc.json`** (hbc_mini):
```json
{ "owner": "andreas" }
```
When set, the value is included in the `os_info` plugin data sent to the server. The server applies it as `host.owner` the first time `os_info` arrives, provided no owner has been configured server-side for that host. Server-configured ownership always takes precedence.
---
### Assigning roles to hosts
```yaml
+1 -1
View File
@@ -14,4 +14,4 @@ Install options:
"""
__all__ = ["__version__"]
__version__ = "5.1.19"
__version__ = "5.2.4"
+5 -2
View File
@@ -15,12 +15,15 @@ CLIENT_DEFAULTS = {
# Network settings
"hb_port": 50003, # Port where hbd servers listen
"interval": 10, # Heartbeat interval in seconds
# Host identity
"owner": None, # Optional username to set as this host's owner on the server
# Runtime flags
"foreground": False,
"verbose": False,
"debug": 0,
# Plugin configuration
"plugins": {}, # Per-plugin configuration
"thresholds": {}, # Threshold configuration for monitoring
+51 -28
View File
@@ -21,6 +21,7 @@ from typing import Dict, List, Optional
# Import protocol and config
from .config import load_config
from ..common.proto import dicttos, stodict
from .. import __version__
# Import plugin system
from .plugin import PluginRegistry, PluginLoader, InfoPlugin, MonitorPlugin
@@ -58,6 +59,7 @@ class AsyncConnection:
self._dead = False
self._ever_opened = False
self._open_fail_count = 0 # consecutive failures before first success
self.request_info_event: asyncio.Event = asyncio.Event()
self.logger = logging.getLogger(f"hbc.conn.{addr}")
@@ -137,6 +139,9 @@ class AsyncConnection:
self.ackcount += 1
self.logger.debug(f"ACK received, RTT: {rtt:.1f}ms")
if msg.get("request_update"):
self.logger.info("server requested plugin info refresh")
self.request_info_event.set()
class HeartbeatProtocol(asyncio.DatagramProtocol):
@@ -172,9 +177,8 @@ class HeartbeatProtocol(asyncio.DatagramProtocol):
self.logger.error(f"Error processing datagram: {e}", exc_info=True)
def error_received(self, exc):
"""Handle protocol errors."""
self.logger.warning(f"Protocol error on {self.connection.addr}: {exc}dropping connection")
self.connection._dead = True
"""Handle protocol errors — close transport so the heartbeat sender retries."""
self.logger.warning(f"Protocol error on {self.connection.addr}: {exc}will retry")
self.connection.close()
@@ -338,15 +342,35 @@ async def heartbeat_sender(conn: AsyncConnection, interval: int):
raise
async def _info_plugin_refresh_loop(conn: AsyncConnection, info_plugins: List):
"""Wait for server requests to re-send InfoPlugin data."""
logger = logging.getLogger("hbc.plugins")
while running:
await conn.request_info_event.wait()
if not running:
break
conn.request_info_event.clear()
logger.info("refreshing InfoPlugins on server request")
for plugin in info_plugins:
plugin._cache = None
try:
data = await plugin.collect()
if data:
await conn.sendto({"plugin": plugin.name, **data}, "PLG")
logger.info(f"Resent {plugin.name} data")
except Exception as e:
logger.error(f"Error re-collecting {plugin.name}: {e}", exc_info=True)
async def plugin_collector(conn: AsyncConnection, registry: PluginRegistry):
"""Collect and send plugin data.
Args:
conn: Connection to send on
registry: Plugin registry
"""
logger = logging.getLogger("hbc.plugins")
# Collect InfoPlugins once at startup
info_plugins = registry.get_by_type(InfoPlugin)
for plugin in info_plugins:
@@ -359,34 +383,31 @@ async def plugin_collector(conn: AsyncConnection, registry: PluginRegistry):
logger.info(f"Sent {plugin.name} data")
except Exception as e:
logger.error(f"Error collecting {plugin.name}: {e}", exc_info=True)
# Schedule MonitorPlugins
# Group plugins by interval
from collections import defaultdict
by_interval = defaultdict(list)
monitor_plugins = registry.get_by_type(MonitorPlugin)
for plugin in monitor_plugins:
by_interval[plugin.interval].append(plugin)
# Create tasks for each interval
tasks = []
# Create tasks for each interval; always include the info-refresh watcher
tasks = [asyncio.create_task(_info_plugin_refresh_loop(conn, info_plugins))]
for interval, plugins in by_interval.items():
task = asyncio.create_task(
tasks.append(asyncio.create_task(
plugin_collector_interval(conn, plugins, interval)
)
tasks.append(task)
# Wait for all tasks
if tasks:
try:
await asyncio.gather(*tasks, return_exceptions=True)
except asyncio.CancelledError:
logger.debug("Plugin collector cancelled, cancelling sub-tasks")
for task in tasks:
if not task.done():
task.cancel()
raise
))
try:
await asyncio.gather(*tasks, return_exceptions=True)
except asyncio.CancelledError:
logger.debug("Plugin collector cancelled, cancelling sub-tasks")
for task in tasks:
if not task.done():
task.cancel()
raise
async def plugin_collector_interval(
@@ -464,7 +485,7 @@ async def cleanup(connections: List[AsyncConnection]):
logger.info("Cleaning up connections")
target = next((c for c in connections if c.transport), connections[0] if connections else None)
if target:
if target and send_shutdown:
try:
await target.sendto({"shutdown": 1, "acks": target.ackcount})
except Exception as e:
@@ -478,7 +499,7 @@ async def cleanup(connections: List[AsyncConnection]):
async def async_main(args, config):
"""Async main function."""
global running, shutdown_event, active_tasks
global running, shutdown_event, active_tasks, send_shutdown
# Create shutdown event
shutdown_event = asyncio.Event()
@@ -495,8 +516,7 @@ async def async_main(args, config):
hb_port = config.get("hb_port", PORT)
interval = config.get("interval", INTERVAL)
logger.info(f"Starting hbc for {iam} -> {hb_hosts}")
logger.info(f"Port: {hb_port}, Interval: {interval}s")
logger.info(f"hbc {__version__} on {iam} -> {hb_hosts} port={hb_port}, interval={interval}s")
# Create connections
connections = []
@@ -526,10 +546,13 @@ async def async_main(args, config):
logger.info(f"Created {len(connections)} connections")
# Send boot/message if requested
send_shutdown = False
if args.boot or args.message:
boot_msg = {}
if args.boot:
boot_msg["boot"] = 1
args.boot = False # Clear boot flag so we don't send it again in main loop
send_shutdown = True
if args.message:
boot_msg["service"] = "service"
boot_msg["msg"] = args.message
+4 -1
View File
@@ -364,7 +364,10 @@ class PluginLoader:
# Instantiate plugin with config — check plugins subdict first,
# then top-level keys (e.g. nagios_runner: ... at root of config).
plugin_instance_config = plugins_subconfig.get(obj.name) or raw_config.get(obj.name, {})
plugin_instance_config = dict(plugins_subconfig.get(obj.name) or raw_config.get(obj.name) or {})
# Propagate top-level owner so os_info (and any future plugin) can report it.
if "owner" in raw_config and "owner" not in plugin_instance_config:
plugin_instance_config["owner"] = raw_config["owner"]
plugin = obj(config=plugin_instance_config)
# Initialize plugin
+12 -28
View File
@@ -31,16 +31,13 @@ from hbd.client.plugin import MonitorPlugin
# Nagios exit codes
NAGIOS_OK = 0
NAGIOS_WARNING = 1
NAGIOS_CRITICAL = 2
NAGIOS_UNKNOWN = 3
STATUS_NAMES = {
NAGIOS_OK: "OK",
NAGIOS_WARNING: "WARNING",
NAGIOS_CRITICAL: "CRITICAL",
NAGIOS_UNKNOWN: "UNKNOWN"
0: "OK",
1: "WARNING",
2: "CRITICAL",
3: "UNKNOWN",
}
@@ -128,52 +125,39 @@ class NagiosRunnerPlugin(MonitorPlugin):
Dictionary with results from all plugins
"""
results = {}
# Track overall status (worst status wins)
worst_status = NAGIOS_OK
for cmd_config in self.commands:
name = cmd_config.get("name")
command = cmd_config.get("command")
if not name or not command:
self.logger.warning("Skipping command with missing name or command")
continue
# Execute plugin
try:
status_code, output, perfdata = await self._run_nagios_plugin(command)
# Store results
results[f"{name}_status"] = STATUS_NAMES.get(status_code, "UNKNOWN")
results[f"{name}_status_code"] = status_code
results[f"{name}_output"] = output
# Track worst status
if status_code > worst_status:
worst_status = status_code
# Parse and add performance data
if perfdata:
for metric_name, metric_value in perfdata.items():
results[f"{name}_{metric_name}"] = metric_value
self.logger.info(
f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
)
except Exception as e:
self.logger.error(f"Error running {name}: {e}", exc_info=True)
results[f"{name}_status"] = "ERROR"
results[f"{name}_status_code"] = NAGIOS_UNKNOWN
results[f"{name}_output"] = str(e)
worst_status = NAGIOS_UNKNOWN
# Add overall status
results["overall_status"] = STATUS_NAMES.get(worst_status, "UNKNOWN")
results["overall_status_code"] = worst_status
results["plugin_count"] = len(self.commands)
return results
async def _run_nagios_plugin(
+3
View File
@@ -62,6 +62,9 @@ class OSInfoPlugin(InfoPlugin):
"hbc_version": hbc_version,
"hbc_type": "full",
}
if self.config.get("owner"):
self.logger.debug(f"Adding owner from config: {self.config['owner']}")
data["owner"] = self.config["owner"]
# Add Linux-specific distribution info
if platform.system() == "Linux":
+2 -6
View File
@@ -13,12 +13,8 @@ plugins:
count: 3 # ICMP packets per ping run (default 3)
timeout: 5 # seconds before a host is considered unreachable (default 5)
hosts:
8.8.8.8:
warning: 20.0 # ms
critical: 100.0 # ms
192.168.1.1:
warning: 5.0
critical: 20.0
- 8.8.8.8
- 192.168.1.1
```
Reported metrics per host (metric key uses the hostname with dots/colons replaced
+7 -1
View File
@@ -95,6 +95,12 @@ THRESHOLD_DEFAULTS = {
'warning': 200,
'critical': 250.0,
'count': 3 # Optional: number of consecutive breaches before alerting
},
'nagios_runner': {
'status_code': {
'display': '{check_name} {output}',
'operator': "nagios"
}
}
}
}
@@ -303,7 +309,7 @@ def get_host_access(config, hostname) -> dict:
"""
host_cfg = get_host_config(config, hostname)
owner = host_cfg.get("owner") or get_default_owner(config)
owner = host_cfg.get("owner") # or get_default_owner(config)
managers = host_cfg.get("managers", [])
if isinstance(managers, str):
+2
View File
@@ -475,6 +475,8 @@ def run(config, config_path=None):
if config.get("debug", 0) > 0:
log_level = logging.DEBUG
logging.basicConfig(level=log_level)
if not config.get("debug", 0):
logging.getLogger("aiohttp.access").propagate = False
load_pickled_hosts(config, hbdclass)
notify_mod.initlog(logfile=config.get("logfile", "messages.log"))
+10 -3
View File
@@ -106,11 +106,18 @@ def closelog():
def eventlog(host, lvl, m, service=None):
ts = time.time()
msg = {
"ts": ts,
"host": host or None,
"level": lvl,
"service": service,
"message": m,
}
data.msgs.append(msg)
s = f"{time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(ts))} {lvl} "
if host:
s += f"{host} "
s += m
data.msgs.append(s)
logger.info(s)
if logf:
try:
@@ -118,7 +125,7 @@ def eventlog(host, lvl, m, service=None):
logf.flush()
except Exception as e:
logger.warning("failed to write to logfile: %s", e)
msg_to_websockets("message", s)
msg_to_websockets("message", msg)
# ---------------------------------------------------------------------------
@@ -209,7 +216,7 @@ def _send_mattermost(channel_cfg: dict, notif: Notification) -> bool:
return False
text = f"**{notif.title}**\n{notif.body}"
if notif.url:
text += f"\n[Plugin metrics]({notif.url})"
text += f"\n[Plugin metrics] {notif.url}"
ses = {"url": host, "scheme": "http", "basepath": "/api/v4", "port": 8065}
mm = Driver(ses)
payload: dict = {"text": text, "channel": channel, "username": channel_cfg.get("username", "hbd")}
+15 -7
View File
@@ -4,7 +4,7 @@
<style>
body {
html, body {
height: auto;
overflow-y: auto;
}
@@ -175,14 +175,18 @@
.alert-hostname {
font-weight: bold;
color: #333;
color: #0066cc;
font-size: 1.1em;
text-decoration: none;
}
.alert-hostname:hover {
text-decoration: underline;
}
.alert-metric {
color: #666;
font-family: 'Courier New', monospace;
font-size: 0.9em;
color: #0066cc;
font-size: 1.1em;
font-weight: normal;
}
.alert-details {
@@ -405,6 +409,10 @@
} else if (alert.threshold_value !== undefined && alert.threshold_value !== null && alert.operator) {
valueText += ` <span class="threshold-info">(threshold: ${alert.operator} ${formatValue(alert.threshold_value)})</span>`;
}
if (alert.recovery_threshold !== undefined && alert.recovery_threshold !== null) {
const recOp = (alert.operator === '>' || alert.operator === '>=') ? '<' : '>';
valueText += ` <span class="threshold-info" style="color:#888">(recovers ${recOp} ${formatValue(alert.recovery_threshold)})</span>`;
}
// Build actions section
let actionsHtml = '';
@@ -429,9 +437,9 @@
<div class="alert-main">
<div class="alert-header">
<span class="alert-level ${level}">${alert.level}</span>
<span class="alert-hostname">${alert.hostname}</span>
<a class="alert-hostname" href="/plugins#${alert.hostname}">${alert.hostname}</a>
<span class="alert-metric">${(alert.metric_path.includes('.') ? alert.metric_path.slice(alert.metric_path.indexOf('.') + 1) : alert.metric_path).replace(/_status_code$/, '')}</span>
</div>
<div class="alert-metric">${alert.metric_path}</div>
<div class="alert-details">
<span>${valueText}</span>
<span class="alert-duration">Active for ${duration}</span>
+1 -1
View File
@@ -214,7 +214,7 @@
ctx.restore();
}
hand((m + s / 60) / 60 * Math.PI * 2 - Math.PI / 2,
hand((sFrac >= 58.5 ? m + 1 : m) / 60 * Math.PI * 2 - Math.PI / 2,
R * 0.88, -R * 0.12, SIZE * 0.027, '#222'); /* minute */
hand((h + m / 60) / 12 * Math.PI * 2 - Math.PI / 2,
R * 0.58, -R * 0.12, SIZE * 0.039, '#222'); /* hour */
+28 -2
View File
@@ -183,11 +183,24 @@
line-height: 1.0;
}
#messages div {
#messages .log-entry {
padding: 5px 0;
border-bottom: 1px solid #f0f0f0;
display: flex;
gap: 0.5em;
align-items: baseline;
}
.log-ts { color: #888; white-space: nowrap; }
.log-level { font-weight: bold; min-width: 6em; }
.log-host { font-weight: 600; }
.log-service { color: #888; }
.log-warning .log-level { color: #b8860b; }
.log-critical .log-level { color: #c00; }
.log-recover .log-level { color: #2a7a2a; }
.log-info .log-level { color: #555; }
/* Modal for connection status messages */
.connection-modal {
display: none;
@@ -460,7 +473,20 @@
update_table(state.data);
} else if (state.type == "message") {
var msgs = document.getElementById("messages");
msgs.insertAdjacentHTML("afterbegin", "<div>" + state.data + "</div>");
var msg = state.data;
var _d = new Date(msg.ts * 1000);
function _p(n) { return n < 10 ? '0' + n : '' + n; }
var ts_str = _d.getFullYear() + '-' + _p(_d.getMonth()+1) + '-' + _p(_d.getDate())
+ ' ' + _p(_d.getHours()) + ':' + _p(_d.getMinutes()) + ':' + _p(_d.getSeconds());
var lvl = (msg.level || "INFO").toLowerCase();
var html = '<div class="log-entry log-' + lvl + '">';
html += '<span class="log-ts">' + ts_str + '</span>';
html += '<span class="log-level">' + (msg.level || "") + '</span>';
if (msg.host) html += '<span class="log-host">' + msg.host + '</span>';
if (msg.service) html += '<span class="log-service">' + msg.service + '</span>';
html += '<span class="log-msg">' + msg.message + '</span>';
html += '</div>';
msgs.insertAdjacentHTML("afterbegin", html);
}
cnt++;
};
+20 -8
View File
@@ -499,6 +499,17 @@
return pluginCache[hostname]?.[pluginName] ?? null;
}
// Return worst nagios exit code (0-3) found in a nagios_runner data object.
function nagiosWorstStatus(data) {
let worst = 0;
for (const [k, v] of Object.entries(data || {})) {
if (k.endsWith('_status_code') && typeof v === 'number' && v > worst) {
worst = v;
}
}
return worst;
}
// ── Fetch helpers ───────────────────────────────────────────────────────
async function fetchPlugin(hostname, pluginName) {
@@ -600,13 +611,13 @@
? chips.join('')
: '<span class="glance-loading">—</span>';
// Nagios badge
// Nagios badge — derive worst status from individual check codes
const nagios = getCache(hostname, 'nagios_runner');
if (nagosBadge && nagios) {
const status = (nagios.data.overall_status || '—').toUpperCase();
const cls = status === 'OK' ? 'ok'
: status === 'WARNING' ? 'warning'
: status === 'CRITICAL' ? 'critical' : '';
const worst = nagiosWorstStatus(nagios.data);
const names = {0:'OK', 1:'WARNING', 2:'CRITICAL', 3:'UNKNOWN'};
const status = names[worst] || '—';
const cls = worst === 0 ? 'ok' : worst === 1 ? 'warning' : worst >= 2 ? 'critical' : '';
nagosBadge.className = `nagios-badge ${cls}`;
nagosBadge.textContent = status;
}
@@ -715,9 +726,10 @@
break;
}
case 'nagios_runner': {
const status = (d.overall_status || '?').toUpperCase();
const count = d.plugin_count;
text = status + (count != null ? `${count} checks` : '');
const worst = nagiosWorstStatus(d);
const names = {0:'OK', 1:'WARNING', 2:'CRITICAL', 3:'UNKNOWN'};
const codes = Object.keys(d).filter(k => k.endsWith('_status_code'));
text = (names[worst] || '?') + (codes.length ? `${codes.length} checks` : '');
break;
}
case 'filesystem_info': {
+211 -104
View File
@@ -30,12 +30,13 @@ class AlertLevel(Enum):
class ComparisonOperator(Enum):
"""Supported comparison operators for threshold checks."""
GT = ">" # Greater than
GTE = ">=" # Greater than or equal
LT = "<" # Less than
LTE = "<=" # Less than or equal
EQ = "==" # Equal to
NEQ = "!=" # Not equal to
GT = ">" # Greater than
GTE = ">=" # Greater than or equal
LT = "<" # Less than
LTE = "<=" # Less than or equal
EQ = "==" # Equal to
NEQ = "!=" # Not equal to
NAGIOS = "nagios" # Nagios exit-code semantics: 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN
class AlertState:
@@ -57,6 +58,7 @@ class AlertState:
self.last_notification = None
self.threshold_value = None # The threshold value that triggered alert
self.operator = None # The comparison operator (>, <, >=, etc.)
self.hysteresis: Optional[float] = None # Hysteresis fraction used for recovery
self.formatted_message = None # Formatted display message for UI
self.acknowledged = False # Whether alert has been acknowledged
self.acknowledged_at = None # Timestamp when acknowledged
@@ -151,7 +153,16 @@ class AlertState:
result["operator"] = self.operator
if self.formatted_message is not None:
result["formatted_message"] = self.formatted_message
# Compute and expose the recovery threshold so the UI can display it
if (self.hysteresis and self.threshold_value is not None
and self.operator is not None):
ha = abs(self.threshold_value * self.hysteresis)
if self.operator in ('>', '>='):
result["recovery_threshold"] = round(self.threshold_value - ha, 4)
elif self.operator in ('<', '<='):
result["recovery_threshold"] = round(self.threshold_value + ha, 4)
return result
def __setstate__(self, state):
@@ -159,6 +170,8 @@ class AlertState:
self.__dict__.update(state)
if not hasattr(self, 'consecutive_count'):
self.consecutive_count = 0
if not hasattr(self, 'hysteresis'):
self.hysteresis = None
def acknowledge(self):
"""Acknowledge this alert to stop reminder notifications."""
@@ -217,33 +230,43 @@ class ThresholdConfig:
def evaluate(self, value: float) -> AlertLevel:
"""
Evaluate a value against this threshold.
Args:
value: Metric value to check
Returns:
AlertLevel indicating the severity
"""
if not self.enabled:
return AlertLevel.OK
# Nagios exit-code semantics: value IS the severity
if self.operator == ComparisonOperator.NAGIOS:
try:
code = int(value)
except (TypeError, ValueError):
return AlertLevel.UNKNOWN
return {0: AlertLevel.OK, 1: AlertLevel.WARNING, 2: AlertLevel.CRITICAL}.get(
code, AlertLevel.UNKNOWN
)
try:
# Convert value to float for comparison
value = float(value)
except (TypeError, ValueError):
logger.warning("Cannot convert value %s to float for %s", value, self.metric_path)
return AlertLevel.UNKNOWN
# Check critical threshold first
if self.critical is not None:
if self._compare(value, self.critical):
return AlertLevel.CRITICAL
# Then check warning threshold
if self.warning is not None:
if self._compare(value, self.warning):
return AlertLevel.WARNING
return AlertLevel.OK
def evaluate_with_hysteresis(
@@ -262,7 +285,11 @@ class ThresholdConfig:
New alert level considering hysteresis
"""
new_level = self.evaluate(value)
# Nagios exit codes are discrete integers — hysteresis doesn't apply
if self.operator == ComparisonOperator.NAGIOS:
return new_level
# If no hysteresis, return new level
if self.hysteresis == 0.0:
return new_level
@@ -392,14 +419,28 @@ class ThresholdChecker:
def _parse_config(self, config: Dict[str, Any]):
"""Parse threshold configuration from YAML structure.
Supports two formats:
1. Legacy format with direct 'thresholds' section
2. New format with 'threshold_configs' and 'host_threshold_mapping'
In all cases, THRESHOLD_DEFAULTS are seeded into threshold_configs["default"]
so the Settings page always shows the built-in defaults.
_parse_multi_config() overwrites this with the fully-merged effective defaults.
"""
# Always expose built-in defaults through threshold_configs["default"] so
# the Settings page has something to display even in legacy/no-config mode.
seed: Dict[str, ThresholdConfig] = {}
for plugin_name, plugin_thresholds in THRESHOLD_DEFAULTS.get("thresholds", {}).items():
if isinstance(plugin_thresholds, dict):
self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=seed)
if seed:
self.threshold_configs["default"] = seed
self.threshold_raw_configs["default"] = {}
# Check for new multi-config format
if "threshold_configs" in config:
self._parse_multi_config(config)
self._parse_multi_config(config) # overwrites threshold_configs["default"]
elif "thresholds" in config:
# Legacy single threshold configuration
self._parse_legacy_config(config)
@@ -545,11 +586,14 @@ class ThresholdChecker:
warning = threshold_config.get("warning")
critical = threshold_config.get("critical")
operator = threshold_config.get("operator", ">")
display = threshold_config.get("display", "(threshold: {op_symbol} {threshold_value})")
hysteresis = threshold_config.get("hysteresis", 0.1) # 10% default
# Nagios operator maps exit codes directly; no numeric thresholds needed
is_nagios_op = (operator == "nagios")
default_display = "{check_name}: {output}" if is_nagios_op else "(threshold: {op_symbol} {threshold_value})"
display = threshold_config.get("display", default_display)
hysteresis = threshold_config.get("hysteresis", 0.0 if is_nagios_op else 0.02)
enabled = threshold_config.get("enabled", True)
if warning is None and critical is None:
if warning is None and critical is None and not is_nagios_op:
logger.warning("No thresholds defined for %s, skipping", metric_path)
continue
@@ -649,7 +693,7 @@ class ThresholdChecker:
warning = rtt_thresholds.get("warning")
critical = rtt_thresholds.get("critical")
operator = rtt_thresholds.get("operator", ">")
hysteresis = rtt_thresholds.get("hysteresis", 0.1) # 10% default
hysteresis = rtt_thresholds.get("hysteresis", 0.02) # 2% default
enabled = rtt_thresholds.get("enabled", True)
display = rtt_thresholds.get("display")
count = rtt_thresholds.get("count", 1)
@@ -794,6 +838,12 @@ class ThresholdChecker:
elif new_level == AlertLevel.WARNING and threshold.warning is not None:
threshold_value = threshold.warning
# Keep hysteresis on the state so the UI can show the recovery threshold
if new_level != AlertLevel.OK:
alert_state.hysteresis = threshold.hysteresis
else:
alert_state.hysteresis = None
# Update state and check for changes
old_level = alert_state.level
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
@@ -805,26 +855,33 @@ class ThresholdChecker:
return None
def _find_threshold(
self, thresholds: Dict[str, "ThresholdConfig"], metric_path: str
) -> Optional["ThresholdConfig"]:
"""Return the threshold for *metric_path*, falling back to suffix matches.
) -> Tuple[Optional["ThresholdConfig"], Optional[str]]:
"""Return (threshold, check_name) for *metric_path*, falling back to suffix matches.
Allows generic thresholds like ``ping_monitor.rtt_avg`` to match
fully-qualified paths like ``ping_monitor.8_8_8_8_rtt_avg``.
Allows generic thresholds like ``nagios_runner.status_code`` to match
fully-qualified paths like ``nagios_runner.check_disk_root_status_code``.
The exact match is always tried first; then successive leading
underscore-delimited segments are stripped from the field name until
a match is found or no segments remain.
Returns:
(ThresholdConfig, None) for an exact match.
(ThresholdConfig, "check_disk_root") for a suffix match — the second
element is the stripped prefix, available as ``{check_name}`` in
display format templates.
(None, None) when no threshold is found.
"""
if metric_path in thresholds:
return thresholds[metric_path]
return thresholds[metric_path], None
plugin, sep, field = metric_path.partition(".")
if not sep:
return None
return None, None
parts = field.split("_")
for i in range(1, len(parts)):
candidate = plugin + "." + "_".join(parts[i:])
if candidate in thresholds:
return thresholds[candidate]
return None
return thresholds[candidate], "_".join(parts[:i])
return None, None
def check_plugin_data(
self,
@@ -853,37 +910,39 @@ class ThresholdChecker:
# Check flat metrics
for metric_name, value in data.items():
metric_path = f"{plugin_name}.{metric_name}"
threshold = self._find_threshold(thresholds, metric_path)
threshold, check_name = self._find_threshold(thresholds, metric_path)
if threshold is None:
continue
# Get or create alert state
if metric_path not in alert_states:
alert_states[metric_path] = AlertState(metric_path)
alert_state = alert_states[metric_path]
# Evaluate threshold with hysteresis
new_level = threshold.evaluate_with_hysteresis(
value,
alert_state.level
)
# Determine which threshold was exceeded
threshold_value = None
if new_level == AlertLevel.CRITICAL and threshold.critical is not None:
threshold_value = threshold.critical
elif new_level == AlertLevel.WARNING and threshold.warning is not None:
threshold_value = threshold.warning
alert_state.hysteresis = threshold.hysteresis if new_level != AlertLevel.OK else None
# Update state and check for changes
old_level = alert_state.level
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
state_changes.append((metric_path, old_level, new_level, value))
self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data)
self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data, check_name=check_name, metric_name=metric_name)
elif new_level != AlertLevel.OK:
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data)
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data, check_name=check_name, metric_name=metric_name)
# Check nested metrics (e.g., partition data in disk_monitor)
self._check_nested_metrics(
@@ -942,7 +1001,9 @@ class ThresholdChecker:
threshold_value = threshold.critical
elif new_level == AlertLevel.WARNING and threshold.warning is not None:
threshold_value = threshold.warning
alert_state.hysteresis = threshold.hysteresis if new_level != AlertLevel.OK else None
old_level = alert_state.level
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
state_changes.append((metric_path, old_level, new_level, value))
@@ -959,6 +1020,8 @@ class ThresholdChecker:
value: Any,
threshold: ThresholdConfig,
plugin_data: Optional[Dict[str, Any]] = None,
check_name: Optional[str] = None,
metric_name: Optional[str] = None,
):
"""Trigger a notification for an alert state change.
@@ -980,56 +1043,54 @@ class ThresholdChecker:
# Format operator symbol
op_symbol = threshold.operator.value
# Short metric label: strip the plugin-name prefix and _status_code suffix
short_path = (metric_path.partition(".")[2] or metric_path).removesuffix("_status_code")
# Use a display-friendly value (inf is the sentinel for "overdue")
import math
display_value = "overdue" if isinstance(value, float) and math.isinf(value) else value
# Format message
if new_level == AlertLevel.OK:
lvl = "RECOVER"
message = f"{metric_path} = {display_value} ({old_level.name} -> OK)"
elif new_level == AlertLevel.WARNING:
lvl = "WARNING"
if threshold_value is not None:
threshold_info = self._format_display(
threshold.display,
value=display_value,
threshold_value=threshold_value,
op_symbol=op_symbol,
plugin_data=plugin_data
)
message = f"{metric_path} = {display_value} {threshold_info}"
else:
message = f"{metric_path} = {display_value}"
elif new_level == AlertLevel.CRITICAL:
lvl = "CRITICAL"
if threshold_value is not None:
threshold_info = self._format_display(
threshold.display,
value=display_value,
threshold_value=threshold_value,
op_symbol=op_symbol,
plugin_data=plugin_data
)
message = f"{metric_path} = {display_value} {threshold_info}"
else:
message = f"{metric_path} = {display_value}"
else:
lvl = "UNKNOWN"
message = f"{metric_path} = {display_value}"
# Return the formatted threshold info for storing in AlertState
formatted_threshold_msg = None
if threshold_value is not None and new_level != AlertLevel.OK:
formatted_threshold_msg = self._format_display(
# Format message — for the nagios operator there is no numeric threshold_value;
# render the display template whenever one is available.
has_display = threshold_value is not None or threshold.operator == ComparisonOperator.NAGIOS
def _fmt():
return self._format_display(
threshold.display,
value=display_value,
threshold_value=threshold_value,
op_symbol=op_symbol,
plugin_data=plugin_data
plugin_data=plugin_data,
check_name=check_name,
metric_name=metric_name,
)
if new_level == AlertLevel.OK:
lvl = "RECOVER"
message = f"{short_path} = {display_value} ({old_level.name} -> OK)"
elif new_level == AlertLevel.WARNING:
lvl = "WARNING"
if has_display:
message = f"{short_path} = {display_value} {_fmt()}"
else:
message = f"{short_path} = {display_value}"
elif new_level == AlertLevel.CRITICAL:
lvl = "CRITICAL"
if has_display:
message = f"{short_path} = {display_value} {_fmt()}"
else:
message = f"{short_path} = {display_value}"
else:
lvl = "UNKNOWN"
if has_display:
message = f"{short_path} = {display_value} {_fmt()}"
else:
message = f"{short_path} = {display_value}"
# Formatted threshold info stored on AlertState for the UI
formatted_threshold_msg = _fmt() if has_display and new_level != AlertLevel.OK else None
return lvl, message, formatted_threshold_msg
def _send_notification(
@@ -1048,11 +1109,16 @@ class ThresholdChecker:
if host is not None and not host.watched:
eventlog(host_name, lvl, message, service="threshold")
return
short_path = (metric_path.partition(".")[2] or metric_path).removesuffix("_status_code")
title = f"[{lvl}] {host_name} {short_path}"
# Strip the "metric = " prefix from message so body is just the value/detail
prefix = short_path + " = "
body = message[len(prefix):] if message.startswith(prefix) else message
asyncio.get_event_loop().create_task(notify_mod.send_notification(
host_name,
notify_mod.Notification(
title=f"[{lvl}] {host_name}",
body=message,
title=title,
body=body,
level=lvl,
),
))
@@ -1077,32 +1143,61 @@ class ThresholdChecker:
self,
display_format: str,
value: Any,
threshold_value: float,
threshold_value: Optional[float],
op_symbol: str,
plugin_data: Optional[Dict[str, Any]] = None,
check_name: Optional[str] = None,
metric_name: Optional[str] = None,
) -> str:
"""Format the display string using available data.
Args:
display_format: Format string from threshold config
value: Current metric value
threshold_value: Threshold value that was exceeded
op_symbol: Comparison operator symbol
plugin_data: Optional dictionary of plugin data fields
Available template variables:
{value} - current metric value
{threshold_value} - threshold that was exceeded
{op_symbol} - comparison operator (>, <, >=, <=, ==, !=)
{check_name} - prefix stripped for generic threshold match
(e.g. "check_disk_root" when metric
"check_disk_root_status_code" matched generic
threshold "status_code")
{metric_name} - field name within the plugin data dict
Any key from plugin_data is also available.
Returns:
Formatted display string
"""
if not display_format:
display_format = "(threshold: {op_symbol} {threshold_value})" if threshold_value is not None else ""
# Build format context with standard variables
format_context = {
'value': value,
'threshold_value': threshold_value,
'op_symbol': op_symbol,
}
if threshold_value is not None:
format_context['threshold_value'] = threshold_value
# Add generic-match context variables when available
if check_name is not None:
format_context['check_name'] = check_name
if metric_name is not None:
format_context['metric_name'] = metric_name
# Add all plugin data fields if available
if plugin_data:
format_context.update(plugin_data)
# For nagios_runner generic matches, expose the matched check's output
# and status as short aliases {output} and {status} so display templates
# don't need to use the full {check_disk_root_output} form.
if check_name and plugin_data:
if 'output' not in format_context:
output = plugin_data.get(f"{check_name}_output")
if output is not None:
format_context['output'] = output
if 'status' not in format_context:
status = plugin_data.get(f"{check_name}_status")
if status is not None:
format_context['status'] = status
try:
# Format the display string
@@ -1133,6 +1228,8 @@ class ThresholdChecker:
value: Any,
threshold: ThresholdConfig,
plugin_data: Optional[Dict[str, Any]],
check_name: Optional[str] = None,
metric_name: Optional[str] = None,
) -> None:
"""Handle a state-change transition with grace-period logic.
@@ -1145,7 +1242,8 @@ class ThresholdChecker:
- Past grace: fires the RECOVER notification normally.
"""
lvl, message, formatted_msg = self._trigger_notification(
host_name, metric_path, old_level, new_level, value, threshold, plugin_data
host_name, metric_path, old_level, new_level, value, threshold, plugin_data,
check_name=check_name, metric_name=metric_name,
)
alert_state.formatted_message = formatted_msg
@@ -1181,6 +1279,8 @@ class ThresholdChecker:
value: Any,
threshold: ThresholdConfig,
plugin_data: Optional[Dict[str, Any]],
check_name: Optional[str] = None,
metric_name: Optional[str] = None,
) -> None:
"""Called when alert level is unchanged and non-OK.
@@ -1190,7 +1290,8 @@ class ThresholdChecker:
if alert_state.pending_since is not None:
if time.time() - alert_state.pending_since >= self.grace_seconds:
lvl, message, formatted_msg = self._trigger_notification(
host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data
host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data,
check_name=check_name, metric_name=metric_name,
)
alert_state.formatted_message = formatted_msg
self._send_notification(
@@ -1199,7 +1300,7 @@ class ThresholdChecker:
alert_state.pending_since = None
# else: still within grace window, do nothing
else:
self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data)
self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data, check_name=check_name, metric_name=metric_name)
def _check_renotify(
self,
@@ -1209,6 +1310,8 @@ class ThresholdChecker:
value: Any,
threshold: ThresholdConfig,
plugin_data: Optional[Dict[str, Any]] = None,
check_name: Optional[str] = None,
metric_name: Optional[str] = None,
):
"""Check if we should send a repeat notification.
@@ -1246,7 +1349,8 @@ class ThresholdChecker:
# Format operator symbol
op_symbol = threshold.operator.value
short_path = (metric_path.partition(".")[2] or metric_path).removesuffix("_status_code")
# Time to re-notify
if threshold_value is not None:
# Use display format string
@@ -1255,20 +1359,23 @@ class ThresholdChecker:
value=value,
threshold_value=threshold_value,
op_symbol=op_symbol,
plugin_data=plugin_data
plugin_data=plugin_data,
check_name=check_name,
metric_name=metric_name,
)
message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} {threshold_info}, ongoing for {int(now - alert_state.since)}s"
body = f"{value} {threshold_info}, ongoing for {int(now - alert_state.since)}s"
else:
message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} (ongoing for {int(now - alert_state.since)}s)"
body = f"{value} (ongoing for {int(now - alert_state.since)}s)"
message = f"REMINDER ({alert_state.level.name}): {host_name} - {short_path} = {body}"
from . import hbdclass
host = hbdclass.Host.hosts.get(host_name)
if host is None or host.watched:
asyncio.get_event_loop().create_task(notify_mod.send_notification(
host_name,
notify_mod.Notification(
title=f"[REMINDER/{alert_state.level.name}] {host_name}",
body=message,
title=f"[REMINDER/{alert_state.level.name}] {host_name} {short_path}",
body=body,
level=alert_state.level.name,
),
))
@@ -1288,7 +1395,7 @@ class ThresholdChecker:
if not host.alert_states:
continue
configured = self.get_thresholds_for_host(hostname)
stale = [mp for mp in host.alert_states if mp not in configured]
stale = [mp for mp in host.alert_states if self._find_threshold(configured, mp)[0] is None]
for mp in stale:
logger.info(
"Purging stale alert state for %s / %s (no threshold configured)",
+9 -3
View File
@@ -336,8 +336,7 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
# Apply user-access settings from config
access = config_mod.get_host_access(cfg, uname)
host.apply_access(access["owner"], access["managers"], access["monitors"])
if verbose:
print(("XX: New host, num now %s" % (len(hbdcls.Host.hosts))))
logger.info("New host signed on: %s (dyn=%s, access=%s)", uname, host.dyn, access)
newh = True
else:
host = hbdcls.Host.hosts[uname]
@@ -351,8 +350,10 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
if msg.get("ID") == "HTB":
host.doesack = msg.get("acks", -1)
# send ACK back
# send ACK back; ask client to resend plugin info when we have none yet
rmsg = {"time": time.time()}
if not host.plugin_data:
rmsg["request_update"] = 1
opkt = dicttos("ACK", rmsg)
try:
transport.sendto(opkt, addr)
@@ -369,6 +370,11 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
if k not in ("ID", "plugin", "id", "name")}
# Store plugin data with timestamp
host.add_plugin_data(plugin_name, plugin_data, timestamp=now)
# If os_info reports an owner and none is configured server-side, apply it
if plugin_name == "os_info":
if not host.owner:
host.owner = plugin_data.get("owner", config_mod.get_default_owner(cfg))
if DEBUG > 1:
print(f"Stored plugin data for {uname}: {plugin_name}")
+6 -2
View File
@@ -85,11 +85,13 @@ async def handler(request):
except Exception as e:
logger.error("Error sending initial hosts: %s", e)
# Send recent messages
# Send recent messages, filtered to hosts this user may see
if data.msgs:
try:
for m in data.msgs:
await ws.send_str(json.dumps({"type": "message", "data": m}))
host_name = m.get("host") if isinstance(m, dict) else None
if not host_name or _user_can_see_host(user, host_name):
await ws.send_str(json.dumps({"type": "message", "data": m}))
except Exception as e:
logger.error("Error sending initial messages: %s", e)
@@ -128,6 +130,8 @@ def broadcast(typ: str, payload) -> bool:
host_name: Optional[str] = None
if typ in ("host", "plugin"):
host_name = payload.get("raw_name") or payload.get("host") or payload.get("name")
elif typ == "message" and isinstance(payload, dict):
host_name = payload.get("host")
jmsg = json.dumps({"type": typ, "data": payload})
+1 -1
View File
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project]
name = "hbd"
version = "5.1.19"
version = "5.2.4"
description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
readme = "README.md"
requires-python = ">=3.11"
+36 -20
View File
@@ -41,7 +41,7 @@ from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
# updated by scripts/bumpminor.sh
__version__ = "5.1.19"
__version__ = "5.2.4"
# ---------------------------------------------------------------------------
# Protocol (mirrors hbd/common/proto.py)
@@ -114,6 +114,7 @@ def _stodict(data: bytes) -> Dict[str, Any]:
_DEFAULTS: Dict[str, Any] = {
"hb_port": 50003,
"interval": 10,
"owner": None,
"plugins": {},
}
@@ -239,6 +240,8 @@ class OSInfoPlugin(InfoPlugin):
"hbc_version": __version__,
"hbc_type": "mini",
}
if self.config.get("owner"):
data["owner"] = self.config["owner"]
if platform.system() == "Linux":
data.update(_linux_distro())
elif platform.system() == "Darwin":
@@ -388,7 +391,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
async def _collect_metrics(self) -> Dict[str, Any]:
results: Dict[str, Any] = {}
worst = 0
for cmd_cfg in self.commands:
name = cmd_cfg.get("name")
command = cmd_cfg.get("command")
@@ -399,10 +401,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
results[f"{name}_status_code"] = rc
results[f"{name}_output"] = msg
results.update({f"{name}_{k}": v for k, v in perf.items()})
worst = max(worst, rc)
results["overall_status"] = _NAGIOS_STATUS.get(worst, "UNKNOWN")
results["overall_status_code"] = worst
results["plugin_count"] = len(self.commands)
return results
@@ -721,7 +719,9 @@ async def _load_plugins(cfg: Dict[str, Any]) -> List[Plugin]:
plugins_cfg: Dict[str, Any] = cfg.get("plugins", {})
loaded: List[Plugin] = []
for cls in _ALL_PLUGIN_CLASSES:
plugin_cfg = plugins_cfg.get(cls.name) or cfg.get(cls.name, {})
plugin_cfg = dict(plugins_cfg.get(cls.name) or cfg.get(cls.name) or {})
if "owner" in cfg and "owner" not in plugin_cfg:
plugin_cfg["owner"] = cfg["owner"]
plugin: Plugin = cls(config=plugin_cfg)
try:
ok = await plugin.initialize()
@@ -791,7 +791,7 @@ class _HeartbeatProtocol(asyncio.DatagramProtocol):
msg_id = msg.get("ID")
now = time.time()
if msg_id == "ACK":
self._conn._handle_ack(now)
self._conn._handle_ack(msg, now)
elif msg_id == "CMD":
asyncio.create_task(_handle_command(self._conn, msg))
elif msg_id == "UPD":
@@ -802,8 +802,7 @@ class _HeartbeatProtocol(asyncio.DatagramProtocol):
self._log.error("datagram error: %s", e)
def error_received(self, exc):
self._log.warning("protocol error on %s: %sdropping connection", self._conn.addr, exc)
self._conn._dead = True
self._log.warning("protocol error on %s: %swill retry", self._conn.addr, exc)
self._conn.close()
@@ -819,6 +818,7 @@ class AsyncConnection:
self.rtts: List[float] = [0.0]
self._transport: Optional[asyncio.DatagramTransport] = None
self._dead = False
self._request_info: asyncio.Event = asyncio.Event()
self._log = logging.getLogger(f"hbc.conn.{addr}")
async def open(self) -> bool:
@@ -837,12 +837,14 @@ class AsyncConnection:
self._transport.close()
self._transport = None
def _handle_ack(self, now: float):
def _handle_ack(self, msg: Dict[str, Any], now: float):
rtt = (now - self.lastsend) * 1000.0
self.rtts.append(rtt)
if len(self.rtts) > 10:
self.rtts.pop(0)
self.ackcount += 1
if msg.get("request_update"):
self._request_info.set()
async def sendto(self, msg: Dict[str, Any], msg_id: str = "HTB"):
if self._dead:
@@ -975,6 +977,19 @@ async def _run_monitor_group(conn: AsyncConnection, plugins: List[Plugin], inter
await _sleep(interval)
async def _info_refresh_loop(conn: AsyncConnection, info: List[Plugin]):
log = logging.getLogger("hbc.plugins")
while _running:
await conn._request_info.wait()
if not _running:
break
conn._request_info.clear()
log.info("refreshing InfoPlugins on server request")
for plugin in info:
plugin._cache = None
await _run_info_plugins(conn, info)
async def _plugin_collector(conn: AsyncConnection, plugins: List[Plugin]):
info = [p for p in plugins if isinstance(p, InfoPlugin)]
monitor = [p for p in plugins if isinstance(p, MonitorPlugin)]
@@ -985,12 +1000,10 @@ async def _plugin_collector(conn: AsyncConnection, plugins: List[Plugin]):
for p in monitor:
by_interval[p.interval].append(p)
if by_interval:
await asyncio.gather(
*[asyncio.create_task(_run_monitor_group(conn, grp, iv))
for iv, grp in by_interval.items()],
return_exceptions=True,
)
tasks = [asyncio.create_task(_info_refresh_loop(conn, info))]
tasks += [asyncio.create_task(_run_monitor_group(conn, grp, iv))
for iv, grp in by_interval.items()]
await asyncio.gather(*tasks, return_exceptions=True)
# ---------------------------------------------------------------------------
@@ -1034,7 +1047,7 @@ def _reconfigure_syslog(level: int):
# ---------------------------------------------------------------------------
async def _async_main(args, cfg: Dict[str, Any]) -> int:
global _running, _shutdown_event, _active_tasks
global _running, _shutdown_event, _active_tasks, send_shutdown
_running = True
_shutdown_event = asyncio.Event()
_active_tasks = []
@@ -1044,7 +1057,7 @@ async def _async_main(args, cfg: Dict[str, Any]) -> int:
port = cfg.get("hb_port", PORT)
interval = cfg.get("interval", INTERVAL)
log.info("starting: %s -> %s port=%d interval=%ds", iam, args.hosts, port, interval)
log.info("hbc_mini %s on %s -> %s port=%d interval=%ds",__version__, iam, args.hosts, port, interval)
connections: List[AsyncConnection] = []
conn_id = 1
@@ -1065,10 +1078,13 @@ async def _async_main(args, cfg: Dict[str, Any]) -> int:
return 1
# Boot / one-shot message
send_shutdown = False
if args.boot or args.message:
bmsg: Dict[str, Any] = {"acks": 0}
if args.boot:
bmsg["boot"] = 1
args.boot = False # don't repeat on restart
send_shutdown = True
if args.message:
bmsg["service"] = "service"
bmsg["msg"] = args.message
@@ -1106,7 +1122,7 @@ async def _async_main(args, cfg: Dict[str, Any]) -> int:
log.info("shutting down")
target = next((c for c in connections if c._transport), connections[0] if connections else None)
if target:
if target and send_shutdown:
try:
await target.sendto({"shutdown": 1, "acks": target.ackcount})
except Exception:
+1 -2
View File
@@ -68,8 +68,7 @@ async def test_nagios_runner():
print(f" ✓ Collected {len(data)} data points")
print(f"\n4. Results:")
print(f" Overall Status: {data.get('overall_status')} (code: {data.get('overall_status_code')})")
print(f" Plugins Executed: {data.get('plugin_count')}")
print(f" Data points collected: {len(data)}")
# Show individual plugin results
print(f"\n5. Individual Plugin Results:")