Compare commits

..

8 Commits

Author SHA1 Message Date
andreas 74c89d098c version 5.1.17
Release / release (push) Successful in 5s
2026-05-04 08:04:01 -04:00
andreas 3301dbfe34 feat: owner Update/Delete buttons on Host Overview; purge stale alerts on reload
Host Overview (plugins.html): show Update and Delete buttons in the
host-right zone when the logged-in user is the host owner (or admin /
unauthenticated mode). Buttons link to /u?h=<host> and /d?h=<host>
with stopPropagation so they don't toggle the accordion; Delete prompts
for confirmation first.

ThresholdChecker.purge_stale_alerts(): removes alert states whose
metric_path has no matching threshold in the current config. Called
after startup pickle restore and after every SIGHUP config reload so
alerts orphaned by upgrades or config changes do not persist
indefinitely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 08:03:46 -04:00
andreas d00d903e7d fix: make Alerts page scrollable
Override the global style.css body height/overflow that locks all pages
to the viewport height (a remnant of the old drawer-menu layout).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 13:33:08 +02:00
andreas babb5d61aa docs: update README with changes since 917d6a4
- ZFS monitor plugin (zfs_monitor) added to plugin list and features
- nagios_runner: async execution, stderr capture, signal handling, path validation
- Threshold alerting: de-escalation suppression, short-duration suppression, ping_monitor thresholds
- Per-host watch flag and role-filtered dashboards
- HTTP API & Web UI: hostname links in Live View, Host Overview with ZFS renderer, alert pie chart in nav bar, Settings threshold viewer
- hbc connection retry: indefinite retry for IPv4; IPv6 dropped after 3 early startup failures
- hbc daemon mode: logs routed to syslog after daemonizing
- hbc_mini: noted zfs_monitor and IPv6 early-fail protection not available

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 12:46:35 +02:00
andreas 11d1c718b3 feat: retry AsyncConnection.open() indefinitely; drop IPv6 only on early startup failure
IPv4 connections are retried forever in heartbeat_sender if open() fails,
so a temporary network outage does not terminate the sender.

IPv6 connections that have never opened successfully are dropped after
IPV6_EARLY_FAIL_LIMIT (3) consecutive failures so that a network without
IPv6 support does not keep a dead sender running.

At startup all resolved connections are added to the list regardless of
whether the initial open() succeeds; the heartbeat_sender loop handles
the first real connection attempt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 12:29:35 +02:00
Andreas Wrede a99b6b54c7 feat: add alert pie chart to nav bar
Show a colour-coded pie chart (red=critical, yellow=warning, green=ok)
to the left of the clock in the nav bar. Backed by a new
GET /api/0/alert_summary endpoint that counts hosts per alert level
for the current user's visible hosts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-03 13:45:15 -04:00
Andreas Wrede 8da3d550eb version 5.1.16
Release / release (push) Successful in 5s
2026-05-03 06:08:14 -04:00
Andreas Wrede a76d0fc840 feat: generic ping_monitor thresholds; round RTT to nearest ms
- threshold.py: add _find_threshold() with suffix fallback so thresholds
  like ping_monitor.rtt_avg match ping_monitor.8_8_8_8_rtt_avg etc.;
  each pinged host keeps its own alert state
- hbdclass.py: format RTT as integer ms (round())
- live.html: JS RTT display rounded to nearest ms (Math.round)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 06:08:11 -04:00
14 changed files with 248 additions and 36 deletions
+21 -5
View File
@@ -27,6 +27,7 @@ A lightweight daemon that listens for UDP heartbeat messages and acts on them: k
- Configurable retention and backup management
- **Plugin system for extensible monitoring** ✅
- Collect system metrics (CPU, memory, disk, network)
- Monitor ZFS pool health, capacity, and I/O via `zpool(8)`
- Execute existing Nagios monitoring plugins
- Create custom plugins with simple Python classes
- **Threshold alerting system** ✅
@@ -34,6 +35,8 @@ A lightweight daemon that listens for UDP heartbeat messages and acts on them: k
- Hysteresis to prevent alert flapping
- Automatic notifications on state changes
- Re-notification for ongoing alerts
- **Per-host watch flag** — set `watch: false` on any host to silence all notifications for that host without removing its configuration ✅
- **Role-filtered dashboards** — Live Dashboard and Host Overview show only hosts where the logged-in user is owner or manager (admins see all) ✅
- Modular codebase suitable for unit testing and CI ✅
---
@@ -61,12 +64,16 @@ Heartbeat includes a comprehensive plugin architecture that extends monitoring b
- `network_monitor`: Monitors network interface statistics, bandwidth, and connections
- `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
- `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
- `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`
### Nagios Integration
The `nagios_runner` plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored:
- Executes plugins via subprocess with timeout protection
- Executes plugins asynchronously (non-blocking) with timeout protection
- Captures both stdout and stderr; if stdout is empty, stderr is used as the status message
- Handles signal-killed processes (negative exit code → UNKNOWN status)
- Validates absolute command paths at startup and warns on missing or non-executable files
- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
- Extracts performance data with thresholds
- Reports aggregated status across all configured checks
@@ -147,9 +154,11 @@ Heartbeat includes a sophisticated threshold alerting system that monitors plugi
- **Multi-level alerts**: WARNING and CRITICAL severity levels
- **Flexible operators**: Support for >, >=, <, <=, ==, != comparisons
- **Hysteresis**: Prevents alert flapping with configurable recovery thresholds
- **Smart notifications**: Alerts only on state changes, not every check
- **Smart notifications**: Alerts only on state changes, not every check; de-escalations (e.g. CRITICAL → WARNING) do not generate a notification
- **Re-notifications**: Periodic reminders for ongoing alerts
- **Short-duration suppression**: Recovery notifications are suppressed for down events under 4 seconds (avoids noise from transient blips)
- **Journal integration**: All threshold events logged for audit trail
- **`ping_monitor` thresholds**: Latency and packet-loss thresholds use the same format as all other plugin metrics
### Configuration
@@ -363,9 +372,10 @@ Heartbeat includes a built-in HTTP/WebSocket server that provides both a REST AP
### Web Dashboards
- **Login** (`/login`): Browser login form (shown automatically when auth is configured)
- **Live View** (`/live`): Real-time host connectivity, latency, and messages
- **Plugin Metrics** (`/plugins`): Browse and visualize metrics from all plugins
- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering
- **Live View** (`/live`): Real-time host connectivity, latency, and messages; hostnames link directly to the Host Overview page
- **Host Overview** (`/plugins/<host>`): Per-host plugin metrics with ZFS pool visualization; filtered to hosts where the logged-in user is owner or manager (admins see all)
- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering; alert count pie chart shown in the navigation bar
- **Settings** (`/settings`): Server configuration, user management, and threshold configuration viewer
### API Endpoints
@@ -476,6 +486,10 @@ plugins:
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
**Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
**Daemon logging:** When running with `-d`, `hbc` routes all log output to syslog (`LOG_DAEMON` facility) after daemonizing. Without `-d`, logs go to stderr as usual.
### hbc_mini — single-file client (no external dependencies)
`scripts/hbc_mini.py` is a self-contained version of the heartbeat client that requires only Python 3.8+ and no external packages. Copy it to any host and run it directly — no virtualenv, no `pip install`.
@@ -531,8 +545,10 @@ python3 hbc_mini.py -m "maintenance starting" your-server.example.com
- No YAML config (use JSON instead)
- No `filesystem_info` plugin
- No `zfs_monitor` plugin (requires `zpool(8)` and the full plugin loader)
- `cpu_monitor` does not report per-core usage or CPU frequency (no psutil)
- Plugins cannot be loaded from external `.py` files — all plugins are compiled in
- No IPv6 early-fail protection — connections that fail to open at startup are silently skipped rather than retried
Everything else — heartbeat protocol, ACK/CMD/UPD handling, `hb_install.sh`-based self-update, daemonize, syslog — is identical to the full client.
+1 -1
View File
@@ -14,4 +14,4 @@ Install options:
"""
__all__ = ["__version__"]
__version__ = "5.1.15"
__version__ = "5.1.17"
+47 -10
View File
@@ -56,6 +56,8 @@ class AsyncConnection:
self.transport: Optional[asyncio.DatagramTransport] = None
self.protocol: Optional[asyncio.DatagramProtocol] = None
self._dead = False
self._ever_opened = False
self._open_fail_count = 0 # consecutive failures before first success
self.logger = logging.getLogger(f"hbc.conn.{addr}")
@@ -73,6 +75,7 @@ class AsyncConnection:
lambda: HeartbeatProtocol(self),
family=self.af
)
self._ever_opened = True
self.logger.debug(f"Opened connection to {self.addr}:{self.port}")
return True
except Exception as e:
@@ -262,15 +265,51 @@ async def handle_update(conn: AsyncConnection, _msg: dict): # pyright: ignore[r
async def heartbeat_sender(conn: AsyncConnection, interval: int):
"""Send periodic heartbeats.
"""Send periodic heartbeats, retrying the connection if it is not open.
IPv6 connections that fail to open before their first successful send are
dropped after IPV6_EARLY_FAIL_LIMIT attempts so that a network without IPv6
does not keep a dead sender alive. IPv4 connections are retried indefinitely.
Args:
conn: Connection to send on
interval: Heartbeat interval in seconds
"""
logger = logging.getLogger("hbc.heartbeat")
IPV6_EARLY_FAIL_LIMIT = 3
while running and not conn._dead:
# Ensure transport is open before attempting to send.
if not conn.transport:
opened = await conn.open()
if opened:
conn._open_fail_count = 0
else:
conn._open_fail_count += 1
# Drop an IPv6 connection that has never come up within the
# first few attempts — it is likely unavailable on this network.
if (not conn._ever_opened
and conn.af == socket.AF_INET6
and conn._open_fail_count >= IPV6_EARLY_FAIL_LIMIT):
logger.warning(
f"IPv6 connection to {conn.addr} unreachable after "
f"{conn._open_fail_count} attempts, disabling"
)
conn._dead = True
break
# Retry after the normal interval; IPv4 retries forever.
try:
if shutdown_event:
await asyncio.wait_for(shutdown_event.wait(), timeout=interval)
break
else:
await asyncio.sleep(interval)
except asyncio.TimeoutError:
pass
except asyncio.CancelledError:
raise
continue
while running:
try:
msg = {
"acks": conn.ackcount,
@@ -279,19 +318,16 @@ async def heartbeat_sender(conn: AsyncConnection, interval: int):
}
await conn.sendto(msg, "HTB")
except Exception as e:
logger.error(f"Error sending heartbeat: {e}", exc_info=True)
except asyncio.CancelledError:
logger.debug("Heartbeat sender cancelled")
raise
except Exception as e:
logger.error(f"Error sending heartbeat: {e}", exc_info=True)
# Wait for next interval or shutdown event
try:
if shutdown_event:
await asyncio.wait_for(
shutdown_event.wait(),
timeout=interval
)
await asyncio.wait_for(shutdown_event.wait(), timeout=interval)
break
else:
await asyncio.sleep(interval)
@@ -481,12 +517,13 @@ async def async_main(args, config):
addr = addr_info[4][0]
conn = AsyncConnection(conn_id, addr, hb_port, af, iam)
if await conn.open():
if not await conn.open():
logger.warning(f"Initial open to {addr} failed, heartbeat sender will retry")
connections.append(conn)
conn_id += 1
if not connections:
logger.error("No connections established")
logger.error("No connections established (DNS resolution failed for all hosts)")
return 1
logger.info(f"Created {len(connections)} connections")
+1 -1
View File
@@ -95,7 +95,7 @@ class Connection:
if not Null:
d["addr"] = self.addr
if self.rtts[-1]:
d["rtt"] = "%0.1f" % self.rtts[-1]
d["rtt"] = "%d" % round(self.rtts[-1])
elif self.state == Connection.UNKNOWN:
d["rtt"] = ""
else:
+21
View File
@@ -154,6 +154,25 @@ async def start(
lst = [h.jsons() for h in hosts]
return web.json_response(json.loads("[" + ",".join(lst) + "]"))
async def api_alert_summary(request):
"""GET /api/0/alert_summary — counts of ok/warning/critical hosts visible to caller."""
user, err = _require_auth(request)
if err:
return err
from .threshold import AlertLevel
critical = warning = ok = 0
for host in hbdclass.Host.hosts.values():
if not _can_operate_host(user, host):
continue
levels = {s.level for s in host.alert_states.values()}
if AlertLevel.CRITICAL in levels:
critical += 1
elif AlertLevel.WARNING in levels:
warning += 1
else:
ok += 1
return web.json_response({"critical": critical, "warning": warning, "ok": ok})
async def api_messages(request):
lst = data.msgs[-30:]
return web.json_response(lst)
@@ -518,6 +537,7 @@ async def start(
hosts_with_plugins.append({
"name": hostname,
"plugins": list(host.plugin_data.keys()),
"is_owner": _can_own_host(current_user, host),
})
tmpl = env.get_template("plugins.html")
@@ -893,6 +913,7 @@ async def start(
web.get("/api/0/users/{username}/avatar", api_user_avatar),
# Hosts
web.get("/api/0/hosts", api_hosts),
web.get("/api/0/alert_summary", api_alert_summary),
web.get("/api/0/messages", api_messages),
web.get("/api/0/hosts/{hostname}/plugins", api_host_plugins),
web.get("/api/0/hosts/{hostname}/plugins/{plugin_name}", api_host_plugin_detail),
+6 -1
View File
@@ -101,9 +101,10 @@ async def reload_configuration(config_obj, config_path, components):
access = config_mod.get_host_access(new_config, hostname)
host.apply_access(access["owner"], access["managers"], access["monitors"])
# Reload threshold checker
# Reload threshold checker and prune alerts orphaned by the new config
if 'threshold_checker' in components:
components['threshold_checker'].reload(new_config)
components['threshold_checker'].purge_stale_alerts(hbdclass)
# Note: Changes to the following require restart:
# - hb_port, hbd_port, ws_port (already bound)
@@ -241,6 +242,10 @@ async def _run_async(config, config_path=None):
)
udp.restore_connection_timers(hbdclass, restore_ctx)
# Drop alert states that no longer have a matching threshold (stale after
# upgrade or config change between runs).
threshold_checker.purge_stale_alerts(hbdclass)
# HTTP server (asyncio-based via aiohttp)
try:
http_task = asyncio.create_task(
+5
View File
@@ -4,6 +4,11 @@
<style>
body {
height: auto;
overflow-y: auto;
}
.container {
max-width: 1400px;
margin: 0 auto;
+7 -1
View File
@@ -126,11 +126,17 @@
}
/* Swiss railway clock — nav */
.nav-clock {
.nav-pie {
flex-shrink: 0;
line-height: 0;
margin-left: auto;
padding: 4px 4px 4px 0;
}
#alert-pie { display: block; cursor: default; }
.nav-clock {
flex-shrink: 0;
line-height: 0;
padding: 4px 4px 4px 0;
cursor: pointer;
}
#swiss-clock { display: block; }
+1 -1
View File
@@ -408,7 +408,7 @@
);
if (data.connections[i].state == "up") {
state = '<span class="state-up">up</span>';
latency = Number.parseFloat(data.connections[i].rtts[0]).toFixed(2);
latency = String(Math.round(Number.parseFloat(data.connections[i].rtts[0])));
} else {
if (data.connections[i].state == "unknown") {
state = "";
+51
View File
@@ -11,6 +11,9 @@
{% endif %}
<a href="/about"{% if active_page == "about" %} class="active"{% endif %}>About</a>
</div>
<div class="nav-pie" title="Host alert status">
<canvas id="alert-pie" width="44" height="44"></canvas>
</div>
<div class="nav-clock" title="Click for full-screen clock">
<canvas id="swiss-clock" width="44" height="44"></canvas>
</div>
@@ -42,4 +45,52 @@
});
}
})();
function drawAlertPie(critical, warning, ok) {
var canvas = document.getElementById('alert-pie');
if (!canvas) return;
var ctx = canvas.getContext('2d');
var SIZE = canvas.width;
var R = SIZE / 2;
ctx.clearRect(0, 0, SIZE, SIZE);
var total = critical + warning + ok;
if (total === 0) {
ctx.beginPath();
ctx.arc(R, R, R - 1, 0, Math.PI * 2);
ctx.fillStyle = '#ccc';
ctx.fill();
return;
}
var slices = [
{ value: critical, color: '#e53935' },
{ value: warning, color: '#ffb300' },
{ value: ok, color: '#43a047' }
];
var start = -Math.PI / 2;
slices.forEach(function(s) {
if (s.value === 0) return;
var sweep = (s.value / total) * Math.PI * 2;
ctx.beginPath();
ctx.moveTo(R, R);
ctx.arc(R, R, R - 1, start, start + sweep);
ctx.closePath();
ctx.fillStyle = s.color;
ctx.fill();
start += sweep;
});
}
function updateAlertPie() {
fetch('/api/0/alert_summary').then(function(r) {
if (!r.ok) return;
return r.json();
}).then(function(d) {
if (d) drawAlertPie(d.critical || 0, d.warning || 0, d.ok || 0);
}).catch(function() {});
}
document.addEventListener('DOMContentLoaded', function() {
updateAlertPie();
setInterval(updateAlertPie, 30000);
});
</script>
+29
View File
@@ -131,6 +131,27 @@
text-overflow: ellipsis;
}
.host-action-btn {
font-size: 0.75em;
font-weight: bold;
padding: 3px 10px;
border-radius: 4px;
border: none;
cursor: pointer;
text-decoration: none;
white-space: nowrap;
}
.host-action-btn.update-btn {
background: #e3f2fd;
color: #1565c0;
}
.host-action-btn.update-btn:hover { background: #bbdefb; }
.host-action-btn.delete-btn {
background: #ffebee;
color: #c62828;
}
.host-action-btn.delete-btn:hover { background: #ffcdd2; }
/* ── Host body ──────────────────────────────────────────────── */
.host-body {
@@ -379,6 +400,14 @@
<span class="nagios-badge" id="nagios-badge-{{ host.name }}"></span>
{% endif %}
<span class="os-label" id="os-label-{{ host.name }}"></span>
{% if host.is_owner %}
<a class="host-action-btn update-btn"
href="/u?h={{ host.name }}"
onclick="event.stopPropagation()">Update</a>
<a class="host-action-btn delete-btn"
href="/d?h={{ host.name }}"
onclick="event.stopPropagation(); return confirm('Delete host {{ host.name }}?')">Delete</a>
{% endif %}
</div>
</div>
+45 -3
View File
@@ -803,6 +803,29 @@ class ThresholdChecker:
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, None)
return None
def _find_threshold(
self, thresholds: Dict[str, "ThresholdConfig"], metric_path: str
) -> Optional["ThresholdConfig"]:
"""Return the threshold for *metric_path*, falling back to suffix matches.
Allows generic thresholds like ``ping_monitor.rtt_avg`` to match
fully-qualified paths like ``ping_monitor.8_8_8_8_rtt_avg``.
The exact match is always tried first; then successive leading
underscore-delimited segments are stripped from the field name until
a match is found or no segments remain.
"""
if metric_path in thresholds:
return thresholds[metric_path]
plugin, sep, field = metric_path.partition(".")
if not sep:
return None
parts = field.split("_")
for i in range(1, len(parts)):
candidate = plugin + "." + "_".join(parts[i:])
if candidate in thresholds:
return thresholds[candidate]
return None
def check_plugin_data(
self,
host_name: str,
@@ -831,11 +854,10 @@ class ThresholdChecker:
for metric_name, value in data.items():
metric_path = f"{plugin_name}.{metric_name}"
if metric_path not in thresholds:
threshold = self._find_threshold(thresholds, metric_path)
if threshold is None:
continue
threshold = thresholds[metric_path]
# Get or create alert state
if metric_path not in alert_states:
alert_states[metric_path] = AlertState(metric_path)
@@ -1254,6 +1276,26 @@ class ThresholdChecker:
alert_state.last_notification = now
alert_state.notification_count += 1
def purge_stale_alerts(self, hbdclass) -> None:
"""Remove alert states that have no matching threshold configuration.
Called after startup (pickle restore) and after each config reload so
that alerts orphaned by configuration changes do not linger forever.
Alerts whose metric_path is not present in the current threshold config
for that host are silently dropped.
"""
for hostname, host in hbdclass.Host.hosts.items():
if not host.alert_states:
continue
configured = self.get_thresholds_for_host(hostname)
stale = [mp for mp in host.alert_states if mp not in configured]
for mp in stale:
logger.info(
"Purging stale alert state for %s / %s (no threshold configured)",
hostname, mp,
)
del host.alert_states[mp]
def get_active_alerts(self, alert_states: Dict[str, AlertState]) -> list:
"""
Get all currently active (non-OK) alerts.
+1 -1
View File
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project]
name = "hbd"
version = "5.1.15"
version = "5.1.17"
description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
readme = "README.md"
requires-python = ">=3.11"
+1 -1
View File
@@ -41,7 +41,7 @@ from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
# updated by scripts/bumpminor.sh
__version__ = "5.1.15"
__version__ = "5.1.17"
# ---------------------------------------------------------------------------
# Protocol (mirrors hbd/common/proto.py)