Compare commits
8 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 74c89d098c | |||
| 3301dbfe34 | |||
| d00d903e7d | |||
| babb5d61aa | |||
| 11d1c718b3 | |||
| a99b6b54c7 | |||
| 8da3d550eb | |||
| a76d0fc840 |
@@ -27,6 +27,7 @@ A lightweight daemon that listens for UDP heartbeat messages and acts on them: k
|
|||||||
- Configurable retention and backup management
|
- Configurable retention and backup management
|
||||||
- **Plugin system for extensible monitoring** ✅
|
- **Plugin system for extensible monitoring** ✅
|
||||||
- Collect system metrics (CPU, memory, disk, network)
|
- Collect system metrics (CPU, memory, disk, network)
|
||||||
|
- Monitor ZFS pool health, capacity, and I/O via `zpool(8)`
|
||||||
- Execute existing Nagios monitoring plugins
|
- Execute existing Nagios monitoring plugins
|
||||||
- Create custom plugins with simple Python classes
|
- Create custom plugins with simple Python classes
|
||||||
- **Threshold alerting system** ✅
|
- **Threshold alerting system** ✅
|
||||||
@@ -34,6 +35,8 @@ A lightweight daemon that listens for UDP heartbeat messages and acts on them: k
|
|||||||
- Hysteresis to prevent alert flapping
|
- Hysteresis to prevent alert flapping
|
||||||
- Automatic notifications on state changes
|
- Automatic notifications on state changes
|
||||||
- Re-notification for ongoing alerts
|
- Re-notification for ongoing alerts
|
||||||
|
- **Per-host watch flag** — set `watch: false` on any host to silence all notifications for that host without removing its configuration ✅
|
||||||
|
- **Role-filtered dashboards** — Live Dashboard and Host Overview show only hosts where the logged-in user is owner or manager (admins see all) ✅
|
||||||
- Modular codebase suitable for unit testing and CI ✅
|
- Modular codebase suitable for unit testing and CI ✅
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -61,12 +64,16 @@ Heartbeat includes a comprehensive plugin architecture that extends monitoring b
|
|||||||
- `network_monitor`: Monitors network interface statistics, bandwidth, and connections
|
- `network_monitor`: Monitors network interface statistics, bandwidth, and connections
|
||||||
- `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
|
- `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
|
||||||
- `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
|
- `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
|
||||||
|
- `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`
|
||||||
|
|
||||||
### Nagios Integration
|
### Nagios Integration
|
||||||
|
|
||||||
The `nagios_runner` plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored:
|
The `nagios_runner` plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored:
|
||||||
|
|
||||||
- Executes plugins via subprocess with timeout protection
|
- Executes plugins asynchronously (non-blocking) with timeout protection
|
||||||
|
- Captures both stdout and stderr; if stdout is empty, stderr is used as the status message
|
||||||
|
- Handles signal-killed processes (negative exit code → UNKNOWN status)
|
||||||
|
- Validates absolute command paths at startup and warns on missing or non-executable files
|
||||||
- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
|
- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
|
||||||
- Extracts performance data with thresholds
|
- Extracts performance data with thresholds
|
||||||
- Reports aggregated status across all configured checks
|
- Reports aggregated status across all configured checks
|
||||||
@@ -147,9 +154,11 @@ Heartbeat includes a sophisticated threshold alerting system that monitors plugi
|
|||||||
- **Multi-level alerts**: WARNING and CRITICAL severity levels
|
- **Multi-level alerts**: WARNING and CRITICAL severity levels
|
||||||
- **Flexible operators**: Support for >, >=, <, <=, ==, != comparisons
|
- **Flexible operators**: Support for >, >=, <, <=, ==, != comparisons
|
||||||
- **Hysteresis**: Prevents alert flapping with configurable recovery thresholds
|
- **Hysteresis**: Prevents alert flapping with configurable recovery thresholds
|
||||||
- **Smart notifications**: Alerts only on state changes, not every check
|
- **Smart notifications**: Alerts only on state changes, not every check; de-escalations (e.g. CRITICAL → WARNING) do not generate a notification
|
||||||
- **Re-notifications**: Periodic reminders for ongoing alerts
|
- **Re-notifications**: Periodic reminders for ongoing alerts
|
||||||
|
- **Short-duration suppression**: Recovery notifications are suppressed for down events under 4 seconds (avoids noise from transient blips)
|
||||||
- **Journal integration**: All threshold events logged for audit trail
|
- **Journal integration**: All threshold events logged for audit trail
|
||||||
|
- **`ping_monitor` thresholds**: Latency and packet-loss thresholds use the same format as all other plugin metrics
|
||||||
|
|
||||||
### Configuration
|
### Configuration
|
||||||
|
|
||||||
@@ -363,9 +372,10 @@ Heartbeat includes a built-in HTTP/WebSocket server that provides both a REST AP
|
|||||||
### Web Dashboards
|
### Web Dashboards
|
||||||
|
|
||||||
- **Login** (`/login`): Browser login form (shown automatically when auth is configured)
|
- **Login** (`/login`): Browser login form (shown automatically when auth is configured)
|
||||||
- **Live View** (`/live`): Real-time host connectivity, latency, and messages
|
- **Live View** (`/live`): Real-time host connectivity, latency, and messages; hostnames link directly to the Host Overview page
|
||||||
- **Plugin Metrics** (`/plugins`): Browse and visualize metrics from all plugins
|
- **Host Overview** (`/plugins/<host>`): Per-host plugin metrics with ZFS pool visualization; filtered to hosts where the logged-in user is owner or manager (admins see all)
|
||||||
- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering
|
- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering; alert count pie chart shown in the navigation bar
|
||||||
|
- **Settings** (`/settings`): Server configuration, user management, and threshold configuration viewer
|
||||||
|
|
||||||
### API Endpoints
|
### API Endpoints
|
||||||
|
|
||||||
@@ -476,6 +486,10 @@ plugins:
|
|||||||
|
|
||||||
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
|
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
|
||||||
|
|
||||||
|
**Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
|
||||||
|
|
||||||
|
**Daemon logging:** When running with `-d`, `hbc` routes all log output to syslog (`LOG_DAEMON` facility) after daemonizing. Without `-d`, logs go to stderr as usual.
|
||||||
|
|
||||||
### hbc_mini — single-file client (no external dependencies)
|
### hbc_mini — single-file client (no external dependencies)
|
||||||
|
|
||||||
`scripts/hbc_mini.py` is a self-contained version of the heartbeat client that requires only Python 3.8+ and no external packages. Copy it to any host and run it directly — no virtualenv, no `pip install`.
|
`scripts/hbc_mini.py` is a self-contained version of the heartbeat client that requires only Python 3.8+ and no external packages. Copy it to any host and run it directly — no virtualenv, no `pip install`.
|
||||||
@@ -531,8 +545,10 @@ python3 hbc_mini.py -m "maintenance starting" your-server.example.com
|
|||||||
|
|
||||||
- No YAML config (use JSON instead)
|
- No YAML config (use JSON instead)
|
||||||
- No `filesystem_info` plugin
|
- No `filesystem_info` plugin
|
||||||
|
- No `zfs_monitor` plugin (requires `zpool(8)` and the full plugin loader)
|
||||||
- `cpu_monitor` does not report per-core usage or CPU frequency (no psutil)
|
- `cpu_monitor` does not report per-core usage or CPU frequency (no psutil)
|
||||||
- Plugins cannot be loaded from external `.py` files — all plugins are compiled in
|
- Plugins cannot be loaded from external `.py` files — all plugins are compiled in
|
||||||
|
- No IPv6 early-fail protection — connections that fail to open at startup are silently skipped rather than retried
|
||||||
|
|
||||||
Everything else — heartbeat protocol, ACK/CMD/UPD handling, `hb_install.sh`-based self-update, daemonize, syslog — is identical to the full client.
|
Everything else — heartbeat protocol, ACK/CMD/UPD handling, `hb_install.sh`-based self-update, daemonize, syslog — is identical to the full client.
|
||||||
|
|
||||||
|
|||||||
+1
-1
@@ -14,4 +14,4 @@ Install options:
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
__all__ = ["__version__"]
|
__all__ = ["__version__"]
|
||||||
__version__ = "5.1.15"
|
__version__ = "5.1.17"
|
||||||
|
|||||||
+47
-10
@@ -56,6 +56,8 @@ class AsyncConnection:
|
|||||||
self.transport: Optional[asyncio.DatagramTransport] = None
|
self.transport: Optional[asyncio.DatagramTransport] = None
|
||||||
self.protocol: Optional[asyncio.DatagramProtocol] = None
|
self.protocol: Optional[asyncio.DatagramProtocol] = None
|
||||||
self._dead = False
|
self._dead = False
|
||||||
|
self._ever_opened = False
|
||||||
|
self._open_fail_count = 0 # consecutive failures before first success
|
||||||
|
|
||||||
self.logger = logging.getLogger(f"hbc.conn.{addr}")
|
self.logger = logging.getLogger(f"hbc.conn.{addr}")
|
||||||
|
|
||||||
@@ -73,6 +75,7 @@ class AsyncConnection:
|
|||||||
lambda: HeartbeatProtocol(self),
|
lambda: HeartbeatProtocol(self),
|
||||||
family=self.af
|
family=self.af
|
||||||
)
|
)
|
||||||
|
self._ever_opened = True
|
||||||
self.logger.debug(f"Opened connection to {self.addr}:{self.port}")
|
self.logger.debug(f"Opened connection to {self.addr}:{self.port}")
|
||||||
return True
|
return True
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
@@ -262,15 +265,51 @@ async def handle_update(conn: AsyncConnection, _msg: dict): # pyright: ignore[r
|
|||||||
|
|
||||||
|
|
||||||
async def heartbeat_sender(conn: AsyncConnection, interval: int):
|
async def heartbeat_sender(conn: AsyncConnection, interval: int):
|
||||||
"""Send periodic heartbeats.
|
"""Send periodic heartbeats, retrying the connection if it is not open.
|
||||||
|
|
||||||
|
IPv6 connections that fail to open before their first successful send are
|
||||||
|
dropped after IPV6_EARLY_FAIL_LIMIT attempts so that a network without IPv6
|
||||||
|
does not keep a dead sender alive. IPv4 connections are retried indefinitely.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
conn: Connection to send on
|
conn: Connection to send on
|
||||||
interval: Heartbeat interval in seconds
|
interval: Heartbeat interval in seconds
|
||||||
"""
|
"""
|
||||||
logger = logging.getLogger("hbc.heartbeat")
|
logger = logging.getLogger("hbc.heartbeat")
|
||||||
|
IPV6_EARLY_FAIL_LIMIT = 3
|
||||||
|
|
||||||
|
while running and not conn._dead:
|
||||||
|
# Ensure transport is open before attempting to send.
|
||||||
|
if not conn.transport:
|
||||||
|
opened = await conn.open()
|
||||||
|
if opened:
|
||||||
|
conn._open_fail_count = 0
|
||||||
|
else:
|
||||||
|
conn._open_fail_count += 1
|
||||||
|
# Drop an IPv6 connection that has never come up within the
|
||||||
|
# first few attempts — it is likely unavailable on this network.
|
||||||
|
if (not conn._ever_opened
|
||||||
|
and conn.af == socket.AF_INET6
|
||||||
|
and conn._open_fail_count >= IPV6_EARLY_FAIL_LIMIT):
|
||||||
|
logger.warning(
|
||||||
|
f"IPv6 connection to {conn.addr} unreachable after "
|
||||||
|
f"{conn._open_fail_count} attempts, disabling"
|
||||||
|
)
|
||||||
|
conn._dead = True
|
||||||
|
break
|
||||||
|
# Retry after the normal interval; IPv4 retries forever.
|
||||||
|
try:
|
||||||
|
if shutdown_event:
|
||||||
|
await asyncio.wait_for(shutdown_event.wait(), timeout=interval)
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
await asyncio.sleep(interval)
|
||||||
|
except asyncio.TimeoutError:
|
||||||
|
pass
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
raise
|
||||||
|
continue
|
||||||
|
|
||||||
while running:
|
|
||||||
try:
|
try:
|
||||||
msg = {
|
msg = {
|
||||||
"acks": conn.ackcount,
|
"acks": conn.ackcount,
|
||||||
@@ -279,19 +318,16 @@ async def heartbeat_sender(conn: AsyncConnection, interval: int):
|
|||||||
}
|
}
|
||||||
await conn.sendto(msg, "HTB")
|
await conn.sendto(msg, "HTB")
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error sending heartbeat: {e}", exc_info=True)
|
|
||||||
except asyncio.CancelledError:
|
except asyncio.CancelledError:
|
||||||
logger.debug("Heartbeat sender cancelled")
|
logger.debug("Heartbeat sender cancelled")
|
||||||
raise
|
raise
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error sending heartbeat: {e}", exc_info=True)
|
||||||
|
|
||||||
# Wait for next interval or shutdown event
|
# Wait for next interval or shutdown event
|
||||||
try:
|
try:
|
||||||
if shutdown_event:
|
if shutdown_event:
|
||||||
await asyncio.wait_for(
|
await asyncio.wait_for(shutdown_event.wait(), timeout=interval)
|
||||||
shutdown_event.wait(),
|
|
||||||
timeout=interval
|
|
||||||
)
|
|
||||||
break
|
break
|
||||||
else:
|
else:
|
||||||
await asyncio.sleep(interval)
|
await asyncio.sleep(interval)
|
||||||
@@ -481,12 +517,13 @@ async def async_main(args, config):
|
|||||||
addr = addr_info[4][0]
|
addr = addr_info[4][0]
|
||||||
|
|
||||||
conn = AsyncConnection(conn_id, addr, hb_port, af, iam)
|
conn = AsyncConnection(conn_id, addr, hb_port, af, iam)
|
||||||
if await conn.open():
|
if not await conn.open():
|
||||||
|
logger.warning(f"Initial open to {addr} failed, heartbeat sender will retry")
|
||||||
connections.append(conn)
|
connections.append(conn)
|
||||||
conn_id += 1
|
conn_id += 1
|
||||||
|
|
||||||
if not connections:
|
if not connections:
|
||||||
logger.error("No connections established")
|
logger.error("No connections established (DNS resolution failed for all hosts)")
|
||||||
return 1
|
return 1
|
||||||
|
|
||||||
logger.info(f"Created {len(connections)} connections")
|
logger.info(f"Created {len(connections)} connections")
|
||||||
|
|||||||
@@ -95,7 +95,7 @@ class Connection:
|
|||||||
if not Null:
|
if not Null:
|
||||||
d["addr"] = self.addr
|
d["addr"] = self.addr
|
||||||
if self.rtts[-1]:
|
if self.rtts[-1]:
|
||||||
d["rtt"] = "%0.1f" % self.rtts[-1]
|
d["rtt"] = "%d" % round(self.rtts[-1])
|
||||||
elif self.state == Connection.UNKNOWN:
|
elif self.state == Connection.UNKNOWN:
|
||||||
d["rtt"] = ""
|
d["rtt"] = ""
|
||||||
else:
|
else:
|
||||||
|
|||||||
@@ -154,6 +154,25 @@ async def start(
|
|||||||
lst = [h.jsons() for h in hosts]
|
lst = [h.jsons() for h in hosts]
|
||||||
return web.json_response(json.loads("[" + ",".join(lst) + "]"))
|
return web.json_response(json.loads("[" + ",".join(lst) + "]"))
|
||||||
|
|
||||||
|
async def api_alert_summary(request):
|
||||||
|
"""GET /api/0/alert_summary — counts of ok/warning/critical hosts visible to caller."""
|
||||||
|
user, err = _require_auth(request)
|
||||||
|
if err:
|
||||||
|
return err
|
||||||
|
from .threshold import AlertLevel
|
||||||
|
critical = warning = ok = 0
|
||||||
|
for host in hbdclass.Host.hosts.values():
|
||||||
|
if not _can_operate_host(user, host):
|
||||||
|
continue
|
||||||
|
levels = {s.level for s in host.alert_states.values()}
|
||||||
|
if AlertLevel.CRITICAL in levels:
|
||||||
|
critical += 1
|
||||||
|
elif AlertLevel.WARNING in levels:
|
||||||
|
warning += 1
|
||||||
|
else:
|
||||||
|
ok += 1
|
||||||
|
return web.json_response({"critical": critical, "warning": warning, "ok": ok})
|
||||||
|
|
||||||
async def api_messages(request):
|
async def api_messages(request):
|
||||||
lst = data.msgs[-30:]
|
lst = data.msgs[-30:]
|
||||||
return web.json_response(lst)
|
return web.json_response(lst)
|
||||||
@@ -518,6 +537,7 @@ async def start(
|
|||||||
hosts_with_plugins.append({
|
hosts_with_plugins.append({
|
||||||
"name": hostname,
|
"name": hostname,
|
||||||
"plugins": list(host.plugin_data.keys()),
|
"plugins": list(host.plugin_data.keys()),
|
||||||
|
"is_owner": _can_own_host(current_user, host),
|
||||||
})
|
})
|
||||||
|
|
||||||
tmpl = env.get_template("plugins.html")
|
tmpl = env.get_template("plugins.html")
|
||||||
@@ -893,6 +913,7 @@ async def start(
|
|||||||
web.get("/api/0/users/{username}/avatar", api_user_avatar),
|
web.get("/api/0/users/{username}/avatar", api_user_avatar),
|
||||||
# Hosts
|
# Hosts
|
||||||
web.get("/api/0/hosts", api_hosts),
|
web.get("/api/0/hosts", api_hosts),
|
||||||
|
web.get("/api/0/alert_summary", api_alert_summary),
|
||||||
web.get("/api/0/messages", api_messages),
|
web.get("/api/0/messages", api_messages),
|
||||||
web.get("/api/0/hosts/{hostname}/plugins", api_host_plugins),
|
web.get("/api/0/hosts/{hostname}/plugins", api_host_plugins),
|
||||||
web.get("/api/0/hosts/{hostname}/plugins/{plugin_name}", api_host_plugin_detail),
|
web.get("/api/0/hosts/{hostname}/plugins/{plugin_name}", api_host_plugin_detail),
|
||||||
|
|||||||
+6
-1
@@ -101,9 +101,10 @@ async def reload_configuration(config_obj, config_path, components):
|
|||||||
access = config_mod.get_host_access(new_config, hostname)
|
access = config_mod.get_host_access(new_config, hostname)
|
||||||
host.apply_access(access["owner"], access["managers"], access["monitors"])
|
host.apply_access(access["owner"], access["managers"], access["monitors"])
|
||||||
|
|
||||||
# Reload threshold checker
|
# Reload threshold checker and prune alerts orphaned by the new config
|
||||||
if 'threshold_checker' in components:
|
if 'threshold_checker' in components:
|
||||||
components['threshold_checker'].reload(new_config)
|
components['threshold_checker'].reload(new_config)
|
||||||
|
components['threshold_checker'].purge_stale_alerts(hbdclass)
|
||||||
|
|
||||||
# Note: Changes to the following require restart:
|
# Note: Changes to the following require restart:
|
||||||
# - hb_port, hbd_port, ws_port (already bound)
|
# - hb_port, hbd_port, ws_port (already bound)
|
||||||
@@ -241,6 +242,10 @@ async def _run_async(config, config_path=None):
|
|||||||
)
|
)
|
||||||
udp.restore_connection_timers(hbdclass, restore_ctx)
|
udp.restore_connection_timers(hbdclass, restore_ctx)
|
||||||
|
|
||||||
|
# Drop alert states that no longer have a matching threshold (stale after
|
||||||
|
# upgrade or config change between runs).
|
||||||
|
threshold_checker.purge_stale_alerts(hbdclass)
|
||||||
|
|
||||||
# HTTP server (asyncio-based via aiohttp)
|
# HTTP server (asyncio-based via aiohttp)
|
||||||
try:
|
try:
|
||||||
http_task = asyncio.create_task(
|
http_task = asyncio.create_task(
|
||||||
|
|||||||
@@ -4,6 +4,11 @@
|
|||||||
|
|
||||||
<style>
|
<style>
|
||||||
|
|
||||||
|
body {
|
||||||
|
height: auto;
|
||||||
|
overflow-y: auto;
|
||||||
|
}
|
||||||
|
|
||||||
.container {
|
.container {
|
||||||
max-width: 1400px;
|
max-width: 1400px;
|
||||||
margin: 0 auto;
|
margin: 0 auto;
|
||||||
|
|||||||
@@ -126,11 +126,17 @@
|
|||||||
}
|
}
|
||||||
|
|
||||||
/* Swiss railway clock — nav */
|
/* Swiss railway clock — nav */
|
||||||
.nav-clock {
|
.nav-pie {
|
||||||
flex-shrink: 0;
|
flex-shrink: 0;
|
||||||
line-height: 0;
|
line-height: 0;
|
||||||
margin-left: auto;
|
margin-left: auto;
|
||||||
padding: 4px 4px 4px 0;
|
padding: 4px 4px 4px 0;
|
||||||
|
}
|
||||||
|
#alert-pie { display: block; cursor: default; }
|
||||||
|
.nav-clock {
|
||||||
|
flex-shrink: 0;
|
||||||
|
line-height: 0;
|
||||||
|
padding: 4px 4px 4px 0;
|
||||||
cursor: pointer;
|
cursor: pointer;
|
||||||
}
|
}
|
||||||
#swiss-clock { display: block; }
|
#swiss-clock { display: block; }
|
||||||
|
|||||||
@@ -408,7 +408,7 @@
|
|||||||
);
|
);
|
||||||
if (data.connections[i].state == "up") {
|
if (data.connections[i].state == "up") {
|
||||||
state = '<span class="state-up">up</span>';
|
state = '<span class="state-up">up</span>';
|
||||||
latency = Number.parseFloat(data.connections[i].rtts[0]).toFixed(2);
|
latency = String(Math.round(Number.parseFloat(data.connections[i].rtts[0])));
|
||||||
} else {
|
} else {
|
||||||
if (data.connections[i].state == "unknown") {
|
if (data.connections[i].state == "unknown") {
|
||||||
state = "";
|
state = "";
|
||||||
|
|||||||
@@ -11,6 +11,9 @@
|
|||||||
{% endif %}
|
{% endif %}
|
||||||
<a href="/about"{% if active_page == "about" %} class="active"{% endif %}>About</a>
|
<a href="/about"{% if active_page == "about" %} class="active"{% endif %}>About</a>
|
||||||
</div>
|
</div>
|
||||||
|
<div class="nav-pie" title="Host alert status">
|
||||||
|
<canvas id="alert-pie" width="44" height="44"></canvas>
|
||||||
|
</div>
|
||||||
<div class="nav-clock" title="Click for full-screen clock">
|
<div class="nav-clock" title="Click for full-screen clock">
|
||||||
<canvas id="swiss-clock" width="44" height="44"></canvas>
|
<canvas id="swiss-clock" width="44" height="44"></canvas>
|
||||||
</div>
|
</div>
|
||||||
@@ -42,4 +45,52 @@
|
|||||||
});
|
});
|
||||||
}
|
}
|
||||||
})();
|
})();
|
||||||
|
|
||||||
|
function drawAlertPie(critical, warning, ok) {
|
||||||
|
var canvas = document.getElementById('alert-pie');
|
||||||
|
if (!canvas) return;
|
||||||
|
var ctx = canvas.getContext('2d');
|
||||||
|
var SIZE = canvas.width;
|
||||||
|
var R = SIZE / 2;
|
||||||
|
ctx.clearRect(0, 0, SIZE, SIZE);
|
||||||
|
var total = critical + warning + ok;
|
||||||
|
if (total === 0) {
|
||||||
|
ctx.beginPath();
|
||||||
|
ctx.arc(R, R, R - 1, 0, Math.PI * 2);
|
||||||
|
ctx.fillStyle = '#ccc';
|
||||||
|
ctx.fill();
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
var slices = [
|
||||||
|
{ value: critical, color: '#e53935' },
|
||||||
|
{ value: warning, color: '#ffb300' },
|
||||||
|
{ value: ok, color: '#43a047' }
|
||||||
|
];
|
||||||
|
var start = -Math.PI / 2;
|
||||||
|
slices.forEach(function(s) {
|
||||||
|
if (s.value === 0) return;
|
||||||
|
var sweep = (s.value / total) * Math.PI * 2;
|
||||||
|
ctx.beginPath();
|
||||||
|
ctx.moveTo(R, R);
|
||||||
|
ctx.arc(R, R, R - 1, start, start + sweep);
|
||||||
|
ctx.closePath();
|
||||||
|
ctx.fillStyle = s.color;
|
||||||
|
ctx.fill();
|
||||||
|
start += sweep;
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
function updateAlertPie() {
|
||||||
|
fetch('/api/0/alert_summary').then(function(r) {
|
||||||
|
if (!r.ok) return;
|
||||||
|
return r.json();
|
||||||
|
}).then(function(d) {
|
||||||
|
if (d) drawAlertPie(d.critical || 0, d.warning || 0, d.ok || 0);
|
||||||
|
}).catch(function() {});
|
||||||
|
}
|
||||||
|
|
||||||
|
document.addEventListener('DOMContentLoaded', function() {
|
||||||
|
updateAlertPie();
|
||||||
|
setInterval(updateAlertPie, 30000);
|
||||||
|
});
|
||||||
</script>
|
</script>
|
||||||
|
|||||||
@@ -131,6 +131,27 @@
|
|||||||
text-overflow: ellipsis;
|
text-overflow: ellipsis;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
.host-action-btn {
|
||||||
|
font-size: 0.75em;
|
||||||
|
font-weight: bold;
|
||||||
|
padding: 3px 10px;
|
||||||
|
border-radius: 4px;
|
||||||
|
border: none;
|
||||||
|
cursor: pointer;
|
||||||
|
text-decoration: none;
|
||||||
|
white-space: nowrap;
|
||||||
|
}
|
||||||
|
.host-action-btn.update-btn {
|
||||||
|
background: #e3f2fd;
|
||||||
|
color: #1565c0;
|
||||||
|
}
|
||||||
|
.host-action-btn.update-btn:hover { background: #bbdefb; }
|
||||||
|
.host-action-btn.delete-btn {
|
||||||
|
background: #ffebee;
|
||||||
|
color: #c62828;
|
||||||
|
}
|
||||||
|
.host-action-btn.delete-btn:hover { background: #ffcdd2; }
|
||||||
|
|
||||||
/* ── Host body ──────────────────────────────────────────────── */
|
/* ── Host body ──────────────────────────────────────────────── */
|
||||||
|
|
||||||
.host-body {
|
.host-body {
|
||||||
@@ -379,6 +400,14 @@
|
|||||||
<span class="nagios-badge" id="nagios-badge-{{ host.name }}">—</span>
|
<span class="nagios-badge" id="nagios-badge-{{ host.name }}">—</span>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
<span class="os-label" id="os-label-{{ host.name }}"></span>
|
<span class="os-label" id="os-label-{{ host.name }}"></span>
|
||||||
|
{% if host.is_owner %}
|
||||||
|
<a class="host-action-btn update-btn"
|
||||||
|
href="/u?h={{ host.name }}"
|
||||||
|
onclick="event.stopPropagation()">Update</a>
|
||||||
|
<a class="host-action-btn delete-btn"
|
||||||
|
href="/d?h={{ host.name }}"
|
||||||
|
onclick="event.stopPropagation(); return confirm('Delete host {{ host.name }}?')">Delete</a>
|
||||||
|
{% endif %}
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|||||||
+45
-3
@@ -803,6 +803,29 @@ class ThresholdChecker:
|
|||||||
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, None)
|
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, None)
|
||||||
|
|
||||||
return None
|
return None
|
||||||
|
def _find_threshold(
|
||||||
|
self, thresholds: Dict[str, "ThresholdConfig"], metric_path: str
|
||||||
|
) -> Optional["ThresholdConfig"]:
|
||||||
|
"""Return the threshold for *metric_path*, falling back to suffix matches.
|
||||||
|
|
||||||
|
Allows generic thresholds like ``ping_monitor.rtt_avg`` to match
|
||||||
|
fully-qualified paths like ``ping_monitor.8_8_8_8_rtt_avg``.
|
||||||
|
The exact match is always tried first; then successive leading
|
||||||
|
underscore-delimited segments are stripped from the field name until
|
||||||
|
a match is found or no segments remain.
|
||||||
|
"""
|
||||||
|
if metric_path in thresholds:
|
||||||
|
return thresholds[metric_path]
|
||||||
|
plugin, sep, field = metric_path.partition(".")
|
||||||
|
if not sep:
|
||||||
|
return None
|
||||||
|
parts = field.split("_")
|
||||||
|
for i in range(1, len(parts)):
|
||||||
|
candidate = plugin + "." + "_".join(parts[i:])
|
||||||
|
if candidate in thresholds:
|
||||||
|
return thresholds[candidate]
|
||||||
|
return None
|
||||||
|
|
||||||
def check_plugin_data(
|
def check_plugin_data(
|
||||||
self,
|
self,
|
||||||
host_name: str,
|
host_name: str,
|
||||||
@@ -831,11 +854,10 @@ class ThresholdChecker:
|
|||||||
for metric_name, value in data.items():
|
for metric_name, value in data.items():
|
||||||
metric_path = f"{plugin_name}.{metric_name}"
|
metric_path = f"{plugin_name}.{metric_name}"
|
||||||
|
|
||||||
if metric_path not in thresholds:
|
threshold = self._find_threshold(thresholds, metric_path)
|
||||||
|
if threshold is None:
|
||||||
continue
|
continue
|
||||||
|
|
||||||
threshold = thresholds[metric_path]
|
|
||||||
|
|
||||||
# Get or create alert state
|
# Get or create alert state
|
||||||
if metric_path not in alert_states:
|
if metric_path not in alert_states:
|
||||||
alert_states[metric_path] = AlertState(metric_path)
|
alert_states[metric_path] = AlertState(metric_path)
|
||||||
@@ -1254,6 +1276,26 @@ class ThresholdChecker:
|
|||||||
alert_state.last_notification = now
|
alert_state.last_notification = now
|
||||||
alert_state.notification_count += 1
|
alert_state.notification_count += 1
|
||||||
|
|
||||||
|
def purge_stale_alerts(self, hbdclass) -> None:
|
||||||
|
"""Remove alert states that have no matching threshold configuration.
|
||||||
|
|
||||||
|
Called after startup (pickle restore) and after each config reload so
|
||||||
|
that alerts orphaned by configuration changes do not linger forever.
|
||||||
|
Alerts whose metric_path is not present in the current threshold config
|
||||||
|
for that host are silently dropped.
|
||||||
|
"""
|
||||||
|
for hostname, host in hbdclass.Host.hosts.items():
|
||||||
|
if not host.alert_states:
|
||||||
|
continue
|
||||||
|
configured = self.get_thresholds_for_host(hostname)
|
||||||
|
stale = [mp for mp in host.alert_states if mp not in configured]
|
||||||
|
for mp in stale:
|
||||||
|
logger.info(
|
||||||
|
"Purging stale alert state for %s / %s (no threshold configured)",
|
||||||
|
hostname, mp,
|
||||||
|
)
|
||||||
|
del host.alert_states[mp]
|
||||||
|
|
||||||
def get_active_alerts(self, alert_states: Dict[str, AlertState]) -> list:
|
def get_active_alerts(self, alert_states: Dict[str, AlertState]) -> list:
|
||||||
"""
|
"""
|
||||||
Get all currently active (non-OK) alerts.
|
Get all currently active (non-OK) alerts.
|
||||||
|
|||||||
+1
-1
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|||||||
|
|
||||||
[project]
|
[project]
|
||||||
name = "hbd"
|
name = "hbd"
|
||||||
version = "5.1.15"
|
version = "5.1.17"
|
||||||
description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
|
description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
requires-python = ">=3.11"
|
requires-python = ">=3.11"
|
||||||
|
|||||||
+1
-1
@@ -41,7 +41,7 @@ from pathlib import Path
|
|||||||
from typing import Any, Dict, List, Optional, Tuple
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
|
||||||
# updated by scripts/bumpminor.sh
|
# updated by scripts/bumpminor.sh
|
||||||
__version__ = "5.1.15"
|
__version__ = "5.1.17"
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Protocol (mirrors hbd/common/proto.py)
|
# Protocol (mirrors hbd/common/proto.py)
|
||||||
|
|||||||
Reference in New Issue
Block a user