Compare commits

...

28 Commits

Author SHA1 Message Date
andreas 7d8ca5d8db version 5.1.3
Release / release (push) Successful in 4s
2026-04-25 16:52:56 +02:00
andreas 56037a036d fix: remove unused pytest import in test_nagios_runner 2026-04-25 16:39:56 +02:00
andreas 65ceb31d8d fix: use os.path.exists check for /dev/log instead of dead-code OSError catch 2026-04-25 16:36:00 +02:00
andreas 1c9b6c1ca9 fix: reconfigure logging to syslog after daemonize() instead of no-op basicConfig
After daemonize() redirects stderr to /dev/null, the existing StreamHandler
writes to /dev/null. logging.basicConfig() is a no-op when handlers are
already configured, so log messages are silently lost.

Replace the daemon block to:
1. Call daemonize() first
2. Explicitly remove existing handlers (pointing to /dev/null)
3. Add SysLogHandler pointing to /dev/log with fallback to UDP localhost:514
4. Log startup message to the new syslog handler

Removes redundant syslog.openlog() call which is no longer needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 16:29:54 +02:00
andreas d7e6b478e1 fix: use shlex.split() in nagios_runner path validation to handle quoted paths 2026-04-25 16:28:32 +02:00
andreas 535dbda47d feat: validate absolute command paths at nagios_runner init 2026-04-25 16:24:33 +02:00
andreas c9567dddae fix: remove stale shell config key from NagiosRunnerPlugin docstring 2026-04-25 16:23:03 +02:00
andreas b5963badd6 feat: async subprocess in nagios_runner with stderr capture and signal handling
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 16:18:09 +02:00
andreas a76a39b4a0 fix: remove redundant no-commands log lines; fix skip_reason docstring style
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 16:15:44 +02:00
andreas 94e1597978 feat: set skip_reason on nagios_runner when no commands configured
When NagiosRunnerPlugin has no commands configured, set skip_reason before
returning False from initialize(). This allows PluginLoader to log INFO
(not WARNING) when the plugin is skipped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 16:13:03 +02:00
andreas c9c2ed772f fix: document skip_reason in Plugin docstring; remove unused import in test 2026-04-25 16:10:35 +02:00
andreas aeb78dcb8e feat: add skip_reason to Plugin; improve PluginLoader init messaging 2026-04-25 16:08:07 +02:00
andreas 77b337e4dd Add implementation plan for plugin error checking and daemon logging fixes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 16:04:13 +02:00
andreas 293461f3f6 Add design spec for plugin error checking and daemon logging fixes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 15:49:09 +02:00
andreas c70a4807dc version 5.1.2
Release / release (push) Successful in 6s
2026-04-25 07:25:06 +02:00
andreas 1a470e7cfa Fix plugin config lookup shadowed by CLIENT_DEFAULTS plugins key
CLIENT_DEFAULTS seeds "plugins": {} so raw_config.get("plugins", raw_config)
always returned the empty subdict instead of falling back to the full config.
Plugins configured at top-level (e.g. nagios_runner: ...) were therefore
never found, resulting in "No Nagios commands configured".

Now checks the plugins subdict first, then top-level keys, so both
config layouts work correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 12:58:42 +02:00
andreas 990c658e65 Apply grace period to all threshold alerts before logging/notifying
Threshold alerts (plugin metrics, RTT) were firing immediately on the
first breach. Now every state transition to WARNING/CRITICAL starts a
grace-period timer (grace_seconds from the 'grace' config key). The
notification is deferred until the next heartbeat after grace_seconds
have elapsed. If the metric recovers within the grace window, both the
alert and the recovery are suppressed — no spurious pages for transient
spikes.

Two helper methods added to ThresholdChecker:
- _apply_grace: handles the state-change path (defer or suppress)
- _check_pending_or_renotify: handles the stable-alert path (fire
  deferred notification once grace expires, or fall through to reminders)

The overdue case is unchanged — on_overdue already fires only after
interval+grace seconds of silence, which is equivalent behaviour.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 12:00:40 +02:00
andreas b78d6ac0fe Fix RECOVER routing: use consistent level name and route via alerted channel
threshold.py was emitting level="RECOVERED" for metric recoveries, which
failed the is_recover check in send_notification (which only matched "RECOVER"),
bypassing _alerted_channels routing and the min_level bypass added in the
previous commit. Changed to "RECOVER" so all recovery paths are consistent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 11:29:04 +02:00
andreas afd5060f59 Fix early reminder notifications and lost recovery notifications
- AlertState.update() now resets last_notification when the alert level
  changes, so a WARNING→CRITICAL escalation restarts the reminder interval
  rather than inheriting a nearly-expired timer.
- _dispatch_to_channel() bypasses min_level for RECOVER, so recovery
  notifications are delivered even after a server restart when
  _alerted_channels is empty and the fallback dispatch path is used.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 18:11:22 +02:00
andreas f61f7aebc2 Use python3 consistently 2026-04-19 09:49:30 +02:00
Andreas Wrede 5c382d2b8d One more nit 2026-04-13 09:31:35 -04:00
Andreas Wrede 35bba451f5 Various formating nits 2026-04-13 09:27:51 -04:00
Andreas Wrede 80edfba0c0 fix inconsistencies in page layout, add swiss clock 2026-04-13 08:45:50 -04:00
Andreas Wrede 6bc8de192e fix non-alerting of overdue hosts 2026-04-12 18:44:36 -04:00
Andreas Wrede 2d8166d04a unse python3 -mpip instead of plain pip 2026-04-12 18:44:11 -04:00
Andreas Wrede ab33d81b30 catch syntax wanring when parsing version string 2026-04-12 16:39:51 -04:00
Andreas Wrede 2c0328f36d update install.sh to handle missing venv module 2026-04-12 16:39:14 -04:00
Andreas Wrede fb8e27825d make install.sh work on systems withou pip 2026-04-12 14:16:44 -04:00
22 changed files with 1380 additions and 176 deletions
+4 -4
View File
@@ -24,11 +24,11 @@ jobs:
- name: Install build tools
run: |
python -m pip install --upgrade pip
pip install build twine
python3 -m pip install --upgrade pip
python3 -m pip install build twine
- name: Build package
run: python -m build
run: python3 -m build
- name: Extract version from tag
id: get_version
@@ -39,7 +39,7 @@ jobs:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
run: |
python -m twine upload --repository-url https://git.wrede.ca/api/packages/andreas/pypi dist/*
python3 -m twine upload --repository-url https://git.wrede.ca/api/packages/andreas/pypi dist/*
- name: Create release
uses: actions/gitea-release-action@v1
@@ -0,0 +1,602 @@
# Plugin Error Checking Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Improve plugin error checking in hbc, especially for nagios_runner, and fix logger messages silently discarded in daemon mode.
**Architecture:** Three focused changes across three files: (1) `hbd/client/plugin.py` gains a `skip_reason` attribute on Plugin and updated PluginLoader messaging; (2) `hbd/client/plugins/nagios_runner.py` gains async subprocess execution, stderr capture, signal-killed process handling, and init-time command path validation; (3) `hbd/client/main.py` gains proper post-fork logging reconfiguration to syslog.
**Tech Stack:** Python 3.11+, asyncio, `logging.handlers.SysLogHandler`, pytest
---
## File Map
| Action | Path | What changes |
|---|---|---|
| Modify | `hbd/client/plugin.py` | `Plugin.__init__` gains `skip_reason`; `PluginLoader` checks it |
| Modify | `hbd/client/plugins/nagios_runner.py` | async subprocess, stderr, signal codes, init validation, `skip_reason` |
| Modify | `hbd/client/main.py` | `_reconfigure_logging_for_daemon()` helper; remove redundant syslog calls |
| Create | `tests/test_plugin.py` | PluginLoader messaging tests |
| Create | `tests/test_nagios_runner.py` | NagiosRunnerPlugin behaviour tests |
Run tests throughout with:
```bash
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
```
---
## Task 1: Plugin.skip_reason + PluginLoader messaging
**Files:**
- Modify: `hbd/client/plugin.py:40-48` (Plugin.__init__)
- Modify: `hbd/client/plugin.py:369-381` (PluginLoader.load_from_directory)
- Create: `tests/test_plugin.py`
- [ ] **Step 1: Write failing tests**
Create `tests/test_plugin.py`:
```python
import asyncio
import logging
import textwrap
from hbd.client.plugin import Plugin, PluginLoader, PluginRegistry
def test_plugin_skip_reason_defaults_none(tmp_path):
plugin_code = textwrap.dedent("""
from hbd.client.plugin import MonitorPlugin
class MinimalPlugin(MonitorPlugin):
name = "minimal"
version = "1.0.0"
interval = 60
async def initialize(self):
return True
async def _collect_metrics(self):
return {}
""")
(tmp_path / "minimal.py").write_text(plugin_code)
registry = PluginRegistry()
loader = PluginLoader(registry)
asyncio.run(loader.load_from_directory(tmp_path))
plugin = registry.get("minimal")
assert plugin is not None
assert plugin.skip_reason is None
def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog):
plugin_code = textwrap.dedent("""
from hbd.client.plugin import MonitorPlugin
class SkippablePlugin(MonitorPlugin):
name = "skippable"
version = "1.0.0"
interval = 60
async def initialize(self):
self.skip_reason = "not configured in yaml"
return False
async def _collect_metrics(self):
return {}
""")
(tmp_path / "skippable.py").write_text(plugin_code)
registry = PluginRegistry()
loader = PluginLoader(registry)
with caplog.at_level(logging.INFO, logger="plugin.loader"):
count = asyncio.run(loader.load_from_directory(tmp_path))
assert count == 0
assert any("skipped: not configured in yaml" in r.message for r in caplog.records)
assert not any("failed initialization" in r.message for r in caplog.records)
def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog):
plugin_code = textwrap.dedent("""
from hbd.client.plugin import MonitorPlugin
class FailPlugin(MonitorPlugin):
name = "fail"
version = "1.0.0"
interval = 60
async def initialize(self):
return False
async def _collect_metrics(self):
return {}
""")
(tmp_path / "fail_plugin.py").write_text(plugin_code)
registry = PluginRegistry()
loader = PluginLoader(registry)
with caplog.at_level(logging.WARNING, logger="plugin.loader"):
count = asyncio.run(loader.load_from_directory(tmp_path))
assert count == 0
assert any("failed initialization" in r.message for r in caplog.records)
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
python -m pytest tests/test_plugin.py -v
```
Expected: `test_plugin_skip_reason_defaults_none` FAILS (attribute missing), others may error.
- [ ] **Step 3: Add `skip_reason` to `Plugin.__init__`**
In `hbd/client/plugin.py`, in `Plugin.__init__` (around line 46), add one line:
```python
def __init__(self, config: Optional[Dict[str, Any]] = None):
self.config = config or {}
self.logger = logging.getLogger(f"plugin.{self.name}")
self._initialized = False
self.skip_reason: Optional[str] = None
```
- [ ] **Step 4: Update PluginLoader messaging**
In `hbd/client/plugin.py`, replace the `if not initialized:` block (around line 372):
```python
if not initialized:
if plugin.skip_reason:
self.logger.info(
f"Plugin {plugin.name} skipped: {plugin.skip_reason}"
)
else:
self.logger.warning(
f"Plugin {plugin.name} failed initialization, skipping"
)
continue
```
- [ ] **Step 5: Run tests to verify they pass**
```bash
python -m pytest tests/test_plugin.py -v
```
Expected: all 3 tests PASS.
- [ ] **Step 6: Commit**
```bash
git add hbd/client/plugin.py tests/test_plugin.py
git commit -m "feat: add skip_reason to Plugin; improve PluginLoader init messaging"
```
---
## Task 2: NagiosRunnerPlugin — skip_reason when no commands
**Files:**
- Modify: `hbd/client/plugins/nagios_runner.py:88-105` (initialize)
- Modify: `tests/test_nagios_runner.py` (create)
- [ ] **Step 1: Write failing test**
Create `tests/test_nagios_runner.py`:
```python
import asyncio
import logging
import os
import stat
import pytest
from hbd.client.plugins.nagios_runner import (
NagiosRunnerPlugin,
NAGIOS_OK,
NAGIOS_WARNING,
NAGIOS_CRITICAL,
NAGIOS_UNKNOWN,
)
def test_no_commands_sets_skip_reason():
plugin = NagiosRunnerPlugin(config={"commands": []})
result = asyncio.run(plugin.initialize())
assert result is False
assert plugin.skip_reason is not None
assert "nagios_runner.commands" in plugin.skip_reason
```
- [ ] **Step 2: Run test to verify it fails**
```bash
python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
```
Expected: FAIL — `plugin.skip_reason` is `None`.
- [ ] **Step 3: Set skip_reason in NagiosRunnerPlugin.initialize()**
In `hbd/client/plugins/nagios_runner.py`, replace the early-return block in `initialize()` (around line 96):
```python
if not self.commands:
self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
self.logger.info("No Nagios commands configured")
return False
```
- [ ] **Step 4: Run test to verify it passes**
```bash
python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
```
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
git commit -m "feat: set skip_reason on nagios_runner when no commands configured"
```
---
## Task 3: NagiosRunnerPlugin — async subprocess, stderr capture, negative return codes
**Files:**
- Modify: `hbd/client/plugins/nagios_runner.py` (imports + `_run_nagios_plugin`)
- Modify: `tests/test_nagios_runner.py`
- [ ] **Step 1: Write failing tests**
Append to `tests/test_nagios_runner.py`:
```python
def test_stderr_used_when_stdout_empty(tmp_path):
script = tmp_path / "check_err.sh"
script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n")
script.chmod(script.stat().st_mode | stat.S_IEXEC)
config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
plugin = NagiosRunnerPlugin(config=config)
asyncio.run(plugin.initialize())
data = asyncio.run(plugin._collect_metrics())
assert "error from stderr" in data["t_output"]
assert data["t_status_code"] == NAGIOS_CRITICAL
def test_stderr_appended_when_both_present(tmp_path):
script = tmp_path / "check_both.sh"
script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n")
script.chmod(script.stat().st_mode | stat.S_IEXEC)
config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
plugin = NagiosRunnerPlugin(config=config)
asyncio.run(plugin.initialize())
data = asyncio.run(plugin._collect_metrics())
assert "OK - all good" in data["t_output"]
assert "extra detail" in data["t_output"]
assert data["t_status_code"] == NAGIOS_OK
def test_negative_returncode_maps_to_unknown():
# kill -9 $$ kills the shell itself; asyncio sees returncode -9
config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5}
plugin = NagiosRunnerPlugin(config=config)
asyncio.run(plugin.initialize())
data = asyncio.run(plugin._collect_metrics())
assert data["t_status_code"] == NAGIOS_UNKNOWN
assert "signal" in data["t_output"].lower()
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
python -m pytest tests/test_nagios_runner.py::test_stderr_used_when_stdout_empty \
tests/test_nagios_runner.py::test_stderr_appended_when_both_present \
tests/test_nagios_runner.py::test_negative_returncode_maps_to_unknown -v
```
Expected: all FAIL — current implementation ignores stderr and doesn't handle negative codes.
- [ ] **Step 3: Update imports in nagios_runner.py**
Replace the import block at the top of `hbd/client/plugins/nagios_runner.py`:
```python
import asyncio
import os
import re
from typing import Any, Dict, List, Optional, Tuple
from hbd.client.plugin import MonitorPlugin
```
(Remove `import subprocess`; add `import asyncio` and `import os`.)
- [ ] **Step 4: Upgrade collection log level from DEBUG to INFO**
In `hbd/client/plugins/nagios_runner.py`, in `_collect_metrics()`, change the debug log (around line 144) so results are visible at INFO level:
```python
self.logger.info(
f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
)
```
- [ ] **Step 5: Replace `_run_nagios_plugin` with async implementation**
Replace the entire `_run_nagios_plugin` method in `hbd/client/plugins/nagios_runner.py`:
```python
async def _run_nagios_plugin(
self,
command: str
) -> Tuple[int, str, Dict[str, Any]]:
"""Execute a Nagios plugin and parse its output."""
try:
proc = await asyncio.create_subprocess_shell(
command,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
try:
stdout_bytes, stderr_bytes = await asyncio.wait_for(
proc.communicate(), timeout=self.timeout
)
except asyncio.TimeoutError:
proc.kill()
await proc.communicate()
self.logger.error(f"Command timed out: {command}")
return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
status_code = proc.returncode
if status_code < 0:
return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {}
if status_code > 3:
status_code = NAGIOS_UNKNOWN
stdout = stdout_bytes.decode(errors="replace").strip()
stderr = stderr_bytes.decode(errors="replace").strip()
# Parse perfdata from stdout before mixing in stderr
perfdata = self._parse_perfdata(stdout)
# Build status message
status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout
if not stdout and stderr:
output_msg = stderr
elif stdout and stderr:
output_msg = f"{status_part} [stderr: {stderr}]"
else:
output_msg = status_part
return status_code, output_msg, perfdata
except Exception as e:
self.logger.error(f"Error executing command: {e}")
return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}
```
Also remove the now-unused `self.shell` line from `__init__` (the `shell` config key is no longer used since `create_subprocess_shell` always uses a shell):
In `NagiosRunnerPlugin.__init__`, remove:
```python
self.shell: bool = config.get("shell", True) if config else True
```
- [ ] **Step 6: Run tests to verify they pass**
```bash
python -m pytest tests/test_nagios_runner.py -v
```
Expected: all tests PASS including the 3 new ones.
- [ ] **Step 7: Commit**
```bash
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
git commit -m "feat: async subprocess in nagios_runner with stderr capture and signal handling"
```
---
## Task 4: NagiosRunnerPlugin — command path validation at init
**Files:**
- Modify: `hbd/client/plugins/nagios_runner.py` (initialize)
- Modify: `tests/test_nagios_runner.py`
- [ ] **Step 1: Write failing tests**
Append to `tests/test_nagios_runner.py`:
```python
def test_absolute_path_not_found_warns(caplog):
fake_cmd = "/nonexistent_hbc_test_path/check_something"
config = {"commands": [{"name": "t", "command": fake_cmd}]}
plugin = NagiosRunnerPlugin(config=config)
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
asyncio.run(plugin.initialize())
assert any("not found" in r.message for r in caplog.records)
def test_absolute_path_not_executable_warns(caplog, tmp_path):
non_exec = tmp_path / "check_test"
non_exec.write_text("#!/bin/sh\necho OK\n")
non_exec.chmod(0o644) # readable but not executable
config = {"commands": [{"name": "t", "command": str(non_exec)}]}
plugin = NagiosRunnerPlugin(config=config)
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
asyncio.run(plugin.initialize())
assert any("not executable" in r.message for r in caplog.records)
def test_relative_path_not_checked(caplog):
# Relative paths (resolved via PATH) must not generate warnings
config = {"commands": [{"name": "t", "command": "echo OK"}]}
plugin = NagiosRunnerPlugin(config=config)
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
asyncio.run(plugin.initialize())
assert not any(
"not found" in r.message or "not executable" in r.message
for r in caplog.records
)
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
python -m pytest tests/test_nagios_runner.py::test_absolute_path_not_found_warns \
tests/test_nagios_runner.py::test_absolute_path_not_executable_warns \
tests/test_nagios_runner.py::test_relative_path_not_checked -v
```
Expected: `test_absolute_path_not_found_warns` and `test_absolute_path_not_executable_warns` FAIL (no warnings logged); `test_relative_path_not_checked` may pass.
- [ ] **Step 3: Add command path validation to `initialize()`**
In `hbd/client/plugins/nagios_runner.py`, extend `initialize()` by adding validation after the existing "log each command" loop (after line 103, before `return True`):
```python
# Validate absolute command paths early
for cmd_config in self.commands:
name = cmd_config.get("name", "unnamed")
command = cmd_config.get("command", "")
if not command:
continue
exe = command.split()[0]
if os.path.isabs(exe):
if not os.path.isfile(exe):
self.logger.warning(
f"Command '{name}': executable not found: {exe}"
)
elif not os.access(exe, os.X_OK):
self.logger.warning(
f"Command '{name}': executable not executable: {exe}"
)
```
- [ ] **Step 4: Run full test suite to verify all pass**
```bash
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
```
Expected: all tests PASS.
- [ ] **Step 5: Commit**
```bash
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
git commit -m "feat: validate absolute command paths at nagios_runner init"
```
---
## Task 5: Daemon mode logging — route to syslog after fork
**Files:**
- Modify: `hbd/client/main.py` (new helper + updated daemon block)
No automated test for daemonization itself (fork behaviour is hard to unit-test). Manual verification steps are provided below.
- [ ] **Step 1: Add `_reconfigure_logging_for_daemon` helper**
In `hbd/client/main.py`, add this function just before `def build_parser()` (around line 589):
```python
def _reconfigure_logging_for_daemon(log_level: int) -> None:
"""Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler."""
from logging.handlers import SysLogHandler
root = logging.getLogger()
for handler in root.handlers[:]:
root.removeHandler(handler)
handler.close()
try:
syslog_handler = SysLogHandler(
address="/dev/log",
facility=SysLogHandler.LOG_DAEMON,
)
except OSError:
syslog_handler = SysLogHandler(
address=("localhost", 514),
facility=SysLogHandler.LOG_DAEMON,
)
# Attach the fallback first so the warning reaches syslog
syslog_handler.setFormatter(
logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
)
root.addHandler(syslog_handler)
root.setLevel(log_level)
logging.warning("/dev/log not found, using syslog UDP localhost:514")
return
syslog_handler.setFormatter(
logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
)
root.addHandler(syslog_handler)
root.setLevel(log_level)
```
- [ ] **Step 2: Update the daemon block in `main()`**
In `hbd/client/main.py`, replace the entire `if args.daemon:` block (lines 664675):
```python
if args.daemon:
print("Daemonizing...")
daemonize()
_reconfigure_logging_for_daemon(log_level)
logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
```
This removes the `import syslog`, `syslog.openlog()`, and `syslog.syslog()` calls (now handled by the logging system) and removes the no-op second `logging.basicConfig()` call.
- [ ] **Step 3: Run existing test suite to confirm no regressions**
```bash
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
```
Expected: all tests still PASS.
- [ ] **Step 4: Manual smoke test — verify syslog output in daemon mode**
```bash
# In one terminal, tail syslog
sudo journalctl -f -t hbc
# In another terminal, start hbc in daemon mode (replace HOST with a real or dummy host)
python -m hbd.client.main -d -v localhost
# Expected in journalctl output:
# hbc[<pid>]: hbc.main INFO: Starting hbc for <hostname> -> ['localhost']
# hbc[<pid>]: hbc.main INFO: hbc starting, sending heartbeat to localhost
# hbc[<pid>]: plugin.loader INFO: ...
# Stop the daemon
pkill -f "hbd.client.main"
```
- [ ] **Step 5: Commit**
```bash
git add hbd/client/main.py
git commit -m "fix: reconfigure logging to syslog after daemonize() instead of no-op basicConfig"
```
@@ -0,0 +1,92 @@
# Plugin Error Checking & Daemon Logging — Design Spec
**Date:** 2026-04-25
**Scope:** hbc client — daemon mode logging, nagios_runner plugin robustness, PluginLoader messaging
**Files affected:** `hbd/client/main.py`, `hbd/client/plugins/nagios_runner.py`, `hbd/client/plugin.py`
---
## 1. Daemon Mode Logging
### Problem
In `main()`, `logging.basicConfig()` is called before `daemonize()` (establishing a StreamHandler to stderr), then called again after `daemonize()`. The second call is a no-op — Python ignores `basicConfig()` when handlers are already configured. After daemonization, stderr is redirected to `/dev/null`, so all subsequent log output is silently discarded.
The existing `syslog.openlog()` / `syslog.syslog()` calls (lines 666668) write a single startup message but do not integrate with the `logging` system, so plugin and connection log messages never reach syslog.
### Fix
After `daemonize()`, explicitly reconfigure the root logger:
1. Remove all existing handlers (they now write to `/dev/null`).
2. Add `logging.handlers.SysLogHandler(address='/dev/log', facility=LOG_DAEMON)`.
3. Set formatter: `hbc[%(process)d]: %(name)s %(levelname)s: %(message)s`
4. Preserve the `log_level` already determined from `-v`/`-x` CLI flags.
Remove the redundant `syslog.openlog()` / `syslog.syslog()` calls — the logging system handles routing.
**Fallback:** If `/dev/log` does not exist (containers, some BSDs), fall back to `SysLogHandler(address=('localhost', 514))`. Log one warning (to stderr, before handlers are replaced) so the operator knows.
---
## 2. Nagios Runner Improvements
### 2a — Async Subprocess
`_run_nagios_plugin()` is declared `async def` but calls `subprocess.run()` synchronously, blocking the event loop for the full command duration.
**Fix:** Replace with `asyncio.create_subprocess_shell()` + `await proc.communicate()`. Enforce timeout with `asyncio.wait_for(..., timeout=self.timeout)` and catch `asyncio.TimeoutError`.
### 2b — Stderr Capture
Subprocess stderr is currently discarded (`capture_output=True` only captures stdout in the sync call; stderr content is lost).
**Fix:** Pass `stderr=asyncio.subprocess.PIPE` to `create_subprocess_shell`. After `communicate()`, if stdout is empty but stderr has content, use stderr as the output message. If both have content, append stderr to the output for visibility.
### 2c — Negative Return Codes
A negative `returncode` means the process was killed by a signal (SIGKILL, OOM, etc.). The current code treats these as-is, which may produce unexpected status values.
**Fix:** If `returncode < 0`, map to `NAGIOS_UNKNOWN` with message `"Process killed by signal {-returncode}"`.
### 2d — Command Path Validation at Init
`initialize()` currently only checks that the commands list is non-empty.
**Fix:** For each command entry during `initialize()`:
- Warn and skip the entry if `name` or `command` is missing.
- Extract the executable (first whitespace-delimited token of the command string).
- If the executable is an absolute path, check `os.path.isfile()` and `os.access(..., os.X_OK)`. Log a `WARNING` if either check fails.
- Commands with relative paths or shell builtins are not checked (they may be on PATH) — just noted.
- Validation warns only; all original entries in `self.commands` are retained and still attempted at collection time (where the existing missing-name/command guard already skips them). The plugin initializes successfully as long as the commands list is non-empty.
---
## 3. PluginLoader Messaging
### Problem
When `initialize()` returns `False`, the loader always logs:
> `WARNING: Plugin X failed initialization, skipping`
This is alarming when the real reason is simply "no commands configured". There is no API to distinguish "not configured" from "genuinely broken".
### Fix
Add an optional `skip_reason` attribute to `Plugin.__init__()` (defaults to `None`).
In `PluginLoader.load_from_directory()`, after `initialize()` returns `False`:
- If `plugin.skip_reason` is set → `logger.info(f"Plugin {plugin.name} skipped: {plugin.skip_reason}")`
- If `plugin.skip_reason` is `None``logger.warning(f"Plugin {plugin.name} failed initialization, skipping")` (existing behaviour)
In `NagiosRunnerPlugin.initialize()`, when no commands are configured:
```python
self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
return False
```
Genuine failures (exceptions) continue to go through the existing `except` block in the loader, logging at `ERROR` with traceback — unchanged.
---
## Decisions
| Topic | Decision |
|---|---|
| Daemon log destination | syslog only (LOG_DAEMON facility) |
| Syslog fallback | localhost:514 UDP if `/dev/log` absent |
| Nagios result log level | INFO for all statuses (OK/WARNING/CRITICAL/UNKNOWN) |
| Invalid command handling at init | Warn and continue; still attempt at collection time |
| PluginLoader API change | `skip_reason` attribute on Plugin base class, checked by loader |
+1 -1
View File
@@ -14,4 +14,4 @@ Install options:
"""
__all__ = ["__version__"]
__version__ = "5.1.1"
__version__ = "5.1.3"
+33 -9
View File
@@ -15,6 +15,7 @@ import socket
import sys
import time
from hashlib import md5
from logging.handlers import SysLogHandler
from pathlib import Path
from typing import Dict, List, Optional
@@ -586,6 +587,36 @@ def daemonize(
os.dup2(se.fileno(), sys.stderr.fileno())
def _reconfigure_logging_for_daemon(log_level: int) -> None:
"""Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler."""
root = logging.getLogger()
for handler in root.handlers[:]:
root.removeHandler(handler)
handler.close()
use_udp_fallback = not os.path.exists("/dev/log")
if use_udp_fallback:
syslog_handler = SysLogHandler(
address=("localhost", 514),
facility=SysLogHandler.LOG_DAEMON,
)
else:
syslog_handler = SysLogHandler(
address="/dev/log",
facility=SysLogHandler.LOG_DAEMON,
)
syslog_handler.setFormatter(
logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
)
root.addHandler(syslog_handler)
root.setLevel(log_level)
if use_udp_fallback:
logging.warning("/dev/log not found, using syslog UDP localhost:514")
def build_parser():
"""Build argument parser."""
parser = argparse.ArgumentParser(
@@ -663,16 +694,9 @@ def main(argv=None):
# Daemonize if requested
if args.daemon:
print("Daemonizing...")
import syslog
syslog.openlog("hbc", syslog.LOG_PID, syslog.LOG_DAEMON)
syslog.syslog(syslog.LOG_INFO, f"Starting heartbeat to {', '.join(args.hosts)}")
daemonize()
# Reconfigure logging for syslog
logging.basicConfig(
level=log_level,
format="hbc[%(process)d]: %(name)s %(levelname)s: %(message)s"
)
_reconfigure_logging_for_daemon(log_level)
logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
# Run async main
try:
+19 -10
View File
@@ -22,13 +22,14 @@ from typing import Any, Dict, List, Optional, Type
class Plugin(ABC):
"""Base class for all plugins.
Attributes:
name: Unique plugin identifier (e.g., "os_info", "cpu_monitor")
version: Plugin version string
description: Human-readable description
interval: Collection interval in seconds (0 for InfoPlugin = collect once)
enabled: Whether plugin is active (can be disabled via config)
skip_reason: Set by plugin before returning False from initialize(); causes loader to log INFO instead of WARNING.
"""
name: str = ""
@@ -39,13 +40,14 @@ class Plugin(ABC):
def __init__(self, config: Optional[Dict[str, Any]] = None):
"""Initialize plugin with optional configuration.
Args:
config: Plugin-specific configuration from YAML (e.g., thresholds, paths)
"""
self.config = config or {}
self.logger = logging.getLogger(f"plugin.{self.name}")
self._initialized = False
self.skip_reason: Optional[str] = None
@abstractmethod
async def initialize(self) -> bool:
@@ -312,9 +314,10 @@ class PluginLoader:
loaded_count = 0
raw_config = config or {}
# Per-plugin config lives under the 'plugins' key; fall back to top-level
# for backwards compatibility.
plugin_config = raw_config.get("plugins", raw_config)
# Per-plugin config lives under the 'plugins' key or at top-level.
# CLIENT_DEFAULTS seeds "plugins": {} so the key always exists; check
# both the subdict and top-level so that either layout in .hbc.yaml works.
plugins_subconfig = raw_config.get("plugins", {})
# Scan for Python files
for plugin_file in directory.glob("*.py"):
@@ -359,17 +362,23 @@ class PluginLoader:
self.logger.debug(f"Found plugin class: {name}")
# Instantiate plugin with config
plugin_instance_config = plugin_config.get(obj.name, {})
# Instantiate plugin with config — check plugins subdict first,
# then top-level keys (e.g. nagios_runner: ... at root of config).
plugin_instance_config = plugins_subconfig.get(obj.name) or raw_config.get(obj.name, {})
plugin = obj(config=plugin_instance_config)
# Initialize plugin
try:
initialized = await plugin.initialize()
if not initialized:
self.logger.warning(
f"Plugin {plugin.name} failed initialization, skipping"
)
if plugin.skip_reason:
self.logger.info(
f"Plugin {plugin.name} skipped: {plugin.skip_reason}"
)
else:
self.logger.warning(
f"Plugin {plugin.name} failed initialization, skipping"
)
continue
except Exception as e:
self.logger.error(
+69 -49
View File
@@ -21,8 +21,10 @@ nagios_runner:
```
"""
import asyncio
import os
import re
import subprocess
import shlex
from typing import Any, Dict, List, Optional, Tuple
from hbd.client.plugin import MonitorPlugin
@@ -52,8 +54,7 @@ class NagiosRunnerPlugin(MonitorPlugin):
interval: Collection interval in seconds (default: 300)
commands: List of command definitions with 'name' and 'command' keys
timeout: Command execution timeout in seconds (default: 30)
shell: Whether to execute commands via shell (default: True)
Example:
nagios_runner:
interval: 300 # Check every 5 minutes
@@ -76,32 +77,48 @@ class NagiosRunnerPlugin(MonitorPlugin):
# Extract configuration
self.commands: List[Dict[str, str]] = config.get("commands", []) if config else []
self.timeout: int = config.get("timeout", 30) if config else 30
self.shell: bool = config.get("shell", True) if config else True
self.interval = config.get("interval", 300) if config else 300
# Validate commands
if not self.commands:
self.logger.info(
"No Nagios commands configured. Add 'nagios_runner.commands' to config."
)
async def initialize(self) -> bool:
"""Initialize the Nagios runner plugin.
Returns:
True if at least one command is configured, False otherwise
"""
self.logger.info(f"Initializing {self.name} plugin")
if not self.commands:
self.logger.info("No Nagios commands configured")
self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
return False
self.logger.info(f"Configured to run {len(self.commands)} Nagios plugin(s)")
for cmd_config in self.commands:
name = cmd_config.get("name", "unnamed")
self.logger.info(f" - {name}: {cmd_config.get('command', 'N/A')}")
# Validate absolute command paths early
for cmd_config in self.commands:
name = cmd_config.get("name", "unnamed")
command = cmd_config.get("command", "")
if not command:
continue
try:
tokens = shlex.split(command)
except ValueError:
continue # malformed command string; skip validation
if not tokens:
continue
exe = tokens[0]
if os.path.isabs(exe):
if not os.path.isfile(exe):
self.logger.warning(
f"Command '{name}': executable not found: {exe}"
)
elif not os.access(exe, os.X_OK):
self.logger.warning(
f"Command '{name}': executable not executable: {exe}"
)
return True
async def _collect_metrics(self) -> Dict[str, Any]:
@@ -141,7 +158,7 @@ class NagiosRunnerPlugin(MonitorPlugin):
for metric_name, metric_value in perfdata.items():
results[f"{name}_{metric_name}"] = metric_value
self.logger.debug(
self.logger.info(
f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
)
@@ -163,46 +180,49 @@ class NagiosRunnerPlugin(MonitorPlugin):
self,
command: str
) -> Tuple[int, str, Dict[str, Any]]:
"""Execute a Nagios plugin and parse its output.
Args:
command: Command string to execute
Returns:
Tuple of (status_code, output_message, performance_data_dict)
"""
"""Execute a Nagios plugin and parse its output."""
try:
# Run command
result = subprocess.run(
proc = await asyncio.create_subprocess_shell(
command,
shell=self.shell,
capture_output=True,
timeout=self.timeout,
text=True
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
status_code = result.returncode
output = result.stdout.strip()
# Nagios plugins can return codes > 3, treat as UNKNOWN
try:
stdout_bytes, stderr_bytes = await asyncio.wait_for(
proc.communicate(), timeout=self.timeout
)
except asyncio.TimeoutError:
proc.kill()
await proc.communicate()
self.logger.error(f"Command timed out: {command}")
return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
status_code = proc.returncode
if status_code < 0:
return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {}
if status_code > 3:
status_code = NAGIOS_UNKNOWN
# Parse performance data
perfdata = self._parse_perfdata(output)
# Extract just the status message (before the pipe if present)
if '|' in output:
output_msg = output.split('|')[0].strip()
stdout = stdout_bytes.decode(errors="replace").strip()
stderr = stderr_bytes.decode(errors="replace").strip()
# Parse perfdata from stdout before mixing in stderr
perfdata = self._parse_perfdata(stdout)
# Build status message
status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout
if not stdout and stderr:
output_msg = stderr
elif stdout and stderr:
output_msg = f"{status_part} [stderr: {stderr}]"
else:
output_msg = output
output_msg = status_part
return status_code, output_msg, perfdata
except subprocess.TimeoutExpired:
self.logger.error(f"Command timed out: {command}")
return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
except Exception as e:
self.logger.error(f"Error executing command: {e}")
return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}
+9 -4
View File
@@ -52,12 +52,17 @@ def decode_value(val: str) -> Any:
except Exception:
return val[1:] # Return as string without @
# Try numeric evaluation (original behavior)
# Try numeric conversion (avoid eval to prevent SyntaxWarnings on version strings)
if val[0].isdigit() or (val[0] == '-' and len(val) > 1 and val[1].isdigit()):
try:
return eval(val)
except Exception:
return val
return int(val)
except ValueError:
pass
try:
return float(val)
except ValueError:
pass
return val
return val
+14 -7
View File
@@ -385,13 +385,20 @@ _DRIVERS = {
def _dispatch_to_channel(channel_name: str, channel_cfg: dict, notif: Notification) -> bool:
"""Send *notif* to a single named channel, honouring min_level."""
min_level = channel_cfg.get("min_level", "WARNING").upper()
if _level_value(notif.level) < _level_value(min_level):
logger.debug(
"channel '%s': skipping level %s (min_level=%s)", channel_name, notif.level, min_level
)
return True # not an error — filtered intentionally
"""Send *notif* to a single named channel, honouring min_level.
RECOVER always bypasses min_level — a recovery is always relevant if the
channel was configured for any alerting (handles the restart-then-recover case
where _alerted_channels is empty and we fall through to the normal loop).
"""
level = notif.level.upper()
if level != "RECOVER":
min_level = channel_cfg.get("min_level", "WARNING").upper()
if _level_value(level) < _level_value(min_level):
logger.debug(
"channel '%s': skipping level %s (min_level=%s)", channel_name, level, min_level
)
return True # not an error — filtered intentionally
ch_type = channel_cfg.get("type", "")
driver = _DRIVERS.get(ch_type)
+4 -11
View File
@@ -3,20 +3,13 @@
{% include 'head.html' %}
<style>
body {
margin: 20px;
background: #f5f5f5;
}
.container {
max-width: 1400px;
margin: 0 auto;
}
h1 {
color: #333;
margin-bottom: 10px;
}
h1 { color: #333; margin-bottom: 10px; font-size: 1.5em; }
.subtitle {
color: #666;
@@ -41,7 +34,7 @@
border-left: 4px solid #ddd;
}
.summary-card.critical { border-left-color: #f44336; }
.summary-card.critical { border-left-color: #ea1e0f; }
.summary-card.warning { border-left-color: #ff9800; }
.summary-card.ok { border-left-color: #4caf50; }
@@ -51,7 +44,7 @@
line-height: 1;
}
.summary-number.critical { color: #f44336; }
.summary-number.critical { color: #ea1e0f; }
.summary-number.warning { color: #ff9800; }
.summary-number.ok { color: #4caf50; }
@@ -116,7 +109,7 @@
}
.alert-item.acknowledged {
opacity: 0.6;
opacity: 0.8;
background: #f0f0f0;
}
+180 -2
View File
@@ -6,10 +6,25 @@
<title>{{ title }}</title>
{% if extra_scripts %}<script src="{{ extra_scripts }}"></script>{% endif %}
<style>
/* ── Reset / shared baseline ── */
*, *::before, *::after { box-sizing: border-box; }
html {
font-family: 'Segoe UI', system-ui, -apple-system, sans-serif;
font-size: 14px;
}
body {
margin: 0;
padding: 10px;
background: #f5f5f5;
}
h1 { font-size: 1.5em; color: #333; margin: 0 0 5px; }
h2 { font-size: 1.1em; color: #333; margin: 0 0 8px; }
p { margin: 0; }
/* Navigation bar — shared across all pages */
.nav {
background: #fff;
padding: 10px 15px;
padding: 6px 12px;
margin-bottom: 10px;
box-shadow: 0 2px 4px rgba(0,0,0,.1);
border-radius: 4px;
@@ -42,6 +57,17 @@
transition: background 0.15s;
}
.nav-user:hover { background: #f0f4ff; text-decoration: none; }
.nav-username {
max-width: 0;
overflow: hidden;
white-space: nowrap;
opacity: 0;
transition: max-width 0.2s ease, opacity 0.2s ease;
}
.nav-user:hover .nav-username {
max-width: 160px;
opacity: 1;
}
.nav-avatar {
width: 28px; height: 28px;
border-radius: 50%;
@@ -94,6 +120,158 @@
.nav-links.nav-open { display: flex; }
.nav-links a { margin-right: 0; padding: 6px 0; font-size: 1em; }
}
/* Swiss railway clock — nav */
.nav-clock {
flex-shrink: 0;
line-height: 0;
margin-left: auto;
padding: 4px 4px 4px 0;
cursor: pointer;
}
#swiss-clock { display: block; }
/* Swiss railway clock — full-page overlay */
#clock-overlay {
display: none;
position: fixed;
inset: 0;
z-index: 9999;
background: #1a1a1a;
align-items: center;
justify-content: center;
cursor: pointer;
}
#clock-overlay.visible { display: flex; }
#swiss-clock-overlay { display: block; }
</style>
<script src="static/sorttable.js"></script>
<script>
/* ── Swiss Federal Railway (SBB) clock ── */
/* Draw one frame of the clock onto any canvas element. */
function drawSwissClock(canvas) {
var SIZE = canvas.width;
var R = SIZE / 2;
var ctx = canvas.getContext('2d');
var now = new Date();
var h = now.getHours() % 12;
var m = now.getMinutes();
var s = now.getSeconds();
var ms = now.getMilliseconds();
/* Seconds hand idles ~1.5 s at 12 before advancing (SBB behaviour) */
var sFrac = s + ms / 1000;
var sAngle = sFrac >= 58.5 ? 0 : (sFrac / 58.5) * Math.PI * 2;
ctx.clearRect(0, 0, SIZE, SIZE);
/* face */
ctx.beginPath();
ctx.arc(R, R, R - 1, 0, Math.PI * 2);
ctx.fillStyle = '#fff';
ctx.fill();
ctx.strokeStyle = '#333';
ctx.lineWidth = SIZE * 0.018;
ctx.stroke();
/* tick marks */
for (var i = 0; i < 60; i++) {
var a = (i / 60) * Math.PI * 2 - Math.PI / 2;
var isHour = (i % 5 === 0);
ctx.beginPath();
ctx.moveTo(R + Math.cos(a) * (isHour ? R * 0.72 : R * 0.88),
R + Math.sin(a) * (isHour ? R * 0.72 : R * 0.88));
ctx.lineTo(R + Math.cos(a) * R * 0.94,
R + Math.sin(a) * R * 0.94);
ctx.strokeStyle = '#222';
ctx.lineWidth = isHour ? SIZE * 0.027 : SIZE * 0.011;
ctx.lineCap = 'butt';
ctx.stroke();
}
/* hands */
function hand(angle, tip, tail, width, color) {
ctx.save();
ctx.translate(R, R);
ctx.rotate(angle);
ctx.beginPath();
ctx.moveTo(tail, 0);
ctx.lineTo(tip, 0);
ctx.strokeStyle = color;
ctx.lineWidth = width;
ctx.lineCap = 'square';
ctx.stroke();
ctx.restore();
}
hand((m + s / 60) / 60 * Math.PI * 2 - Math.PI / 2,
R * 0.88, -R * 0.12, SIZE * 0.027, '#222'); /* minute */
hand((h + m / 60) / 12 * Math.PI * 2 - Math.PI / 2,
R * 0.58, -R * 0.12, SIZE * 0.039, '#222'); /* hour */
hand(sAngle - Math.PI / 2, R * 0.78, -R * 0.22,
SIZE * 0.013, '#e00'); /* second tail+tip */
/* round dot at tip of second hand */
var dotR = SIZE * 0.028;
ctx.save();
ctx.translate(R, R);
ctx.rotate(sAngle - Math.PI / 2);
ctx.beginPath();
ctx.arc(R * 0.78, 0, dotR, 0, Math.PI * 2);
ctx.fillStyle = '#e00';
ctx.fill();
ctx.restore();
/* centre cap */
ctx.beginPath();
ctx.arc(R, R, R * 0.04, 0, Math.PI * 2);
ctx.fillStyle = '#222';
ctx.fill();
}
/* Resize the overlay canvas to fit the viewport, keeping it square. */
function resizeOverlayClock() {
var oc = document.getElementById('swiss-clock-overlay');
if (!oc) return;
var size = Math.min(window.innerWidth, window.innerHeight) * 0.88;
size = Math.floor(size);
oc.width = size;
oc.height = size;
}
/* Main tick — redraws both nav clock and (if visible) overlay clock. */
function clockTick() {
var nav = document.getElementById('swiss-clock');
if (nav) drawSwissClock(nav);
var overlay = document.getElementById('clock-overlay');
if (overlay && overlay.classList.contains('visible')) {
var oc = document.getElementById('swiss-clock-overlay');
if (oc) drawSwissClock(oc);
}
var delay = 100 - (Date.now() % 100);
setTimeout(clockTick, delay);
}
document.addEventListener('DOMContentLoaded', function() {
/* Start the shared tick loop */
clockTick();
/* Overlay toggle — clicking the nav clock opens it */
var navClock = document.querySelector('.nav-clock');
var overlay = document.getElementById('clock-overlay');
if (navClock && overlay) {
navClock.addEventListener('click', function() {
resizeOverlayClock();
overlay.classList.add('visible');
});
overlay.addEventListener('click', function() {
overlay.classList.remove('visible');
});
window.addEventListener('resize', function() {
if (overlay.classList.contains('visible')) resizeOverlayClock();
});
}
});
</script>
<script src="static/sorttable.js"></script>
</head>
+4 -6
View File
@@ -7,10 +7,6 @@
display: flex;
flex-direction: column;
height: 100vh;
box-sizing: border-box;
padding: 10px;
margin: 0;
background: #f5f5f5;
overflow: hidden;
}
@@ -489,8 +485,10 @@
{% include 'menu.html' %}
<div class="container">
<h1>{{ header }}</h1>
<p class="subtitle">Real-time host monitoring and event log</p>
<div>
<h1>{{ header }}</h1>
<p class="subtitle">Real-time host monitoring and event log</p>
</div>
<div class="table-section">
<table id="ntable" class="sortable">
+9
View File
@@ -10,6 +10,9 @@
<a href="/settings"{% if active_page == "settings" %} class="active"{% endif %}>Settings</a>
{% endif %}
</div>
<div class="nav-clock" title="Click for full-screen clock">
<canvas id="swiss-clock" width="44" height="44"></canvas>
</div>
{% if current_user %}
<a href="/profile" class="nav-user{% if active_page == 'profile' %} active{% endif %}" title="{{ current_user.full_name or current_user.username }}">
{% if current_user.avatar %}
@@ -21,6 +24,12 @@
</a>
{% endif %}
</div>
<!-- Full-page clock overlay (click anywhere to dismiss) -->
<div id="clock-overlay">
<canvas id="swiss-clock-overlay" width="400" height="400"></canvas>
</div>
<script>
(function() {
var btn = document.getElementById('nav-hamburger-btn');
+1 -5
View File
@@ -3,11 +3,7 @@
{% include 'head.html' %}
<style>
body {
margin: 10px;
background: #f5f5f5;
overflow: hidden;
}
body { overflow: hidden; }
.container {
max-width: 1400px;
+1 -9
View File
@@ -3,15 +3,7 @@
{% include 'head.html' %}
<style>
html, body {
overflow: visible;
}
body {
margin: 20px;
background: #f5f5f5;
font-family: 'Segoe UI', system-ui, sans-serif;
}
html, body { overflow: visible; }
.container {
max-width: 900px;
+1 -10
View File
@@ -3,19 +3,10 @@
{% include 'head.html' %}
<style>
html, body {
overflow: visible;
}
body {
margin: 20px;
background: #f5f5f5;
font-family: 'Segoe UI', system-ui, sans-serif;
}
html, body { overflow: visible; }
.container {
max-width: 960px;
margin: 0 auto;
}
h1 { color: #333; margin-bottom: 4px; font-size: 1.5em; }
+82 -30
View File
@@ -60,6 +60,7 @@ class AlertState:
self.acknowledged = False # Whether alert has been acknowledged
self.acknowledged_at = None # Timestamp when acknowledged
self.consecutive_count = 0 # Consecutive exceedances while still OK (for count gating)
self.pending_since: Optional[float] = None # non-None while waiting out grace period before notifying
def update(
self,
@@ -105,6 +106,7 @@ class AlertState:
self.level = level
self.since = now
self.notification_count = 0
self.last_notification = None # restart reminder interval on level change
# Reset acknowledgment on state change
if level != AlertLevel.OK:
# Only reset if changing to a different alert level
@@ -339,8 +341,9 @@ class ThresholdChecker:
self.default_config = "default"
self.renotify_interval = renotify_interval
self.grace_seconds: float = float(config.get("grace", 2))
self.journal = journal
# Parse configuration
self._parse_config(config)
@@ -371,7 +374,8 @@ class ThresholdChecker:
self.threshold_configs.clear()
self.thresholds.clear()
self.host_config_mapping.clear()
self.grace_seconds = float(config.get("grace", 2))
# Parse new configuration
self._parse_config(config)
@@ -759,15 +763,10 @@ class ThresholdChecker:
# Update state and check for changes
old_level = alert_state.level
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
# For check_value, we don't have full plugin data, pass None
lvl, message, formatted_msg = self._trigger_notification(host_name, metric_path, old_level, new_level, value, threshold, None)
# Update alert state with formatted message
alert_state.formatted_message = formatted_msg
self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, None)
return (old_level, new_level)
elif new_level != AlertLevel.OK:
# Check if we should re-notify
self._check_renotify(host_name, alert_state, metric_path, value, threshold, None)
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, None)
return None
def check_plugin_data(
@@ -826,14 +825,10 @@ class ThresholdChecker:
old_level = alert_state.level
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
state_changes.append((metric_path, old_level, new_level, value))
lvl, message, formatted_msg = self._trigger_notification(host_name, metric_path, old_level, new_level, value, threshold, data)
# Update alert state with formatted message
alert_state.formatted_message = formatted_msg
self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data)
elif new_level != AlertLevel.OK:
# Check if we should re-notify
self._check_renotify(host_name, alert_state, metric_path, value, threshold, data)
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data)
# Check nested metrics (e.g., partition data in disk_monitor)
self._check_nested_metrics(
host_name,
@@ -895,20 +890,9 @@ class ThresholdChecker:
old_level = alert_state.level
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
state_changes.append((metric_path, old_level, new_level, value))
lvl, message, formatted_msg = self._trigger_notification(
host_name,
metric_path,
old_level,
new_level,
value,
threshold,
data # Pass full plugin data for format string
)
# Update alert state with formatted message
alert_state.formatted_message = formatted_msg
self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data)
elif new_level != AlertLevel.OK:
self._check_renotify(host_name, alert_state, metric_path, value, threshold, data)
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data)
def _trigger_notification(
self,
@@ -947,7 +931,7 @@ class ThresholdChecker:
# Format message
if new_level == AlertLevel.OK:
lvl = "RECOVERED"
lvl = "RECOVER"
message = f"{metric_path} = {display_value} ({old_level.name} -> OK)"
elif new_level == AlertLevel.WARNING:
lvl = "WARNING"
@@ -1083,6 +1067,74 @@ class ThresholdChecker:
)
return f"(threshold: {op_symbol} {threshold_value})"
def _apply_grace(
self,
host_name: str,
alert_state: AlertState,
metric_path: str,
old_level: AlertLevel,
new_level: AlertLevel,
value: Any,
threshold: ThresholdConfig,
plugin_data: Optional[Dict[str, Any]],
) -> None:
"""Handle a state-change transition with grace-period logic.
Transitioning INTO alert: defers the notification for grace_seconds.
Transitioning TO OK:
- Still in grace window (pending_since set): suppresses both the alert
and the recovery — the spike never warranted a page.
- Past grace: fires the RECOVER notification normally.
"""
lvl, message, formatted_msg = self._trigger_notification(
host_name, metric_path, old_level, new_level, value, threshold, plugin_data
)
alert_state.formatted_message = formatted_msg
if new_level == AlertLevel.OK:
if alert_state.pending_since is not None:
logger.info(
"Alert suppressed (recovered within %.0fs grace): %s on %s",
self.grace_seconds, metric_path, host_name,
)
alert_state.pending_since = None
else:
self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
else:
alert_state.pending_since = time.time()
logger.debug(
"Alert deferred (%.0fs grace): %s on %s = %s",
self.grace_seconds, metric_path, host_name, value,
)
def _check_pending_or_renotify(
self,
host_name: str,
alert_state: AlertState,
metric_path: str,
value: Any,
threshold: ThresholdConfig,
plugin_data: Optional[Dict[str, Any]],
) -> None:
"""Called when alert level is unchanged and non-OK.
If a deferred notification is pending and grace_seconds have elapsed,
fires it now. Otherwise falls through to normal reminder logic.
"""
if alert_state.pending_since is not None:
if time.time() - alert_state.pending_since >= self.grace_seconds:
lvl, message, formatted_msg = self._trigger_notification(
host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data
)
alert_state.formatted_message = formatted_msg
self._send_notification(
host_name, lvl, message, metric_path, AlertLevel.OK, alert_state.level, value
)
alert_state.pending_since = None
# else: still within grace window, do nothing
else:
self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data)
def _check_renotify(
self,
host_name: str,
+24
View File
@@ -171,6 +171,24 @@ def dicttos(ID, d):
DROPOVERDUE = 7 * 24 * 3600 # seconds before an overdue host becomes UNKNOWN
def _set_connectivity_alert(host, afam, level_name):
"""Update (or clear) a connectivity alert_state entry for a host/address-family.
level_name is "CRITICAL", "WARNING", or "OK". "OK" removes the entry so
that recovered hosts don't clutter the Alerts Dashboard.
"""
from .threshold import AlertState, AlertLevel
metric_path = f"connectivity.{afam}"
level = getattr(AlertLevel, level_name, AlertLevel.OK)
if level == AlertLevel.OK:
host.alert_states.pop(metric_path, None)
return
if metric_path not in host.alert_states:
host.alert_states[metric_path] = AlertState(metric_path)
state = host.alert_states[metric_path]
state.update(level, level_name)
def _make_timer_callbacks(uname, host, ctx):
"""Return (on_overdue, on_unknown) async callbacks for connection timer logic.
@@ -182,6 +200,7 @@ def _make_timer_callbacks(uname, host, ctx):
async def on_unknown(connection):
connection.newstate(connection.__class__.UNKNOWN, connection.lastbeat)
# Keep connectivity alert active when host transitions to unknown
if msg_to_websockets:
msg_to_websockets("host", host.stateinfo())
@@ -196,6 +215,8 @@ def _make_timer_callbacks(uname, host, ctx):
uname,
notify_mod.Notification(title=f"[CRITICAL] {uname}", body=msg, level="CRITICAL"),
)
# Track in alert_states so the Alerts Dashboard shows this
_set_connectivity_alert(host, connection.afam, "CRITICAL")
if threshold_checker:
threshold_checker.check_value(
host_name=uname,
@@ -410,6 +431,8 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
if conn.getstate() != hbdcls.Connection.UP:
lasts = conn.state
d = conn.newstate(hbdcls.Connection.UP, now)
# Clear connectivity alert now that the host is back up
_set_connectivity_alert(host, conn.afam, "OK")
# Don't log/notify RECOVER for a brand-new host seen for the first time —
# it was never down, it just hasn't been seen before.
if not newh:
@@ -436,6 +459,7 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
notify_mod.Notification(title=f"[INFO] {uname}", body=m, level="INFO"),
)
conn.newstate(hbdcls.Connection.DOWN, now)
_set_connectivity_alert(host, conn.afam, "CRITICAL")
if interval > 0:
host.interval = interval
+1 -1
View File
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project]
name = "hbd"
version = "5.1.1"
version = "5.1.3"
description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
readme = "README.md"
requires-python = ">=3.11"
+48 -18
View File
@@ -14,6 +14,8 @@
set -e
what=$1
on_ha=0
[ -z "$what" ] && what="client"
if [ -d /homeassistant ]; then
echo "cannot install in HA, run \"docker exec -it homeassistant $0 $@\""
@@ -23,36 +25,64 @@ if [ -d /config ]; then
echo "Installing on HA"
where="/config/bin"
venv="/config/venvs"
on_ha=1
else
if [ ! -d ~/.local/bin ] && [ ! -d ~/bin ]; then
echo "No suitable bin directory found in PATH, please add either ~/.local/bin or ~/bin to your PATH"
if [ ! -d $HOME/.local/bin ] && [ ! -d $HOME/bin ]; then
echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
exit 1
fi
for where in ~/bin ~/.local/bin; do
for where in $HOME/bin $HOME/.local/bin notset ; do
if echo ":$PATH:" | grep -q ":$where:" ; then
break
fi
done
venv="~/venvs"
if [ "$where" = "notset" ]; then
echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
exit 1
fi
venv="$HOME/venvs"
fi
python3 -m pip --version > /dev/null 2>&1 || { echo "pip is not installed, please install pip for python3"; exit 1; }
echo "Installing heartbeat $what"
if [ ! -d $venv/hbd ]; then
python3 -m pip --version > /dev/null 2>&1
if [ $? -ne 0 ]; then
# truenas does not have pip installed by default, so we need to fetch get-pip.py and install pip
echo "pip is not installed, fetching get-pip.py and installing pip"
arg="--without-pip"
fi
mkdir -p $venv
have_venv=$(python3 -c "import venv" &> /dev/null && echo "Installed" || echo "Not Installed")
if [ "$have_venv" = "Not Installed" ]; then
echo "python venv module not found, installing virtualenv"
python3 -m pip install --user virtualenv
python3 -m virtualenv $venv/hbd --system-site-packages $arg
else
python3 -m venv $venv/hbd --system-site-packages $arg
fi
. $venv/hbd/bin/activate
if [ -n "$arg" ]; then
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py
fi
deactivate
fi
. $venv/hbd/bin/activate
python3 -mpip install --upgrade --index-url https://git.wrede.ca/api/packages/andreas/pypi/simple/ --extra-index-url https://pypi.org/simple hbd[$what]
if [ "$what" = "server" ]; then
echo "Installing heartbeat server (hbd)"
else
what="client"
echo "Installing heartbeat client (hbc)"
fi
if [ ! -d $venv/hbd ]; then
mkdir -p $venv
python3 -m venv $venv/hbd --system-site-packages
fi
. $venv/hbd/bin/activate
pip install --index-url https://git.wrede.ca/api/packages/andreas/pypi/simple/ --extra-index-url https://pypi.org/simple hbd[$what]
if [ "$what" = "server" ]; then
rm -f ~$where/hbd
rm -f $where/hbd
ln -sf $(which hbd) $where/hbd
echo "hbd installed, you can run it with \"$where/hbd\" or \"hbd\" if $where is in your PATH"
else
rm -f $where/hbc
ln -sf $(which hbc) $where/hbc
if [ $on_ha -eq 1 ]; then
echo "restarting hbc "
job=$(grep run_hbc configuration.yaml | sed 's/run_hbc://')
$job
else
echo "hbc installed, you can run it with \"$where/hbc\" or \"hbc\" if $where is in your PATH"
fi
fi
+99
View File
@@ -0,0 +1,99 @@
import asyncio
import logging
import os
import stat
from hbd.client.plugins.nagios_runner import (
NagiosRunnerPlugin,
NAGIOS_OK,
NAGIOS_WARNING,
NAGIOS_CRITICAL,
NAGIOS_UNKNOWN,
)
def test_no_commands_sets_skip_reason():
plugin = NagiosRunnerPlugin(config={"commands": []})
result = asyncio.run(plugin.initialize())
assert result is False
assert plugin.skip_reason is not None
assert "nagios_runner.commands" in plugin.skip_reason
def test_stderr_used_when_stdout_empty(tmp_path):
script = tmp_path / "check_err.sh"
script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n")
script.chmod(script.stat().st_mode | stat.S_IEXEC)
config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
plugin = NagiosRunnerPlugin(config=config)
asyncio.run(plugin.initialize())
data = asyncio.run(plugin._collect_metrics())
assert "error from stderr" in data["t_output"]
assert data["t_status_code"] == NAGIOS_CRITICAL
def test_stderr_appended_when_both_present(tmp_path):
script = tmp_path / "check_both.sh"
script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n")
script.chmod(script.stat().st_mode | stat.S_IEXEC)
config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
plugin = NagiosRunnerPlugin(config=config)
asyncio.run(plugin.initialize())
data = asyncio.run(plugin._collect_metrics())
assert "OK - all good" in data["t_output"]
assert "extra detail" in data["t_output"]
assert data["t_status_code"] == NAGIOS_OK
def test_negative_returncode_maps_to_unknown():
# kill -9 $$ kills the shell itself; asyncio sees returncode -9
config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5}
plugin = NagiosRunnerPlugin(config=config)
asyncio.run(plugin.initialize())
data = asyncio.run(plugin._collect_metrics())
assert data["t_status_code"] == NAGIOS_UNKNOWN
assert "signal" in data["t_output"].lower()
def test_absolute_path_not_found_warns(caplog):
fake_cmd = "/nonexistent_hbc_test_path/check_something"
config = {"commands": [{"name": "t", "command": fake_cmd}]}
plugin = NagiosRunnerPlugin(config=config)
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
asyncio.run(plugin.initialize())
assert any("not found" in r.message for r in caplog.records)
def test_absolute_path_not_executable_warns(caplog, tmp_path):
non_exec = tmp_path / "check_test"
non_exec.write_text("#!/bin/sh\necho OK\n")
non_exec.chmod(0o644) # readable but not executable
config = {"commands": [{"name": "t", "command": str(non_exec)}]}
plugin = NagiosRunnerPlugin(config=config)
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
asyncio.run(plugin.initialize())
assert any("not executable" in r.message for r in caplog.records)
def test_relative_path_not_checked(caplog):
# Relative paths (resolved via PATH) must not generate warnings
config = {"commands": [{"name": "t", "command": "echo OK"}]}
plugin = NagiosRunnerPlugin(config=config)
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
asyncio.run(plugin.initialize())
assert not any(
"not found" in r.message or "not executable" in r.message
for r in caplog.records
)
+83
View File
@@ -0,0 +1,83 @@
import asyncio
import logging
import textwrap
from hbd.client.plugin import PluginLoader, PluginRegistry
def test_plugin_skip_reason_defaults_none(tmp_path):
plugin_code = textwrap.dedent("""
from hbd.client.plugin import MonitorPlugin
class MinimalPlugin(MonitorPlugin):
name = "minimal"
version = "1.0.0"
interval = 60
async def initialize(self):
return True
async def _collect_metrics(self):
return {}
""")
(tmp_path / "minimal.py").write_text(plugin_code)
registry = PluginRegistry()
loader = PluginLoader(registry)
asyncio.run(loader.load_from_directory(tmp_path))
plugin = registry.get("minimal")
assert plugin is not None
assert plugin.skip_reason is None
def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog):
plugin_code = textwrap.dedent("""
from hbd.client.plugin import MonitorPlugin
class SkippablePlugin(MonitorPlugin):
name = "skippable"
version = "1.0.0"
interval = 60
async def initialize(self):
self.skip_reason = "not configured in yaml"
return False
async def _collect_metrics(self):
return {}
""")
(tmp_path / "skippable.py").write_text(plugin_code)
registry = PluginRegistry()
loader = PluginLoader(registry)
with caplog.at_level(logging.INFO, logger="plugin.loader"):
count = asyncio.run(loader.load_from_directory(tmp_path))
assert count == 0
assert any("skipped: not configured in yaml" in r.message for r in caplog.records)
assert not any("failed initialization" in r.message for r in caplog.records)
def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog):
plugin_code = textwrap.dedent("""
from hbd.client.plugin import MonitorPlugin
class FailPlugin(MonitorPlugin):
name = "fail"
version = "1.0.0"
interval = 60
async def initialize(self):
return False
async def _collect_metrics(self):
return {}
""")
(tmp_path / "fail_plugin.py").write_text(plugin_code)
registry = PluginRegistry()
loader = PluginLoader(registry)
with caplog.at_level(logging.WARNING, logger="plugin.loader"):
count = asyncio.run(loader.load_from_directory(tmp_path))
assert count == 0
assert any("failed initialization" in r.message for r in caplog.records)