version 5.1.3

fix: remove unused pytest import in test_nagios_runner
fix: use os.path.exists check for /dev/log instead of dead-code OSError catch
2026-04-25 16:52:56 +02:00 · 2026-04-25 16:39:56 +02:00 · 2026-04-25 16:36:00 +02:00 · 2026-04-25 16:29:54 +02:00 · 2026-04-25 16:28:32 +02:00 · 2026-04-25 16:24:33 +02:00
22 changed files with 1380 additions and 176 deletions
@@ -24,11 +24,11 @@ jobs:
          
      - name: Install build tools
        run: |
-          python -m pip install --upgrade pip
-          pip install build twine
+          python3 -m pip install --upgrade pip
+          python3 -m pip install build twine
          
      - name: Build package
-        run: python -m build
+        run: python3 -m build
        
      - name: Extract version from tag
        id: get_version
@@ -39,7 +39,7 @@ jobs:
          TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
          TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
        run: |
-          python -m twine upload --repository-url https://git.wrede.ca/api/packages/andreas/pypi dist/*
+          python3 -m twine upload --repository-url https://git.wrede.ca/api/packages/andreas/pypi dist/*

      - name: Create release
        uses: actions/gitea-release-action@v1
@@ -0,0 +1,602 @@
+# Plugin Error Checking Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Improve plugin error checking in hbc, especially for nagios_runner, and fix logger messages silently discarded in daemon mode.
+
+**Architecture:** Three focused changes across three files: (1) `hbd/client/plugin.py` gains a `skip_reason` attribute on Plugin and updated PluginLoader messaging; (2) `hbd/client/plugins/nagios_runner.py` gains async subprocess execution, stderr capture, signal-killed process handling, and init-time command path validation; (3) `hbd/client/main.py` gains proper post-fork logging reconfiguration to syslog.
+
+**Tech Stack:** Python 3.11+, asyncio, `logging.handlers.SysLogHandler`, pytest
+
+---
+
+## File Map
+
+| Action | Path | What changes |
+|---|---|---|
+| Modify | `hbd/client/plugin.py` | `Plugin.__init__` gains `skip_reason`; `PluginLoader` checks it |
+| Modify | `hbd/client/plugins/nagios_runner.py` | async subprocess, stderr, signal codes, init validation, `skip_reason` |
+| Modify | `hbd/client/main.py` | `_reconfigure_logging_for_daemon()` helper; remove redundant syslog calls |
+| Create | `tests/test_plugin.py` | PluginLoader messaging tests |
+| Create | `tests/test_nagios_runner.py` | NagiosRunnerPlugin behaviour tests |
+
+Run tests throughout with:
+```bash
+python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
+```
+
+---
+
+## Task 1: Plugin.skip_reason + PluginLoader messaging
+
+**Files:**
+- Modify: `hbd/client/plugin.py:40-48` (Plugin.__init__)
+- Modify: `hbd/client/plugin.py:369-381` (PluginLoader.load_from_directory)
+- Create: `tests/test_plugin.py`
+
+- [ ] **Step 1: Write failing tests**
+
+Create `tests/test_plugin.py`:
+
+```python
+import asyncio
+import logging
+import textwrap
+
+from hbd.client.plugin import Plugin, PluginLoader, PluginRegistry
+
+
+def test_plugin_skip_reason_defaults_none(tmp_path):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class MinimalPlugin(MonitorPlugin):
+            name = "minimal"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                return True
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "minimal.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+    asyncio.run(loader.load_from_directory(tmp_path))
+    plugin = registry.get("minimal")
+    assert plugin is not None
+    assert plugin.skip_reason is None
+
+
+def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class SkippablePlugin(MonitorPlugin):
+            name = "skippable"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                self.skip_reason = "not configured in yaml"
+                return False
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "skippable.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+
+    with caplog.at_level(logging.INFO, logger="plugin.loader"):
+        count = asyncio.run(loader.load_from_directory(tmp_path))
+
+    assert count == 0
+    assert any("skipped: not configured in yaml" in r.message for r in caplog.records)
+    assert not any("failed initialization" in r.message for r in caplog.records)
+
+
+def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class FailPlugin(MonitorPlugin):
+            name = "fail"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                return False
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "fail_plugin.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.loader"):
+        count = asyncio.run(loader.load_from_directory(tmp_path))
+
+    assert count == 0
+    assert any("failed initialization" in r.message for r in caplog.records)
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+python -m pytest tests/test_plugin.py -v
+```
+Expected: `test_plugin_skip_reason_defaults_none` FAILS (attribute missing), others may error.
+
+- [ ] **Step 3: Add `skip_reason` to `Plugin.__init__`**
+
+In `hbd/client/plugin.py`, in `Plugin.__init__` (around line 46), add one line:
+
+```python
+def __init__(self, config: Optional[Dict[str, Any]] = None):
+    self.config = config or {}
+    self.logger = logging.getLogger(f"plugin.{self.name}")
+    self._initialized = False
+    self.skip_reason: Optional[str] = None
+```
+
+- [ ] **Step 4: Update PluginLoader messaging**
+
+In `hbd/client/plugin.py`, replace the `if not initialized:` block (around line 372):
+
+```python
+                    if not initialized:
+                        if plugin.skip_reason:
+                            self.logger.info(
+                                f"Plugin {plugin.name} skipped: {plugin.skip_reason}"
+                            )
+                        else:
+                            self.logger.warning(
+                                f"Plugin {plugin.name} failed initialization, skipping"
+                            )
+                        continue
+```
+
+- [ ] **Step 5: Run tests to verify they pass**
+
+```bash
+python -m pytest tests/test_plugin.py -v
+```
+Expected: all 3 tests PASS.
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add hbd/client/plugin.py tests/test_plugin.py
+git commit -m "feat: add skip_reason to Plugin; improve PluginLoader init messaging"
+```
+
+---
+
+## Task 2: NagiosRunnerPlugin — skip_reason when no commands
+
+**Files:**
+- Modify: `hbd/client/plugins/nagios_runner.py:88-105` (initialize)
+- Modify: `tests/test_nagios_runner.py` (create)
+
+- [ ] **Step 1: Write failing test**
+
+Create `tests/test_nagios_runner.py`:
+
+```python
+import asyncio
+import logging
+import os
+import stat
+
+import pytest
+
+from hbd.client.plugins.nagios_runner import (
+    NagiosRunnerPlugin,
+    NAGIOS_OK,
+    NAGIOS_WARNING,
+    NAGIOS_CRITICAL,
+    NAGIOS_UNKNOWN,
+)
+
+
+def test_no_commands_sets_skip_reason():
+    plugin = NagiosRunnerPlugin(config={"commands": []})
+    result = asyncio.run(plugin.initialize())
+    assert result is False
+    assert plugin.skip_reason is not None
+    assert "nagios_runner.commands" in plugin.skip_reason
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
+```
+Expected: FAIL — `plugin.skip_reason` is `None`.
+
+- [ ] **Step 3: Set skip_reason in NagiosRunnerPlugin.initialize()**
+
+In `hbd/client/plugins/nagios_runner.py`, replace the early-return block in `initialize()` (around line 96):
+
+```python
+        if not self.commands:
+            self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
+            self.logger.info("No Nagios commands configured")
+            return False
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
+```
+Expected: PASS.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
+git commit -m "feat: set skip_reason on nagios_runner when no commands configured"
+```
+
+---
+
+## Task 3: NagiosRunnerPlugin — async subprocess, stderr capture, negative return codes
+
+**Files:**
+- Modify: `hbd/client/plugins/nagios_runner.py` (imports + `_run_nagios_plugin`)
+- Modify: `tests/test_nagios_runner.py`
+
+- [ ] **Step 1: Write failing tests**
+
+Append to `tests/test_nagios_runner.py`:
+
+```python
+def test_stderr_used_when_stdout_empty(tmp_path):
+    script = tmp_path / "check_err.sh"
+    script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n")
+    script.chmod(script.stat().st_mode | stat.S_IEXEC)
+
+    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert "error from stderr" in data["t_output"]
+    assert data["t_status_code"] == NAGIOS_CRITICAL
+
+
+def test_stderr_appended_when_both_present(tmp_path):
+    script = tmp_path / "check_both.sh"
+    script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n")
+    script.chmod(script.stat().st_mode | stat.S_IEXEC)
+
+    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert "OK - all good" in data["t_output"]
+    assert "extra detail" in data["t_output"]
+    assert data["t_status_code"] == NAGIOS_OK
+
+
+def test_negative_returncode_maps_to_unknown():
+    # kill -9 $$ kills the shell itself; asyncio sees returncode -9
+    config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert data["t_status_code"] == NAGIOS_UNKNOWN
+    assert "signal" in data["t_output"].lower()
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+python -m pytest tests/test_nagios_runner.py::test_stderr_used_when_stdout_empty \
+    tests/test_nagios_runner.py::test_stderr_appended_when_both_present \
+    tests/test_nagios_runner.py::test_negative_returncode_maps_to_unknown -v
+```
+Expected: all FAIL — current implementation ignores stderr and doesn't handle negative codes.
+
+- [ ] **Step 3: Update imports in nagios_runner.py**
+
+Replace the import block at the top of `hbd/client/plugins/nagios_runner.py`:
+
+```python
+import asyncio
+import os
+import re
+from typing import Any, Dict, List, Optional, Tuple
+
+from hbd.client.plugin import MonitorPlugin
+```
+
+(Remove `import subprocess`; add `import asyncio` and `import os`.)
+
+- [ ] **Step 4: Upgrade collection log level from DEBUG to INFO**
+
+In `hbd/client/plugins/nagios_runner.py`, in `_collect_metrics()`, change the debug log (around line 144) so results are visible at INFO level:
+
+```python
+                self.logger.info(
+                    f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
+                )
+```
+
+- [ ] **Step 5: Replace `_run_nagios_plugin` with async implementation**
+
+Replace the entire `_run_nagios_plugin` method in `hbd/client/plugins/nagios_runner.py`:
+
+```python
+    async def _run_nagios_plugin(
+        self,
+        command: str
+    ) -> Tuple[int, str, Dict[str, Any]]:
+        """Execute a Nagios plugin and parse its output."""
+        try:
+            proc = await asyncio.create_subprocess_shell(
+                command,
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.PIPE,
+            )
+            try:
+                stdout_bytes, stderr_bytes = await asyncio.wait_for(
+                    proc.communicate(), timeout=self.timeout
+                )
+            except asyncio.TimeoutError:
+                proc.kill()
+                await proc.communicate()
+                self.logger.error(f"Command timed out: {command}")
+                return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
+
+            status_code = proc.returncode
+
+            if status_code < 0:
+                return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {}
+
+            if status_code > 3:
+                status_code = NAGIOS_UNKNOWN
+
+            stdout = stdout_bytes.decode(errors="replace").strip()
+            stderr = stderr_bytes.decode(errors="replace").strip()
+
+            # Parse perfdata from stdout before mixing in stderr
+            perfdata = self._parse_perfdata(stdout)
+
+            # Build status message
+            status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout
+
+            if not stdout and stderr:
+                output_msg = stderr
+            elif stdout and stderr:
+                output_msg = f"{status_part} [stderr: {stderr}]"
+            else:
+                output_msg = status_part
+
+            return status_code, output_msg, perfdata
+
+        except Exception as e:
+            self.logger.error(f"Error executing command: {e}")
+            return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}
+```
+
+Also remove the now-unused `self.shell` line from `__init__` (the `shell` config key is no longer used since `create_subprocess_shell` always uses a shell):
+
+In `NagiosRunnerPlugin.__init__`, remove:
+```python
+        self.shell: bool = config.get("shell", True) if config else True
+```
+
+- [ ] **Step 6: Run tests to verify they pass**
+
+```bash
+python -m pytest tests/test_nagios_runner.py -v
+```
+Expected: all tests PASS including the 3 new ones.
+
+- [ ] **Step 7: Commit**
+
+```bash
+git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
+git commit -m "feat: async subprocess in nagios_runner with stderr capture and signal handling"
+```
+
+---
+
+## Task 4: NagiosRunnerPlugin — command path validation at init
+
+**Files:**
+- Modify: `hbd/client/plugins/nagios_runner.py` (initialize)
+- Modify: `tests/test_nagios_runner.py`
+
+- [ ] **Step 1: Write failing tests**
+
+Append to `tests/test_nagios_runner.py`:
+
+```python
+def test_absolute_path_not_found_warns(caplog):
+    fake_cmd = "/nonexistent_hbc_test_path/check_something"
+    config = {"commands": [{"name": "t", "command": fake_cmd}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert any("not found" in r.message for r in caplog.records)
+
+
+def test_absolute_path_not_executable_warns(caplog, tmp_path):
+    non_exec = tmp_path / "check_test"
+    non_exec.write_text("#!/bin/sh\necho OK\n")
+    non_exec.chmod(0o644)  # readable but not executable
+
+    config = {"commands": [{"name": "t", "command": str(non_exec)}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert any("not executable" in r.message for r in caplog.records)
+
+
+def test_relative_path_not_checked(caplog):
+    # Relative paths (resolved via PATH) must not generate warnings
+    config = {"commands": [{"name": "t", "command": "echo OK"}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert not any(
+        "not found" in r.message or "not executable" in r.message
+        for r in caplog.records
+    )
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+python -m pytest tests/test_nagios_runner.py::test_absolute_path_not_found_warns \
+    tests/test_nagios_runner.py::test_absolute_path_not_executable_warns \
+    tests/test_nagios_runner.py::test_relative_path_not_checked -v
+```
+Expected: `test_absolute_path_not_found_warns` and `test_absolute_path_not_executable_warns` FAIL (no warnings logged); `test_relative_path_not_checked` may pass.
+
+- [ ] **Step 3: Add command path validation to `initialize()`**
+
+In `hbd/client/plugins/nagios_runner.py`, extend `initialize()` by adding validation after the existing "log each command" loop (after line 103, before `return True`):
+
+```python
+        # Validate absolute command paths early
+        for cmd_config in self.commands:
+            name = cmd_config.get("name", "unnamed")
+            command = cmd_config.get("command", "")
+            if not command:
+                continue
+            exe = command.split()[0]
+            if os.path.isabs(exe):
+                if not os.path.isfile(exe):
+                    self.logger.warning(
+                        f"Command '{name}': executable not found: {exe}"
+                    )
+                elif not os.access(exe, os.X_OK):
+                    self.logger.warning(
+                        f"Command '{name}': executable not executable: {exe}"
+                    )
+```
+
+- [ ] **Step 4: Run full test suite to verify all pass**
+
+```bash
+python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
+```
+Expected: all tests PASS.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
+git commit -m "feat: validate absolute command paths at nagios_runner init"
+```
+
+---
+
+## Task 5: Daemon mode logging — route to syslog after fork
+
+**Files:**
+- Modify: `hbd/client/main.py` (new helper + updated daemon block)
+
+No automated test for daemonization itself (fork behaviour is hard to unit-test). Manual verification steps are provided below.
+
+- [ ] **Step 1: Add `_reconfigure_logging_for_daemon` helper**
+
+In `hbd/client/main.py`, add this function just before `def build_parser()` (around line 589):
+
+```python
+def _reconfigure_logging_for_daemon(log_level: int) -> None:
+    """Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler."""
+    from logging.handlers import SysLogHandler
+
+    root = logging.getLogger()
+    for handler in root.handlers[:]:
+        root.removeHandler(handler)
+        handler.close()
+
+    try:
+        syslog_handler = SysLogHandler(
+            address="/dev/log",
+            facility=SysLogHandler.LOG_DAEMON,
+        )
+    except OSError:
+        syslog_handler = SysLogHandler(
+            address=("localhost", 514),
+            facility=SysLogHandler.LOG_DAEMON,
+        )
+        # Attach the fallback first so the warning reaches syslog
+        syslog_handler.setFormatter(
+            logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
+        )
+        root.addHandler(syslog_handler)
+        root.setLevel(log_level)
+        logging.warning("/dev/log not found, using syslog UDP localhost:514")
+        return
+
+    syslog_handler.setFormatter(
+        logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
+    )
+    root.addHandler(syslog_handler)
+    root.setLevel(log_level)
+```
+
+- [ ] **Step 2: Update the daemon block in `main()`**
+
+In `hbd/client/main.py`, replace the entire `if args.daemon:` block (lines 664–675):
+
+```python
+    if args.daemon:
+        print("Daemonizing...")
+        daemonize()
+        _reconfigure_logging_for_daemon(log_level)
+        logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
+```
+
+This removes the `import syslog`, `syslog.openlog()`, and `syslog.syslog()` calls (now handled by the logging system) and removes the no-op second `logging.basicConfig()` call.
+
+- [ ] **Step 3: Run existing test suite to confirm no regressions**
+
+```bash
+python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
+```
+Expected: all tests still PASS.
+
+- [ ] **Step 4: Manual smoke test — verify syslog output in daemon mode**
+
+```bash
+# In one terminal, tail syslog
+sudo journalctl -f -t hbc
+
+# In another terminal, start hbc in daemon mode (replace HOST with a real or dummy host)
+python -m hbd.client.main -d -v localhost
+
+# Expected in journalctl output:
+#   hbc[<pid>]: hbc.main INFO: Starting hbc for <hostname> -> ['localhost']
+#   hbc[<pid>]: hbc.main INFO: hbc starting, sending heartbeat to localhost
+#   hbc[<pid>]: plugin.loader INFO: ...
+
+# Stop the daemon
+pkill -f "hbd.client.main"
+```
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add hbd/client/main.py
+git commit -m "fix: reconfigure logging to syslog after daemonize() instead of no-op basicConfig"
+```
@@ -0,0 +1,92 @@
+# Plugin Error Checking & Daemon Logging — Design Spec
+
+**Date:** 2026-04-25  
+**Scope:** hbc client — daemon mode logging, nagios_runner plugin robustness, PluginLoader messaging  
+**Files affected:** `hbd/client/main.py`, `hbd/client/plugins/nagios_runner.py`, `hbd/client/plugin.py`
+
+---
+
+## 1. Daemon Mode Logging
+
+### Problem
+In `main()`, `logging.basicConfig()` is called before `daemonize()` (establishing a StreamHandler to stderr), then called again after `daemonize()`. The second call is a no-op — Python ignores `basicConfig()` when handlers are already configured. After daemonization, stderr is redirected to `/dev/null`, so all subsequent log output is silently discarded.
+
+The existing `syslog.openlog()` / `syslog.syslog()` calls (lines 666–668) write a single startup message but do not integrate with the `logging` system, so plugin and connection log messages never reach syslog.
+
+### Fix
+After `daemonize()`, explicitly reconfigure the root logger:
+
+1. Remove all existing handlers (they now write to `/dev/null`).
+2. Add `logging.handlers.SysLogHandler(address='/dev/log', facility=LOG_DAEMON)`.
+3. Set formatter: `hbc[%(process)d]: %(name)s %(levelname)s: %(message)s`
+4. Preserve the `log_level` already determined from `-v`/`-x` CLI flags.
+
+Remove the redundant `syslog.openlog()` / `syslog.syslog()` calls — the logging system handles routing.
+
+**Fallback:** If `/dev/log` does not exist (containers, some BSDs), fall back to `SysLogHandler(address=('localhost', 514))`. Log one warning (to stderr, before handlers are replaced) so the operator knows.
+
+---
+
+## 2. Nagios Runner Improvements
+
+### 2a — Async Subprocess
+`_run_nagios_plugin()` is declared `async def` but calls `subprocess.run()` synchronously, blocking the event loop for the full command duration.
+
+**Fix:** Replace with `asyncio.create_subprocess_shell()` + `await proc.communicate()`. Enforce timeout with `asyncio.wait_for(..., timeout=self.timeout)` and catch `asyncio.TimeoutError`.
+
+### 2b — Stderr Capture
+Subprocess stderr is currently discarded (`capture_output=True` only captures stdout in the sync call; stderr content is lost).
+
+**Fix:** Pass `stderr=asyncio.subprocess.PIPE` to `create_subprocess_shell`. After `communicate()`, if stdout is empty but stderr has content, use stderr as the output message. If both have content, append stderr to the output for visibility.
+
+### 2c — Negative Return Codes
+A negative `returncode` means the process was killed by a signal (SIGKILL, OOM, etc.). The current code treats these as-is, which may produce unexpected status values.
+
+**Fix:** If `returncode < 0`, map to `NAGIOS_UNKNOWN` with message `"Process killed by signal {-returncode}"`.
+
+### 2d — Command Path Validation at Init
+`initialize()` currently only checks that the commands list is non-empty.
+
+**Fix:** For each command entry during `initialize()`:
+- Warn and skip the entry if `name` or `command` is missing.
+- Extract the executable (first whitespace-delimited token of the command string).
+- If the executable is an absolute path, check `os.path.isfile()` and `os.access(..., os.X_OK)`. Log a `WARNING` if either check fails.
+- Commands with relative paths or shell builtins are not checked (they may be on PATH) — just noted.
+- Validation warns only; all original entries in `self.commands` are retained and still attempted at collection time (where the existing missing-name/command guard already skips them). The plugin initializes successfully as long as the commands list is non-empty.
+
+---
+
+## 3. PluginLoader Messaging
+
+### Problem
+When `initialize()` returns `False`, the loader always logs:
+> `WARNING: Plugin X failed initialization, skipping`
+
+This is alarming when the real reason is simply "no commands configured". There is no API to distinguish "not configured" from "genuinely broken".
+
+### Fix
+Add an optional `skip_reason` attribute to `Plugin.__init__()` (defaults to `None`).
+
+In `PluginLoader.load_from_directory()`, after `initialize()` returns `False`:
+- If `plugin.skip_reason` is set → `logger.info(f"Plugin {plugin.name} skipped: {plugin.skip_reason}")`
+- If `plugin.skip_reason` is `None` → `logger.warning(f"Plugin {plugin.name} failed initialization, skipping")` (existing behaviour)
+
+In `NagiosRunnerPlugin.initialize()`, when no commands are configured:
+```python
+self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
+return False
+```
+
+Genuine failures (exceptions) continue to go through the existing `except` block in the loader, logging at `ERROR` with traceback — unchanged.
+
+---
+
+## Decisions
+
+| Topic | Decision |
+|---|---|
+| Daemon log destination | syslog only (LOG_DAEMON facility) |
+| Syslog fallback | localhost:514 UDP if `/dev/log` absent |
+| Nagios result log level | INFO for all statuses (OK/WARNING/CRITICAL/UNKNOWN) |
+| Invalid command handling at init | Warn and continue; still attempt at collection time |
+| PluginLoader API change | `skip_reason` attribute on Plugin base class, checked by loader |
@@ -14,4 +14,4 @@ Install options:
 """

 __all__ = ["__version__"]
-__version__ = "5.1.1"
+__version__ = "5.1.3"
@@ -15,6 +15,7 @@ import socket
 import sys
 import time
 from hashlib import md5
+from logging.handlers import SysLogHandler
 from pathlib import Path
 from typing import Dict, List, Optional

@@ -586,6 +587,36 @@ def daemonize(
    os.dup2(se.fileno(), sys.stderr.fileno())


+def _reconfigure_logging_for_daemon(log_level: int) -> None:
+    """Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler."""
+    root = logging.getLogger()
+    for handler in root.handlers[:]:
+        root.removeHandler(handler)
+        handler.close()
+
+    use_udp_fallback = not os.path.exists("/dev/log")
+
+    if use_udp_fallback:
+        syslog_handler = SysLogHandler(
+            address=("localhost", 514),
+            facility=SysLogHandler.LOG_DAEMON,
+        )
+    else:
+        syslog_handler = SysLogHandler(
+            address="/dev/log",
+            facility=SysLogHandler.LOG_DAEMON,
+        )
+
+    syslog_handler.setFormatter(
+        logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
+    )
+    root.addHandler(syslog_handler)
+    root.setLevel(log_level)
+
+    if use_udp_fallback:
+        logging.warning("/dev/log not found, using syslog UDP localhost:514")
+
+
 def build_parser():
    """Build argument parser."""
    parser = argparse.ArgumentParser(
@@ -663,16 +694,9 @@ def main(argv=None):
    # Daemonize if requested
    if args.daemon:
        print("Daemonizing...")
-        import syslog
-        syslog.openlog("hbc", syslog.LOG_PID, syslog.LOG_DAEMON)
-        syslog.syslog(syslog.LOG_INFO, f"Starting heartbeat to {', '.join(args.hosts)}")
        daemonize()
-        
-        # Reconfigure logging for syslog
-        logging.basicConfig(
-            level=log_level,
-            format="hbc[%(process)d]: %(name)s %(levelname)s: %(message)s"
-        )
+        _reconfigure_logging_for_daemon(log_level)
+        logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
    
    # Run async main
    try:
@@ -22,13 +22,14 @@ from typing import Any, Dict, List, Optional, Type

 class Plugin(ABC):
    """Base class for all plugins.
-    
+
    Attributes:
        name: Unique plugin identifier (e.g., "os_info", "cpu_monitor")
        version: Plugin version string
        description: Human-readable description
        interval: Collection interval in seconds (0 for InfoPlugin = collect once)
        enabled: Whether plugin is active (can be disabled via config)
+        skip_reason: Set by plugin before returning False from initialize(); causes loader to log INFO instead of WARNING.
    """
    
    name: str = ""
@@ -39,13 +40,14 @@ class Plugin(ABC):
    
    def __init__(self, config: Optional[Dict[str, Any]] = None):
        """Initialize plugin with optional configuration.
-        
+
        Args:
            config: Plugin-specific configuration from YAML (e.g., thresholds, paths)
        """
        self.config = config or {}
        self.logger = logging.getLogger(f"plugin.{self.name}")
        self._initialized = False
+        self.skip_reason: Optional[str] = None
        
    @abstractmethod
    async def initialize(self) -> bool:
@@ -312,9 +314,10 @@ class PluginLoader:
        
        loaded_count = 0
        raw_config = config or {}
-        # Per-plugin config lives under the 'plugins' key; fall back to top-level
-        # for backwards compatibility.
-        plugin_config = raw_config.get("plugins", raw_config)
+        # Per-plugin config lives under the 'plugins' key or at top-level.
+        # CLIENT_DEFAULTS seeds "plugins": {} so the key always exists; check
+        # both the subdict and top-level so that either layout in .hbc.yaml works.
+        plugins_subconfig = raw_config.get("plugins", {})
        
        # Scan for Python files
        for plugin_file in directory.glob("*.py"):
@@ -359,17 +362,23 @@ class PluginLoader:
                    
                    self.logger.debug(f"Found plugin class: {name}")
                    
-                    # Instantiate plugin with config
-                    plugin_instance_config = plugin_config.get(obj.name, {})
+                    # Instantiate plugin with config — check plugins subdict first,
+                    # then top-level keys (e.g. nagios_runner: ... at root of config).
+                    plugin_instance_config = plugins_subconfig.get(obj.name) or raw_config.get(obj.name, {})
                    plugin = obj(config=plugin_instance_config)
                    
                    # Initialize plugin
                    try:
                        initialized = await plugin.initialize()
                        if not initialized:
-                            self.logger.warning(
-                                f"Plugin {plugin.name} failed initialization, skipping"
-                            )
+                            if plugin.skip_reason:
+                                self.logger.info(
+                                    f"Plugin {plugin.name} skipped: {plugin.skip_reason}"
+                                )
+                            else:
+                                self.logger.warning(
+                                    f"Plugin {plugin.name} failed initialization, skipping"
+                                )
                            continue
                    except Exception as e:
                        self.logger.error(
@@ -21,8 +21,10 @@ nagios_runner:
 ```
 """

+import asyncio
+import os
 import re
-import subprocess
+import shlex
 from typing import Any, Dict, List, Optional, Tuple

 from hbd.client.plugin import MonitorPlugin
@@ -52,8 +54,7 @@ class NagiosRunnerPlugin(MonitorPlugin):
        interval: Collection interval in seconds (default: 300)
        commands: List of command definitions with 'name' and 'command' keys
        timeout: Command execution timeout in seconds (default: 30)
-        shell: Whether to execute commands via shell (default: True)
-    
+
    Example:
        nagios_runner:
          interval: 300  # Check every 5 minutes
@@ -76,32 +77,48 @@ class NagiosRunnerPlugin(MonitorPlugin):
        # Extract configuration
        self.commands: List[Dict[str, str]] = config.get("commands", []) if config else []
        self.timeout: int = config.get("timeout", 30) if config else 30
-        self.shell: bool = config.get("shell", True) if config else True
        self.interval = config.get("interval", 300) if config else 300
-        
-        # Validate commands
-        if not self.commands:
-            self.logger.info(
-                "No Nagios commands configured. Add 'nagios_runner.commands' to config."
-            )
    
    async def initialize(self) -> bool:
        """Initialize the Nagios runner plugin.
-        
+
        Returns:
            True if at least one command is configured, False otherwise
        """
        self.logger.info(f"Initializing {self.name} plugin")
-        
+
        if not self.commands:
-            self.logger.info("No Nagios commands configured")
+            self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
            return False
-        
+
        self.logger.info(f"Configured to run {len(self.commands)} Nagios plugin(s)")
        for cmd_config in self.commands:
            name = cmd_config.get("name", "unnamed")
            self.logger.info(f"  - {name}: {cmd_config.get('command', 'N/A')}")
-        
+
+        # Validate absolute command paths early
+        for cmd_config in self.commands:
+            name = cmd_config.get("name", "unnamed")
+            command = cmd_config.get("command", "")
+            if not command:
+                continue
+            try:
+                tokens = shlex.split(command)
+            except ValueError:
+                continue  # malformed command string; skip validation
+            if not tokens:
+                continue
+            exe = tokens[0]
+            if os.path.isabs(exe):
+                if not os.path.isfile(exe):
+                    self.logger.warning(
+                        f"Command '{name}': executable not found: {exe}"
+                    )
+                elif not os.access(exe, os.X_OK):
+                    self.logger.warning(
+                        f"Command '{name}': executable not executable: {exe}"
+                    )
+
        return True
    
    async def _collect_metrics(self) -> Dict[str, Any]:
@@ -141,7 +158,7 @@ class NagiosRunnerPlugin(MonitorPlugin):
                    for metric_name, metric_value in perfdata.items():
                        results[f"{name}_{metric_name}"] = metric_value
                
-                self.logger.debug(
+                self.logger.info(
                    f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
                )
                
@@ -163,46 +180,49 @@ class NagiosRunnerPlugin(MonitorPlugin):
        self,
        command: str
    ) -> Tuple[int, str, Dict[str, Any]]:
-        """Execute a Nagios plugin and parse its output.
-        
-        Args:
-            command: Command string to execute
-            
-        Returns:
-            Tuple of (status_code, output_message, performance_data_dict)
-        """
+        """Execute a Nagios plugin and parse its output."""
        try:
-            # Run command
-            result = subprocess.run(
+            proc = await asyncio.create_subprocess_shell(
                command,
-                shell=self.shell,
-                capture_output=True,
-                timeout=self.timeout,
-                text=True
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.PIPE,
            )
-            
-            status_code = result.returncode
-            output = result.stdout.strip()
-            
-            # Nagios plugins can return codes > 3, treat as UNKNOWN
+            try:
+                stdout_bytes, stderr_bytes = await asyncio.wait_for(
+                    proc.communicate(), timeout=self.timeout
+                )
+            except asyncio.TimeoutError:
+                proc.kill()
+                await proc.communicate()
+                self.logger.error(f"Command timed out: {command}")
+                return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
+
+            status_code = proc.returncode
+
+            if status_code < 0:
+                return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {}
+
            if status_code > 3:
                status_code = NAGIOS_UNKNOWN
-            
-            # Parse performance data
-            perfdata = self._parse_perfdata(output)
-            
-            # Extract just the status message (before the pipe if present)
-            if '|' in output:
-                output_msg = output.split('|')[0].strip()
+
+            stdout = stdout_bytes.decode(errors="replace").strip()
+            stderr = stderr_bytes.decode(errors="replace").strip()
+
+            # Parse perfdata from stdout before mixing in stderr
+            perfdata = self._parse_perfdata(stdout)
+
+            # Build status message
+            status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout
+
+            if not stdout and stderr:
+                output_msg = stderr
+            elif stdout and stderr:
+                output_msg = f"{status_part} [stderr: {stderr}]"
            else:
-                output_msg = output
-            
+                output_msg = status_part
+
            return status_code, output_msg, perfdata
-            
-        except subprocess.TimeoutExpired:
-            self.logger.error(f"Command timed out: {command}")
-            return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
-        
+
        except Exception as e:
            self.logger.error(f"Error executing command: {e}")
            return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}
@@ -52,12 +52,17 @@ def decode_value(val: str) -> Any:
        except Exception:
            return val[1:]  # Return as string without @
    
-    # Try numeric evaluation (original behavior)
+    # Try numeric conversion (avoid eval to prevent SyntaxWarnings on version strings)
    if val[0].isdigit() or (val[0] == '-' and len(val) > 1 and val[1].isdigit()):
        try:
-            return eval(val)
-        except Exception:
-            return val
+            return int(val)
+        except ValueError:
+            pass
+        try:
+            return float(val)
+        except ValueError:
+            pass
+        return val
    
    return val

@@ -385,13 +385,20 @@ _DRIVERS = {


 def _dispatch_to_channel(channel_name: str, channel_cfg: dict, notif: Notification) -> bool:
-    """Send *notif* to a single named channel, honouring min_level."""
-    min_level = channel_cfg.get("min_level", "WARNING").upper()
-    if _level_value(notif.level) < _level_value(min_level):
-        logger.debug(
-            "channel '%s': skipping level %s (min_level=%s)", channel_name, notif.level, min_level
-        )
-        return True  # not an error — filtered intentionally
+    """Send *notif* to a single named channel, honouring min_level.
+
+    RECOVER always bypasses min_level — a recovery is always relevant if the
+    channel was configured for any alerting (handles the restart-then-recover case
+    where _alerted_channels is empty and we fall through to the normal loop).
+    """
+    level = notif.level.upper()
+    if level != "RECOVER":
+        min_level = channel_cfg.get("min_level", "WARNING").upper()
+        if _level_value(level) < _level_value(min_level):
+            logger.debug(
+                "channel '%s': skipping level %s (min_level=%s)", channel_name, level, min_level
+            )
+            return True  # not an error — filtered intentionally

    ch_type = channel_cfg.get("type", "")
    driver = _DRIVERS.get(ch_type)
@@ -3,20 +3,13 @@
  {% include 'head.html' %}

  <style>
-    body {
-      margin: 20px;
-      background: #f5f5f5;
-    }

    .container {
      max-width: 1400px;
      margin: 0 auto;
    }

-    h1 {
-      color: #333;
-      margin-bottom: 10px;
-    }
+    h1 { color: #333; margin-bottom: 10px; font-size: 1.5em; }

    .subtitle {
      color: #666;
@@ -41,7 +34,7 @@
      border-left: 4px solid #ddd;
    }

-    .summary-card.critical { border-left-color: #f44336; }
+    .summary-card.critical { border-left-color: #ea1e0f; }
    .summary-card.warning  { border-left-color: #ff9800; }
    .summary-card.ok       { border-left-color: #4caf50; }

@@ -51,7 +44,7 @@
      line-height: 1;
    }

-    .summary-number.critical { color: #f44336; }
+    .summary-number.critical { color: #ea1e0f; }
    .summary-number.warning  { color: #ff9800; }
    .summary-number.ok       { color: #4caf50; }

@@ -116,7 +109,7 @@
    }
    
    .alert-item.acknowledged {
-      opacity: 0.6;
+      opacity: 0.8;
      background: #f0f0f0;
    }

@@ -6,10 +6,25 @@
    <title>{{ title }}</title>
    {% if extra_scripts %}<script src="{{ extra_scripts }}"></script>{% endif %}
    <style>
+      /* ── Reset / shared baseline ── */
+      *, *::before, *::after { box-sizing: border-box; }
+      html {
+        font-family: 'Segoe UI', system-ui, -apple-system, sans-serif;
+        font-size: 14px;
+      }
+      body {
+        margin: 0;
+        padding: 10px;
+        background: #f5f5f5;
+      }
+      h1 { font-size: 1.5em; color: #333; margin: 0 0 5px; }
+      h2 { font-size: 1.1em; color: #333; margin: 0 0 8px; }
+      p  { margin: 0; }
+
      /* Navigation bar — shared across all pages */
      .nav {
        background: #fff;
-        padding: 10px 15px;
+        padding: 6px 12px;
        margin-bottom: 10px;
        box-shadow: 0 2px 4px rgba(0,0,0,.1);
        border-radius: 4px;
@@ -42,6 +57,17 @@
        transition: background 0.15s;
      }
      .nav-user:hover { background: #f0f4ff; text-decoration: none; }
+      .nav-username {
+        max-width: 0;
+        overflow: hidden;
+        white-space: nowrap;
+        opacity: 0;
+        transition: max-width 0.2s ease, opacity 0.2s ease;
+      }
+      .nav-user:hover .nav-username {
+        max-width: 160px;
+        opacity: 1;
+      }
      .nav-avatar {
        width: 28px; height: 28px;
        border-radius: 50%;
@@ -94,6 +120,158 @@
        .nav-links.nav-open { display: flex; }
        .nav-links a { margin-right: 0; padding: 6px 0; font-size: 1em; }
      }
+
+      /* Swiss railway clock — nav */
+      .nav-clock {
+        flex-shrink: 0;
+        line-height: 0;
+        margin-left: auto;
+        padding: 4px 4px 4px 0;
+        cursor: pointer;
+      }
+      #swiss-clock { display: block; }
+
+      /* Swiss railway clock — full-page overlay */
+      #clock-overlay {
+        display: none;
+        position: fixed;
+        inset: 0;
+        z-index: 9999;
+        background: #1a1a1a;
+        align-items: center;
+        justify-content: center;
+        cursor: pointer;
+      }
+      #clock-overlay.visible { display: flex; }
+      #swiss-clock-overlay { display: block; }
    </style>
-    <script src="static/sorttable.js"></script> 
+    <script>
+    /* ── Swiss Federal Railway (SBB) clock ── */
+
+    /* Draw one frame of the clock onto any canvas element. */
+    function drawSwissClock(canvas) {
+      var SIZE = canvas.width;
+      var R = SIZE / 2;
+      var ctx = canvas.getContext('2d');
+      var now = new Date();
+      var h  = now.getHours() % 12;
+      var m  = now.getMinutes();
+      var s  = now.getSeconds();
+      var ms = now.getMilliseconds();
+
+      /* Seconds hand idles ~1.5 s at 12 before advancing (SBB behaviour) */
+      var sFrac = s + ms / 1000;
+      var sAngle = sFrac >= 58.5 ? 0 : (sFrac / 58.5) * Math.PI * 2;
+
+      ctx.clearRect(0, 0, SIZE, SIZE);
+
+      /* face */
+      ctx.beginPath();
+      ctx.arc(R, R, R - 1, 0, Math.PI * 2);
+      ctx.fillStyle = '#fff';
+      ctx.fill();
+      ctx.strokeStyle = '#333';
+      ctx.lineWidth = SIZE * 0.018;
+      ctx.stroke();
+
+      /* tick marks */
+      for (var i = 0; i < 60; i++) {
+        var a = (i / 60) * Math.PI * 2 - Math.PI / 2;
+        var isHour = (i % 5 === 0);
+        ctx.beginPath();
+        ctx.moveTo(R + Math.cos(a) * (isHour ? R * 0.72 : R * 0.88),
+                   R + Math.sin(a) * (isHour ? R * 0.72 : R * 0.88));
+        ctx.lineTo(R + Math.cos(a) * R * 0.94,
+                   R + Math.sin(a) * R * 0.94);
+        ctx.strokeStyle = '#222';
+        ctx.lineWidth = isHour ? SIZE * 0.027 : SIZE * 0.011;
+        ctx.lineCap = 'butt';
+        ctx.stroke();
+      }
+
+      /* hands */
+      function hand(angle, tip, tail, width, color) {
+        ctx.save();
+        ctx.translate(R, R);
+        ctx.rotate(angle);
+        ctx.beginPath();
+        ctx.moveTo(tail, 0);
+        ctx.lineTo(tip,  0);
+        ctx.strokeStyle = color;
+        ctx.lineWidth = width;
+        ctx.lineCap = 'square';
+        ctx.stroke();
+        ctx.restore();
+      }
+
+      hand((m + s / 60) / 60 * Math.PI * 2 - Math.PI / 2,
+           R * 0.88, -R * 0.12, SIZE * 0.027, '#222');           /* minute */
+      hand((h + m / 60) / 12 * Math.PI * 2 - Math.PI / 2,
+           R * 0.58, -R * 0.12, SIZE * 0.039, '#222');           /* hour   */
+      hand(sAngle - Math.PI / 2, R * 0.78, -R * 0.22,
+           SIZE * 0.013, '#e00');                                 /* second tail+tip */
+
+      /* round dot at tip of second hand */
+      var dotR = SIZE * 0.028;
+      ctx.save();
+      ctx.translate(R, R);
+      ctx.rotate(sAngle - Math.PI / 2);
+      ctx.beginPath();
+      ctx.arc(R * 0.78, 0, dotR, 0, Math.PI * 2);
+      ctx.fillStyle = '#e00';
+      ctx.fill();
+      ctx.restore();
+
+      /* centre cap */
+      ctx.beginPath();
+      ctx.arc(R, R, R * 0.04, 0, Math.PI * 2);
+      ctx.fillStyle = '#222';
+      ctx.fill();
+    }
+
+    /* Resize the overlay canvas to fit the viewport, keeping it square. */
+    function resizeOverlayClock() {
+      var oc = document.getElementById('swiss-clock-overlay');
+      if (!oc) return;
+      var size = Math.min(window.innerWidth, window.innerHeight) * 0.88;
+      size = Math.floor(size);
+      oc.width  = size;
+      oc.height = size;
+    }
+
+    /* Main tick — redraws both nav clock and (if visible) overlay clock. */
+    function clockTick() {
+      var nav = document.getElementById('swiss-clock');
+      if (nav) drawSwissClock(nav);
+      var overlay = document.getElementById('clock-overlay');
+      if (overlay && overlay.classList.contains('visible')) {
+        var oc = document.getElementById('swiss-clock-overlay');
+        if (oc) drawSwissClock(oc);
+      }
+      var delay = 100 - (Date.now() % 100);
+      setTimeout(clockTick, delay);
+    }
+
+    document.addEventListener('DOMContentLoaded', function() {
+      /* Start the shared tick loop */
+      clockTick();
+
+      /* Overlay toggle — clicking the nav clock opens it */
+      var navClock = document.querySelector('.nav-clock');
+      var overlay  = document.getElementById('clock-overlay');
+      if (navClock && overlay) {
+        navClock.addEventListener('click', function() {
+          resizeOverlayClock();
+          overlay.classList.add('visible');
+        });
+        overlay.addEventListener('click', function() {
+          overlay.classList.remove('visible');
+        });
+        window.addEventListener('resize', function() {
+          if (overlay.classList.contains('visible')) resizeOverlayClock();
+        });
+      }
+    });
+    </script>
+    <script src="static/sorttable.js"></script>
 </head>
@@ -7,10 +7,6 @@
      display: flex;
      flex-direction: column;
      height: 100vh;
-      box-sizing: border-box;
-      padding: 10px;
-      margin: 0;
-      background: #f5f5f5;
      overflow: hidden;
    }

@@ -489,8 +485,10 @@
    {% include 'menu.html' %}

    <div class="container">
-      <h1>{{ header }}</h1>
-      <p class="subtitle">Real-time host monitoring and event log</p>
+      <div>
+        <h1>{{ header }}</h1>
+        <p class="subtitle">Real-time host monitoring and event log</p>
+      </div>
      
      <div class="table-section">
        <table id="ntable" class="sortable">
@@ -10,6 +10,9 @@
    <a href="/settings"{% if active_page == "settings" %} class="active"{% endif %}>Settings</a>
    {% endif %}
  </div>
+  <div class="nav-clock" title="Click for full-screen clock">
+    <canvas id="swiss-clock" width="44" height="44"></canvas>
+  </div>
  {% if current_user %}
  <a href="/profile" class="nav-user{% if active_page == 'profile' %} active{% endif %}" title="{{ current_user.full_name or current_user.username }}">
    {% if current_user.avatar %}
@@ -21,6 +24,12 @@
  </a>
  {% endif %}
 </div>
+
+<!-- Full-page clock overlay (click anywhere to dismiss) -->
+<div id="clock-overlay">
+  <canvas id="swiss-clock-overlay" width="400" height="400"></canvas>
+</div>
+
 <script>
  (function() {
    var btn = document.getElementById('nav-hamburger-btn');
@@ -3,11 +3,7 @@
  {% include 'head.html' %}

  <style>
-    body {
-      margin: 10px;
-      background: #f5f5f5;
-      overflow: hidden;
-    }
+    body { overflow: hidden; }

    .container {
      max-width: 1400px;
@@ -3,15 +3,7 @@
  {% include 'head.html' %}

  <style>
-    html, body {
-      overflow: visible;
-    }
-
-    body {
-      margin: 20px;
-      background: #f5f5f5;
-      font-family: 'Segoe UI', system-ui, sans-serif;
-    }
+    html, body { overflow: visible; }

    .container {
      max-width: 900px;
@@ -3,19 +3,10 @@
  {% include 'head.html' %}

  <style>
-    html, body {
-      overflow: visible;
-    }
-
-    body {
-      margin: 20px;
-      background: #f5f5f5;
-      font-family: 'Segoe UI', system-ui, sans-serif;
-    }
+    html, body { overflow: visible; }

    .container {
      max-width: 960px;
-      margin: 0 auto;
    }

    h1 { color: #333; margin-bottom: 4px; font-size: 1.5em; }
@@ -60,6 +60,7 @@ class AlertState:
        self.acknowledged = False  # Whether alert has been acknowledged
        self.acknowledged_at = None  # Timestamp when acknowledged
        self.consecutive_count = 0  # Consecutive exceedances while still OK (for count gating)
+        self.pending_since: Optional[float] = None  # non-None while waiting out grace period before notifying
    
    def update(
        self, 
@@ -105,6 +106,7 @@ class AlertState:
            self.level = level
            self.since = now
            self.notification_count = 0
+            self.last_notification = None  # restart reminder interval on level change
            # Reset acknowledgment on state change
            if level != AlertLevel.OK:
                # Only reset if changing to a different alert level
@@ -339,8 +341,9 @@ class ThresholdChecker:
        self.default_config = "default"
        
        self.renotify_interval = renotify_interval
+        self.grace_seconds: float = float(config.get("grace", 2))
        self.journal = journal
-        
+
        # Parse configuration
        self._parse_config(config)
        
@@ -371,7 +374,8 @@ class ThresholdChecker:
        self.threshold_configs.clear()
        self.thresholds.clear()
        self.host_config_mapping.clear()
-        
+        self.grace_seconds = float(config.get("grace", 2))
+
        # Parse new configuration
        self._parse_config(config)
        
@@ -759,15 +763,10 @@ class ThresholdChecker:
        # Update state and check for changes
        old_level = alert_state.level
        if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
-            # For check_value, we don't have full plugin data, pass None
-            lvl, message, formatted_msg = self._trigger_notification(host_name, metric_path, old_level, new_level, value, threshold, None)
-            # Update alert state with formatted message
-            alert_state.formatted_message = formatted_msg
-            self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
+            self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, None)
            return (old_level, new_level)
        elif new_level != AlertLevel.OK:
-            # Check if we should re-notify
-            self._check_renotify(host_name, alert_state, metric_path, value, threshold, None)
+            self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, None)

        return None
    def check_plugin_data(
@@ -826,14 +825,10 @@ class ThresholdChecker:
            old_level = alert_state.level
            if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
                state_changes.append((metric_path, old_level, new_level, value))
-                lvl, message, formatted_msg = self._trigger_notification(host_name, metric_path, old_level, new_level, value, threshold, data)
-                # Update alert state with formatted message
-                alert_state.formatted_message = formatted_msg
-                self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
+                self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data)
            elif new_level != AlertLevel.OK:
-                # Check if we should re-notify
-                self._check_renotify(host_name, alert_state, metric_path, value, threshold, data)
-        
+                self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data)
+
        # Check nested metrics (e.g., partition data in disk_monitor)
        self._check_nested_metrics(
            host_name,
@@ -895,20 +890,9 @@ class ThresholdChecker:
                    old_level = alert_state.level
                    if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
                        state_changes.append((metric_path, old_level, new_level, value))
-                        lvl, message, formatted_msg = self._trigger_notification(
-                            host_name,
-                            metric_path,
-                            old_level,
-                            new_level,
-                            value,
-                            threshold,
-                            data  # Pass full plugin data for format string
-                        )
-                        # Update alert state with formatted message
-                        alert_state.formatted_message = formatted_msg
-                        self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
+                        self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data)
                    elif new_level != AlertLevel.OK:
-                        self._check_renotify(host_name, alert_state, metric_path, value, threshold, data)
+                        self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data)
    
    def _trigger_notification(
        self,
@@ -947,7 +931,7 @@ class ThresholdChecker:

        # Format message
        if new_level == AlertLevel.OK:
-            lvl = "RECOVERED"
+            lvl = "RECOVER"
            message = f"{metric_path} = {display_value} ({old_level.name} -> OK)"
        elif new_level == AlertLevel.WARNING:
            lvl = "WARNING"
@@ -1083,6 +1067,74 @@ class ThresholdChecker:
            )
            return f"(threshold: {op_symbol} {threshold_value})"
    
+    def _apply_grace(
+        self,
+        host_name: str,
+        alert_state: AlertState,
+        metric_path: str,
+        old_level: AlertLevel,
+        new_level: AlertLevel,
+        value: Any,
+        threshold: ThresholdConfig,
+        plugin_data: Optional[Dict[str, Any]],
+    ) -> None:
+        """Handle a state-change transition with grace-period logic.
+
+        Transitioning INTO alert: defers the notification for grace_seconds.
+        Transitioning TO OK:
+          - Still in grace window (pending_since set): suppresses both the alert
+            and the recovery — the spike never warranted a page.
+          - Past grace: fires the RECOVER notification normally.
+        """
+        lvl, message, formatted_msg = self._trigger_notification(
+            host_name, metric_path, old_level, new_level, value, threshold, plugin_data
+        )
+        alert_state.formatted_message = formatted_msg
+
+        if new_level == AlertLevel.OK:
+            if alert_state.pending_since is not None:
+                logger.info(
+                    "Alert suppressed (recovered within %.0fs grace): %s on %s",
+                    self.grace_seconds, metric_path, host_name,
+                )
+                alert_state.pending_since = None
+            else:
+                self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
+        else:
+            alert_state.pending_since = time.time()
+            logger.debug(
+                "Alert deferred (%.0fs grace): %s on %s = %s",
+                self.grace_seconds, metric_path, host_name, value,
+            )
+
+    def _check_pending_or_renotify(
+        self,
+        host_name: str,
+        alert_state: AlertState,
+        metric_path: str,
+        value: Any,
+        threshold: ThresholdConfig,
+        plugin_data: Optional[Dict[str, Any]],
+    ) -> None:
+        """Called when alert level is unchanged and non-OK.
+
+        If a deferred notification is pending and grace_seconds have elapsed,
+        fires it now. Otherwise falls through to normal reminder logic.
+        """
+        if alert_state.pending_since is not None:
+            if time.time() - alert_state.pending_since >= self.grace_seconds:
+                lvl, message, formatted_msg = self._trigger_notification(
+                    host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data
+                )
+                alert_state.formatted_message = formatted_msg
+                self._send_notification(
+                    host_name, lvl, message, metric_path, AlertLevel.OK, alert_state.level, value
+                )
+                alert_state.pending_since = None
+            # else: still within grace window, do nothing
+        else:
+            self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data)
+
    def _check_renotify(
        self,
        host_name: str,
@@ -171,6 +171,24 @@ def dicttos(ID, d):
 DROPOVERDUE = 7 * 24 * 3600  # seconds before an overdue host becomes UNKNOWN


+def _set_connectivity_alert(host, afam, level_name):
+    """Update (or clear) a connectivity alert_state entry for a host/address-family.
+
+    level_name is "CRITICAL", "WARNING", or "OK".  "OK" removes the entry so
+    that recovered hosts don't clutter the Alerts Dashboard.
+    """
+    from .threshold import AlertState, AlertLevel
+    metric_path = f"connectivity.{afam}"
+    level = getattr(AlertLevel, level_name, AlertLevel.OK)
+    if level == AlertLevel.OK:
+        host.alert_states.pop(metric_path, None)
+        return
+    if metric_path not in host.alert_states:
+        host.alert_states[metric_path] = AlertState(metric_path)
+    state = host.alert_states[metric_path]
+    state.update(level, level_name)
+
+
 def _make_timer_callbacks(uname, host, ctx):
    """Return (on_overdue, on_unknown) async callbacks for connection timer logic.

@@ -182,6 +200,7 @@ def _make_timer_callbacks(uname, host, ctx):

    async def on_unknown(connection):
        connection.newstate(connection.__class__.UNKNOWN, connection.lastbeat)
+        # Keep connectivity alert active when host transitions to unknown
        if msg_to_websockets:
            msg_to_websockets("host", host.stateinfo())

@@ -196,6 +215,8 @@ def _make_timer_callbacks(uname, host, ctx):
            uname,
            notify_mod.Notification(title=f"[CRITICAL] {uname}", body=msg, level="CRITICAL"),
        )
+        # Track in alert_states so the Alerts Dashboard shows this
+        _set_connectivity_alert(host, connection.afam, "CRITICAL")
        if threshold_checker:
            threshold_checker.check_value(
                host_name=uname,
@@ -410,6 +431,8 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
    if conn.getstate() != hbdcls.Connection.UP:
        lasts = conn.state
        d = conn.newstate(hbdcls.Connection.UP, now)
+        # Clear connectivity alert now that the host is back up
+        _set_connectivity_alert(host, conn.afam, "OK")
        # Don't log/notify RECOVER for a brand-new host seen for the first time —
        # it was never down, it just hasn't been seen before.
        if not newh:
@@ -436,6 +459,7 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
            notify_mod.Notification(title=f"[INFO] {uname}", body=m, level="INFO"),
        )
        conn.newstate(hbdcls.Connection.DOWN, now)
+        _set_connectivity_alert(host, conn.afam, "CRITICAL")

    if interval > 0:
        host.interval = interval
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "hbd"
-version = "5.1.1"
+version = "5.1.3"
 description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
 readme = "README.md"
 requires-python = ">=3.11"
@@ -14,6 +14,8 @@

 set -e
 what=$1
+on_ha=0
+[ -z "$what" ] && what="client"

 if [ -d /homeassistant ]; then
    echo "cannot install in HA, run \"docker exec -it homeassistant $0 $@\""
@@ -23,36 +25,64 @@ if [ -d /config ]; then
    echo "Installing on HA"
    where="/config/bin"
    venv="/config/venvs"
+    on_ha=1
 else
-    if [ ! -d ~/.local/bin ] && [ ! -d ~/bin ]; then
-        echo "No suitable bin directory found in PATH, please add either ~/.local/bin or ~/bin to your PATH"
+    if [ ! -d $HOME/.local/bin ] && [ ! -d $HOME/bin ]; then
+        echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
        exit 1
    fi
-    for where in ~/bin ~/.local/bin; do
+    for where in $HOME/bin $HOME/.local/bin notset ; do
        if echo ":$PATH:" | grep -q ":$where:" ; then
            break
        fi
    done
-    venv="~/venvs"
+    if [ "$where" = "notset" ]; then
+        echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
+        exit 1
+    fi
+    venv="$HOME/venvs"
 fi
-python3 -m pip --version > /dev/null 2>&1 || { echo "pip is not installed, please install pip for python3"; exit 1; }
+
+echo "Installing heartbeat $what"
+
+if [ ! -d  $venv/hbd ]; then
+    python3 -m pip --version > /dev/null 2>&1 
+    if [ $? -ne 0 ]; then
+        # truenas does not have pip installed by default, so we need to fetch get-pip.py and install pip
+        echo "pip is not installed, fetching get-pip.py and installing pip"
+        arg="--without-pip"
+    fi
+    mkdir -p $venv
+    have_venv=$(python3 -c "import venv" &> /dev/null && echo "Installed" || echo "Not Installed")
+    if [ "$have_venv" = "Not Installed" ]; then
+        echo "python venv module not found, installing virtualenv"
+        python3 -m pip install --user virtualenv
+        python3 -m virtualenv $venv/hbd --system-site-packages $arg
+    else
+        python3 -m venv $venv/hbd --system-site-packages $arg
+    fi
+    . $venv/hbd/bin/activate
+    if [ -n "$arg" ]; then  
+        curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py
+    fi
+    deactivate
+fi
+
+. $venv/hbd/bin/activate
+python3 -mpip install --upgrade --index-url https://git.wrede.ca/api/packages/andreas/pypi/simple/ --extra-index-url https://pypi.org/simple hbd[$what]

 if [ "$what" = "server" ]; then
-    echo "Installing heartbeat server (hbd)"
-else
-    what="client"
-    echo "Installing heartbeat client (hbc)"
-fi
-if [ ! -d  $venv/hbd ]; then
-    mkdir -p $venv
-    python3 -m venv $venv/hbd --system-site-packages
-fi
-. $venv/hbd/bin/activate
-pip install --index-url https://git.wrede.ca/api/packages/andreas/pypi/simple/ --extra-index-url https://pypi.org/simple hbd[$what]
-if [ "$what" = "server" ]; then
-    rm -f ~$where/hbd
+    rm -f $where/hbd
    ln -sf $(which hbd) $where/hbd
+    echo "hbd installed, you can run it with \"$where/hbd\" or \"hbd\" if $where is in your PATH"
 else
    rm -f $where/hbc
    ln -sf $(which hbc) $where/hbc
+    if [ $on_ha -eq 1 ]; then
+        echo "restarting hbc "
+        job=$(grep run_hbc configuration.yaml | sed 's/run_hbc://')
+        $job
+    else
+        echo "hbc installed, you can run it with \"$where/hbc\" or \"hbc\" if $where is in your PATH"
+    fi  
 fi
@@ -0,0 +1,99 @@
+import asyncio
+import logging
+import os
+import stat
+
+from hbd.client.plugins.nagios_runner import (
+    NagiosRunnerPlugin,
+    NAGIOS_OK,
+    NAGIOS_WARNING,
+    NAGIOS_CRITICAL,
+    NAGIOS_UNKNOWN,
+)
+
+
+def test_no_commands_sets_skip_reason():
+    plugin = NagiosRunnerPlugin(config={"commands": []})
+    result = asyncio.run(plugin.initialize())
+    assert result is False
+    assert plugin.skip_reason is not None
+    assert "nagios_runner.commands" in plugin.skip_reason
+
+
+def test_stderr_used_when_stdout_empty(tmp_path):
+    script = tmp_path / "check_err.sh"
+    script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n")
+    script.chmod(script.stat().st_mode | stat.S_IEXEC)
+
+    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert "error from stderr" in data["t_output"]
+    assert data["t_status_code"] == NAGIOS_CRITICAL
+
+
+def test_stderr_appended_when_both_present(tmp_path):
+    script = tmp_path / "check_both.sh"
+    script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n")
+    script.chmod(script.stat().st_mode | stat.S_IEXEC)
+
+    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert "OK - all good" in data["t_output"]
+    assert "extra detail" in data["t_output"]
+    assert data["t_status_code"] == NAGIOS_OK
+
+
+def test_negative_returncode_maps_to_unknown():
+    # kill -9 $$ kills the shell itself; asyncio sees returncode -9
+    config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert data["t_status_code"] == NAGIOS_UNKNOWN
+    assert "signal" in data["t_output"].lower()
+
+
+def test_absolute_path_not_found_warns(caplog):
+    fake_cmd = "/nonexistent_hbc_test_path/check_something"
+    config = {"commands": [{"name": "t", "command": fake_cmd}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert any("not found" in r.message for r in caplog.records)
+
+
+def test_absolute_path_not_executable_warns(caplog, tmp_path):
+    non_exec = tmp_path / "check_test"
+    non_exec.write_text("#!/bin/sh\necho OK\n")
+    non_exec.chmod(0o644)  # readable but not executable
+
+    config = {"commands": [{"name": "t", "command": str(non_exec)}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert any("not executable" in r.message for r in caplog.records)
+
+
+def test_relative_path_not_checked(caplog):
+    # Relative paths (resolved via PATH) must not generate warnings
+    config = {"commands": [{"name": "t", "command": "echo OK"}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert not any(
+        "not found" in r.message or "not executable" in r.message
+        for r in caplog.records
+    )
@@ -0,0 +1,83 @@
+import asyncio
+import logging
+import textwrap
+
+from hbd.client.plugin import PluginLoader, PluginRegistry
+
+
+def test_plugin_skip_reason_defaults_none(tmp_path):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class MinimalPlugin(MonitorPlugin):
+            name = "minimal"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                return True
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "minimal.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+    asyncio.run(loader.load_from_directory(tmp_path))
+    plugin = registry.get("minimal")
+    assert plugin is not None
+    assert plugin.skip_reason is None
+
+
+def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class SkippablePlugin(MonitorPlugin):
+            name = "skippable"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                self.skip_reason = "not configured in yaml"
+                return False
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "skippable.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+
+    with caplog.at_level(logging.INFO, logger="plugin.loader"):
+        count = asyncio.run(loader.load_from_directory(tmp_path))
+
+    assert count == 0
+    assert any("skipped: not configured in yaml" in r.message for r in caplog.records)
+    assert not any("failed initialization" in r.message for r in caplog.records)
+
+
+def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class FailPlugin(MonitorPlugin):
+            name = "fail"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                return False
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "fail_plugin.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.loader"):
+        count = asyncio.run(loader.load_from_directory(tmp_path))
+
+    assert count == 0
+    assert any("failed initialization" in r.message for r in caplog.records)
Author	SHA1	Message	Date
andreas	7d8ca5d8db	version 5.1.3 Release / release (push) Successful in 4s Details	2026-04-25 16:52:56 +02:00
andreas	56037a036d	fix: remove unused pytest import in test_nagios_runner	2026-04-25 16:39:56 +02:00
andreas	65ceb31d8d	fix: use os.path.exists check for /dev/log instead of dead-code OSError catch	2026-04-25 16:36:00 +02:00
andreas	1c9b6c1ca9	fix: reconfigure logging to syslog after daemonize() instead of no-op basicConfig After daemonize() redirects stderr to /dev/null, the existing StreamHandler writes to /dev/null. logging.basicConfig() is a no-op when handlers are already configured, so log messages are silently lost. Replace the daemon block to: 1. Call daemonize() first 2. Explicitly remove existing handlers (pointing to /dev/null) 3. Add SysLogHandler pointing to /dev/log with fallback to UDP localhost:514 4. Log startup message to the new syslog handler Removes redundant syslog.openlog() call which is no longer needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:29:54 +02:00
andreas	d7e6b478e1	fix: use shlex.split() in nagios_runner path validation to handle quoted paths	2026-04-25 16:28:32 +02:00
andreas	535dbda47d	feat: validate absolute command paths at nagios_runner init	2026-04-25 16:24:33 +02:00
andreas	c9567dddae	fix: remove stale shell config key from NagiosRunnerPlugin docstring	2026-04-25 16:23:03 +02:00
andreas	b5963badd6	feat: async subprocess in nagios_runner with stderr capture and signal handling Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:18:09 +02:00
andreas	a76a39b4a0	fix: remove redundant no-commands log lines; fix skip_reason docstring style Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:15:44 +02:00
andreas	94e1597978	feat: set skip_reason on nagios_runner when no commands configured When NagiosRunnerPlugin has no commands configured, set skip_reason before returning False from initialize(). This allows PluginLoader to log INFO (not WARNING) when the plugin is skipped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:13:03 +02:00
andreas	c9c2ed772f	fix: document skip_reason in Plugin docstring; remove unused import in test	2026-04-25 16:10:35 +02:00
andreas	aeb78dcb8e	feat: add skip_reason to Plugin; improve PluginLoader init messaging	2026-04-25 16:08:07 +02:00
andreas	77b337e4dd	Add implementation plan for plugin error checking and daemon logging fixes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:04:13 +02:00
andreas	293461f3f6	Add design spec for plugin error checking and daemon logging fixes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 15:49:09 +02:00
andreas	c70a4807dc	version 5.1.2 Release / release (push) Successful in 6s Details	2026-04-25 07:25:06 +02:00
andreas	1a470e7cfa	Fix plugin config lookup shadowed by CLIENT_DEFAULTS plugins key CLIENT_DEFAULTS seeds "plugins": {} so raw_config.get("plugins", raw_config) always returned the empty subdict instead of falling back to the full config. Plugins configured at top-level (e.g. nagios_runner: ...) were therefore never found, resulting in "No Nagios commands configured". Now checks the plugins subdict first, then top-level keys, so both config layouts work correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 12:58:42 +02:00
andreas	990c658e65	Apply grace period to all threshold alerts before logging/notifying Threshold alerts (plugin metrics, RTT) were firing immediately on the first breach. Now every state transition to WARNING/CRITICAL starts a grace-period timer (grace_seconds from the 'grace' config key). The notification is deferred until the next heartbeat after grace_seconds have elapsed. If the metric recovers within the grace window, both the alert and the recovery are suppressed — no spurious pages for transient spikes. Two helper methods added to ThresholdChecker: - _apply_grace: handles the state-change path (defer or suppress) - _check_pending_or_renotify: handles the stable-alert path (fire deferred notification once grace expires, or fall through to reminders) The overdue case is unchanged — on_overdue already fires only after interval+grace seconds of silence, which is equivalent behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 12:00:40 +02:00
andreas	b78d6ac0fe	Fix RECOVER routing: use consistent level name and route via alerted channel threshold.py was emitting level="RECOVERED" for metric recoveries, which failed the is_recover check in send_notification (which only matched "RECOVER"), bypassing _alerted_channels routing and the min_level bypass added in the previous commit. Changed to "RECOVER" so all recovery paths are consistent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 11:29:04 +02:00
andreas	afd5060f59	Fix early reminder notifications and lost recovery notifications - AlertState.update() now resets last_notification when the alert level changes, so a WARNING→CRITICAL escalation restarts the reminder interval rather than inheriting a nearly-expired timer. - _dispatch_to_channel() bypasses min_level for RECOVER, so recovery notifications are delivered even after a server restart when _alerted_channels is empty and the fallback dispatch path is used. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 18:11:22 +02:00
andreas	f61f7aebc2	Use python3 consistently	2026-04-19 09:49:30 +02:00
Andreas Wrede	5c382d2b8d	One more nit	2026-04-13 09:31:35 -04:00
Andreas Wrede	35bba451f5	Various formating nits	2026-04-13 09:27:51 -04:00
Andreas Wrede	80edfba0c0	fix inconsistencies in page layout, add swiss clock	2026-04-13 08:45:50 -04:00
Andreas Wrede	6bc8de192e	fix non-alerting of overdue hosts	2026-04-12 18:44:36 -04:00
Andreas Wrede	2d8166d04a	unse python3 -mpip instead of plain pip	2026-04-12 18:44:11 -04:00
Andreas Wrede	ab33d81b30	catch syntax wanring when parsing version string	2026-04-12 16:39:51 -04:00
Andreas Wrede	2c0328f36d	update install.sh to handle missing venv module	2026-04-12 16:39:14 -04:00
Andreas Wrede	fb8e27825d	make install.sh work on systems withou pip	2026-04-12 14:16:44 -04:00