version 5.1.3

fix: remove unused pytest import in test_nagios_runner
fix: use os.path.exists check for /dev/log instead of dead-code OSError catch
2026-04-25 16:52:56 +02:00 · 2026-04-25 16:39:56 +02:00 · 2026-04-25 16:36:00 +02:00 · 2026-04-25 16:29:54 +02:00 · 2026-04-25 16:28:32 +02:00 · 2026-04-25 16:24:33 +02:00
9 changed files with 992 additions and 65 deletions
@@ -0,0 +1,602 @@
+# Plugin Error Checking Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Improve plugin error checking in hbc, especially for nagios_runner, and fix logger messages silently discarded in daemon mode.
+
+**Architecture:** Three focused changes across three files: (1) `hbd/client/plugin.py` gains a `skip_reason` attribute on Plugin and updated PluginLoader messaging; (2) `hbd/client/plugins/nagios_runner.py` gains async subprocess execution, stderr capture, signal-killed process handling, and init-time command path validation; (3) `hbd/client/main.py` gains proper post-fork logging reconfiguration to syslog.
+
+**Tech Stack:** Python 3.11+, asyncio, `logging.handlers.SysLogHandler`, pytest
+
+---
+
+## File Map
+
+| Action | Path | What changes |
+|---|---|---|
+| Modify | `hbd/client/plugin.py` | `Plugin.__init__` gains `skip_reason`; `PluginLoader` checks it |
+| Modify | `hbd/client/plugins/nagios_runner.py` | async subprocess, stderr, signal codes, init validation, `skip_reason` |
+| Modify | `hbd/client/main.py` | `_reconfigure_logging_for_daemon()` helper; remove redundant syslog calls |
+| Create | `tests/test_plugin.py` | PluginLoader messaging tests |
+| Create | `tests/test_nagios_runner.py` | NagiosRunnerPlugin behaviour tests |
+
+Run tests throughout with:
+```bash
+python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
+```
+
+---
+
+## Task 1: Plugin.skip_reason + PluginLoader messaging
+
+**Files:**
+- Modify: `hbd/client/plugin.py:40-48` (Plugin.__init__)
+- Modify: `hbd/client/plugin.py:369-381` (PluginLoader.load_from_directory)
+- Create: `tests/test_plugin.py`
+
+- [ ] **Step 1: Write failing tests**
+
+Create `tests/test_plugin.py`:
+
+```python
+import asyncio
+import logging
+import textwrap
+
+from hbd.client.plugin import Plugin, PluginLoader, PluginRegistry
+
+
+def test_plugin_skip_reason_defaults_none(tmp_path):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class MinimalPlugin(MonitorPlugin):
+            name = "minimal"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                return True
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "minimal.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+    asyncio.run(loader.load_from_directory(tmp_path))
+    plugin = registry.get("minimal")
+    assert plugin is not None
+    assert plugin.skip_reason is None
+
+
+def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class SkippablePlugin(MonitorPlugin):
+            name = "skippable"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                self.skip_reason = "not configured in yaml"
+                return False
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "skippable.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+
+    with caplog.at_level(logging.INFO, logger="plugin.loader"):
+        count = asyncio.run(loader.load_from_directory(tmp_path))
+
+    assert count == 0
+    assert any("skipped: not configured in yaml" in r.message for r in caplog.records)
+    assert not any("failed initialization" in r.message for r in caplog.records)
+
+
+def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class FailPlugin(MonitorPlugin):
+            name = "fail"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                return False
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "fail_plugin.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.loader"):
+        count = asyncio.run(loader.load_from_directory(tmp_path))
+
+    assert count == 0
+    assert any("failed initialization" in r.message for r in caplog.records)
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+python -m pytest tests/test_plugin.py -v
+```
+Expected: `test_plugin_skip_reason_defaults_none` FAILS (attribute missing), others may error.
+
+- [ ] **Step 3: Add `skip_reason` to `Plugin.__init__`**
+
+In `hbd/client/plugin.py`, in `Plugin.__init__` (around line 46), add one line:
+
+```python
+def __init__(self, config: Optional[Dict[str, Any]] = None):
+    self.config = config or {}
+    self.logger = logging.getLogger(f"plugin.{self.name}")
+    self._initialized = False
+    self.skip_reason: Optional[str] = None
+```
+
+- [ ] **Step 4: Update PluginLoader messaging**
+
+In `hbd/client/plugin.py`, replace the `if not initialized:` block (around line 372):
+
+```python
+                    if not initialized:
+                        if plugin.skip_reason:
+                            self.logger.info(
+                                f"Plugin {plugin.name} skipped: {plugin.skip_reason}"
+                            )
+                        else:
+                            self.logger.warning(
+                                f"Plugin {plugin.name} failed initialization, skipping"
+                            )
+                        continue
+```
+
+- [ ] **Step 5: Run tests to verify they pass**
+
+```bash
+python -m pytest tests/test_plugin.py -v
+```
+Expected: all 3 tests PASS.
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add hbd/client/plugin.py tests/test_plugin.py
+git commit -m "feat: add skip_reason to Plugin; improve PluginLoader init messaging"
+```
+
+---
+
+## Task 2: NagiosRunnerPlugin — skip_reason when no commands
+
+**Files:**
+- Modify: `hbd/client/plugins/nagios_runner.py:88-105` (initialize)
+- Modify: `tests/test_nagios_runner.py` (create)
+
+- [ ] **Step 1: Write failing test**
+
+Create `tests/test_nagios_runner.py`:
+
+```python
+import asyncio
+import logging
+import os
+import stat
+
+import pytest
+
+from hbd.client.plugins.nagios_runner import (
+    NagiosRunnerPlugin,
+    NAGIOS_OK,
+    NAGIOS_WARNING,
+    NAGIOS_CRITICAL,
+    NAGIOS_UNKNOWN,
+)
+
+
+def test_no_commands_sets_skip_reason():
+    plugin = NagiosRunnerPlugin(config={"commands": []})
+    result = asyncio.run(plugin.initialize())
+    assert result is False
+    assert plugin.skip_reason is not None
+    assert "nagios_runner.commands" in plugin.skip_reason
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
+```
+Expected: FAIL — `plugin.skip_reason` is `None`.
+
+- [ ] **Step 3: Set skip_reason in NagiosRunnerPlugin.initialize()**
+
+In `hbd/client/plugins/nagios_runner.py`, replace the early-return block in `initialize()` (around line 96):
+
+```python
+        if not self.commands:
+            self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
+            self.logger.info("No Nagios commands configured")
+            return False
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
+```
+Expected: PASS.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
+git commit -m "feat: set skip_reason on nagios_runner when no commands configured"
+```
+
+---
+
+## Task 3: NagiosRunnerPlugin — async subprocess, stderr capture, negative return codes
+
+**Files:**
+- Modify: `hbd/client/plugins/nagios_runner.py` (imports + `_run_nagios_plugin`)
+- Modify: `tests/test_nagios_runner.py`
+
+- [ ] **Step 1: Write failing tests**
+
+Append to `tests/test_nagios_runner.py`:
+
+```python
+def test_stderr_used_when_stdout_empty(tmp_path):
+    script = tmp_path / "check_err.sh"
+    script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n")
+    script.chmod(script.stat().st_mode | stat.S_IEXEC)
+
+    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert "error from stderr" in data["t_output"]
+    assert data["t_status_code"] == NAGIOS_CRITICAL
+
+
+def test_stderr_appended_when_both_present(tmp_path):
+    script = tmp_path / "check_both.sh"
+    script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n")
+    script.chmod(script.stat().st_mode | stat.S_IEXEC)
+
+    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert "OK - all good" in data["t_output"]
+    assert "extra detail" in data["t_output"]
+    assert data["t_status_code"] == NAGIOS_OK
+
+
+def test_negative_returncode_maps_to_unknown():
+    # kill -9 $$ kills the shell itself; asyncio sees returncode -9
+    config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert data["t_status_code"] == NAGIOS_UNKNOWN
+    assert "signal" in data["t_output"].lower()
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+python -m pytest tests/test_nagios_runner.py::test_stderr_used_when_stdout_empty \
+    tests/test_nagios_runner.py::test_stderr_appended_when_both_present \
+    tests/test_nagios_runner.py::test_negative_returncode_maps_to_unknown -v
+```
+Expected: all FAIL — current implementation ignores stderr and doesn't handle negative codes.
+
+- [ ] **Step 3: Update imports in nagios_runner.py**
+
+Replace the import block at the top of `hbd/client/plugins/nagios_runner.py`:
+
+```python
+import asyncio
+import os
+import re
+from typing import Any, Dict, List, Optional, Tuple
+
+from hbd.client.plugin import MonitorPlugin
+```
+
+(Remove `import subprocess`; add `import asyncio` and `import os`.)
+
+- [ ] **Step 4: Upgrade collection log level from DEBUG to INFO**
+
+In `hbd/client/plugins/nagios_runner.py`, in `_collect_metrics()`, change the debug log (around line 144) so results are visible at INFO level:
+
+```python
+                self.logger.info(
+                    f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
+                )
+```
+
+- [ ] **Step 5: Replace `_run_nagios_plugin` with async implementation**
+
+Replace the entire `_run_nagios_plugin` method in `hbd/client/plugins/nagios_runner.py`:
+
+```python
+    async def _run_nagios_plugin(
+        self,
+        command: str
+    ) -> Tuple[int, str, Dict[str, Any]]:
+        """Execute a Nagios plugin and parse its output."""
+        try:
+            proc = await asyncio.create_subprocess_shell(
+                command,
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.PIPE,
+            )
+            try:
+                stdout_bytes, stderr_bytes = await asyncio.wait_for(
+                    proc.communicate(), timeout=self.timeout
+                )
+            except asyncio.TimeoutError:
+                proc.kill()
+                await proc.communicate()
+                self.logger.error(f"Command timed out: {command}")
+                return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
+
+            status_code = proc.returncode
+
+            if status_code < 0:
+                return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {}
+
+            if status_code > 3:
+                status_code = NAGIOS_UNKNOWN
+
+            stdout = stdout_bytes.decode(errors="replace").strip()
+            stderr = stderr_bytes.decode(errors="replace").strip()
+
+            # Parse perfdata from stdout before mixing in stderr
+            perfdata = self._parse_perfdata(stdout)
+
+            # Build status message
+            status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout
+
+            if not stdout and stderr:
+                output_msg = stderr
+            elif stdout and stderr:
+                output_msg = f"{status_part} [stderr: {stderr}]"
+            else:
+                output_msg = status_part
+
+            return status_code, output_msg, perfdata
+
+        except Exception as e:
+            self.logger.error(f"Error executing command: {e}")
+            return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}
+```
+
+Also remove the now-unused `self.shell` line from `__init__` (the `shell` config key is no longer used since `create_subprocess_shell` always uses a shell):
+
+In `NagiosRunnerPlugin.__init__`, remove:
+```python
+        self.shell: bool = config.get("shell", True) if config else True
+```
+
+- [ ] **Step 6: Run tests to verify they pass**
+
+```bash
+python -m pytest tests/test_nagios_runner.py -v
+```
+Expected: all tests PASS including the 3 new ones.
+
+- [ ] **Step 7: Commit**
+
+```bash
+git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
+git commit -m "feat: async subprocess in nagios_runner with stderr capture and signal handling"
+```
+
+---
+
+## Task 4: NagiosRunnerPlugin — command path validation at init
+
+**Files:**
+- Modify: `hbd/client/plugins/nagios_runner.py` (initialize)
+- Modify: `tests/test_nagios_runner.py`
+
+- [ ] **Step 1: Write failing tests**
+
+Append to `tests/test_nagios_runner.py`:
+
+```python
+def test_absolute_path_not_found_warns(caplog):
+    fake_cmd = "/nonexistent_hbc_test_path/check_something"
+    config = {"commands": [{"name": "t", "command": fake_cmd}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert any("not found" in r.message for r in caplog.records)
+
+
+def test_absolute_path_not_executable_warns(caplog, tmp_path):
+    non_exec = tmp_path / "check_test"
+    non_exec.write_text("#!/bin/sh\necho OK\n")
+    non_exec.chmod(0o644)  # readable but not executable
+
+    config = {"commands": [{"name": "t", "command": str(non_exec)}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert any("not executable" in r.message for r in caplog.records)
+
+
+def test_relative_path_not_checked(caplog):
+    # Relative paths (resolved via PATH) must not generate warnings
+    config = {"commands": [{"name": "t", "command": "echo OK"}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert not any(
+        "not found" in r.message or "not executable" in r.message
+        for r in caplog.records
+    )
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+python -m pytest tests/test_nagios_runner.py::test_absolute_path_not_found_warns \
+    tests/test_nagios_runner.py::test_absolute_path_not_executable_warns \
+    tests/test_nagios_runner.py::test_relative_path_not_checked -v
+```
+Expected: `test_absolute_path_not_found_warns` and `test_absolute_path_not_executable_warns` FAIL (no warnings logged); `test_relative_path_not_checked` may pass.
+
+- [ ] **Step 3: Add command path validation to `initialize()`**
+
+In `hbd/client/plugins/nagios_runner.py`, extend `initialize()` by adding validation after the existing "log each command" loop (after line 103, before `return True`):
+
+```python
+        # Validate absolute command paths early
+        for cmd_config in self.commands:
+            name = cmd_config.get("name", "unnamed")
+            command = cmd_config.get("command", "")
+            if not command:
+                continue
+            exe = command.split()[0]
+            if os.path.isabs(exe):
+                if not os.path.isfile(exe):
+                    self.logger.warning(
+                        f"Command '{name}': executable not found: {exe}"
+                    )
+                elif not os.access(exe, os.X_OK):
+                    self.logger.warning(
+                        f"Command '{name}': executable not executable: {exe}"
+                    )
+```
+
+- [ ] **Step 4: Run full test suite to verify all pass**
+
+```bash
+python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
+```
+Expected: all tests PASS.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
+git commit -m "feat: validate absolute command paths at nagios_runner init"
+```
+
+---
+
+## Task 5: Daemon mode logging — route to syslog after fork
+
+**Files:**
+- Modify: `hbd/client/main.py` (new helper + updated daemon block)
+
+No automated test for daemonization itself (fork behaviour is hard to unit-test). Manual verification steps are provided below.
+
+- [ ] **Step 1: Add `_reconfigure_logging_for_daemon` helper**
+
+In `hbd/client/main.py`, add this function just before `def build_parser()` (around line 589):
+
+```python
+def _reconfigure_logging_for_daemon(log_level: int) -> None:
+    """Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler."""
+    from logging.handlers import SysLogHandler
+
+    root = logging.getLogger()
+    for handler in root.handlers[:]:
+        root.removeHandler(handler)
+        handler.close()
+
+    try:
+        syslog_handler = SysLogHandler(
+            address="/dev/log",
+            facility=SysLogHandler.LOG_DAEMON,
+        )
+    except OSError:
+        syslog_handler = SysLogHandler(
+            address=("localhost", 514),
+            facility=SysLogHandler.LOG_DAEMON,
+        )
+        # Attach the fallback first so the warning reaches syslog
+        syslog_handler.setFormatter(
+            logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
+        )
+        root.addHandler(syslog_handler)
+        root.setLevel(log_level)
+        logging.warning("/dev/log not found, using syslog UDP localhost:514")
+        return
+
+    syslog_handler.setFormatter(
+        logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
+    )
+    root.addHandler(syslog_handler)
+    root.setLevel(log_level)
+```
+
+- [ ] **Step 2: Update the daemon block in `main()`**
+
+In `hbd/client/main.py`, replace the entire `if args.daemon:` block (lines 664–675):
+
+```python
+    if args.daemon:
+        print("Daemonizing...")
+        daemonize()
+        _reconfigure_logging_for_daemon(log_level)
+        logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
+```
+
+This removes the `import syslog`, `syslog.openlog()`, and `syslog.syslog()` calls (now handled by the logging system) and removes the no-op second `logging.basicConfig()` call.
+
+- [ ] **Step 3: Run existing test suite to confirm no regressions**
+
+```bash
+python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
+```
+Expected: all tests still PASS.
+
+- [ ] **Step 4: Manual smoke test — verify syslog output in daemon mode**
+
+```bash
+# In one terminal, tail syslog
+sudo journalctl -f -t hbc
+
+# In another terminal, start hbc in daemon mode (replace HOST with a real or dummy host)
+python -m hbd.client.main -d -v localhost
+
+# Expected in journalctl output:
+#   hbc[<pid>]: hbc.main INFO: Starting hbc for <hostname> -> ['localhost']
+#   hbc[<pid>]: hbc.main INFO: hbc starting, sending heartbeat to localhost
+#   hbc[<pid>]: plugin.loader INFO: ...
+
+# Stop the daemon
+pkill -f "hbd.client.main"
+```
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add hbd/client/main.py
+git commit -m "fix: reconfigure logging to syslog after daemonize() instead of no-op basicConfig"
+```
@@ -0,0 +1,92 @@
+# Plugin Error Checking & Daemon Logging — Design Spec
+
+**Date:** 2026-04-25  
+**Scope:** hbc client — daemon mode logging, nagios_runner plugin robustness, PluginLoader messaging  
+**Files affected:** `hbd/client/main.py`, `hbd/client/plugins/nagios_runner.py`, `hbd/client/plugin.py`
+
+---
+
+## 1. Daemon Mode Logging
+
+### Problem
+In `main()`, `logging.basicConfig()` is called before `daemonize()` (establishing a StreamHandler to stderr), then called again after `daemonize()`. The second call is a no-op — Python ignores `basicConfig()` when handlers are already configured. After daemonization, stderr is redirected to `/dev/null`, so all subsequent log output is silently discarded.
+
+The existing `syslog.openlog()` / `syslog.syslog()` calls (lines 666–668) write a single startup message but do not integrate with the `logging` system, so plugin and connection log messages never reach syslog.
+
+### Fix
+After `daemonize()`, explicitly reconfigure the root logger:
+
+1. Remove all existing handlers (they now write to `/dev/null`).
+2. Add `logging.handlers.SysLogHandler(address='/dev/log', facility=LOG_DAEMON)`.
+3. Set formatter: `hbc[%(process)d]: %(name)s %(levelname)s: %(message)s`
+4. Preserve the `log_level` already determined from `-v`/`-x` CLI flags.
+
+Remove the redundant `syslog.openlog()` / `syslog.syslog()` calls — the logging system handles routing.
+
+**Fallback:** If `/dev/log` does not exist (containers, some BSDs), fall back to `SysLogHandler(address=('localhost', 514))`. Log one warning (to stderr, before handlers are replaced) so the operator knows.
+
+---
+
+## 2. Nagios Runner Improvements
+
+### 2a — Async Subprocess
+`_run_nagios_plugin()` is declared `async def` but calls `subprocess.run()` synchronously, blocking the event loop for the full command duration.
+
+**Fix:** Replace with `asyncio.create_subprocess_shell()` + `await proc.communicate()`. Enforce timeout with `asyncio.wait_for(..., timeout=self.timeout)` and catch `asyncio.TimeoutError`.
+
+### 2b — Stderr Capture
+Subprocess stderr is currently discarded (`capture_output=True` only captures stdout in the sync call; stderr content is lost).
+
+**Fix:** Pass `stderr=asyncio.subprocess.PIPE` to `create_subprocess_shell`. After `communicate()`, if stdout is empty but stderr has content, use stderr as the output message. If both have content, append stderr to the output for visibility.
+
+### 2c — Negative Return Codes
+A negative `returncode` means the process was killed by a signal (SIGKILL, OOM, etc.). The current code treats these as-is, which may produce unexpected status values.
+
+**Fix:** If `returncode < 0`, map to `NAGIOS_UNKNOWN` with message `"Process killed by signal {-returncode}"`.
+
+### 2d — Command Path Validation at Init
+`initialize()` currently only checks that the commands list is non-empty.
+
+**Fix:** For each command entry during `initialize()`:
+- Warn and skip the entry if `name` or `command` is missing.
+- Extract the executable (first whitespace-delimited token of the command string).
+- If the executable is an absolute path, check `os.path.isfile()` and `os.access(..., os.X_OK)`. Log a `WARNING` if either check fails.
+- Commands with relative paths or shell builtins are not checked (they may be on PATH) — just noted.
+- Validation warns only; all original entries in `self.commands` are retained and still attempted at collection time (where the existing missing-name/command guard already skips them). The plugin initializes successfully as long as the commands list is non-empty.
+
+---
+
+## 3. PluginLoader Messaging
+
+### Problem
+When `initialize()` returns `False`, the loader always logs:
+> `WARNING: Plugin X failed initialization, skipping`
+
+This is alarming when the real reason is simply "no commands configured". There is no API to distinguish "not configured" from "genuinely broken".
+
+### Fix
+Add an optional `skip_reason` attribute to `Plugin.__init__()` (defaults to `None`).
+
+In `PluginLoader.load_from_directory()`, after `initialize()` returns `False`:
+- If `plugin.skip_reason` is set → `logger.info(f"Plugin {plugin.name} skipped: {plugin.skip_reason}")`
+- If `plugin.skip_reason` is `None` → `logger.warning(f"Plugin {plugin.name} failed initialization, skipping")` (existing behaviour)
+
+In `NagiosRunnerPlugin.initialize()`, when no commands are configured:
+```python
+self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
+return False
+```
+
+Genuine failures (exceptions) continue to go through the existing `except` block in the loader, logging at `ERROR` with traceback — unchanged.
+
+---
+
+## Decisions
+
+| Topic | Decision |
+|---|---|
+| Daemon log destination | syslog only (LOG_DAEMON facility) |
+| Syslog fallback | localhost:514 UDP if `/dev/log` absent |
+| Nagios result log level | INFO for all statuses (OK/WARNING/CRITICAL/UNKNOWN) |
+| Invalid command handling at init | Warn and continue; still attempt at collection time |
+| PluginLoader API change | `skip_reason` attribute on Plugin base class, checked by loader |
@@ -14,4 +14,4 @@ Install options:
 """

 __all__ = ["__version__"]
-__version__ = "5.1.2"
+__version__ = "5.1.3"
@@ -15,6 +15,7 @@ import socket
 import sys
 import time
 from hashlib import md5
+from logging.handlers import SysLogHandler
 from pathlib import Path
 from typing import Dict, List, Optional

@@ -586,6 +587,36 @@ def daemonize(
    os.dup2(se.fileno(), sys.stderr.fileno())


+def _reconfigure_logging_for_daemon(log_level: int) -> None:
+    """Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler."""
+    root = logging.getLogger()
+    for handler in root.handlers[:]:
+        root.removeHandler(handler)
+        handler.close()
+
+    use_udp_fallback = not os.path.exists("/dev/log")
+
+    if use_udp_fallback:
+        syslog_handler = SysLogHandler(
+            address=("localhost", 514),
+            facility=SysLogHandler.LOG_DAEMON,
+        )
+    else:
+        syslog_handler = SysLogHandler(
+            address="/dev/log",
+            facility=SysLogHandler.LOG_DAEMON,
+        )
+
+    syslog_handler.setFormatter(
+        logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
+    )
+    root.addHandler(syslog_handler)
+    root.setLevel(log_level)
+
+    if use_udp_fallback:
+        logging.warning("/dev/log not found, using syslog UDP localhost:514")
+
+
 def build_parser():
    """Build argument parser."""
    parser = argparse.ArgumentParser(
@@ -663,16 +694,9 @@ def main(argv=None):
    # Daemonize if requested
    if args.daemon:
        print("Daemonizing...")
-        import syslog
-        syslog.openlog("hbc", syslog.LOG_PID, syslog.LOG_DAEMON)
-        syslog.syslog(syslog.LOG_INFO, f"Starting heartbeat to {', '.join(args.hosts)}")
        daemonize()
-        
-        # Reconfigure logging for syslog
-        logging.basicConfig(
-            level=log_level,
-            format="hbc[%(process)d]: %(name)s %(levelname)s: %(message)s"
-        )
+        _reconfigure_logging_for_daemon(log_level)
+        logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
    
    # Run async main
    try:
@@ -22,13 +22,14 @@ from typing import Any, Dict, List, Optional, Type

 class Plugin(ABC):
    """Base class for all plugins.
-    
+
    Attributes:
        name: Unique plugin identifier (e.g., "os_info", "cpu_monitor")
        version: Plugin version string
        description: Human-readable description
        interval: Collection interval in seconds (0 for InfoPlugin = collect once)
        enabled: Whether plugin is active (can be disabled via config)
+        skip_reason: Set by plugin before returning False from initialize(); causes loader to log INFO instead of WARNING.
    """
    
    name: str = ""
@@ -39,13 +40,14 @@ class Plugin(ABC):
    
    def __init__(self, config: Optional[Dict[str, Any]] = None):
        """Initialize plugin with optional configuration.
-        
+
        Args:
            config: Plugin-specific configuration from YAML (e.g., thresholds, paths)
        """
        self.config = config or {}
        self.logger = logging.getLogger(f"plugin.{self.name}")
        self._initialized = False
+        self.skip_reason: Optional[str] = None
        
    @abstractmethod
    async def initialize(self) -> bool:
@@ -369,9 +371,14 @@ class PluginLoader:
                    try:
                        initialized = await plugin.initialize()
                        if not initialized:
-                            self.logger.warning(
-                                f"Plugin {plugin.name} failed initialization, skipping"
-                            )
+                            if plugin.skip_reason:
+                                self.logger.info(
+                                    f"Plugin {plugin.name} skipped: {plugin.skip_reason}"
+                                )
+                            else:
+                                self.logger.warning(
+                                    f"Plugin {plugin.name} failed initialization, skipping"
+                                )
                            continue
                    except Exception as e:
                        self.logger.error(
@@ -21,8 +21,10 @@ nagios_runner:
 ```
 """

+import asyncio
+import os
 import re
-import subprocess
+import shlex
 from typing import Any, Dict, List, Optional, Tuple

 from hbd.client.plugin import MonitorPlugin
@@ -52,8 +54,7 @@ class NagiosRunnerPlugin(MonitorPlugin):
        interval: Collection interval in seconds (default: 300)
        commands: List of command definitions with 'name' and 'command' keys
        timeout: Command execution timeout in seconds (default: 30)
-        shell: Whether to execute commands via shell (default: True)
-    
+
    Example:
        nagios_runner:
          interval: 300  # Check every 5 minutes
@@ -76,32 +77,48 @@ class NagiosRunnerPlugin(MonitorPlugin):
        # Extract configuration
        self.commands: List[Dict[str, str]] = config.get("commands", []) if config else []
        self.timeout: int = config.get("timeout", 30) if config else 30
-        self.shell: bool = config.get("shell", True) if config else True
        self.interval = config.get("interval", 300) if config else 300
-        
-        # Validate commands
-        if not self.commands:
-            self.logger.info(
-                "No Nagios commands configured. Add 'nagios_runner.commands' to config."
-            )
    
    async def initialize(self) -> bool:
        """Initialize the Nagios runner plugin.
-        
+
        Returns:
            True if at least one command is configured, False otherwise
        """
        self.logger.info(f"Initializing {self.name} plugin")
-        
+
        if not self.commands:
-            self.logger.info("No Nagios commands configured")
+            self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
            return False
-        
+
        self.logger.info(f"Configured to run {len(self.commands)} Nagios plugin(s)")
        for cmd_config in self.commands:
            name = cmd_config.get("name", "unnamed")
            self.logger.info(f"  - {name}: {cmd_config.get('command', 'N/A')}")
-        
+
+        # Validate absolute command paths early
+        for cmd_config in self.commands:
+            name = cmd_config.get("name", "unnamed")
+            command = cmd_config.get("command", "")
+            if not command:
+                continue
+            try:
+                tokens = shlex.split(command)
+            except ValueError:
+                continue  # malformed command string; skip validation
+            if not tokens:
+                continue
+            exe = tokens[0]
+            if os.path.isabs(exe):
+                if not os.path.isfile(exe):
+                    self.logger.warning(
+                        f"Command '{name}': executable not found: {exe}"
+                    )
+                elif not os.access(exe, os.X_OK):
+                    self.logger.warning(
+                        f"Command '{name}': executable not executable: {exe}"
+                    )
+
        return True
    
    async def _collect_metrics(self) -> Dict[str, Any]:
@@ -141,7 +158,7 @@ class NagiosRunnerPlugin(MonitorPlugin):
                    for metric_name, metric_value in perfdata.items():
                        results[f"{name}_{metric_name}"] = metric_value
                
-                self.logger.debug(
+                self.logger.info(
                    f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
                )
                
@@ -163,46 +180,49 @@ class NagiosRunnerPlugin(MonitorPlugin):
        self,
        command: str
    ) -> Tuple[int, str, Dict[str, Any]]:
-        """Execute a Nagios plugin and parse its output.
-        
-        Args:
-            command: Command string to execute
-            
-        Returns:
-            Tuple of (status_code, output_message, performance_data_dict)
-        """
+        """Execute a Nagios plugin and parse its output."""
        try:
-            # Run command
-            result = subprocess.run(
+            proc = await asyncio.create_subprocess_shell(
                command,
-                shell=self.shell,
-                capture_output=True,
-                timeout=self.timeout,
-                text=True
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.PIPE,
            )
-            
-            status_code = result.returncode
-            output = result.stdout.strip()
-            
-            # Nagios plugins can return codes > 3, treat as UNKNOWN
+            try:
+                stdout_bytes, stderr_bytes = await asyncio.wait_for(
+                    proc.communicate(), timeout=self.timeout
+                )
+            except asyncio.TimeoutError:
+                proc.kill()
+                await proc.communicate()
+                self.logger.error(f"Command timed out: {command}")
+                return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
+
+            status_code = proc.returncode
+
+            if status_code < 0:
+                return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {}
+
            if status_code > 3:
                status_code = NAGIOS_UNKNOWN
-            
-            # Parse performance data
-            perfdata = self._parse_perfdata(output)
-            
-            # Extract just the status message (before the pipe if present)
-            if '|' in output:
-                output_msg = output.split('|')[0].strip()
+
+            stdout = stdout_bytes.decode(errors="replace").strip()
+            stderr = stderr_bytes.decode(errors="replace").strip()
+
+            # Parse perfdata from stdout before mixing in stderr
+            perfdata = self._parse_perfdata(stdout)
+
+            # Build status message
+            status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout
+
+            if not stdout and stderr:
+                output_msg = stderr
+            elif stdout and stderr:
+                output_msg = f"{status_part} [stderr: {stderr}]"
            else:
-                output_msg = output
-            
+                output_msg = status_part
+
            return status_code, output_msg, perfdata
-            
-        except subprocess.TimeoutExpired:
-            self.logger.error(f"Command timed out: {command}")
-            return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
-        
+
        except Exception as e:
            self.logger.error(f"Error executing command: {e}")
            return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "hbd"
-version = "5.1.2"
+version = "5.1.3"
 description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
 readme = "README.md"
 requires-python = ">=3.11"
@@ -0,0 +1,99 @@
+import asyncio
+import logging
+import os
+import stat
+
+from hbd.client.plugins.nagios_runner import (
+    NagiosRunnerPlugin,
+    NAGIOS_OK,
+    NAGIOS_WARNING,
+    NAGIOS_CRITICAL,
+    NAGIOS_UNKNOWN,
+)
+
+
+def test_no_commands_sets_skip_reason():
+    plugin = NagiosRunnerPlugin(config={"commands": []})
+    result = asyncio.run(plugin.initialize())
+    assert result is False
+    assert plugin.skip_reason is not None
+    assert "nagios_runner.commands" in plugin.skip_reason
+
+
+def test_stderr_used_when_stdout_empty(tmp_path):
+    script = tmp_path / "check_err.sh"
+    script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n")
+    script.chmod(script.stat().st_mode | stat.S_IEXEC)
+
+    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert "error from stderr" in data["t_output"]
+    assert data["t_status_code"] == NAGIOS_CRITICAL
+
+
+def test_stderr_appended_when_both_present(tmp_path):
+    script = tmp_path / "check_both.sh"
+    script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n")
+    script.chmod(script.stat().st_mode | stat.S_IEXEC)
+
+    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert "OK - all good" in data["t_output"]
+    assert "extra detail" in data["t_output"]
+    assert data["t_status_code"] == NAGIOS_OK
+
+
+def test_negative_returncode_maps_to_unknown():
+    # kill -9 $$ kills the shell itself; asyncio sees returncode -9
+    config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert data["t_status_code"] == NAGIOS_UNKNOWN
+    assert "signal" in data["t_output"].lower()
+
+
+def test_absolute_path_not_found_warns(caplog):
+    fake_cmd = "/nonexistent_hbc_test_path/check_something"
+    config = {"commands": [{"name": "t", "command": fake_cmd}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert any("not found" in r.message for r in caplog.records)
+
+
+def test_absolute_path_not_executable_warns(caplog, tmp_path):
+    non_exec = tmp_path / "check_test"
+    non_exec.write_text("#!/bin/sh\necho OK\n")
+    non_exec.chmod(0o644)  # readable but not executable
+
+    config = {"commands": [{"name": "t", "command": str(non_exec)}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert any("not executable" in r.message for r in caplog.records)
+
+
+def test_relative_path_not_checked(caplog):
+    # Relative paths (resolved via PATH) must not generate warnings
+    config = {"commands": [{"name": "t", "command": "echo OK"}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert not any(
+        "not found" in r.message or "not executable" in r.message
+        for r in caplog.records
+    )
@@ -0,0 +1,83 @@
+import asyncio
+import logging
+import textwrap
+
+from hbd.client.plugin import PluginLoader, PluginRegistry
+
+
+def test_plugin_skip_reason_defaults_none(tmp_path):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class MinimalPlugin(MonitorPlugin):
+            name = "minimal"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                return True
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "minimal.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+    asyncio.run(loader.load_from_directory(tmp_path))
+    plugin = registry.get("minimal")
+    assert plugin is not None
+    assert plugin.skip_reason is None
+
+
+def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class SkippablePlugin(MonitorPlugin):
+            name = "skippable"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                self.skip_reason = "not configured in yaml"
+                return False
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "skippable.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+
+    with caplog.at_level(logging.INFO, logger="plugin.loader"):
+        count = asyncio.run(loader.load_from_directory(tmp_path))
+
+    assert count == 0
+    assert any("skipped: not configured in yaml" in r.message for r in caplog.records)
+    assert not any("failed initialization" in r.message for r in caplog.records)
+
+
+def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class FailPlugin(MonitorPlugin):
+            name = "fail"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                return False
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "fail_plugin.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.loader"):
+        count = asyncio.run(loader.load_from_directory(tmp_path))
+
+    assert count == 0
+    assert any("failed initialization" in r.message for r in caplog.records)
Author	SHA1	Message	Date
andreas	7d8ca5d8db	version 5.1.3 Release / release (push) Successful in 4s Details	2026-04-25 16:52:56 +02:00
andreas	56037a036d	fix: remove unused pytest import in test_nagios_runner	2026-04-25 16:39:56 +02:00
andreas	65ceb31d8d	fix: use os.path.exists check for /dev/log instead of dead-code OSError catch	2026-04-25 16:36:00 +02:00
andreas	1c9b6c1ca9	fix: reconfigure logging to syslog after daemonize() instead of no-op basicConfig After daemonize() redirects stderr to /dev/null, the existing StreamHandler writes to /dev/null. logging.basicConfig() is a no-op when handlers are already configured, so log messages are silently lost. Replace the daemon block to: 1. Call daemonize() first 2. Explicitly remove existing handlers (pointing to /dev/null) 3. Add SysLogHandler pointing to /dev/log with fallback to UDP localhost:514 4. Log startup message to the new syslog handler Removes redundant syslog.openlog() call which is no longer needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:29:54 +02:00
andreas	d7e6b478e1	fix: use shlex.split() in nagios_runner path validation to handle quoted paths	2026-04-25 16:28:32 +02:00
andreas	535dbda47d	feat: validate absolute command paths at nagios_runner init	2026-04-25 16:24:33 +02:00
andreas	c9567dddae	fix: remove stale shell config key from NagiosRunnerPlugin docstring	2026-04-25 16:23:03 +02:00
andreas	b5963badd6	feat: async subprocess in nagios_runner with stderr capture and signal handling Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:18:09 +02:00
andreas	a76a39b4a0	fix: remove redundant no-commands log lines; fix skip_reason docstring style Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:15:44 +02:00
andreas	94e1597978	feat: set skip_reason on nagios_runner when no commands configured When NagiosRunnerPlugin has no commands configured, set skip_reason before returning False from initialize(). This allows PluginLoader to log INFO (not WARNING) when the plugin is skipped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:13:03 +02:00
andreas	c9c2ed772f	fix: document skip_reason in Plugin docstring; remove unused import in test	2026-04-25 16:10:35 +02:00
andreas	aeb78dcb8e	feat: add skip_reason to Plugin; improve PluginLoader init messaging	2026-04-25 16:08:07 +02:00
andreas	77b337e4dd	Add implementation plan for plugin error checking and daemon logging fixes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:04:13 +02:00
andreas	293461f3f6	Add design spec for plugin error checking and daemon logging fixes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 15:49:09 +02:00