diff --git a/docs/superpowers/plans/2026-04-25-plugin-error-checking.md b/docs/superpowers/plans/2026-04-25-plugin-error-checking.md new file mode 100644 index 0000000..300e005 --- /dev/null +++ b/docs/superpowers/plans/2026-04-25-plugin-error-checking.md @@ -0,0 +1,602 @@ +# Plugin Error Checking Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Improve plugin error checking in hbc, especially for nagios_runner, and fix logger messages silently discarded in daemon mode. + +**Architecture:** Three focused changes across three files: (1) `hbd/client/plugin.py` gains a `skip_reason` attribute on Plugin and updated PluginLoader messaging; (2) `hbd/client/plugins/nagios_runner.py` gains async subprocess execution, stderr capture, signal-killed process handling, and init-time command path validation; (3) `hbd/client/main.py` gains proper post-fork logging reconfiguration to syslog. + +**Tech Stack:** Python 3.11+, asyncio, `logging.handlers.SysLogHandler`, pytest + +--- + +## File Map + +| Action | Path | What changes | +|---|---|---| +| Modify | `hbd/client/plugin.py` | `Plugin.__init__` gains `skip_reason`; `PluginLoader` checks it | +| Modify | `hbd/client/plugins/nagios_runner.py` | async subprocess, stderr, signal codes, init validation, `skip_reason` | +| Modify | `hbd/client/main.py` | `_reconfigure_logging_for_daemon()` helper; remove redundant syslog calls | +| Create | `tests/test_plugin.py` | PluginLoader messaging tests | +| Create | `tests/test_nagios_runner.py` | NagiosRunnerPlugin behaviour tests | + +Run tests throughout with: +```bash +python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v +``` + +--- + +## Task 1: Plugin.skip_reason + PluginLoader messaging + +**Files:** +- Modify: `hbd/client/plugin.py:40-48` (Plugin.__init__) +- Modify: `hbd/client/plugin.py:369-381` (PluginLoader.load_from_directory) +- Create: `tests/test_plugin.py` + +- [ ] **Step 1: Write failing tests** + +Create `tests/test_plugin.py`: + +```python +import asyncio +import logging +import textwrap + +from hbd.client.plugin import Plugin, PluginLoader, PluginRegistry + + +def test_plugin_skip_reason_defaults_none(tmp_path): + plugin_code = textwrap.dedent(""" + from hbd.client.plugin import MonitorPlugin + + class MinimalPlugin(MonitorPlugin): + name = "minimal" + version = "1.0.0" + interval = 60 + + async def initialize(self): + return True + + async def _collect_metrics(self): + return {} + """) + (tmp_path / "minimal.py").write_text(plugin_code) + registry = PluginRegistry() + loader = PluginLoader(registry) + asyncio.run(loader.load_from_directory(tmp_path)) + plugin = registry.get("minimal") + assert plugin is not None + assert plugin.skip_reason is None + + +def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog): + plugin_code = textwrap.dedent(""" + from hbd.client.plugin import MonitorPlugin + + class SkippablePlugin(MonitorPlugin): + name = "skippable" + version = "1.0.0" + interval = 60 + + async def initialize(self): + self.skip_reason = "not configured in yaml" + return False + + async def _collect_metrics(self): + return {} + """) + (tmp_path / "skippable.py").write_text(plugin_code) + registry = PluginRegistry() + loader = PluginLoader(registry) + + with caplog.at_level(logging.INFO, logger="plugin.loader"): + count = asyncio.run(loader.load_from_directory(tmp_path)) + + assert count == 0 + assert any("skipped: not configured in yaml" in r.message for r in caplog.records) + assert not any("failed initialization" in r.message for r in caplog.records) + + +def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog): + plugin_code = textwrap.dedent(""" + from hbd.client.plugin import MonitorPlugin + + class FailPlugin(MonitorPlugin): + name = "fail" + version = "1.0.0" + interval = 60 + + async def initialize(self): + return False + + async def _collect_metrics(self): + return {} + """) + (tmp_path / "fail_plugin.py").write_text(plugin_code) + registry = PluginRegistry() + loader = PluginLoader(registry) + + with caplog.at_level(logging.WARNING, logger="plugin.loader"): + count = asyncio.run(loader.load_from_directory(tmp_path)) + + assert count == 0 + assert any("failed initialization" in r.message for r in caplog.records) +``` + +- [ ] **Step 2: Run tests to verify they fail** + +```bash +python -m pytest tests/test_plugin.py -v +``` +Expected: `test_plugin_skip_reason_defaults_none` FAILS (attribute missing), others may error. + +- [ ] **Step 3: Add `skip_reason` to `Plugin.__init__`** + +In `hbd/client/plugin.py`, in `Plugin.__init__` (around line 46), add one line: + +```python +def __init__(self, config: Optional[Dict[str, Any]] = None): + self.config = config or {} + self.logger = logging.getLogger(f"plugin.{self.name}") + self._initialized = False + self.skip_reason: Optional[str] = None +``` + +- [ ] **Step 4: Update PluginLoader messaging** + +In `hbd/client/plugin.py`, replace the `if not initialized:` block (around line 372): + +```python + if not initialized: + if plugin.skip_reason: + self.logger.info( + f"Plugin {plugin.name} skipped: {plugin.skip_reason}" + ) + else: + self.logger.warning( + f"Plugin {plugin.name} failed initialization, skipping" + ) + continue +``` + +- [ ] **Step 5: Run tests to verify they pass** + +```bash +python -m pytest tests/test_plugin.py -v +``` +Expected: all 3 tests PASS. + +- [ ] **Step 6: Commit** + +```bash +git add hbd/client/plugin.py tests/test_plugin.py +git commit -m "feat: add skip_reason to Plugin; improve PluginLoader init messaging" +``` + +--- + +## Task 2: NagiosRunnerPlugin — skip_reason when no commands + +**Files:** +- Modify: `hbd/client/plugins/nagios_runner.py:88-105` (initialize) +- Modify: `tests/test_nagios_runner.py` (create) + +- [ ] **Step 1: Write failing test** + +Create `tests/test_nagios_runner.py`: + +```python +import asyncio +import logging +import os +import stat + +import pytest + +from hbd.client.plugins.nagios_runner import ( + NagiosRunnerPlugin, + NAGIOS_OK, + NAGIOS_WARNING, + NAGIOS_CRITICAL, + NAGIOS_UNKNOWN, +) + + +def test_no_commands_sets_skip_reason(): + plugin = NagiosRunnerPlugin(config={"commands": []}) + result = asyncio.run(plugin.initialize()) + assert result is False + assert plugin.skip_reason is not None + assert "nagios_runner.commands" in plugin.skip_reason +``` + +- [ ] **Step 2: Run test to verify it fails** + +```bash +python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v +``` +Expected: FAIL — `plugin.skip_reason` is `None`. + +- [ ] **Step 3: Set skip_reason in NagiosRunnerPlugin.initialize()** + +In `hbd/client/plugins/nagios_runner.py`, replace the early-return block in `initialize()` (around line 96): + +```python + if not self.commands: + self.skip_reason = "no commands configured (add nagios_runner.commands to config)" + self.logger.info("No Nagios commands configured") + return False +``` + +- [ ] **Step 4: Run test to verify it passes** + +```bash +python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v +``` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py +git commit -m "feat: set skip_reason on nagios_runner when no commands configured" +``` + +--- + +## Task 3: NagiosRunnerPlugin — async subprocess, stderr capture, negative return codes + +**Files:** +- Modify: `hbd/client/plugins/nagios_runner.py` (imports + `_run_nagios_plugin`) +- Modify: `tests/test_nagios_runner.py` + +- [ ] **Step 1: Write failing tests** + +Append to `tests/test_nagios_runner.py`: + +```python +def test_stderr_used_when_stdout_empty(tmp_path): + script = tmp_path / "check_err.sh" + script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n") + script.chmod(script.stat().st_mode | stat.S_IEXEC) + + config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5} + plugin = NagiosRunnerPlugin(config=config) + asyncio.run(plugin.initialize()) + data = asyncio.run(plugin._collect_metrics()) + + assert "error from stderr" in data["t_output"] + assert data["t_status_code"] == NAGIOS_CRITICAL + + +def test_stderr_appended_when_both_present(tmp_path): + script = tmp_path / "check_both.sh" + script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n") + script.chmod(script.stat().st_mode | stat.S_IEXEC) + + config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5} + plugin = NagiosRunnerPlugin(config=config) + asyncio.run(plugin.initialize()) + data = asyncio.run(plugin._collect_metrics()) + + assert "OK - all good" in data["t_output"] + assert "extra detail" in data["t_output"] + assert data["t_status_code"] == NAGIOS_OK + + +def test_negative_returncode_maps_to_unknown(): + # kill -9 $$ kills the shell itself; asyncio sees returncode -9 + config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5} + plugin = NagiosRunnerPlugin(config=config) + asyncio.run(plugin.initialize()) + data = asyncio.run(plugin._collect_metrics()) + + assert data["t_status_code"] == NAGIOS_UNKNOWN + assert "signal" in data["t_output"].lower() +``` + +- [ ] **Step 2: Run tests to verify they fail** + +```bash +python -m pytest tests/test_nagios_runner.py::test_stderr_used_when_stdout_empty \ + tests/test_nagios_runner.py::test_stderr_appended_when_both_present \ + tests/test_nagios_runner.py::test_negative_returncode_maps_to_unknown -v +``` +Expected: all FAIL — current implementation ignores stderr and doesn't handle negative codes. + +- [ ] **Step 3: Update imports in nagios_runner.py** + +Replace the import block at the top of `hbd/client/plugins/nagios_runner.py`: + +```python +import asyncio +import os +import re +from typing import Any, Dict, List, Optional, Tuple + +from hbd.client.plugin import MonitorPlugin +``` + +(Remove `import subprocess`; add `import asyncio` and `import os`.) + +- [ ] **Step 4: Upgrade collection log level from DEBUG to INFO** + +In `hbd/client/plugins/nagios_runner.py`, in `_collect_metrics()`, change the debug log (around line 144) so results are visible at INFO level: + +```python + self.logger.info( + f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}" + ) +``` + +- [ ] **Step 5: Replace `_run_nagios_plugin` with async implementation** + +Replace the entire `_run_nagios_plugin` method in `hbd/client/plugins/nagios_runner.py`: + +```python + async def _run_nagios_plugin( + self, + command: str + ) -> Tuple[int, str, Dict[str, Any]]: + """Execute a Nagios plugin and parse its output.""" + try: + proc = await asyncio.create_subprocess_shell( + command, + stdout=asyncio.subprocess.PIPE, + stderr=asyncio.subprocess.PIPE, + ) + try: + stdout_bytes, stderr_bytes = await asyncio.wait_for( + proc.communicate(), timeout=self.timeout + ) + except asyncio.TimeoutError: + proc.kill() + await proc.communicate() + self.logger.error(f"Command timed out: {command}") + return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {} + + status_code = proc.returncode + + if status_code < 0: + return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {} + + if status_code > 3: + status_code = NAGIOS_UNKNOWN + + stdout = stdout_bytes.decode(errors="replace").strip() + stderr = stderr_bytes.decode(errors="replace").strip() + + # Parse perfdata from stdout before mixing in stderr + perfdata = self._parse_perfdata(stdout) + + # Build status message + status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout + + if not stdout and stderr: + output_msg = stderr + elif stdout and stderr: + output_msg = f"{status_part} [stderr: {stderr}]" + else: + output_msg = status_part + + return status_code, output_msg, perfdata + + except Exception as e: + self.logger.error(f"Error executing command: {e}") + return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {} +``` + +Also remove the now-unused `self.shell` line from `__init__` (the `shell` config key is no longer used since `create_subprocess_shell` always uses a shell): + +In `NagiosRunnerPlugin.__init__`, remove: +```python + self.shell: bool = config.get("shell", True) if config else True +``` + +- [ ] **Step 6: Run tests to verify they pass** + +```bash +python -m pytest tests/test_nagios_runner.py -v +``` +Expected: all tests PASS including the 3 new ones. + +- [ ] **Step 7: Commit** + +```bash +git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py +git commit -m "feat: async subprocess in nagios_runner with stderr capture and signal handling" +``` + +--- + +## Task 4: NagiosRunnerPlugin — command path validation at init + +**Files:** +- Modify: `hbd/client/plugins/nagios_runner.py` (initialize) +- Modify: `tests/test_nagios_runner.py` + +- [ ] **Step 1: Write failing tests** + +Append to `tests/test_nagios_runner.py`: + +```python +def test_absolute_path_not_found_warns(caplog): + fake_cmd = "/nonexistent_hbc_test_path/check_something" + config = {"commands": [{"name": "t", "command": fake_cmd}]} + plugin = NagiosRunnerPlugin(config=config) + + with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"): + asyncio.run(plugin.initialize()) + + assert any("not found" in r.message for r in caplog.records) + + +def test_absolute_path_not_executable_warns(caplog, tmp_path): + non_exec = tmp_path / "check_test" + non_exec.write_text("#!/bin/sh\necho OK\n") + non_exec.chmod(0o644) # readable but not executable + + config = {"commands": [{"name": "t", "command": str(non_exec)}]} + plugin = NagiosRunnerPlugin(config=config) + + with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"): + asyncio.run(plugin.initialize()) + + assert any("not executable" in r.message for r in caplog.records) + + +def test_relative_path_not_checked(caplog): + # Relative paths (resolved via PATH) must not generate warnings + config = {"commands": [{"name": "t", "command": "echo OK"}]} + plugin = NagiosRunnerPlugin(config=config) + + with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"): + asyncio.run(plugin.initialize()) + + assert not any( + "not found" in r.message or "not executable" in r.message + for r in caplog.records + ) +``` + +- [ ] **Step 2: Run tests to verify they fail** + +```bash +python -m pytest tests/test_nagios_runner.py::test_absolute_path_not_found_warns \ + tests/test_nagios_runner.py::test_absolute_path_not_executable_warns \ + tests/test_nagios_runner.py::test_relative_path_not_checked -v +``` +Expected: `test_absolute_path_not_found_warns` and `test_absolute_path_not_executable_warns` FAIL (no warnings logged); `test_relative_path_not_checked` may pass. + +- [ ] **Step 3: Add command path validation to `initialize()`** + +In `hbd/client/plugins/nagios_runner.py`, extend `initialize()` by adding validation after the existing "log each command" loop (after line 103, before `return True`): + +```python + # Validate absolute command paths early + for cmd_config in self.commands: + name = cmd_config.get("name", "unnamed") + command = cmd_config.get("command", "") + if not command: + continue + exe = command.split()[0] + if os.path.isabs(exe): + if not os.path.isfile(exe): + self.logger.warning( + f"Command '{name}': executable not found: {exe}" + ) + elif not os.access(exe, os.X_OK): + self.logger.warning( + f"Command '{name}': executable not executable: {exe}" + ) +``` + +- [ ] **Step 4: Run full test suite to verify all pass** + +```bash +python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v +``` +Expected: all tests PASS. + +- [ ] **Step 5: Commit** + +```bash +git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py +git commit -m "feat: validate absolute command paths at nagios_runner init" +``` + +--- + +## Task 5: Daemon mode logging — route to syslog after fork + +**Files:** +- Modify: `hbd/client/main.py` (new helper + updated daemon block) + +No automated test for daemonization itself (fork behaviour is hard to unit-test). Manual verification steps are provided below. + +- [ ] **Step 1: Add `_reconfigure_logging_for_daemon` helper** + +In `hbd/client/main.py`, add this function just before `def build_parser()` (around line 589): + +```python +def _reconfigure_logging_for_daemon(log_level: int) -> None: + """Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler.""" + from logging.handlers import SysLogHandler + + root = logging.getLogger() + for handler in root.handlers[:]: + root.removeHandler(handler) + handler.close() + + try: + syslog_handler = SysLogHandler( + address="/dev/log", + facility=SysLogHandler.LOG_DAEMON, + ) + except OSError: + syslog_handler = SysLogHandler( + address=("localhost", 514), + facility=SysLogHandler.LOG_DAEMON, + ) + # Attach the fallback first so the warning reaches syslog + syslog_handler.setFormatter( + logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s") + ) + root.addHandler(syslog_handler) + root.setLevel(log_level) + logging.warning("/dev/log not found, using syslog UDP localhost:514") + return + + syslog_handler.setFormatter( + logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s") + ) + root.addHandler(syslog_handler) + root.setLevel(log_level) +``` + +- [ ] **Step 2: Update the daemon block in `main()`** + +In `hbd/client/main.py`, replace the entire `if args.daemon:` block (lines 664–675): + +```python + if args.daemon: + print("Daemonizing...") + daemonize() + _reconfigure_logging_for_daemon(log_level) + logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}") +``` + +This removes the `import syslog`, `syslog.openlog()`, and `syslog.syslog()` calls (now handled by the logging system) and removes the no-op second `logging.basicConfig()` call. + +- [ ] **Step 3: Run existing test suite to confirm no regressions** + +```bash +python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v +``` +Expected: all tests still PASS. + +- [ ] **Step 4: Manual smoke test — verify syslog output in daemon mode** + +```bash +# In one terminal, tail syslog +sudo journalctl -f -t hbc + +# In another terminal, start hbc in daemon mode (replace HOST with a real or dummy host) +python -m hbd.client.main -d -v localhost + +# Expected in journalctl output: +# hbc[]: hbc.main INFO: Starting hbc for -> ['localhost'] +# hbc[]: hbc.main INFO: hbc starting, sending heartbeat to localhost +# hbc[]: plugin.loader INFO: ... + +# Stop the daemon +pkill -f "hbd.client.main" +``` + +- [ ] **Step 5: Commit** + +```bash +git add hbd/client/main.py +git commit -m "fix: reconfigure logging to syslog after daemonize() instead of no-op basicConfig" +```