# Plugin Error Checking Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Improve plugin error checking in hbc, especially for nagios_runner, and fix logger messages silently discarded in daemon mode. **Architecture:** Three focused changes across three files: (1) `hbd/client/plugin.py` gains a `skip_reason` attribute on Plugin and updated PluginLoader messaging; (2) `hbd/client/plugins/nagios_runner.py` gains async subprocess execution, stderr capture, signal-killed process handling, and init-time command path validation; (3) `hbd/client/main.py` gains proper post-fork logging reconfiguration to syslog. **Tech Stack:** Python 3.11+, asyncio, `logging.handlers.SysLogHandler`, pytest --- ## File Map | Action | Path | What changes | |---|---|---| | Modify | `hbd/client/plugin.py` | `Plugin.__init__` gains `skip_reason`; `PluginLoader` checks it | | Modify | `hbd/client/plugins/nagios_runner.py` | async subprocess, stderr, signal codes, init validation, `skip_reason` | | Modify | `hbd/client/main.py` | `_reconfigure_logging_for_daemon()` helper; remove redundant syslog calls | | Create | `tests/test_plugin.py` | PluginLoader messaging tests | | Create | `tests/test_nagios_runner.py` | NagiosRunnerPlugin behaviour tests | Run tests throughout with: ```bash python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v ``` --- ## Task 1: Plugin.skip_reason + PluginLoader messaging **Files:** - Modify: `hbd/client/plugin.py:40-48` (Plugin.__init__) - Modify: `hbd/client/plugin.py:369-381` (PluginLoader.load_from_directory) - Create: `tests/test_plugin.py` - [ ] **Step 1: Write failing tests** Create `tests/test_plugin.py`: ```python import asyncio import logging import textwrap from hbd.client.plugin import Plugin, PluginLoader, PluginRegistry def test_plugin_skip_reason_defaults_none(tmp_path): plugin_code = textwrap.dedent(""" from hbd.client.plugin import MonitorPlugin class MinimalPlugin(MonitorPlugin): name = "minimal" version = "1.0.0" interval = 60 async def initialize(self): return True async def _collect_metrics(self): return {} """) (tmp_path / "minimal.py").write_text(plugin_code) registry = PluginRegistry() loader = PluginLoader(registry) asyncio.run(loader.load_from_directory(tmp_path)) plugin = registry.get("minimal") assert plugin is not None assert plugin.skip_reason is None def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog): plugin_code = textwrap.dedent(""" from hbd.client.plugin import MonitorPlugin class SkippablePlugin(MonitorPlugin): name = "skippable" version = "1.0.0" interval = 60 async def initialize(self): self.skip_reason = "not configured in yaml" return False async def _collect_metrics(self): return {} """) (tmp_path / "skippable.py").write_text(plugin_code) registry = PluginRegistry() loader = PluginLoader(registry) with caplog.at_level(logging.INFO, logger="plugin.loader"): count = asyncio.run(loader.load_from_directory(tmp_path)) assert count == 0 assert any("skipped: not configured in yaml" in r.message for r in caplog.records) assert not any("failed initialization" in r.message for r in caplog.records) def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog): plugin_code = textwrap.dedent(""" from hbd.client.plugin import MonitorPlugin class FailPlugin(MonitorPlugin): name = "fail" version = "1.0.0" interval = 60 async def initialize(self): return False async def _collect_metrics(self): return {} """) (tmp_path / "fail_plugin.py").write_text(plugin_code) registry = PluginRegistry() loader = PluginLoader(registry) with caplog.at_level(logging.WARNING, logger="plugin.loader"): count = asyncio.run(loader.load_from_directory(tmp_path)) assert count == 0 assert any("failed initialization" in r.message for r in caplog.records) ``` - [ ] **Step 2: Run tests to verify they fail** ```bash python -m pytest tests/test_plugin.py -v ``` Expected: `test_plugin_skip_reason_defaults_none` FAILS (attribute missing), others may error. - [ ] **Step 3: Add `skip_reason` to `Plugin.__init__`** In `hbd/client/plugin.py`, in `Plugin.__init__` (around line 46), add one line: ```python def __init__(self, config: Optional[Dict[str, Any]] = None): self.config = config or {} self.logger = logging.getLogger(f"plugin.{self.name}") self._initialized = False self.skip_reason: Optional[str] = None ``` - [ ] **Step 4: Update PluginLoader messaging** In `hbd/client/plugin.py`, replace the `if not initialized:` block (around line 372): ```python if not initialized: if plugin.skip_reason: self.logger.info( f"Plugin {plugin.name} skipped: {plugin.skip_reason}" ) else: self.logger.warning( f"Plugin {plugin.name} failed initialization, skipping" ) continue ``` - [ ] **Step 5: Run tests to verify they pass** ```bash python -m pytest tests/test_plugin.py -v ``` Expected: all 3 tests PASS. - [ ] **Step 6: Commit** ```bash git add hbd/client/plugin.py tests/test_plugin.py git commit -m "feat: add skip_reason to Plugin; improve PluginLoader init messaging" ``` --- ## Task 2: NagiosRunnerPlugin — skip_reason when no commands **Files:** - Modify: `hbd/client/plugins/nagios_runner.py:88-105` (initialize) - Modify: `tests/test_nagios_runner.py` (create) - [ ] **Step 1: Write failing test** Create `tests/test_nagios_runner.py`: ```python import asyncio import logging import os import stat import pytest from hbd.client.plugins.nagios_runner import ( NagiosRunnerPlugin, NAGIOS_OK, NAGIOS_WARNING, NAGIOS_CRITICAL, NAGIOS_UNKNOWN, ) def test_no_commands_sets_skip_reason(): plugin = NagiosRunnerPlugin(config={"commands": []}) result = asyncio.run(plugin.initialize()) assert result is False assert plugin.skip_reason is not None assert "nagios_runner.commands" in plugin.skip_reason ``` - [ ] **Step 2: Run test to verify it fails** ```bash python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v ``` Expected: FAIL — `plugin.skip_reason` is `None`. - [ ] **Step 3: Set skip_reason in NagiosRunnerPlugin.initialize()** In `hbd/client/plugins/nagios_runner.py`, replace the early-return block in `initialize()` (around line 96): ```python if not self.commands: self.skip_reason = "no commands configured (add nagios_runner.commands to config)" self.logger.info("No Nagios commands configured") return False ``` - [ ] **Step 4: Run test to verify it passes** ```bash python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v ``` Expected: PASS. - [ ] **Step 5: Commit** ```bash git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py git commit -m "feat: set skip_reason on nagios_runner when no commands configured" ``` --- ## Task 3: NagiosRunnerPlugin — async subprocess, stderr capture, negative return codes **Files:** - Modify: `hbd/client/plugins/nagios_runner.py` (imports + `_run_nagios_plugin`) - Modify: `tests/test_nagios_runner.py` - [ ] **Step 1: Write failing tests** Append to `tests/test_nagios_runner.py`: ```python def test_stderr_used_when_stdout_empty(tmp_path): script = tmp_path / "check_err.sh" script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n") script.chmod(script.stat().st_mode | stat.S_IEXEC) config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5} plugin = NagiosRunnerPlugin(config=config) asyncio.run(plugin.initialize()) data = asyncio.run(plugin._collect_metrics()) assert "error from stderr" in data["t_output"] assert data["t_status_code"] == NAGIOS_CRITICAL def test_stderr_appended_when_both_present(tmp_path): script = tmp_path / "check_both.sh" script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n") script.chmod(script.stat().st_mode | stat.S_IEXEC) config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5} plugin = NagiosRunnerPlugin(config=config) asyncio.run(plugin.initialize()) data = asyncio.run(plugin._collect_metrics()) assert "OK - all good" in data["t_output"] assert "extra detail" in data["t_output"] assert data["t_status_code"] == NAGIOS_OK def test_negative_returncode_maps_to_unknown(): # kill -9 $$ kills the shell itself; asyncio sees returncode -9 config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5} plugin = NagiosRunnerPlugin(config=config) asyncio.run(plugin.initialize()) data = asyncio.run(plugin._collect_metrics()) assert data["t_status_code"] == NAGIOS_UNKNOWN assert "signal" in data["t_output"].lower() ``` - [ ] **Step 2: Run tests to verify they fail** ```bash python -m pytest tests/test_nagios_runner.py::test_stderr_used_when_stdout_empty \ tests/test_nagios_runner.py::test_stderr_appended_when_both_present \ tests/test_nagios_runner.py::test_negative_returncode_maps_to_unknown -v ``` Expected: all FAIL — current implementation ignores stderr and doesn't handle negative codes. - [ ] **Step 3: Update imports in nagios_runner.py** Replace the import block at the top of `hbd/client/plugins/nagios_runner.py`: ```python import asyncio import os import re from typing import Any, Dict, List, Optional, Tuple from hbd.client.plugin import MonitorPlugin ``` (Remove `import subprocess`; add `import asyncio` and `import os`.) - [ ] **Step 4: Upgrade collection log level from DEBUG to INFO** In `hbd/client/plugins/nagios_runner.py`, in `_collect_metrics()`, change the debug log (around line 144) so results are visible at INFO level: ```python self.logger.info( f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}" ) ``` - [ ] **Step 5: Replace `_run_nagios_plugin` with async implementation** Replace the entire `_run_nagios_plugin` method in `hbd/client/plugins/nagios_runner.py`: ```python async def _run_nagios_plugin( self, command: str ) -> Tuple[int, str, Dict[str, Any]]: """Execute a Nagios plugin and parse its output.""" try: proc = await asyncio.create_subprocess_shell( command, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE, ) try: stdout_bytes, stderr_bytes = await asyncio.wait_for( proc.communicate(), timeout=self.timeout ) except asyncio.TimeoutError: proc.kill() await proc.communicate() self.logger.error(f"Command timed out: {command}") return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {} status_code = proc.returncode if status_code < 0: return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {} if status_code > 3: status_code = NAGIOS_UNKNOWN stdout = stdout_bytes.decode(errors="replace").strip() stderr = stderr_bytes.decode(errors="replace").strip() # Parse perfdata from stdout before mixing in stderr perfdata = self._parse_perfdata(stdout) # Build status message status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout if not stdout and stderr: output_msg = stderr elif stdout and stderr: output_msg = f"{status_part} [stderr: {stderr}]" else: output_msg = status_part return status_code, output_msg, perfdata except Exception as e: self.logger.error(f"Error executing command: {e}") return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {} ``` Also remove the now-unused `self.shell` line from `__init__` (the `shell` config key is no longer used since `create_subprocess_shell` always uses a shell): In `NagiosRunnerPlugin.__init__`, remove: ```python self.shell: bool = config.get("shell", True) if config else True ``` - [ ] **Step 6: Run tests to verify they pass** ```bash python -m pytest tests/test_nagios_runner.py -v ``` Expected: all tests PASS including the 3 new ones. - [ ] **Step 7: Commit** ```bash git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py git commit -m "feat: async subprocess in nagios_runner with stderr capture and signal handling" ``` --- ## Task 4: NagiosRunnerPlugin — command path validation at init **Files:** - Modify: `hbd/client/plugins/nagios_runner.py` (initialize) - Modify: `tests/test_nagios_runner.py` - [ ] **Step 1: Write failing tests** Append to `tests/test_nagios_runner.py`: ```python def test_absolute_path_not_found_warns(caplog): fake_cmd = "/nonexistent_hbc_test_path/check_something" config = {"commands": [{"name": "t", "command": fake_cmd}]} plugin = NagiosRunnerPlugin(config=config) with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"): asyncio.run(plugin.initialize()) assert any("not found" in r.message for r in caplog.records) def test_absolute_path_not_executable_warns(caplog, tmp_path): non_exec = tmp_path / "check_test" non_exec.write_text("#!/bin/sh\necho OK\n") non_exec.chmod(0o644) # readable but not executable config = {"commands": [{"name": "t", "command": str(non_exec)}]} plugin = NagiosRunnerPlugin(config=config) with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"): asyncio.run(plugin.initialize()) assert any("not executable" in r.message for r in caplog.records) def test_relative_path_not_checked(caplog): # Relative paths (resolved via PATH) must not generate warnings config = {"commands": [{"name": "t", "command": "echo OK"}]} plugin = NagiosRunnerPlugin(config=config) with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"): asyncio.run(plugin.initialize()) assert not any( "not found" in r.message or "not executable" in r.message for r in caplog.records ) ``` - [ ] **Step 2: Run tests to verify they fail** ```bash python -m pytest tests/test_nagios_runner.py::test_absolute_path_not_found_warns \ tests/test_nagios_runner.py::test_absolute_path_not_executable_warns \ tests/test_nagios_runner.py::test_relative_path_not_checked -v ``` Expected: `test_absolute_path_not_found_warns` and `test_absolute_path_not_executable_warns` FAIL (no warnings logged); `test_relative_path_not_checked` may pass. - [ ] **Step 3: Add command path validation to `initialize()`** In `hbd/client/plugins/nagios_runner.py`, extend `initialize()` by adding validation after the existing "log each command" loop (after line 103, before `return True`): ```python # Validate absolute command paths early for cmd_config in self.commands: name = cmd_config.get("name", "unnamed") command = cmd_config.get("command", "") if not command: continue exe = command.split()[0] if os.path.isabs(exe): if not os.path.isfile(exe): self.logger.warning( f"Command '{name}': executable not found: {exe}" ) elif not os.access(exe, os.X_OK): self.logger.warning( f"Command '{name}': executable not executable: {exe}" ) ``` - [ ] **Step 4: Run full test suite to verify all pass** ```bash python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v ``` Expected: all tests PASS. - [ ] **Step 5: Commit** ```bash git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py git commit -m "feat: validate absolute command paths at nagios_runner init" ``` --- ## Task 5: Daemon mode logging — route to syslog after fork **Files:** - Modify: `hbd/client/main.py` (new helper + updated daemon block) No automated test for daemonization itself (fork behaviour is hard to unit-test). Manual verification steps are provided below. - [ ] **Step 1: Add `_reconfigure_logging_for_daemon` helper** In `hbd/client/main.py`, add this function just before `def build_parser()` (around line 589): ```python def _reconfigure_logging_for_daemon(log_level: int) -> None: """Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler.""" from logging.handlers import SysLogHandler root = logging.getLogger() for handler in root.handlers[:]: root.removeHandler(handler) handler.close() try: syslog_handler = SysLogHandler( address="/dev/log", facility=SysLogHandler.LOG_DAEMON, ) except OSError: syslog_handler = SysLogHandler( address=("localhost", 514), facility=SysLogHandler.LOG_DAEMON, ) # Attach the fallback first so the warning reaches syslog syslog_handler.setFormatter( logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s") ) root.addHandler(syslog_handler) root.setLevel(log_level) logging.warning("/dev/log not found, using syslog UDP localhost:514") return syslog_handler.setFormatter( logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s") ) root.addHandler(syslog_handler) root.setLevel(log_level) ``` - [ ] **Step 2: Update the daemon block in `main()`** In `hbd/client/main.py`, replace the entire `if args.daemon:` block (lines 664–675): ```python if args.daemon: print("Daemonizing...") daemonize() _reconfigure_logging_for_daemon(log_level) logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}") ``` This removes the `import syslog`, `syslog.openlog()`, and `syslog.syslog()` calls (now handled by the logging system) and removes the no-op second `logging.basicConfig()` call. - [ ] **Step 3: Run existing test suite to confirm no regressions** ```bash python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v ``` Expected: all tests still PASS. - [ ] **Step 4: Manual smoke test — verify syslog output in daemon mode** ```bash # In one terminal, tail syslog sudo journalctl -f -t hbc # In another terminal, start hbc in daemon mode (replace HOST with a real or dummy host) python -m hbd.client.main -d -v localhost # Expected in journalctl output: # hbc[]: hbc.main INFO: Starting hbc for -> ['localhost'] # hbc[]: hbc.main INFO: hbc starting, sending heartbeat to localhost # hbc[]: plugin.loader INFO: ... # Stop the daemon pkill -f "hbd.client.main" ``` - [ ] **Step 5: Commit** ```bash git add hbd/client/main.py git commit -m "fix: reconfigure logging to syslog after daemonize() instead of no-op basicConfig" ```