heartbeat/docs/superpowers/plans/2026-04-25-plugin-error-checking.md

# Plugin Error Checking Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Improve plugin error checking in hbc, especially for nagios_runner, and fix logger messages silently discarded in daemon mode.

**Architecture:** Three focused changes across three files: (1) `hbd/client/plugin.py` gains a `skip_reason` attribute on Plugin and updated PluginLoader messaging; (2) `hbd/client/plugins/nagios_runner.py` gains async subprocess execution, stderr capture, signal-killed process handling, and init-time command path validation; (3) `hbd/client/main.py` gains proper post-fork logging reconfiguration to syslog.

**Tech Stack:** Python 3.11+, asyncio, `logging.handlers.SysLogHandler`, pytest

---

## File Map

| Action | Path | What changes |
|---|---|---|
| Modify | `hbd/client/plugin.py` | `Plugin.__init__` gains `skip_reason`; `PluginLoader` checks it |
| Modify | `hbd/client/plugins/nagios_runner.py` | async subprocess, stderr, signal codes, init validation, `skip_reason` |
| Modify | `hbd/client/main.py` | `_reconfigure_logging_for_daemon()` helper; remove redundant syslog calls |
| Create | `tests/test_plugin.py` | PluginLoader messaging tests |
| Create | `tests/test_nagios_runner.py` | NagiosRunnerPlugin behaviour tests |

Run tests throughout with:
```bash
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
```

---

## Task 1: Plugin.skip_reason + PluginLoader messaging

**Files:**
- Modify: `hbd/client/plugin.py:40-48` (Plugin.__init__)
- Modify: `hbd/client/plugin.py:369-381` (PluginLoader.load_from_directory)
- Create: `tests/test_plugin.py`

- [ ] **Step 1: Write failing tests**

Create `tests/test_plugin.py`:

```python
import asyncio
import logging
import textwrap

from hbd.client.plugin import Plugin, PluginLoader, PluginRegistry


def test_plugin_skip_reason_defaults_none(tmp_path):
    plugin_code = textwrap.dedent("""
        from hbd.client.plugin import MonitorPlugin

        class MinimalPlugin(MonitorPlugin):
            name = "minimal"
            version = "1.0.0"
            interval = 60

            async def initialize(self):
                return True

            async def _collect_metrics(self):
                return {}
    """)
    (tmp_path / "minimal.py").write_text(plugin_code)
    registry = PluginRegistry()
    loader = PluginLoader(registry)
    asyncio.run(loader.load_from_directory(tmp_path))
    plugin = registry.get("minimal")
    assert plugin is not None
    assert plugin.skip_reason is None


def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog):
    plugin_code = textwrap.dedent("""
        from hbd.client.plugin import MonitorPlugin

        class SkippablePlugin(MonitorPlugin):
            name = "skippable"
            version = "1.0.0"
            interval = 60

            async def initialize(self):
                self.skip_reason = "not configured in yaml"
                return False

            async def _collect_metrics(self):
                return {}
    """)
    (tmp_path / "skippable.py").write_text(plugin_code)
    registry = PluginRegistry()
    loader = PluginLoader(registry)

    with caplog.at_level(logging.INFO, logger="plugin.loader"):
        count = asyncio.run(loader.load_from_directory(tmp_path))

    assert count == 0
    assert any("skipped: not configured in yaml" in r.message for r in caplog.records)
    assert not any("failed initialization" in r.message for r in caplog.records)


def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog):
    plugin_code = textwrap.dedent("""
        from hbd.client.plugin import MonitorPlugin

        class FailPlugin(MonitorPlugin):
            name = "fail"
            version = "1.0.0"
            interval = 60

            async def initialize(self):
                return False

            async def _collect_metrics(self):
                return {}
    """)
    (tmp_path / "fail_plugin.py").write_text(plugin_code)
    registry = PluginRegistry()
    loader = PluginLoader(registry)

    with caplog.at_level(logging.WARNING, logger="plugin.loader"):
        count = asyncio.run(loader.load_from_directory(tmp_path))

    assert count == 0
    assert any("failed initialization" in r.message for r in caplog.records)
```

- [ ] **Step 2: Run tests to verify they fail**

```bash
python -m pytest tests/test_plugin.py -v
```
Expected: `test_plugin_skip_reason_defaults_none` FAILS (attribute missing), others may error.

- [ ] **Step 3: Add `skip_reason` to `Plugin.__init__`**

In `hbd/client/plugin.py`, in `Plugin.__init__` (around line 46), add one line:

```python
def __init__(self, config: Optional[Dict[str, Any]] = None):
    self.config = config or {}
    self.logger = logging.getLogger(f"plugin.{self.name}")
    self._initialized = False
    self.skip_reason: Optional[str] = None
```

- [ ] **Step 4: Update PluginLoader messaging**

In `hbd/client/plugin.py`, replace the `if not initialized:` block (around line 372):

```python
                    if not initialized:
                        if plugin.skip_reason:
                            self.logger.info(
                                f"Plugin {plugin.name} skipped: {plugin.skip_reason}"
                            )
                        else:
                            self.logger.warning(
                                f"Plugin {plugin.name} failed initialization, skipping"
                            )
                        continue
```

- [ ] **Step 5: Run tests to verify they pass**

```bash
python -m pytest tests/test_plugin.py -v
```
Expected: all 3 tests PASS.

- [ ] **Step 6: Commit**

```bash
git add hbd/client/plugin.py tests/test_plugin.py
git commit -m "feat: add skip_reason to Plugin; improve PluginLoader init messaging"
```

---

## Task 2: NagiosRunnerPlugin — skip_reason when no commands

**Files:**
- Modify: `hbd/client/plugins/nagios_runner.py:88-105` (initialize)
- Modify: `tests/test_nagios_runner.py` (create)

- [ ] **Step 1: Write failing test**

Create `tests/test_nagios_runner.py`:

```python
import asyncio
import logging
import os
import stat

import pytest

from hbd.client.plugins.nagios_runner import (
    NagiosRunnerPlugin,
    NAGIOS_OK,
    NAGIOS_WARNING,
    NAGIOS_CRITICAL,
    NAGIOS_UNKNOWN,
)


def test_no_commands_sets_skip_reason():
    plugin = NagiosRunnerPlugin(config={"commands": []})
    result = asyncio.run(plugin.initialize())
    assert result is False
    assert plugin.skip_reason is not None
    assert "nagios_runner.commands" in plugin.skip_reason
```

- [ ] **Step 2: Run test to verify it fails**

```bash
python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
```
Expected: FAIL — `plugin.skip_reason` is `None`.

- [ ] **Step 3: Set skip_reason in NagiosRunnerPlugin.initialize()**

In `hbd/client/plugins/nagios_runner.py`, replace the early-return block in `initialize()` (around line 96):

```python
        if not self.commands:
            self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
            self.logger.info("No Nagios commands configured")
            return False
```

- [ ] **Step 4: Run test to verify it passes**

```bash
python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
```
Expected: PASS.

- [ ] **Step 5: Commit**

```bash
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
git commit -m "feat: set skip_reason on nagios_runner when no commands configured"
```

---

## Task 3: NagiosRunnerPlugin — async subprocess, stderr capture, negative return codes

**Files:**
- Modify: `hbd/client/plugins/nagios_runner.py` (imports + `_run_nagios_plugin`)
- Modify: `tests/test_nagios_runner.py`

- [ ] **Step 1: Write failing tests**

Append to `tests/test_nagios_runner.py`:

```python
def test_stderr_used_when_stdout_empty(tmp_path):
    script = tmp_path / "check_err.sh"
    script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n")
    script.chmod(script.stat().st_mode | stat.S_IEXEC)

    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
    plugin = NagiosRunnerPlugin(config=config)
    asyncio.run(plugin.initialize())
    data = asyncio.run(plugin._collect_metrics())

    assert "error from stderr" in data["t_output"]
    assert data["t_status_code"] == NAGIOS_CRITICAL


def test_stderr_appended_when_both_present(tmp_path):
    script = tmp_path / "check_both.sh"
    script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n")
    script.chmod(script.stat().st_mode | stat.S_IEXEC)

    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
    plugin = NagiosRunnerPlugin(config=config)
    asyncio.run(plugin.initialize())
    data = asyncio.run(plugin._collect_metrics())

    assert "OK - all good" in data["t_output"]
    assert "extra detail" in data["t_output"]
    assert data["t_status_code"] == NAGIOS_OK


def test_negative_returncode_maps_to_unknown():
    # kill -9 $$ kills the shell itself; asyncio sees returncode -9
    config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5}
    plugin = NagiosRunnerPlugin(config=config)
    asyncio.run(plugin.initialize())
    data = asyncio.run(plugin._collect_metrics())

    assert data["t_status_code"] == NAGIOS_UNKNOWN
    assert "signal" in data["t_output"].lower()
```

- [ ] **Step 2: Run tests to verify they fail**

```bash
python -m pytest tests/test_nagios_runner.py::test_stderr_used_when_stdout_empty \
    tests/test_nagios_runner.py::test_stderr_appended_when_both_present \
    tests/test_nagios_runner.py::test_negative_returncode_maps_to_unknown -v
```
Expected: all FAIL — current implementation ignores stderr and doesn't handle negative codes.

- [ ] **Step 3: Update imports in nagios_runner.py**

Replace the import block at the top of `hbd/client/plugins/nagios_runner.py`:

```python
import asyncio
import os
import re
from typing import Any, Dict, List, Optional, Tuple

from hbd.client.plugin import MonitorPlugin
```

(Remove `import subprocess`; add `import asyncio` and `import os`.)

- [ ] **Step 4: Upgrade collection log level from DEBUG to INFO**

In `hbd/client/plugins/nagios_runner.py`, in `_collect_metrics()`, change the debug log (around line 144) so results are visible at INFO level:

```python
                self.logger.info(
                    f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
                )
```

- [ ] **Step 5: Replace `_run_nagios_plugin` with async implementation**

Replace the entire `_run_nagios_plugin` method in `hbd/client/plugins/nagios_runner.py`:

```python
    async def _run_nagios_plugin(
        self,
        command: str
    ) -> Tuple[int, str, Dict[str, Any]]:
        """Execute a Nagios plugin and parse its output."""
        try:
            proc = await asyncio.create_subprocess_shell(
                command,
                stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.PIPE,
            )
            try:
                stdout_bytes, stderr_bytes = await asyncio.wait_for(
                    proc.communicate(), timeout=self.timeout
                )
            except asyncio.TimeoutError:
                proc.kill()
                await proc.communicate()
                self.logger.error(f"Command timed out: {command}")
                return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}

            status_code = proc.returncode

            if status_code < 0:
                return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {}

            if status_code > 3:
                status_code = NAGIOS_UNKNOWN

            stdout = stdout_bytes.decode(errors="replace").strip()
            stderr = stderr_bytes.decode(errors="replace").strip()

            # Parse perfdata from stdout before mixing in stderr
            perfdata = self._parse_perfdata(stdout)

            # Build status message
            status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout

            if not stdout and stderr:
                output_msg = stderr
            elif stdout and stderr:
                output_msg = f"{status_part} [stderr: {stderr}]"
            else:
                output_msg = status_part

            return status_code, output_msg, perfdata

        except Exception as e:
            self.logger.error(f"Error executing command: {e}")
            return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}
```

Also remove the now-unused `self.shell` line from `__init__` (the `shell` config key is no longer used since `create_subprocess_shell` always uses a shell):

In `NagiosRunnerPlugin.__init__`, remove:
```python
        self.shell: bool = config.get("shell", True) if config else True
```

- [ ] **Step 6: Run tests to verify they pass**

```bash
python -m pytest tests/test_nagios_runner.py -v
```
Expected: all tests PASS including the 3 new ones.

- [ ] **Step 7: Commit**

```bash
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
git commit -m "feat: async subprocess in nagios_runner with stderr capture and signal handling"
```

---

## Task 4: NagiosRunnerPlugin — command path validation at init

**Files:**
- Modify: `hbd/client/plugins/nagios_runner.py` (initialize)
- Modify: `tests/test_nagios_runner.py`

- [ ] **Step 1: Write failing tests**

Append to `tests/test_nagios_runner.py`:

```python
def test_absolute_path_not_found_warns(caplog):
    fake_cmd = "/nonexistent_hbc_test_path/check_something"
    config = {"commands": [{"name": "t", "command": fake_cmd}]}
    plugin = NagiosRunnerPlugin(config=config)

    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
        asyncio.run(plugin.initialize())

    assert any("not found" in r.message for r in caplog.records)


def test_absolute_path_not_executable_warns(caplog, tmp_path):
    non_exec = tmp_path / "check_test"
    non_exec.write_text("#!/bin/sh\necho OK\n")
    non_exec.chmod(0o644)  # readable but not executable

    config = {"commands": [{"name": "t", "command": str(non_exec)}]}
    plugin = NagiosRunnerPlugin(config=config)

    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
        asyncio.run(plugin.initialize())

    assert any("not executable" in r.message for r in caplog.records)


def test_relative_path_not_checked(caplog):
    # Relative paths (resolved via PATH) must not generate warnings
    config = {"commands": [{"name": "t", "command": "echo OK"}]}
    plugin = NagiosRunnerPlugin(config=config)

    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
        asyncio.run(plugin.initialize())

    assert not any(
        "not found" in r.message or "not executable" in r.message
        for r in caplog.records
    )
```

- [ ] **Step 2: Run tests to verify they fail**

```bash
python -m pytest tests/test_nagios_runner.py::test_absolute_path_not_found_warns \
    tests/test_nagios_runner.py::test_absolute_path_not_executable_warns \
    tests/test_nagios_runner.py::test_relative_path_not_checked -v
```
Expected: `test_absolute_path_not_found_warns` and `test_absolute_path_not_executable_warns` FAIL (no warnings logged); `test_relative_path_not_checked` may pass.

- [ ] **Step 3: Add command path validation to `initialize()`**

In `hbd/client/plugins/nagios_runner.py`, extend `initialize()` by adding validation after the existing "log each command" loop (after line 103, before `return True`):

```python
        # Validate absolute command paths early
        for cmd_config in self.commands:
            name = cmd_config.get("name", "unnamed")
            command = cmd_config.get("command", "")
            if not command:
                continue
            exe = command.split()[0]
            if os.path.isabs(exe):
                if not os.path.isfile(exe):
                    self.logger.warning(
                        f"Command '{name}': executable not found: {exe}"
                    )
                elif not os.access(exe, os.X_OK):
                    self.logger.warning(
                        f"Command '{name}': executable not executable: {exe}"
                    )
```

- [ ] **Step 4: Run full test suite to verify all pass**

```bash
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
```
Expected: all tests PASS.

- [ ] **Step 5: Commit**

```bash
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
git commit -m "feat: validate absolute command paths at nagios_runner init"
```

---

## Task 5: Daemon mode logging — route to syslog after fork

**Files:**
- Modify: `hbd/client/main.py` (new helper + updated daemon block)

No automated test for daemonization itself (fork behaviour is hard to unit-test). Manual verification steps are provided below.

- [ ] **Step 1: Add `_reconfigure_logging_for_daemon` helper**

In `hbd/client/main.py`, add this function just before `def build_parser()` (around line 589):

```python
def _reconfigure_logging_for_daemon(log_level: int) -> None:
    """Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler."""
    from logging.handlers import SysLogHandler

    root = logging.getLogger()
    for handler in root.handlers[:]:
        root.removeHandler(handler)
        handler.close()

    try:
        syslog_handler = SysLogHandler(
            address="/dev/log",
            facility=SysLogHandler.LOG_DAEMON,
        )
    except OSError:
        syslog_handler = SysLogHandler(
            address=("localhost", 514),
            facility=SysLogHandler.LOG_DAEMON,
        )
        # Attach the fallback first so the warning reaches syslog
        syslog_handler.setFormatter(
            logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
        )
        root.addHandler(syslog_handler)
        root.setLevel(log_level)
        logging.warning("/dev/log not found, using syslog UDP localhost:514")
        return

    syslog_handler.setFormatter(
        logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
    )
    root.addHandler(syslog_handler)
    root.setLevel(log_level)
```

- [ ] **Step 2: Update the daemon block in `main()`**

In `hbd/client/main.py`, replace the entire `if args.daemon:` block (lines 664–675):

```python
    if args.daemon:
        print("Daemonizing...")
        daemonize()
        _reconfigure_logging_for_daemon(log_level)
        logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
```

This removes the `import syslog`, `syslog.openlog()`, and `syslog.syslog()` calls (now handled by the logging system) and removes the no-op second `logging.basicConfig()` call.

- [ ] **Step 3: Run existing test suite to confirm no regressions**

```bash
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
```
Expected: all tests still PASS.

- [ ] **Step 4: Manual smoke test — verify syslog output in daemon mode**

```bash
# In one terminal, tail syslog
sudo journalctl -f -t hbc

# In another terminal, start hbc in daemon mode (replace HOST with a real or dummy host)
python -m hbd.client.main -d -v localhost

# Expected in journalctl output:
#   hbc[<pid>]: hbc.main INFO: Starting hbc for <hostname> -> ['localhost']
#   hbc[<pid>]: hbc.main INFO: hbc starting, sending heartbeat to localhost
#   hbc[<pid>]: plugin.loader INFO: ...

# Stop the daemon
pkill -f "hbd.client.main"
```

- [ ] **Step 5: Commit**

```bash
git add hbd/client/main.py
git commit -m "fix: reconfigure logging to syslog after daemonize() instead of no-op basicConfig"
```