Files
heartbeat/docs/superpowers/plans/2026-04-25-plugin-error-checking.md
T

19 KiB
Raw Blame History

Plugin Error Checking Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Improve plugin error checking in hbc, especially for nagios_runner, and fix logger messages silently discarded in daemon mode.

Architecture: Three focused changes across three files: (1) hbd/client/plugin.py gains a skip_reason attribute on Plugin and updated PluginLoader messaging; (2) hbd/client/plugins/nagios_runner.py gains async subprocess execution, stderr capture, signal-killed process handling, and init-time command path validation; (3) hbd/client/main.py gains proper post-fork logging reconfiguration to syslog.

Tech Stack: Python 3.11+, asyncio, logging.handlers.SysLogHandler, pytest


File Map

Action Path What changes
Modify hbd/client/plugin.py Plugin.__init__ gains skip_reason; PluginLoader checks it
Modify hbd/client/plugins/nagios_runner.py async subprocess, stderr, signal codes, init validation, skip_reason
Modify hbd/client/main.py _reconfigure_logging_for_daemon() helper; remove redundant syslog calls
Create tests/test_plugin.py PluginLoader messaging tests
Create tests/test_nagios_runner.py NagiosRunnerPlugin behaviour tests

Run tests throughout with:

python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v

Task 1: Plugin.skip_reason + PluginLoader messaging

Files:

  • Modify: hbd/client/plugin.py:40-48 (Plugin.init)

  • Modify: hbd/client/plugin.py:369-381 (PluginLoader.load_from_directory)

  • Create: tests/test_plugin.py

  • Step 1: Write failing tests

Create tests/test_plugin.py:

import asyncio
import logging
import textwrap

from hbd.client.plugin import Plugin, PluginLoader, PluginRegistry


def test_plugin_skip_reason_defaults_none(tmp_path):
    plugin_code = textwrap.dedent("""
        from hbd.client.plugin import MonitorPlugin

        class MinimalPlugin(MonitorPlugin):
            name = "minimal"
            version = "1.0.0"
            interval = 60

            async def initialize(self):
                return True

            async def _collect_metrics(self):
                return {}
    """)
    (tmp_path / "minimal.py").write_text(plugin_code)
    registry = PluginRegistry()
    loader = PluginLoader(registry)
    asyncio.run(loader.load_from_directory(tmp_path))
    plugin = registry.get("minimal")
    assert plugin is not None
    assert plugin.skip_reason is None


def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog):
    plugin_code = textwrap.dedent("""
        from hbd.client.plugin import MonitorPlugin

        class SkippablePlugin(MonitorPlugin):
            name = "skippable"
            version = "1.0.0"
            interval = 60

            async def initialize(self):
                self.skip_reason = "not configured in yaml"
                return False

            async def _collect_metrics(self):
                return {}
    """)
    (tmp_path / "skippable.py").write_text(plugin_code)
    registry = PluginRegistry()
    loader = PluginLoader(registry)

    with caplog.at_level(logging.INFO, logger="plugin.loader"):
        count = asyncio.run(loader.load_from_directory(tmp_path))

    assert count == 0
    assert any("skipped: not configured in yaml" in r.message for r in caplog.records)
    assert not any("failed initialization" in r.message for r in caplog.records)


def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog):
    plugin_code = textwrap.dedent("""
        from hbd.client.plugin import MonitorPlugin

        class FailPlugin(MonitorPlugin):
            name = "fail"
            version = "1.0.0"
            interval = 60

            async def initialize(self):
                return False

            async def _collect_metrics(self):
                return {}
    """)
    (tmp_path / "fail_plugin.py").write_text(plugin_code)
    registry = PluginRegistry()
    loader = PluginLoader(registry)

    with caplog.at_level(logging.WARNING, logger="plugin.loader"):
        count = asyncio.run(loader.load_from_directory(tmp_path))

    assert count == 0
    assert any("failed initialization" in r.message for r in caplog.records)
  • Step 2: Run tests to verify they fail
python -m pytest tests/test_plugin.py -v

Expected: test_plugin_skip_reason_defaults_none FAILS (attribute missing), others may error.

  • Step 3: Add skip_reason to Plugin.__init__

In hbd/client/plugin.py, in Plugin.__init__ (around line 46), add one line:

def __init__(self, config: Optional[Dict[str, Any]] = None):
    self.config = config or {}
    self.logger = logging.getLogger(f"plugin.{self.name}")
    self._initialized = False
    self.skip_reason: Optional[str] = None
  • Step 4: Update PluginLoader messaging

In hbd/client/plugin.py, replace the if not initialized: block (around line 372):

                    if not initialized:
                        if plugin.skip_reason:
                            self.logger.info(
                                f"Plugin {plugin.name} skipped: {plugin.skip_reason}"
                            )
                        else:
                            self.logger.warning(
                                f"Plugin {plugin.name} failed initialization, skipping"
                            )
                        continue
  • Step 5: Run tests to verify they pass
python -m pytest tests/test_plugin.py -v

Expected: all 3 tests PASS.

  • Step 6: Commit
git add hbd/client/plugin.py tests/test_plugin.py
git commit -m "feat: add skip_reason to Plugin; improve PluginLoader init messaging"

Task 2: NagiosRunnerPlugin — skip_reason when no commands

Files:

  • Modify: hbd/client/plugins/nagios_runner.py:88-105 (initialize)

  • Modify: tests/test_nagios_runner.py (create)

  • Step 1: Write failing test

Create tests/test_nagios_runner.py:

import asyncio
import logging
import os
import stat

import pytest

from hbd.client.plugins.nagios_runner import (
    NagiosRunnerPlugin,
    NAGIOS_OK,
    NAGIOS_WARNING,
    NAGIOS_CRITICAL,
    NAGIOS_UNKNOWN,
)


def test_no_commands_sets_skip_reason():
    plugin = NagiosRunnerPlugin(config={"commands": []})
    result = asyncio.run(plugin.initialize())
    assert result is False
    assert plugin.skip_reason is not None
    assert "nagios_runner.commands" in plugin.skip_reason
  • Step 2: Run test to verify it fails
python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v

Expected: FAIL — plugin.skip_reason is None.

  • Step 3: Set skip_reason in NagiosRunnerPlugin.initialize()

In hbd/client/plugins/nagios_runner.py, replace the early-return block in initialize() (around line 96):

        if not self.commands:
            self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
            self.logger.info("No Nagios commands configured")
            return False
  • Step 4: Run test to verify it passes
python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v

Expected: PASS.

  • Step 5: Commit
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
git commit -m "feat: set skip_reason on nagios_runner when no commands configured"

Task 3: NagiosRunnerPlugin — async subprocess, stderr capture, negative return codes

Files:

  • Modify: hbd/client/plugins/nagios_runner.py (imports + _run_nagios_plugin)

  • Modify: tests/test_nagios_runner.py

  • Step 1: Write failing tests

Append to tests/test_nagios_runner.py:

def test_stderr_used_when_stdout_empty(tmp_path):
    script = tmp_path / "check_err.sh"
    script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n")
    script.chmod(script.stat().st_mode | stat.S_IEXEC)

    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
    plugin = NagiosRunnerPlugin(config=config)
    asyncio.run(plugin.initialize())
    data = asyncio.run(plugin._collect_metrics())

    assert "error from stderr" in data["t_output"]
    assert data["t_status_code"] == NAGIOS_CRITICAL


def test_stderr_appended_when_both_present(tmp_path):
    script = tmp_path / "check_both.sh"
    script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n")
    script.chmod(script.stat().st_mode | stat.S_IEXEC)

    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
    plugin = NagiosRunnerPlugin(config=config)
    asyncio.run(plugin.initialize())
    data = asyncio.run(plugin._collect_metrics())

    assert "OK - all good" in data["t_output"]
    assert "extra detail" in data["t_output"]
    assert data["t_status_code"] == NAGIOS_OK


def test_negative_returncode_maps_to_unknown():
    # kill -9 $$ kills the shell itself; asyncio sees returncode -9
    config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5}
    plugin = NagiosRunnerPlugin(config=config)
    asyncio.run(plugin.initialize())
    data = asyncio.run(plugin._collect_metrics())

    assert data["t_status_code"] == NAGIOS_UNKNOWN
    assert "signal" in data["t_output"].lower()
  • Step 2: Run tests to verify they fail
python -m pytest tests/test_nagios_runner.py::test_stderr_used_when_stdout_empty \
    tests/test_nagios_runner.py::test_stderr_appended_when_both_present \
    tests/test_nagios_runner.py::test_negative_returncode_maps_to_unknown -v

Expected: all FAIL — current implementation ignores stderr and doesn't handle negative codes.

  • Step 3: Update imports in nagios_runner.py

Replace the import block at the top of hbd/client/plugins/nagios_runner.py:

import asyncio
import os
import re
from typing import Any, Dict, List, Optional, Tuple

from hbd.client.plugin import MonitorPlugin

(Remove import subprocess; add import asyncio and import os.)

  • Step 4: Upgrade collection log level from DEBUG to INFO

In hbd/client/plugins/nagios_runner.py, in _collect_metrics(), change the debug log (around line 144) so results are visible at INFO level:

                self.logger.info(
                    f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
                )
  • Step 5: Replace _run_nagios_plugin with async implementation

Replace the entire _run_nagios_plugin method in hbd/client/plugins/nagios_runner.py:

    async def _run_nagios_plugin(
        self,
        command: str
    ) -> Tuple[int, str, Dict[str, Any]]:
        """Execute a Nagios plugin and parse its output."""
        try:
            proc = await asyncio.create_subprocess_shell(
                command,
                stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.PIPE,
            )
            try:
                stdout_bytes, stderr_bytes = await asyncio.wait_for(
                    proc.communicate(), timeout=self.timeout
                )
            except asyncio.TimeoutError:
                proc.kill()
                await proc.communicate()
                self.logger.error(f"Command timed out: {command}")
                return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}

            status_code = proc.returncode

            if status_code < 0:
                return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {}

            if status_code > 3:
                status_code = NAGIOS_UNKNOWN

            stdout = stdout_bytes.decode(errors="replace").strip()
            stderr = stderr_bytes.decode(errors="replace").strip()

            # Parse perfdata from stdout before mixing in stderr
            perfdata = self._parse_perfdata(stdout)

            # Build status message
            status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout

            if not stdout and stderr:
                output_msg = stderr
            elif stdout and stderr:
                output_msg = f"{status_part} [stderr: {stderr}]"
            else:
                output_msg = status_part

            return status_code, output_msg, perfdata

        except Exception as e:
            self.logger.error(f"Error executing command: {e}")
            return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}

Also remove the now-unused self.shell line from __init__ (the shell config key is no longer used since create_subprocess_shell always uses a shell):

In NagiosRunnerPlugin.__init__, remove:

        self.shell: bool = config.get("shell", True) if config else True
  • Step 6: Run tests to verify they pass
python -m pytest tests/test_nagios_runner.py -v

Expected: all tests PASS including the 3 new ones.

  • Step 7: Commit
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
git commit -m "feat: async subprocess in nagios_runner with stderr capture and signal handling"

Task 4: NagiosRunnerPlugin — command path validation at init

Files:

  • Modify: hbd/client/plugins/nagios_runner.py (initialize)

  • Modify: tests/test_nagios_runner.py

  • Step 1: Write failing tests

Append to tests/test_nagios_runner.py:

def test_absolute_path_not_found_warns(caplog):
    fake_cmd = "/nonexistent_hbc_test_path/check_something"
    config = {"commands": [{"name": "t", "command": fake_cmd}]}
    plugin = NagiosRunnerPlugin(config=config)

    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
        asyncio.run(plugin.initialize())

    assert any("not found" in r.message for r in caplog.records)


def test_absolute_path_not_executable_warns(caplog, tmp_path):
    non_exec = tmp_path / "check_test"
    non_exec.write_text("#!/bin/sh\necho OK\n")
    non_exec.chmod(0o644)  # readable but not executable

    config = {"commands": [{"name": "t", "command": str(non_exec)}]}
    plugin = NagiosRunnerPlugin(config=config)

    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
        asyncio.run(plugin.initialize())

    assert any("not executable" in r.message for r in caplog.records)


def test_relative_path_not_checked(caplog):
    # Relative paths (resolved via PATH) must not generate warnings
    config = {"commands": [{"name": "t", "command": "echo OK"}]}
    plugin = NagiosRunnerPlugin(config=config)

    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
        asyncio.run(plugin.initialize())

    assert not any(
        "not found" in r.message or "not executable" in r.message
        for r in caplog.records
    )
  • Step 2: Run tests to verify they fail
python -m pytest tests/test_nagios_runner.py::test_absolute_path_not_found_warns \
    tests/test_nagios_runner.py::test_absolute_path_not_executable_warns \
    tests/test_nagios_runner.py::test_relative_path_not_checked -v

Expected: test_absolute_path_not_found_warns and test_absolute_path_not_executable_warns FAIL (no warnings logged); test_relative_path_not_checked may pass.

  • Step 3: Add command path validation to initialize()

In hbd/client/plugins/nagios_runner.py, extend initialize() by adding validation after the existing "log each command" loop (after line 103, before return True):

        # Validate absolute command paths early
        for cmd_config in self.commands:
            name = cmd_config.get("name", "unnamed")
            command = cmd_config.get("command", "")
            if not command:
                continue
            exe = command.split()[0]
            if os.path.isabs(exe):
                if not os.path.isfile(exe):
                    self.logger.warning(
                        f"Command '{name}': executable not found: {exe}"
                    )
                elif not os.access(exe, os.X_OK):
                    self.logger.warning(
                        f"Command '{name}': executable not executable: {exe}"
                    )
  • Step 4: Run full test suite to verify all pass
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v

Expected: all tests PASS.

  • Step 5: Commit
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
git commit -m "feat: validate absolute command paths at nagios_runner init"

Task 5: Daemon mode logging — route to syslog after fork

Files:

  • Modify: hbd/client/main.py (new helper + updated daemon block)

No automated test for daemonization itself (fork behaviour is hard to unit-test). Manual verification steps are provided below.

  • Step 1: Add _reconfigure_logging_for_daemon helper

In hbd/client/main.py, add this function just before def build_parser() (around line 589):

def _reconfigure_logging_for_daemon(log_level: int) -> None:
    """Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler."""
    from logging.handlers import SysLogHandler

    root = logging.getLogger()
    for handler in root.handlers[:]:
        root.removeHandler(handler)
        handler.close()

    try:
        syslog_handler = SysLogHandler(
            address="/dev/log",
            facility=SysLogHandler.LOG_DAEMON,
        )
    except OSError:
        syslog_handler = SysLogHandler(
            address=("localhost", 514),
            facility=SysLogHandler.LOG_DAEMON,
        )
        # Attach the fallback first so the warning reaches syslog
        syslog_handler.setFormatter(
            logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
        )
        root.addHandler(syslog_handler)
        root.setLevel(log_level)
        logging.warning("/dev/log not found, using syslog UDP localhost:514")
        return

    syslog_handler.setFormatter(
        logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
    )
    root.addHandler(syslog_handler)
    root.setLevel(log_level)
  • Step 2: Update the daemon block in main()

In hbd/client/main.py, replace the entire if args.daemon: block (lines 664675):

    if args.daemon:
        print("Daemonizing...")
        daemonize()
        _reconfigure_logging_for_daemon(log_level)
        logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")

This removes the import syslog, syslog.openlog(), and syslog.syslog() calls (now handled by the logging system) and removes the no-op second logging.basicConfig() call.

  • Step 3: Run existing test suite to confirm no regressions
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v

Expected: all tests still PASS.

  • Step 4: Manual smoke test — verify syslog output in daemon mode
# In one terminal, tail syslog
sudo journalctl -f -t hbc

# In another terminal, start hbc in daemon mode (replace HOST with a real or dummy host)
python -m hbd.client.main -d -v localhost

# Expected in journalctl output:
#   hbc[<pid>]: hbc.main INFO: Starting hbc for <hostname> -> ['localhost']
#   hbc[<pid>]: hbc.main INFO: hbc starting, sending heartbeat to localhost
#   hbc[<pid>]: plugin.loader INFO: ...

# Stop the daemon
pkill -f "hbd.client.main"
  • Step 5: Commit
git add hbd/client/main.py
git commit -m "fix: reconfigure logging to syslog after daemonize() instead of no-op basicConfig"