Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
19 KiB
Plugin Error Checking Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Improve plugin error checking in hbc, especially for nagios_runner, and fix logger messages silently discarded in daemon mode.
Architecture: Three focused changes across three files: (1) hbd/client/plugin.py gains a skip_reason attribute on Plugin and updated PluginLoader messaging; (2) hbd/client/plugins/nagios_runner.py gains async subprocess execution, stderr capture, signal-killed process handling, and init-time command path validation; (3) hbd/client/main.py gains proper post-fork logging reconfiguration to syslog.
Tech Stack: Python 3.11+, asyncio, logging.handlers.SysLogHandler, pytest
File Map
| Action | Path | What changes |
|---|---|---|
| Modify | hbd/client/plugin.py |
Plugin.__init__ gains skip_reason; PluginLoader checks it |
| Modify | hbd/client/plugins/nagios_runner.py |
async subprocess, stderr, signal codes, init validation, skip_reason |
| Modify | hbd/client/main.py |
_reconfigure_logging_for_daemon() helper; remove redundant syslog calls |
| Create | tests/test_plugin.py |
PluginLoader messaging tests |
| Create | tests/test_nagios_runner.py |
NagiosRunnerPlugin behaviour tests |
Run tests throughout with:
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
Task 1: Plugin.skip_reason + PluginLoader messaging
Files:
-
Modify:
hbd/client/plugin.py:40-48(Plugin.init) -
Modify:
hbd/client/plugin.py:369-381(PluginLoader.load_from_directory) -
Create:
tests/test_plugin.py -
Step 1: Write failing tests
Create tests/test_plugin.py:
import asyncio
import logging
import textwrap
from hbd.client.plugin import Plugin, PluginLoader, PluginRegistry
def test_plugin_skip_reason_defaults_none(tmp_path):
plugin_code = textwrap.dedent("""
from hbd.client.plugin import MonitorPlugin
class MinimalPlugin(MonitorPlugin):
name = "minimal"
version = "1.0.0"
interval = 60
async def initialize(self):
return True
async def _collect_metrics(self):
return {}
""")
(tmp_path / "minimal.py").write_text(plugin_code)
registry = PluginRegistry()
loader = PluginLoader(registry)
asyncio.run(loader.load_from_directory(tmp_path))
plugin = registry.get("minimal")
assert plugin is not None
assert plugin.skip_reason is None
def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog):
plugin_code = textwrap.dedent("""
from hbd.client.plugin import MonitorPlugin
class SkippablePlugin(MonitorPlugin):
name = "skippable"
version = "1.0.0"
interval = 60
async def initialize(self):
self.skip_reason = "not configured in yaml"
return False
async def _collect_metrics(self):
return {}
""")
(tmp_path / "skippable.py").write_text(plugin_code)
registry = PluginRegistry()
loader = PluginLoader(registry)
with caplog.at_level(logging.INFO, logger="plugin.loader"):
count = asyncio.run(loader.load_from_directory(tmp_path))
assert count == 0
assert any("skipped: not configured in yaml" in r.message for r in caplog.records)
assert not any("failed initialization" in r.message for r in caplog.records)
def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog):
plugin_code = textwrap.dedent("""
from hbd.client.plugin import MonitorPlugin
class FailPlugin(MonitorPlugin):
name = "fail"
version = "1.0.0"
interval = 60
async def initialize(self):
return False
async def _collect_metrics(self):
return {}
""")
(tmp_path / "fail_plugin.py").write_text(plugin_code)
registry = PluginRegistry()
loader = PluginLoader(registry)
with caplog.at_level(logging.WARNING, logger="plugin.loader"):
count = asyncio.run(loader.load_from_directory(tmp_path))
assert count == 0
assert any("failed initialization" in r.message for r in caplog.records)
- Step 2: Run tests to verify they fail
python -m pytest tests/test_plugin.py -v
Expected: test_plugin_skip_reason_defaults_none FAILS (attribute missing), others may error.
- Step 3: Add
skip_reasontoPlugin.__init__
In hbd/client/plugin.py, in Plugin.__init__ (around line 46), add one line:
def __init__(self, config: Optional[Dict[str, Any]] = None):
self.config = config or {}
self.logger = logging.getLogger(f"plugin.{self.name}")
self._initialized = False
self.skip_reason: Optional[str] = None
- Step 4: Update PluginLoader messaging
In hbd/client/plugin.py, replace the if not initialized: block (around line 372):
if not initialized:
if plugin.skip_reason:
self.logger.info(
f"Plugin {plugin.name} skipped: {plugin.skip_reason}"
)
else:
self.logger.warning(
f"Plugin {plugin.name} failed initialization, skipping"
)
continue
- Step 5: Run tests to verify they pass
python -m pytest tests/test_plugin.py -v
Expected: all 3 tests PASS.
- Step 6: Commit
git add hbd/client/plugin.py tests/test_plugin.py
git commit -m "feat: add skip_reason to Plugin; improve PluginLoader init messaging"
Task 2: NagiosRunnerPlugin — skip_reason when no commands
Files:
-
Modify:
hbd/client/plugins/nagios_runner.py:88-105(initialize) -
Modify:
tests/test_nagios_runner.py(create) -
Step 1: Write failing test
Create tests/test_nagios_runner.py:
import asyncio
import logging
import os
import stat
import pytest
from hbd.client.plugins.nagios_runner import (
NagiosRunnerPlugin,
NAGIOS_OK,
NAGIOS_WARNING,
NAGIOS_CRITICAL,
NAGIOS_UNKNOWN,
)
def test_no_commands_sets_skip_reason():
plugin = NagiosRunnerPlugin(config={"commands": []})
result = asyncio.run(plugin.initialize())
assert result is False
assert plugin.skip_reason is not None
assert "nagios_runner.commands" in plugin.skip_reason
- Step 2: Run test to verify it fails
python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
Expected: FAIL — plugin.skip_reason is None.
- Step 3: Set skip_reason in NagiosRunnerPlugin.initialize()
In hbd/client/plugins/nagios_runner.py, replace the early-return block in initialize() (around line 96):
if not self.commands:
self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
self.logger.info("No Nagios commands configured")
return False
- Step 4: Run test to verify it passes
python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
Expected: PASS.
- Step 5: Commit
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
git commit -m "feat: set skip_reason on nagios_runner when no commands configured"
Task 3: NagiosRunnerPlugin — async subprocess, stderr capture, negative return codes
Files:
-
Modify:
hbd/client/plugins/nagios_runner.py(imports +_run_nagios_plugin) -
Modify:
tests/test_nagios_runner.py -
Step 1: Write failing tests
Append to tests/test_nagios_runner.py:
def test_stderr_used_when_stdout_empty(tmp_path):
script = tmp_path / "check_err.sh"
script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n")
script.chmod(script.stat().st_mode | stat.S_IEXEC)
config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
plugin = NagiosRunnerPlugin(config=config)
asyncio.run(plugin.initialize())
data = asyncio.run(plugin._collect_metrics())
assert "error from stderr" in data["t_output"]
assert data["t_status_code"] == NAGIOS_CRITICAL
def test_stderr_appended_when_both_present(tmp_path):
script = tmp_path / "check_both.sh"
script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n")
script.chmod(script.stat().st_mode | stat.S_IEXEC)
config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
plugin = NagiosRunnerPlugin(config=config)
asyncio.run(plugin.initialize())
data = asyncio.run(plugin._collect_metrics())
assert "OK - all good" in data["t_output"]
assert "extra detail" in data["t_output"]
assert data["t_status_code"] == NAGIOS_OK
def test_negative_returncode_maps_to_unknown():
# kill -9 $$ kills the shell itself; asyncio sees returncode -9
config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5}
plugin = NagiosRunnerPlugin(config=config)
asyncio.run(plugin.initialize())
data = asyncio.run(plugin._collect_metrics())
assert data["t_status_code"] == NAGIOS_UNKNOWN
assert "signal" in data["t_output"].lower()
- Step 2: Run tests to verify they fail
python -m pytest tests/test_nagios_runner.py::test_stderr_used_when_stdout_empty \
tests/test_nagios_runner.py::test_stderr_appended_when_both_present \
tests/test_nagios_runner.py::test_negative_returncode_maps_to_unknown -v
Expected: all FAIL — current implementation ignores stderr and doesn't handle negative codes.
- Step 3: Update imports in nagios_runner.py
Replace the import block at the top of hbd/client/plugins/nagios_runner.py:
import asyncio
import os
import re
from typing import Any, Dict, List, Optional, Tuple
from hbd.client.plugin import MonitorPlugin
(Remove import subprocess; add import asyncio and import os.)
- Step 4: Upgrade collection log level from DEBUG to INFO
In hbd/client/plugins/nagios_runner.py, in _collect_metrics(), change the debug log (around line 144) so results are visible at INFO level:
self.logger.info(
f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
)
- Step 5: Replace
_run_nagios_pluginwith async implementation
Replace the entire _run_nagios_plugin method in hbd/client/plugins/nagios_runner.py:
async def _run_nagios_plugin(
self,
command: str
) -> Tuple[int, str, Dict[str, Any]]:
"""Execute a Nagios plugin and parse its output."""
try:
proc = await asyncio.create_subprocess_shell(
command,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
try:
stdout_bytes, stderr_bytes = await asyncio.wait_for(
proc.communicate(), timeout=self.timeout
)
except asyncio.TimeoutError:
proc.kill()
await proc.communicate()
self.logger.error(f"Command timed out: {command}")
return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
status_code = proc.returncode
if status_code < 0:
return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {}
if status_code > 3:
status_code = NAGIOS_UNKNOWN
stdout = stdout_bytes.decode(errors="replace").strip()
stderr = stderr_bytes.decode(errors="replace").strip()
# Parse perfdata from stdout before mixing in stderr
perfdata = self._parse_perfdata(stdout)
# Build status message
status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout
if not stdout and stderr:
output_msg = stderr
elif stdout and stderr:
output_msg = f"{status_part} [stderr: {stderr}]"
else:
output_msg = status_part
return status_code, output_msg, perfdata
except Exception as e:
self.logger.error(f"Error executing command: {e}")
return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}
Also remove the now-unused self.shell line from __init__ (the shell config key is no longer used since create_subprocess_shell always uses a shell):
In NagiosRunnerPlugin.__init__, remove:
self.shell: bool = config.get("shell", True) if config else True
- Step 6: Run tests to verify they pass
python -m pytest tests/test_nagios_runner.py -v
Expected: all tests PASS including the 3 new ones.
- Step 7: Commit
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
git commit -m "feat: async subprocess in nagios_runner with stderr capture and signal handling"
Task 4: NagiosRunnerPlugin — command path validation at init
Files:
-
Modify:
hbd/client/plugins/nagios_runner.py(initialize) -
Modify:
tests/test_nagios_runner.py -
Step 1: Write failing tests
Append to tests/test_nagios_runner.py:
def test_absolute_path_not_found_warns(caplog):
fake_cmd = "/nonexistent_hbc_test_path/check_something"
config = {"commands": [{"name": "t", "command": fake_cmd}]}
plugin = NagiosRunnerPlugin(config=config)
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
asyncio.run(plugin.initialize())
assert any("not found" in r.message for r in caplog.records)
def test_absolute_path_not_executable_warns(caplog, tmp_path):
non_exec = tmp_path / "check_test"
non_exec.write_text("#!/bin/sh\necho OK\n")
non_exec.chmod(0o644) # readable but not executable
config = {"commands": [{"name": "t", "command": str(non_exec)}]}
plugin = NagiosRunnerPlugin(config=config)
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
asyncio.run(plugin.initialize())
assert any("not executable" in r.message for r in caplog.records)
def test_relative_path_not_checked(caplog):
# Relative paths (resolved via PATH) must not generate warnings
config = {"commands": [{"name": "t", "command": "echo OK"}]}
plugin = NagiosRunnerPlugin(config=config)
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
asyncio.run(plugin.initialize())
assert not any(
"not found" in r.message or "not executable" in r.message
for r in caplog.records
)
- Step 2: Run tests to verify they fail
python -m pytest tests/test_nagios_runner.py::test_absolute_path_not_found_warns \
tests/test_nagios_runner.py::test_absolute_path_not_executable_warns \
tests/test_nagios_runner.py::test_relative_path_not_checked -v
Expected: test_absolute_path_not_found_warns and test_absolute_path_not_executable_warns FAIL (no warnings logged); test_relative_path_not_checked may pass.
- Step 3: Add command path validation to
initialize()
In hbd/client/plugins/nagios_runner.py, extend initialize() by adding validation after the existing "log each command" loop (after line 103, before return True):
# Validate absolute command paths early
for cmd_config in self.commands:
name = cmd_config.get("name", "unnamed")
command = cmd_config.get("command", "")
if not command:
continue
exe = command.split()[0]
if os.path.isabs(exe):
if not os.path.isfile(exe):
self.logger.warning(
f"Command '{name}': executable not found: {exe}"
)
elif not os.access(exe, os.X_OK):
self.logger.warning(
f"Command '{name}': executable not executable: {exe}"
)
- Step 4: Run full test suite to verify all pass
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
Expected: all tests PASS.
- Step 5: Commit
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
git commit -m "feat: validate absolute command paths at nagios_runner init"
Task 5: Daemon mode logging — route to syslog after fork
Files:
- Modify:
hbd/client/main.py(new helper + updated daemon block)
No automated test for daemonization itself (fork behaviour is hard to unit-test). Manual verification steps are provided below.
- Step 1: Add
_reconfigure_logging_for_daemonhelper
In hbd/client/main.py, add this function just before def build_parser() (around line 589):
def _reconfigure_logging_for_daemon(log_level: int) -> None:
"""Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler."""
from logging.handlers import SysLogHandler
root = logging.getLogger()
for handler in root.handlers[:]:
root.removeHandler(handler)
handler.close()
try:
syslog_handler = SysLogHandler(
address="/dev/log",
facility=SysLogHandler.LOG_DAEMON,
)
except OSError:
syslog_handler = SysLogHandler(
address=("localhost", 514),
facility=SysLogHandler.LOG_DAEMON,
)
# Attach the fallback first so the warning reaches syslog
syslog_handler.setFormatter(
logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
)
root.addHandler(syslog_handler)
root.setLevel(log_level)
logging.warning("/dev/log not found, using syslog UDP localhost:514")
return
syslog_handler.setFormatter(
logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
)
root.addHandler(syslog_handler)
root.setLevel(log_level)
- Step 2: Update the daemon block in
main()
In hbd/client/main.py, replace the entire if args.daemon: block (lines 664–675):
if args.daemon:
print("Daemonizing...")
daemonize()
_reconfigure_logging_for_daemon(log_level)
logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
This removes the import syslog, syslog.openlog(), and syslog.syslog() calls (now handled by the logging system) and removes the no-op second logging.basicConfig() call.
- Step 3: Run existing test suite to confirm no regressions
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
Expected: all tests still PASS.
- Step 4: Manual smoke test — verify syslog output in daemon mode
# In one terminal, tail syslog
sudo journalctl -f -t hbc
# In another terminal, start hbc in daemon mode (replace HOST with a real or dummy host)
python -m hbd.client.main -d -v localhost
# Expected in journalctl output:
# hbc[<pid>]: hbc.main INFO: Starting hbc for <hostname> -> ['localhost']
# hbc[<pid>]: hbc.main INFO: hbc starting, sending heartbeat to localhost
# hbc[<pid>]: plugin.loader INFO: ...
# Stop the daemon
pkill -f "hbd.client.main"
- Step 5: Commit
git add hbd/client/main.py
git commit -m "fix: reconfigure logging to syslog after daemonize() instead of no-op basicConfig"