From 293461f3f680e8e1ae453ff4a939603a1ea6e307 Mon Sep 17 00:00:00 2001 From: Andreas Wrede Date: Sat, 25 Apr 2026 15:49:09 +0200 Subject: [PATCH] Add design spec for plugin error checking and daemon logging fixes Co-Authored-By: Claude Sonnet 4.6 --- ...2026-04-25-plugin-error-checking-design.md | 92 +++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-25-plugin-error-checking-design.md diff --git a/docs/superpowers/specs/2026-04-25-plugin-error-checking-design.md b/docs/superpowers/specs/2026-04-25-plugin-error-checking-design.md new file mode 100644 index 0000000..c12d81e --- /dev/null +++ b/docs/superpowers/specs/2026-04-25-plugin-error-checking-design.md @@ -0,0 +1,92 @@ +# Plugin Error Checking & Daemon Logging — Design Spec + +**Date:** 2026-04-25 +**Scope:** hbc client — daemon mode logging, nagios_runner plugin robustness, PluginLoader messaging +**Files affected:** `hbd/client/main.py`, `hbd/client/plugins/nagios_runner.py`, `hbd/client/plugin.py` + +--- + +## 1. Daemon Mode Logging + +### Problem +In `main()`, `logging.basicConfig()` is called before `daemonize()` (establishing a StreamHandler to stderr), then called again after `daemonize()`. The second call is a no-op — Python ignores `basicConfig()` when handlers are already configured. After daemonization, stderr is redirected to `/dev/null`, so all subsequent log output is silently discarded. + +The existing `syslog.openlog()` / `syslog.syslog()` calls (lines 666–668) write a single startup message but do not integrate with the `logging` system, so plugin and connection log messages never reach syslog. + +### Fix +After `daemonize()`, explicitly reconfigure the root logger: + +1. Remove all existing handlers (they now write to `/dev/null`). +2. Add `logging.handlers.SysLogHandler(address='/dev/log', facility=LOG_DAEMON)`. +3. Set formatter: `hbc[%(process)d]: %(name)s %(levelname)s: %(message)s` +4. Preserve the `log_level` already determined from `-v`/`-x` CLI flags. + +Remove the redundant `syslog.openlog()` / `syslog.syslog()` calls — the logging system handles routing. + +**Fallback:** If `/dev/log` does not exist (containers, some BSDs), fall back to `SysLogHandler(address=('localhost', 514))`. Log one warning (to stderr, before handlers are replaced) so the operator knows. + +--- + +## 2. Nagios Runner Improvements + +### 2a — Async Subprocess +`_run_nagios_plugin()` is declared `async def` but calls `subprocess.run()` synchronously, blocking the event loop for the full command duration. + +**Fix:** Replace with `asyncio.create_subprocess_shell()` + `await proc.communicate()`. Enforce timeout with `asyncio.wait_for(..., timeout=self.timeout)` and catch `asyncio.TimeoutError`. + +### 2b — Stderr Capture +Subprocess stderr is currently discarded (`capture_output=True` only captures stdout in the sync call; stderr content is lost). + +**Fix:** Pass `stderr=asyncio.subprocess.PIPE` to `create_subprocess_shell`. After `communicate()`, if stdout is empty but stderr has content, use stderr as the output message. If both have content, append stderr to the output for visibility. + +### 2c — Negative Return Codes +A negative `returncode` means the process was killed by a signal (SIGKILL, OOM, etc.). The current code treats these as-is, which may produce unexpected status values. + +**Fix:** If `returncode < 0`, map to `NAGIOS_UNKNOWN` with message `"Process killed by signal {-returncode}"`. + +### 2d — Command Path Validation at Init +`initialize()` currently only checks that the commands list is non-empty. + +**Fix:** For each command entry during `initialize()`: +- Warn and skip the entry if `name` or `command` is missing. +- Extract the executable (first whitespace-delimited token of the command string). +- If the executable is an absolute path, check `os.path.isfile()` and `os.access(..., os.X_OK)`. Log a `WARNING` if either check fails. +- Commands with relative paths or shell builtins are not checked (they may be on PATH) — just noted. +- Validation warns only; all original entries in `self.commands` are retained and still attempted at collection time (where the existing missing-name/command guard already skips them). The plugin initializes successfully as long as the commands list is non-empty. + +--- + +## 3. PluginLoader Messaging + +### Problem +When `initialize()` returns `False`, the loader always logs: +> `WARNING: Plugin X failed initialization, skipping` + +This is alarming when the real reason is simply "no commands configured". There is no API to distinguish "not configured" from "genuinely broken". + +### Fix +Add an optional `skip_reason` attribute to `Plugin.__init__()` (defaults to `None`). + +In `PluginLoader.load_from_directory()`, after `initialize()` returns `False`: +- If `plugin.skip_reason` is set → `logger.info(f"Plugin {plugin.name} skipped: {plugin.skip_reason}")` +- If `plugin.skip_reason` is `None` → `logger.warning(f"Plugin {plugin.name} failed initialization, skipping")` (existing behaviour) + +In `NagiosRunnerPlugin.initialize()`, when no commands are configured: +```python +self.skip_reason = "no commands configured (add nagios_runner.commands to config)" +return False +``` + +Genuine failures (exceptions) continue to go through the existing `except` block in the loader, logging at `ERROR` with traceback — unchanged. + +--- + +## Decisions + +| Topic | Decision | +|---|---| +| Daemon log destination | syslog only (LOG_DAEMON facility) | +| Syslog fallback | localhost:514 UDP if `/dev/log` absent | +| Nagios result log level | INFO for all statuses (OK/WARNING/CRITICAL/UNKNOWN) | +| Invalid command handling at init | Warn and continue; still attempt at collection time | +| PluginLoader API change | `skip_reason` attribute on Plugin base class, checked by loader |