Files
heartbeat/docs/superpowers/specs/2026-04-25-plugin-error-checking-design.md
T
2026-04-25 15:49:09 +02:00

4.9 KiB
Raw Blame History

Plugin Error Checking & Daemon Logging — Design Spec

Date: 2026-04-25
Scope: hbc client — daemon mode logging, nagios_runner plugin robustness, PluginLoader messaging
Files affected: hbd/client/main.py, hbd/client/plugins/nagios_runner.py, hbd/client/plugin.py


1. Daemon Mode Logging

Problem

In main(), logging.basicConfig() is called before daemonize() (establishing a StreamHandler to stderr), then called again after daemonize(). The second call is a no-op — Python ignores basicConfig() when handlers are already configured. After daemonization, stderr is redirected to /dev/null, so all subsequent log output is silently discarded.

The existing syslog.openlog() / syslog.syslog() calls (lines 666668) write a single startup message but do not integrate with the logging system, so plugin and connection log messages never reach syslog.

Fix

After daemonize(), explicitly reconfigure the root logger:

  1. Remove all existing handlers (they now write to /dev/null).
  2. Add logging.handlers.SysLogHandler(address='/dev/log', facility=LOG_DAEMON).
  3. Set formatter: hbc[%(process)d]: %(name)s %(levelname)s: %(message)s
  4. Preserve the log_level already determined from -v/-x CLI flags.

Remove the redundant syslog.openlog() / syslog.syslog() calls — the logging system handles routing.

Fallback: If /dev/log does not exist (containers, some BSDs), fall back to SysLogHandler(address=('localhost', 514)). Log one warning (to stderr, before handlers are replaced) so the operator knows.


2. Nagios Runner Improvements

2a — Async Subprocess

_run_nagios_plugin() is declared async def but calls subprocess.run() synchronously, blocking the event loop for the full command duration.

Fix: Replace with asyncio.create_subprocess_shell() + await proc.communicate(). Enforce timeout with asyncio.wait_for(..., timeout=self.timeout) and catch asyncio.TimeoutError.

2b — Stderr Capture

Subprocess stderr is currently discarded (capture_output=True only captures stdout in the sync call; stderr content is lost).

Fix: Pass stderr=asyncio.subprocess.PIPE to create_subprocess_shell. After communicate(), if stdout is empty but stderr has content, use stderr as the output message. If both have content, append stderr to the output for visibility.

2c — Negative Return Codes

A negative returncode means the process was killed by a signal (SIGKILL, OOM, etc.). The current code treats these as-is, which may produce unexpected status values.

Fix: If returncode < 0, map to NAGIOS_UNKNOWN with message "Process killed by signal {-returncode}".

2d — Command Path Validation at Init

initialize() currently only checks that the commands list is non-empty.

Fix: For each command entry during initialize():

  • Warn and skip the entry if name or command is missing.
  • Extract the executable (first whitespace-delimited token of the command string).
  • If the executable is an absolute path, check os.path.isfile() and os.access(..., os.X_OK). Log a WARNING if either check fails.
  • Commands with relative paths or shell builtins are not checked (they may be on PATH) — just noted.
  • Validation warns only; all original entries in self.commands are retained and still attempted at collection time (where the existing missing-name/command guard already skips them). The plugin initializes successfully as long as the commands list is non-empty.

3. PluginLoader Messaging

Problem

When initialize() returns False, the loader always logs:

WARNING: Plugin X failed initialization, skipping

This is alarming when the real reason is simply "no commands configured". There is no API to distinguish "not configured" from "genuinely broken".

Fix

Add an optional skip_reason attribute to Plugin.__init__() (defaults to None).

In PluginLoader.load_from_directory(), after initialize() returns False:

  • If plugin.skip_reason is set → logger.info(f"Plugin {plugin.name} skipped: {plugin.skip_reason}")
  • If plugin.skip_reason is Nonelogger.warning(f"Plugin {plugin.name} failed initialization, skipping") (existing behaviour)

In NagiosRunnerPlugin.initialize(), when no commands are configured:

self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
return False

Genuine failures (exceptions) continue to go through the existing except block in the loader, logging at ERROR with traceback — unchanged.


Decisions

Topic Decision
Daemon log destination syslog only (LOG_DAEMON facility)
Syslog fallback localhost:514 UDP if /dev/log absent
Nagios result log level INFO for all statuses (OK/WARNING/CRITICAL/UNKNOWN)
Invalid command handling at init Warn and continue; still attempt at collection time
PluginLoader API change skip_reason attribute on Plugin base class, checked by loader