Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.9 KiB
Plugin Error Checking & Daemon Logging — Design Spec
Date: 2026-04-25
Scope: hbc client — daemon mode logging, nagios_runner plugin robustness, PluginLoader messaging
Files affected: hbd/client/main.py, hbd/client/plugins/nagios_runner.py, hbd/client/plugin.py
1. Daemon Mode Logging
Problem
In main(), logging.basicConfig() is called before daemonize() (establishing a StreamHandler to stderr), then called again after daemonize(). The second call is a no-op — Python ignores basicConfig() when handlers are already configured. After daemonization, stderr is redirected to /dev/null, so all subsequent log output is silently discarded.
The existing syslog.openlog() / syslog.syslog() calls (lines 666–668) write a single startup message but do not integrate with the logging system, so plugin and connection log messages never reach syslog.
Fix
After daemonize(), explicitly reconfigure the root logger:
- Remove all existing handlers (they now write to
/dev/null). - Add
logging.handlers.SysLogHandler(address='/dev/log', facility=LOG_DAEMON). - Set formatter:
hbc[%(process)d]: %(name)s %(levelname)s: %(message)s - Preserve the
log_levelalready determined from-v/-xCLI flags.
Remove the redundant syslog.openlog() / syslog.syslog() calls — the logging system handles routing.
Fallback: If /dev/log does not exist (containers, some BSDs), fall back to SysLogHandler(address=('localhost', 514)). Log one warning (to stderr, before handlers are replaced) so the operator knows.
2. Nagios Runner Improvements
2a — Async Subprocess
_run_nagios_plugin() is declared async def but calls subprocess.run() synchronously, blocking the event loop for the full command duration.
Fix: Replace with asyncio.create_subprocess_shell() + await proc.communicate(). Enforce timeout with asyncio.wait_for(..., timeout=self.timeout) and catch asyncio.TimeoutError.
2b — Stderr Capture
Subprocess stderr is currently discarded (capture_output=True only captures stdout in the sync call; stderr content is lost).
Fix: Pass stderr=asyncio.subprocess.PIPE to create_subprocess_shell. After communicate(), if stdout is empty but stderr has content, use stderr as the output message. If both have content, append stderr to the output for visibility.
2c — Negative Return Codes
A negative returncode means the process was killed by a signal (SIGKILL, OOM, etc.). The current code treats these as-is, which may produce unexpected status values.
Fix: If returncode < 0, map to NAGIOS_UNKNOWN with message "Process killed by signal {-returncode}".
2d — Command Path Validation at Init
initialize() currently only checks that the commands list is non-empty.
Fix: For each command entry during initialize():
- Warn and skip the entry if
nameorcommandis missing. - Extract the executable (first whitespace-delimited token of the command string).
- If the executable is an absolute path, check
os.path.isfile()andos.access(..., os.X_OK). Log aWARNINGif either check fails. - Commands with relative paths or shell builtins are not checked (they may be on PATH) — just noted.
- Validation warns only; all original entries in
self.commandsare retained and still attempted at collection time (where the existing missing-name/command guard already skips them). The plugin initializes successfully as long as the commands list is non-empty.
3. PluginLoader Messaging
Problem
When initialize() returns False, the loader always logs:
WARNING: Plugin X failed initialization, skipping
This is alarming when the real reason is simply "no commands configured". There is no API to distinguish "not configured" from "genuinely broken".
Fix
Add an optional skip_reason attribute to Plugin.__init__() (defaults to None).
In PluginLoader.load_from_directory(), after initialize() returns False:
- If
plugin.skip_reasonis set →logger.info(f"Plugin {plugin.name} skipped: {plugin.skip_reason}") - If
plugin.skip_reasonisNone→logger.warning(f"Plugin {plugin.name} failed initialization, skipping")(existing behaviour)
In NagiosRunnerPlugin.initialize(), when no commands are configured:
self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
return False
Genuine failures (exceptions) continue to go through the existing except block in the loader, logging at ERROR with traceback — unchanged.
Decisions
| Topic | Decision |
|---|---|
| Daemon log destination | syslog only (LOG_DAEMON facility) |
| Syslog fallback | localhost:514 UDP if /dev/log absent |
| Nagios result log level | INFO for all statuses (OK/WARNING/CRITICAL/UNKNOWN) |
| Invalid command handling at init | Warn and continue; still attempt at collection time |
| PluginLoader API change | skip_reason attribute on Plugin base class, checked by loader |