# Heartbeat Daemon (hbd) โœ… A lightweight daemon that listens for UDP heartbeat messages and acts on them: keeps host state, optionally updates DNS records via `nsupdate`, forwards messages to WebSocket clients, and sends notifications (email, Pushover, Mattermost, Signal). It is a refactor of a previously monolithic script into a modular Python package (`hbd`). --- ## ๐Ÿ“Œ Features - Receive and parse heartbeat datagrams (text or zlib-compressed) โœ… - Maintain host state and detect up/down transitions โœ… - Queue DNS updates via `nsupdate` and run them in a background thread โœ… - WebSocket API for live updates (hosts & messages) โœ… - Notification pipeline (email, Pushover, Mattermost, Signal) โœ… - **User management & access control** โœ… - Optional user accounts with bcrypt-style password hashing (stdlib only) - Per-host roles: owner, manager, monitor - Session-based auth with cookie support (browser login page included) - Backwards compatible: no auth required when no users are configured - **HTTP API & Web UI** โœ… - REST API for plugin data, alerts, host information, and user management - Live dashboard with WebSocket updates - Interactive plugin metrics visualization - Alerts dashboard with filtering and summaries - **Message journal with automatic log rotation** โœ… - Logs all received messages in JSON format - Size-based automatic rotation - Configurable retention and backup management - **Plugin system for extensible monitoring** โœ… - Collect system metrics (CPU, memory, disk, network) - Execute existing Nagios monitoring plugins - Create custom plugins with simple Python classes - **Threshold alerting system** โœ… - Monitor metrics against configurable WARNING/CRITICAL thresholds - Hysteresis to prevent alert flapping - Automatic notifications on state changes - Re-notification for ongoing alerts - Modular codebase suitable for unit testing and CI โœ… --- ## ๐Ÿ”Œ Plugin System Heartbeat includes a comprehensive plugin architecture that extends monitoring beyond simple heartbeats. The plugin system allows you to: - **Collect system information**: OS details, hardware info, system configuration - **Monitor resources**: CPU usage, memory, disk space, network statistics - **Run Nagios plugins**: Execute thousands of existing Nagios monitoring plugins without modification - **Create custom plugins**: Build your own monitoring logic with simple Python classes ### Plugin Types - **InfoPlugin**: Collects static information once (e.g., OS version, hardware specs) - **MonitorPlugin**: Collects metrics periodically (e.g., CPU usage every 30 seconds) ### Built-in Plugins - `os_info`: Collects OS, kernel, distribution, and architecture information - `cpu_monitor`: Monitors CPU usage, load average, frequency, and process counts - `memory_monitor`: Monitors RAM and swap usage, available memory - `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics - `network_monitor`: Monitors network interface statistics, bandwidth, and connections - `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default) - `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.) ### Nagios Integration The `nagios_runner` plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored: - Executes plugins via subprocess with timeout protection - Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN) - Extracts performance data with thresholds - Reports aggregated status across all configured checks See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development. ### Creating Custom Plugins ```python from hbd.client.plugin import MonitorPlugin class DiskMonitorPlugin(MonitorPlugin): name = "disk_monitor" interval = 60 # Run every 60 seconds async def collect(self): return { "disk_usage": get_disk_usage(), "timestamp": time.time() } ``` Place plugins in `hbd/client/plugins/` and they'll be automatically discovered and loaded by the client. --- ## ๐Ÿ“ Message Journal Heartbeat includes a message journal that logs all received messages with automatic rotation. ### Features - **JSON Format**: All messages logged in JSONL (JSON Lines) format for easy parsing - **Automatic Rotation**: Size-based rotation with configurable thresholds - **Backup Management**: Keeps configurable number of rotated log files - **Non-blocking**: Async logging with minimal performance impact ### Configuration ```yaml # Message journal settings journal_enabled: true # Enable/disable journaling journal_dir: /var/log/heartbeat # Journal directory journal_file: messages.journal # Base filename journal_max_size: 104857600 # Max size (100MB default) journal_max_backups: 10 # Number of backups to keep ``` ### Example Journal Entry ```json {"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver1","interval":30}} ``` ### Analyzing Journal Files ```bash # View recent messages tail -100 /var/log/heartbeat/messages.journal | jq . # Count messages by type cat /var/log/heartbeat/messages.journal | jq -r '.message.ID' | sort | uniq -c # Filter by hostname cat /var/log/heartbeat/messages.journal | jq 'select(.message.name == "webserver1")' ``` See [docs/MESSAGE_JOURNAL.md](docs/MESSAGE_JOURNAL.md) for complete documentation including rotation behavior, integration with log management systems, and analysis examples. --- ## ๐Ÿšจ Threshold Alerting Heartbeat includes a sophisticated threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured limits. ### Features - **Multi-level alerts**: WARNING and CRITICAL severity levels - **Flexible operators**: Support for >, >=, <, <=, ==, != comparisons - **Hysteresis**: Prevents alert flapping with configurable recovery thresholds - **Smart notifications**: Alerts only on state changes, not every check - **Re-notifications**: Periodic reminders for ongoing alerts - **Journal integration**: All threshold events logged for audit trail ### Configuration ```yaml thresholds: # RTT (Round-Trip Time) thresholds for heartbeat monitoring # These are checked on every HTB message arrival rtt: webserver01: warning: 100.0 # Warn when RTT > 100ms critical: 500.0 # Critical when RTT > 500ms database01: warning: 50.0 critical: 200.0 # Plugin metric thresholds cpu_monitor: cpu_percent: warning: 80.0 # Warn when CPU > 80% critical: 90.0 # Critical when CPU > 90% operator: ">" hysteresis: 0.1 # 10% hysteresis to prevent flapping memory_monitor: percent: warning: 85.0 critical: 95.0 disk_monitor: partitions: /: percent: warning: 80.0 critical: 90.0 free_gb: warning: 10.0 # Alert when < 10GB free critical: 5.0 operator: "<" # Inverse threshold # Global settings threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts ``` ### RTT Monitoring Heartbeat monitors network latency (Round-Trip Time) for each host's heartbeat messages. RTT thresholds are **fully integrated with the threshold alerting system**: - **Per-host configuration**: Set different thresholds for each monitored host - **Real-time checking**: Thresholds evaluated on every HTB message arrival - **Alert state tracking**: RTT alerts use the same state management as plugin metrics - **Hysteresis support**: Configurable hysteresis prevents rapid state transitions - **Alerts dashboard**: RTT alerts visible on the `/alerts` web page alongside plugin alerts - **Smart notifications**: Only triggers on state changes (OK โ†’ WARNING โ†’ CRITICAL) - **Re-notification**: Periodic reminders for ongoing RTT issues - **Event & journal logging**: All RTT events logged for audit trail **Configuration format:** ```yaml thresholds: rtt: : warning: # Warn when RTT > this value critical: # Critical when RTT > this value hysteresis: 0.1 # Optional: 10% hysteresis (default) ``` **Example alerts:** ``` WARNING: webserver01 - rtt.webserver01 = 125.3 CRITICAL: database01 - rtt.database01 = 520.1 RECOVERED: webserver01 - rtt.webserver01 = 45.2 (WARNING -> OK) ``` RTT alerts appear on the Alerts dashboard and can be filtered by severity level. The `metric_path` format is `rtt.`, making it easy to distinguish from plugin metrics. ### Alert Behavior 1. **State Changes**: Notifications sent when crossing thresholds - OK โ†’ WARNING: Early notification - WARNING โ†’ CRITICAL: Escalation - CRITICAL โ†’ OK: Recovery 2. **Hysteresis**: Prevents rapid state transitions ``` Critical threshold: 90% Hysteresis: 10% Recovery threshold: 81% (90 - 10% of 90) Value 91% โ†’ CRITICAL (threshold crossed) Value 85% โ†’ CRITICAL (still above 81%) Value 79% โ†’ OK (below recovery threshold) ``` 3. **Re-notifications**: Periodic reminders for ongoing alerts - Default: Every 60 minutes - Configurable via `threshold_renotify_interval` ### Example Notifications ``` WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0 CRITICAL: webserver01 - memory_monitor.percent = 96.0 RECOVERED: database01 - disk_monitor./.percent = 75.0 (WARNING -> OK) REMINDER (CRITICAL): mailserver - cpu_monitor.load_1min = 12.5 (ongoing for 3600s) ``` ### Supported Metrics All plugin metrics can be thresholded: - **CPU**: cpu_percent, load_1min, load_5min, load_15min - **Memory**: percent, available_mb, swap_percent - **Disk**: Per-partition percent, free_gb, free_mb - **Network**: errors_total, dropped packets, connection counts - **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL) ### Per-Host Threshold Profiles Named threshold configurations let different hosts use different limits. A host's `threshold_config` can be a single name or a **list** โ€” lists are applied left-to-right so profiles compose without duplication: ```yaml threshold_configs: default: thresholds: cpu_monitor: cpu_percent: {warning: 80, critical: 90} memory_monitor: memory_percent: {warning: 85, critical: 95} tight_cpu: # override CPU limits only thresholds: cpu_monitor: cpu_percent: {warning: 60, critical: 75} db_disk: # add a database partition check thresholds: disk_monitor: partitions: /var/lib/postgresql: percent: {warning: 75, critical: 88} hosts: web-01: threshold_config: default # single profile db-01: threshold_config: [tight_cpu, db_disk] # layered: CPU override + extra disk check ``` Each named config's overrides are applied in order on top of the defaults. Metrics not mentioned in a profile are inherited unchanged. See [docs/THRESHOLD_ALERTING.md](docs/THRESHOLD_ALERTING.md) for comprehensive documentation including best practices, troubleshooting, and advanced configuration. --- ## ๐Ÿ‘ฅ User Management Heartbeat supports optional user accounts with role-based access control per host. ### Roles - **monitor** โ€” view status, plugin data, alerts - **manager** โ€” monitor + queue commands, trigger DNS, queue upgrades - **owner** โ€” manager + drop host, transfer ownership, update access - **admin** (user flag) โ€” owner-level access on every host When no users are configured the server runs in **unauthenticated mode** โ€” all existing behaviour is unchanged. ### Quick setup ```yaml users: alice: full_name: Alice Smith password: pbkdf2:sha256:... # hbd passwd alice admin: true default_owner: alice hosts: webserver01: owner: alice managers: [bob] monitors: [carol] ``` ```bash # Generate a password hash hbd passwd alice ``` Browser users are redirected to `/login` automatically. The session cookie is set on login, so `fetch()` calls from dashboards work without any JavaScript changes. See [docs/USERS.md](docs/USERS.md) for complete user management documentation. --- ## ๐ŸŒ HTTP API & Web UI Heartbeat includes a built-in HTTP/WebSocket server that provides both a REST API and web-based dashboards for monitoring and visualization. ### Features - **User auth**: Optional session-based authentication with per-host role enforcement - **REST API**: JSON endpoints for accessing plugin data, alerts, host information, and user management - **Live Dashboard**: Real-time WebSocket-powered host status view - **Plugin Metrics**: Interactive visualization of all plugin data with auto-refresh - **Alerts Dashboard**: Comprehensive alert monitoring with filtering and summaries ### Web Dashboards - **Login** (`/login`): Browser login form (shown automatically when auth is configured) - **Live View** (`/live`): Real-time host connectivity, latency, and messages - **Plugin Metrics** (`/plugins`): Browse and visualize metrics from all plugins - **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering ### API Endpoints ```bash # Log in (when auth is configured) TOKEN=$(curl -s -X POST http://localhost:50004/api/0/auth/login \ -H 'Content-Type: application/json' \ -d '{"username":"alice","password":"secret"}' | jq -r .token) AUTH="-H \"Authorization: Bearer $TOKEN\"" # List all monitored hosts curl $AUTH http://localhost:50004/api/0/hosts # Get all plugin data for a host curl $AUTH http://localhost:50004/api/0/hosts/webserver01/plugins # Get detailed plugin history (last 50 samples) curl $AUTH "http://localhost:50004/api/0/hosts/webserver01/plugins/cpu_monitor?limit=50" # Get alert states for a specific host curl $AUTH http://localhost:50004/api/0/hosts/webserver01/alerts # Get all active alerts across all hosts curl $AUTH http://localhost:50004/api/0/alerts # View/update host access roles curl $AUTH http://localhost:50004/api/0/hosts/webserver01/access ``` See [docs/HTTP_API.md](docs/HTTP_API.md) for complete API documentation including response formats, error handling, and integration examples. --- ## โš™๏ธ Quickstart Prerequisites: - Python 3.11+ (project uses language features from recent Python) - `nsupdate` (for DNS updates) if using dynamic DNS Install dependencies (recommended into a venv): This project now declares its dependencies in `pyproject.toml`. Instead of the old `requirements.txt` flow, install the package into a virtualenv using `pip`: See `scripts/hb_install.sh` for a way to install. Run the daemon (example): ```bash # run with default config lookup (~/.hb.yaml) hbd -c .hb.yaml -f -v ``` You can also run it directly via the package entrypoint after installation: ```bash python -m hbd.server.cli -c /path/to/config.yaml ``` ### Running the Client The heartbeat client (`hbc`) sends periodic heartbeats and plugin data to the server: ```bash # Basic usage pointing to server (host is a positional argument) hbc your-server.example.com # Run as daemon with a config file hbc -d -c /etc/hbc.yaml your-server.example.com # Send a one-off boot message hbc --boot your-server.example.com # Verbose output hbc -v your-server.example.com ``` You can also run it via the module entrypoint: ```bash python -m hbd.client.main your-server.example.com ``` Client configuration can also be specified in YAML: ```yaml server: hbd.example.com port: 50003 interval: 30 plugins: cpu_monitor: interval: 300 # Check every 5 minutes (default) per_core: true memory_monitor: interval: 300 # Check every 5 minutes (default) disk_monitor: interval: 300 # Check every 5 minutes (default) network_monitor: interval: 300 # Check every 5 minutes (default) nagios_runner: interval: 300 # Check every 5 minutes (default) commands: - /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6 - /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p / ``` All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed. ### hbc_mini โ€” single-file client (no external dependencies) `scripts/hbc_mini.py` is a self-contained version of the heartbeat client that requires only Python 3.8+ and no external packages. Copy it to any host and run it directly โ€” no virtualenv, no `pip install`. ```bash # Basic usage python3 hbc_mini.py your-server.example.com # Run as daemon python3 hbc_mini.py -d your-server.example.com # Send a boot message python3 hbc_mini.py -b your-server.example.com # Send a one-off message python3 hbc_mini.py -m "maintenance starting" your-server.example.com ``` **Config:** `~/.hbc.json` (same keys as `~/.hbc.yaml`, JSON format). Example: ```json { "hb_port": 50003, "interval": 30, "plugins": { "ping_monitor": { "interval": 60, "hosts": ["8.8.8.8", "192.168.1.1"] }, "nagios_runner": { "interval": 300, "commands": [ {"name": "check_load", "command": "/usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6"} ] } } } ``` **Plugin availability:** | Plugin | Platform | Data source | |---|---|---| | `os_info` | all | `platform` stdlib | | `ping_monitor` | all | `ping` subprocess | | `nagios_runner` | all (not Windows) | subprocess | | `cpu_monitor` | Linux | `/proc/stat` | | `memory_monitor` | Linux | `/proc/meminfo` | | `disk_monitor` | Linux, macOS, BSD | `df -P` subprocess | | `network_monitor` | Linux | `/proc/net/dev` | **What is not available compared to the full `hbc`:** - No YAML config (use JSON instead) - No `filesystem_info` plugin - `cpu_monitor` does not report per-core usage or CPU frequency (no psutil) - Plugins cannot be loaded from external `.py` files โ€” all plugins are compiled in Everything else โ€” heartbeat protocol, ACK/CMD/UPD handling, `hb_install.sh`-based self-update, daemonize, syslog โ€” is identical to the full client. --- ## ๐Ÿž Debugging in VS Code This repository includes a ready-to-use `.vscode/launch.json` with configurations to run or attach the VS Code debugger to `hbd`. - Ensure the **Python** extension is installed and select the project `.venv` as the interpreter (bottom-left of VS Code). - Use **F5** and pick one of these configurations from the Run view: - **Python: Run hbd (module)** โ€” runs `hbd.server.cli` as a module and sets `PYTHONPATH` to the workspace root (recommended). - **Python: Run hbd with debugpy (listen)** โ€” launches `debugpy` and `hbd` together; useful when you want the process to listen for a debugger. - **Python: Attach (localhost:5678)** โ€” attach the debugger to a running process started with `debugpy`. To start `hbd` manually and wait for the debugger to attach, run: ```bash PYTHONPATH=. python -m debugpy --listen 5678 --wait-for-client -m hbd.server.cli -c .hb.yaml -f -v ``` Set breakpoints in modules such as `hbd/server/udp.py`, `hbd/server/dns.py`, or `hbd/server/main.py`, and use the **Attach** configuration to connect. Use `justMyCode: false` if you need to step into third-party code. --- ## ๐Ÿ›  Configuration `hbd` reads YAML configuration (optional). If `PyYAML` is not installed, built-in defaults are used. Example configuration keys (see `hbd/server/config.py`): - `hb_port`: UDP port to listen for heartbeats (default: 50003) - `hbd_port`: internal control port (default: 50004) - `hbd_host`: bind address for HTTP/WSS - `pickfile`: path for persisted state - `logfile`: path to log file - `pushsrv`: push service (`pushover`|`mattermost`|`all`) - `interval` / `grace`: heartbeat timing configuration - `dyndomains`: list of dyndomains to update via `nsupdate` - `nsupdate_bin`: path to nsupdate binary - `ws_port`: port for plain WebSocket connections (default: 50005) - `wss_port`: port for secure WebSocket (WSS) connections (default: none). If set, `hbd` will attempt to serve WSS on this port when `wss_pem` and `wss_key` SSL files are available under `cert_path` (see below). - `cert_path`: directory where TLS certificate and key are looked up (default: /usr/local/etc/ssl/) - `wss_pem`: filename for the certificate chain (default: fullchain.pem) - `wss_key`: filename for the private key (default: privkey.pem) - `users`: mapping of username โ†’ user attributes (full_name, avatar, password, admin, notification_channels) - `default_owner`: username that owns hosts with no explicit owner (falls back to first admin user) Example `.hb.yaml` (minimal): ```yaml hbd_host: 0.0.0.0 hbd_port: 50004 dyndomains: - example.com nsupdate_bin: /usr/bin/nsupdate pushsrv: pushover ``` > Tip: `SERVER_DEFAULTS` in `hbd/server/config.py` contains the canonical defaults and accepted configuration keys. --- ## ๐Ÿ”ง Architecture & Modules The package is organized into three subpackages: **`hbd.common`** โ€” shared code used by both client and server: - `hbd.common.proto` โ€” serialization/deserialization of heartbeat messages (supports compressed payloads and plugin data) - `hbd.common.utils` โ€” small utility helpers (`shortname`, `dur`, `initlog`) **`hbd.server`** โ€” the heartbeat daemon (`hbd`): - `hbd.server.cli` โ€” CLI entrypoint and argument parsing - `hbd.server.main` โ€” async orchestration to run UDP/HTTP/WSS components - `hbd.server.udp` โ€” UDP parsing and `handle_datagram` implementation (main state machine) - `hbd.server.dns` โ€” `create_nsupdate_payload`, `nsupdate`, and an asyncio DNS worker (`start_dns_worker`). The DNS worker runs as an `asyncio` task and the package exposes a small thread-safe bridge so legacy synchronous code can `put()` updates into the queue. - `hbd.server.notify` โ€” email and push notification helpers - `hbd.server.ws` โ€” WebSocket server and thread-safe broadcast helpers - `hbd.server.http` โ€” HTTP handler factory for the status UI/API - `hbd.server.journal` โ€” message journal with size-based log rotation and backup management - `hbd.server.threshold` โ€” threshold alerting engine - `hbd.server.monitor` โ€” host state monitoring - `hbd.server.hbdclass` โ€” `Host` class and shared server state - `hbd.server.config` โ€” configuration loader and defaults **`hbd.client`** โ€” the heartbeat client (`hbc`): - `hbd.client.main` โ€” client entrypoint; sends heartbeats and plugin data to the server - `hbd.client.plugin` โ€” plugin framework with base classes, registry, and dynamic loader - `hbd.client.plugins/` โ€” built-in plugins (os_info, cpu_monitor, memory_monitor, disk_monitor, network_monitor, filesystem_info, nagios_runner) - `hbd.client.config` โ€” client configuration loader This modular layout makes the code easier to test and maintain. **Runtime & Shutdown** - The main runtime is asyncio-based. Services (UDP listener, HTTP server, WebSocket server, monitor, and DNS worker) run as asyncio tasks. - On SIGINT/SIGTERM the server triggers a graceful shutdown: it cancels active tasks, signals the DNS worker via a sentinel, and cleans up resources before exit. - The DNS update worker is implemented as an `asyncio` task; synchronous producers can still enqueue DNS updates via a small thread-safe bridge available at `hbd.server.hbdclass.Host.dnsQ`. **Templates & Static Files** - Template files are located under `hbd/server/templates`. The HTTP server resolves templates relative to the `hbd.server` package but the path can be overridden with the `templates_dir` config key. - Static assets (CSS/JS/images) are served from `hbd/server/static` via the `/static/` HTTP route. --- ## ๐Ÿงช Testing & Dev Tests are implemented using `unittest` and additional tests rely on `pytest` if you prefer. To run tests locally without installing anything beyond the dev requirements: ```bash # with project root on PYTHONPATH PYTHONPATH=. python -m unittest discover -v # or with pytest if installed pytest -q ``` Developer tooling included: - `pyproject.toml` โ€” project metadata and dependencies - `tox.ini` โ€” convenience wrappers for running tests, lint, and mypy To run linters and type checks locally: ```bash # after installing dev deps tox -e lint tox -e mypy ``` --- ## ๐Ÿš€ Running in production - Use your system service manager (systemd, launchd, etc.) to run `hbd` in the background. - Ensure `nsupdate` and necessary credentials are available for dynamic DNS updates. - Configure TLS for WSS if you enable secure websockets. > Note: The project contains a small example for obtaining DNS-verified certs (certbot with RFC2136) โ€” see earlier commit history or ask me to re-add the example to this README if you want it documented here. --- ## ๐Ÿค Contributing Contributions welcome! Please: 1. Open an issue to discuss larger changes. 2. Create a topic branch and a clear PR. 3. Add tests for new features and run linters. 4. Keep changes focused and documented. --- ## ๐Ÿ“œ License This repository is licensed under the MIT license. See `LICENSE` for details. --- If you'd like, I can also: - add a **GitHub Actions** workflow that runs tests and lint on push/PR ๐Ÿ” - add a `CONTRIBUTING.md` template for PRs and code style ๐Ÿ’ฌ Which one should I do next? โœจ