# Heartbeat Daemon (hbd) โœ… A lightweight daemon that listens for UDP heartbeat messages and acts on them: keeps host state, optionally updates DNS records via `nsupdate`, forwards messages to WebSocket clients, and sends notifications (email, Pushover, Mattermost, Signal). It is a refactor of a previously monolithic script into a modular Python package (`hbd`). --- ## ๐Ÿ“Œ Features - Receive and parse heartbeat datagrams (text or zlib-compressed) โœ… - Maintain host state and detect up/down transitions โœ… - Queue DNS updates via `nsupdate` and run them in a background thread โœ… - WebSocket API for live updates (hosts & messages) โœ… - Notification pipeline (email, Pushover, Mattermost, Signal) โœ… - **HTTP API & Web UI** โœ… - REST API for plugin data, alerts, and host information - Live dashboard with WebSocket updates - Interactive plugin metrics visualization - Alerts dashboard with filtering and summaries - **Message journal with automatic log rotation** โœ… - Logs all received messages in JSON format - Size-based automatic rotation - Configurable retention and backup management - **Plugin system for extensible monitoring** โœ… - Collect system metrics (CPU, memory, disk, network) - Execute existing Nagios monitoring plugins - Create custom plugins with simple Python classes - **Threshold alerting system** โœ… - Monitor metrics against configurable WARNING/CRITICAL thresholds - Hysteresis to prevent alert flapping - Automatic notifications on state changes - Re-notification for ongoing alerts - Modular codebase suitable for unit testing and CI โœ… --- ## ๐Ÿ”Œ Plugin System Heartbeat includes a comprehensive plugin architecture that extends monitoring beyond simple heartbeats. The plugin system allows you to: - **Collect system information**: OS details, hardware info, system configuration - **Monitor resources**: CPU usage, memory, disk space, network statistics - **Run Nagios plugins**: Execute thousands of existing Nagios monitoring plugins without modification - **Create custom plugins**: Build your own monitoring logic with simple Python classes ### Plugin Types - **InfoPlugin**: Collects static information once (e.g., OS version, hardware specs) - **MonitorPlugin**: Collects metrics periodically (e.g., CPU usage every 30 seconds) ### Built-in Plugins - `os_info`: Collects OS, kernel, distribution, and architecture information - `cpu_monitor`: Monitors CPU usage, load average, frequency, and process counts - `memory_monitor`: Monitors RAM and swap usage, available memory - `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics - `network_monitor`: Monitors network interface statistics, bandwidth, and connections - `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default) - `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.) ### Nagios Integration The `nagios_runner` plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored: - Executes plugins via subprocess with timeout protection - Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN) - Extracts performance data with thresholds - Reports aggregated status across all configured checks See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development. ### Creating Custom Plugins ```python from hbd.plugin import MonitorPlugin class DiskMonitorPlugin(MonitorPlugin): name = "disk_monitor" interval = 60 # Run every 60 seconds async def collect(self): return { "disk_usage": get_disk_usage(), "timestamp": time.time() } ``` Place plugins in `hbd/plugins/` and they'll be automatically discovered and loaded by the client. --- ## ๐Ÿ“ Message Journal Heartbeat includes a message journal that logs all received messages with automatic rotation. ### Features - **JSON Format**: All messages logged in JSONL (JSON Lines) format for easy parsing - **Automatic Rotation**: Size-based rotation with configurable thresholds - **Backup Management**: Keeps configurable number of rotated log files - **Non-blocking**: Async logging with minimal performance impact ### Configuration ```yaml # Message journal settings journal_enabled: true # Enable/disable journaling journal_dir: /var/log/heartbeat # Journal directory journal_file: messages.journal # Base filename journal_max_size: 104857600 # Max size (100MB default) journal_max_backups: 10 # Number of backups to keep ``` ### Example Journal Entry ```json {"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver1","interval":30}} ``` ### Analyzing Journal Files ```bash # View recent messages tail -100 /var/log/heartbeat/messages.journal | jq . # Count messages by type cat /var/log/heartbeat/messages.journal | jq -r '.message.ID' | sort | uniq -c # Filter by hostname cat /var/log/heartbeat/messages.journal | jq 'select(.message.name == "webserver1")' ``` See [docs/MESSAGE_JOURNAL.md](docs/MESSAGE_JOURNAL.md) for complete documentation including rotation behavior, integration with log management systems, and analysis examples. --- ## ๐Ÿšจ Threshold Alerting Heartbeat includes a sophisticated threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured limits. ### Features - **Multi-level alerts**: WARNING and CRITICAL severity levels - **Flexible operators**: Support for >, >=, <, <=, ==, != comparisons - **Hysteresis**: Prevents alert flapping with configurable recovery thresholds - **Smart notifications**: Alerts only on state changes, not every check - **Re-notifications**: Periodic reminders for ongoing alerts - **Journal integration**: All threshold events logged for audit trail ### Configuration ```yaml thresholds: cpu_monitor: cpu_percent: warning: 80.0 # Warn when CPU > 80% critical: 90.0 # Critical when CPU > 90% operator: ">" hysteresis: 0.1 # 10% hysteresis to prevent flapping memory_monitor: percent: warning: 85.0 critical: 95.0 disk_monitor: partitions: /: percent: warning: 80.0 critical: 90.0 free_gb: warning: 10.0 # Alert when < 10GB free critical: 5.0 operator: "<" # Inverse threshold # Global settings threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts ``` ### Alert Behavior 1. **State Changes**: Notifications sent when crossing thresholds - OK โ†’ WARNING: Early notification - WARNING โ†’ CRITICAL: Escalation - CRITICAL โ†’ OK: Recovery 2. **Hysteresis**: Prevents rapid state transitions ``` Critical threshold: 90% Hysteresis: 10% Recovery threshold: 81% (90 - 10% of 90) Value 91% โ†’ CRITICAL (threshold crossed) Value 85% โ†’ CRITICAL (still above 81%) Value 79% โ†’ OK (below recovery threshold) ``` 3. **Re-notifications**: Periodic reminders for ongoing alerts - Default: Every 60 minutes - Configurable via `threshold_renotify_interval` ### Example Notifications ``` WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0 CRITICAL: webserver01 - memory_monitor.percent = 96.0 RECOVERED: database01 - disk_monitor./.percent = 75.0 (WARNING -> OK) REMINDER (CRITICAL): mailserver - cpu_monitor.load_1min = 12.5 (ongoing for 3600s) ``` ### Supported Metrics All plugin metrics can be thresholded: - **CPU**: cpu_percent, load_1min, load_5min, load_15min - **Memory**: percent, available_mb, swap_percent - **Disk**: Per-partition percent, free_gb, free_mb - **Network**: errors_total, dropped packets, connection counts - **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL) See [docs/THRESHOLD_ALERTING.md](docs/THRESHOLD_ALERTING.md) for comprehensive documentation including best practices, troubleshooting, and advanced configuration. --- ## ๐ŸŒ HTTP API & Web UI Heartbeat includes a built-in HTTP/WebSocket server that provides both a REST API and web-based dashboards for monitoring and visualization. ### Features - **REST API**: JSON endpoints for accessing plugin data, alerts, and host information - **Live Dashboard**: Real-time WebSocket-powered host status view - **Plugin Metrics**: Interactive visualization of all plugin data with auto-refresh - **Alerts Dashboard**: Comprehensive alert monitoring with filtering and summaries - **CORS Support**: Configurable for integration with external applications ### Web Dashboards - **Live View** (`/live`): Real-time host connectivity, latency, and messages - **Plugin Metrics** (`/plugins`): Browse and visualize metrics from all plugins - **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering ### API Endpoints ```bash # List all monitored hosts curl http://localhost:50004/api/0/hosts # Get all plugin data for a host curl http://localhost:50004/api/0/hosts/webserver01/plugins # Get detailed plugin history (last 50 samples) curl http://localhost:50004/api/0/hosts/webserver01/plugins/cpu_monitor?limit=50 # Get alert states for a specific host curl http://localhost:50004/api/0/hosts/webserver01/alerts # Get all active alerts across all hosts curl http://localhost:50004/api/0/alerts ``` ### Integration Examples **Python Client:** ```python import requests # Monitor for critical alerts response = requests.get('http://localhost:50004/api/0/alerts') alerts = response.json() if alerts['summary']['critical'] > 0: print(f"โš ๏ธ {alerts['summary']['critical']} CRITICAL alerts!") for alert in alerts['alerts']: if alert['level'] == 'CRITICAL': print(f" {alert['hostname']}: {alert['metric_path']} = {alert['last_value']}") ``` **Bash Monitoring Script:** ```bash #!/bin/bash # Check for critical alerts CRITICAL=$(curl -s http://localhost:50004/api/0/alerts | jq '.summary.critical') if [ "$CRITICAL" -gt 0 ]; then echo "CRITICAL: $CRITICAL critical alerts detected!" # Send notification fi ``` ### Demo & Testing Run the API demo script to test all endpoints: ```bash python3 scripts/demo_http_api.py ``` See [docs/HTTP_API.md](docs/HTTP_API.md) for complete API documentation including response formats, error handling, and integration examples. --- ## โš™๏ธ Quickstart Prerequisites: - Python 3.10+ (project uses language features from recent Python) - `nsupdate` (for DNS updates) if using dynamic DNS Install dependencies (recommended into a venv): This project now declares its dependencies in `pyproject.toml`. Instead of the old `requirements.txt` flow, install the package into a virtualenv using `pip`: See `scripts/install.sh` for a way to install. Run the daemon (example): ```bash # run with default config lookup (~/.hb.yaml) hbd -c .hb.yaml -f -v ``` You can also run it directly via the package entrypoint after installation: ```bash python -m hbd.cli -c /path/to/config.yaml ``` ### Running the Client The heartbeat client (`hbc`) sends periodic heartbeats and plugin data to the server: ```bash # Basic usage pointing to server python -m hbd.hbc --server your-server.example.com # With custom configuration python -m hbd.hbc --server 192.168.1.100 --port 50003 --interval 30 # Run with specific plugins enabled/disabled python -m hbd.hbc --server hbd.local --disable-plugin os_info ``` Client configuration can also be specified in YAML: ```yaml server: hbd.example.com port: 50003 interval: 30 plugins: cpu_monitor: interval: 300 # Check every 5 minutes (default) per_core: true memory_monitor: interval: 300 # Check every 5 minutes (default) disk_monitor: interval: 300 # Check every 5 minutes (default) network_monitor: interval: 300 # Check every 5 minutes (default) nagios_runner: interval: 300 # Check every 5 minutes (default) commands: - /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6 - /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p / ``` All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed. ## ๐Ÿž Debugging in VS Code This repository includes a ready-to-use `.vscode/launch.json` with configurations to run or attach the VS Code debugger to `hbd`. - Ensure the **Python** extension is installed and select the project `.venv` as the interpreter (bottom-left of VS Code). - Use **F5** and pick one of these configurations from the Run view: - **Python: Run hbd (module)** โ€” runs `hbd.cli` as a module and sets `PYTHONPATH` to the workspace root (recommended). - **Python: Run hbd with debugpy (listen)** โ€” launches `debugpy` and `hbd` together; useful when you want the process to listen for a debugger. - **Python: Attach (localhost:5678)** โ€” attach the debugger to a running process started with `debugpy`. To start `hbd` manually and wait for the debugger to attach, run: ```bash PYTHONPATH=. python -m debugpy --listen 5678 --wait-for-client -m hbd.cli -c .hb.yaml -f -v ``` Set breakpoints in modules such as `hbd/udp.py`, `hbd/dns.py`, or `hbd/server.py`, and use the **Attach** configuration to connect. Use `justMyCode: false` if you need to step into third-party code. --- ## ๐Ÿ›  Configuration `hbd` reads YAML configuration (optional). If `PyYAML` is not installed, built-in defaults are used. Example configuration keys (see `hbd/config.py`): - `hb_port`: UDP port to listen for heartbeats (default: 50003) - `hbd_port`: internal control port (default: 50004) - `hbd_host`: bind address for HTTP/WSS - `pickfile`: path for persisted state - `logfile`: path to log file - `logfmt`: `text` or `msg` - `pushsrv`: push service (`pushover`|`mattermost`|`all`) - `interval` / `grace`: heartbeat timing configuration - `dyndomains`: list of dyndomains to update via `nsupdate` - `nsupdate_bin`: path to nsupdate binary - `ws_port`: port for plain WebSocket connections (default: 50005) - `wss_port`: port for secure WebSocket (WSS) connections (default: none). If set, `hbd` will attempt to serve WSS on this port when `wss_pem` and `wss_key` SSL files are available under `cert_path` (see below). - `cert_path`: directory where TLS certificate and key are looked up (default: /usr/local/etc/ssl/) - `wss_pem`: filename for the certificate chain (default: fullchain.pem) - `wss_key`: filename for the private key (default: privkey.pem) Example `.hb.yaml` (minimal): ```yaml hbd_host: 0.0.0.0 hbd_port: 50004 dyndomains: - example.com nsupdate_bin: /usr/bin/nsupdate pushsrv: pushover ``` > Tip: `config.DEFAULTS` in `hbd/config.py` contains the canonical defaults and accepted configuration keys. --- ## ๐Ÿ”ง Architecture & Modules - `hbd.proto` โ€” serialization/deserialization of heartbeat messages (supports compressed payloads and plugin data) - `hbd.udp` โ€” UDP parsing and `handle_datagram` implementation (main state machine) - `hbd.dns` โ€” `create_nsupdate_payload`, `nsupdate`, and an asyncio DNS worker (`start_dns_worker`). The DNS worker now runs as an `asyncio` task and the package exposes a small thread-safe bridge so legacy synchronous code can `put()` updates into the queue; there is no longer a permanently-blocking background `threading.Thread`. - `hbd.notify` โ€” email and push notification helpers - `hbd.ws` โ€” WebSocket server and thread-safe broadcast helpers - `hbd.http` โ€” HTTP handler factory for the status UI/API - `hbd.journal` โ€” message journal with size-based log rotation and backup management - `hbd.plugin` โ€” plugin framework with base classes, registry, and dynamic loader - `hbd.plugins/` โ€” built-in plugins (os_info, cpu_monitor, memory_monitor, disk_monitor, network_monitor, filesystem_info, nagios_runner) - `hbd.hbc` โ€” heartbeat client that sends heartbeats and plugin data to server - `hbd.utils` โ€” small utility helpers (`shortname`, `dur`, `initlog`) - `hbd.cli` โ€” CLI entrypoint and argument parsing - `hbd.server` โ€” async orchestration to run UDP/HTTP/WSS components This modular layout makes the code easier to test and maintain. **Runtime & Shutdown** - The main runtime is asyncio-based. Services (UDP listener, HTTP server, WebSocket server, monitor, and DNS worker) run as asyncio tasks. - On SIGINT/SIGTERM the server triggers a graceful shutdown: it cancels active tasks, signals the DNS worker via a sentinel, and cleans up resources before exit. - The DNS update worker is implemented as an `asyncio` task; synchronous producers can still enqueue DNS updates via a small thread-safe bridge available at `hbd.hbdclass.Host.dnsQ`. **Templates & Static Files** - Template files are located under `hbd/templates` by default. The HTTP server resolves templates relative to the `hbd` package but the path can be overridden with the `templates_dir` config key. - Static assets (CSS/JS/images) are served from `hbd/static` via the `/static/` HTTP route. Place your static files in that directory or configure the HTTP server as needed. --- ## ๐Ÿงช Testing & Dev Tests are implemented using `unittest` and additional tests rely on `pytest` if you prefer. To run tests locally without installing anything beyond the dev requirements: ```bash # with project root on PYTHONPATH PYTHONPATH=. python -m unittest discover -v # or with pytest if installed pytest -q ``` Developer tooling included: - `pyproject.toml` โ€” project metadata and dependencies - `tox.ini` โ€” convenience wrappers for running tests, lint, and mypy To run linters and type checks locally: ```bash # after installing dev deps tox -e lint tox -e mypy ``` --- ## ๐Ÿš€ Running in production - Use your system service manager (systemd, launchd, etc.) to run `hbd` in the background. - Ensure `nsupdate` and necessary credentials are available for dynamic DNS updates. - Configure TLS for WSS if you enable secure websockets. > Note: The project contains a small example for obtaining DNS-verified certs (certbot with RFC2136) โ€” see earlier commit history or ask me to re-add the example to this README if you want it documented here. --- ## ๐Ÿค Contributing Contributions welcome! Please: 1. Open an issue to discuss larger changes. 2. Create a topic branch and a clear PR. 3. Add tests for new features and run linters. 4. Keep changes focused and documented. --- ## ๐Ÿ“œ License This repository is licensed under the MIT license. See `LICENSE` for details. --- If you'd like, I can also: - add a **GitHub Actions** workflow that runs tests and lint on push/PR ๐Ÿ” - add a `CONTRIBUTING.md` template for PRs and code style ๐Ÿ’ฌ Which one should I do next? โœจ