# Heartbeat Daemon (hbd) A lightweight UDP-based host monitoring system. Monitored hosts run a client (`hbc`) that sends periodic heartbeat packets and system metrics to a central server (`hbd`). The server tracks host reachability, evaluates metric thresholds, sends notifications, and serves a web dashboard. --- ## Architecture ``` [ host running hbc ] [ server running hbd ] ┌────────────────────┐ ┌────────────────────────────┐ │ heartbeat client │ UDP 50003 │ heartbeat daemon │ │ │ ──────────> │ │ │ plugins: │ HTB / PLG │ host state tracking │ │ - cpu_monitor │ │ threshold evaluation │ │ - memory_monitor │ <────────── │ DNS updates (nsupdate) │ │ - disk_monitor │ ACK/CMD/UPD │ notifications │ │ - nagios_runner │ │ web dashboard + REST API │ │ - ... │ │ WebSocket live updates │ └────────────────────┘ └────────────────────────────┘ ``` **Package:** `hbd` v5.3.4 **Python:** 3.11+ ### Subpackages | Package | Purpose | |---|---| | `hbd.common` | Protocol encoding/decoding, shared utilities | | `hbd.server` | The `hbd` daemon | | `hbd.client` | The `hbc` client | --- ## Installation Dependencies are declared in `pyproject.toml`. Install into a virtualenv: ```bash # Server + client pip install . # Using the install script scripts/hb_install.sh ``` **Entry points:** - `hbd` — server (`hbd.server.cli:main`) - `hbc` — client (`hbd.client.main:main`) **Runtime dependencies:** | Component | Packages | |---|---| | Both | PyYAML ≥6.0 | | Client | psutil ≥5.9.0 | | Server | aiohttp ≥3.11, websockets ≥13.2, Jinja2 ≥3.1.6, ruamel.yaml ≥0.18, mattermostdriver ≥7.3.0, matrix-nio ≥0.24 | --- ## Server (`hbd`) ### Starting the server ```bash # Foreground, verbose, with config file hbd serve -c /etc/hb.yaml -f -v # As a module python -m hbd.server.cli serve -c /etc/hb.yaml ``` ### CLI subcommands | Command | Description | |---|---| | `hbd serve` | Start the daemon (default) | | `hbd passwd ` | Generate a password hash for config | | `hbd notify` | Test notification channels | | `hbd stop` | Stop a running daemon | | `hbd reload` | Reload config (send SIGHUP) | | `hbd restart` | Restart daemon | ### Configuration (`~/.hb.yaml`) ```yaml # Network hb_port: 50003 # UDP port for heartbeat messages hbd_port: 50004 # HTTP API / web UI port hbd_host: "" # Bind address (empty = all interfaces) ws_port: 50005 # WebSocket port (plain) wss_port: ~ # WebSocket port (TLS; requires cert_path/wss_pem/wss_key) # Timing interval: 20 # Expected heartbeat interval (seconds) grace: 2 # Extra seconds before declaring a host overdue # Persistence pickfile: ~/.hb.pick # Host state persistence pidfile: ~/.hb.pid logfile: ~/.hb.log # Message journal journal_enabled: true journal_dir: /var/log/heartbeat journal_file: messages.journal journal_max_size: 104857600 # 100 MB journal_max_backups: 10 # DNS nsupdate_bin: /usr/bin/nsupdate dyndomains: - example.com # Threshold alert re-notification interval (seconds) threshold_renotify_interval: 3600 # Notification channels notification_channels: pushover_ops: type: pushover token: YOUR_APP_TOKEN user: YOUR_USER_KEY email_ops: type: email smtp_server: smtp.example.com port: 587 user: alerts@example.com password: secret recipients: [ops@example.com] # Users users: alice: full_name: Alice Smith password: pbkdf2:sha256:... # generate with: hbd passwd alice admin: true notification_channels: [pushover_ops] bob: password: pbkdf2:sha256:... notification_channels: [email_ops] default_owner: alice # Hosts hosts: webserver01: dyndns: true # Update DNS when address changes owner: alice managers: [bob] monitors: [] database01: watch: false # Suppress all notifications for this host ``` Send SIGHUP (or `hbd reload`) to reload configuration without restarting. Changes to ports, certificates, pickle path, and journal path require a full restart. ### Persistence Host state (reachability, plugin data, alert states) is saved to `pickfile` every 5 minutes and on clean shutdown. The server loads this state on startup. --- ## Client (`hbc`) ### Usage ```bash # Basic — send heartbeats to a server hbc your-server.example.com # Multiple servers hbc server1.example.com server2.example.com # With config file, running as a daemon hbc -d -c /etc/hbc.yaml your-server.example.com # Send a boot message, then heartbeat normally hbc -b your-server.example.com # One-off message hbc -m "maintenance starting" your-server.example.com # Force IPv4 or IPv6 only hbc -4 your-server.example.com hbc -6 your-server.example.com ``` ### Options | Flag | Description | |---|---| | `-b`, `--boot` | Send a boot message at startup | | `-c`, `--config FILE` | Config file path (default: `~/.hbc.yaml`) | | `-d`, `--daemon` | Daemonize (logs go to syslog) | | `-m`, `--message TEXT` | Send a one-off message and exit | | `-n`, `--name NAME` | Override reported hostname | | `-v`, `--verbose` | Verbose output | | `-x`, `--debug` | Debug level (repeatable) | | `-4` / `-6` | Restrict to IPv4 or IPv6 | ### Configuration (`~/.hbc.yaml`) ```yaml hb_port: 50003 # Server UDP port interval: 10 # Heartbeat interval (seconds) owner: alice # Optional: claim ownership of this host plugins: cpu_monitor: interval: 300 # Override collection interval per_core: true # Report per-core CPU usage memory_monitor: interval: 300 disk_monitor: interval: 300 network_monitor: interval: 300 ping_monitor: interval: 60 hosts: [8.8.8.8, 192.168.1.1] nagios_runner: interval: 300 commands: - name: check_load command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6 - name: check_disk_root command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p / zfs_monitor: interval: 300 ``` ### Connection behaviour - The client sends heartbeats over UDP to each server address resolved from the hostname (IPv4 and IPv6). - If a connection fails to open at startup, IPv6 connections are dropped after 3 consecutive failures. IPv4 connections retry indefinitely. - In daemon mode (`-d`), all log output goes to syslog (`LOG_DAEMON` facility). --- ## UDP Protocol All messages are zlib-compressed key=value pairs with an ID prefix. ``` !: ``` Payload format: `key=value;key=value;...` | Message | Direction | Purpose | |---|---|---| | `HTB` | client → server | Heartbeat (name, timestamp, RTT, acks, interval) | | `PLG` | client → server | Plugin data (plugin name + metrics) | | `ACK` | server → client | Acknowledgment | | `CMD` | server → client | Execute a shell command on the client | | `UPD` | server → client | Trigger self-update via `hb_install.sh` | Value encoding: - Floats: 5 decimal places - Lists/dicts: JSON prefixed with `@` - Booleans: `1` / `0` RTT is measured using kernel SO_TIMESTAMP when available (Linux, macOS, FreeBSD), falling back to application-layer timing. --- ## Plugin System Plugins run on the client and collect system metrics that are sent to the server as `PLG` messages. ### Plugin types | Type | `interval` | When collected | |---|---|---| | `InfoPlugin` | 0 | Once at startup; re-collected on server request | | `MonitorPlugin` | 30 (default) | Periodically on the configured interval | ### Built-in plugins | Plugin | Type | Data collected | |---|---|---| | `os_info` | Info | OS, kernel, distro, architecture, Python version, hbc version | | `cpu_monitor` | Monitor | cpu_percent, per-core usage, load averages, process count, frequency | | `memory_monitor` | Monitor | RAM and swap usage (ZFS ARC-aware) | | `disk_monitor` | Monitor | Per-partition usage, disk I/O stats | | `network_monitor` | Monitor | Per-interface byte/packet counts, connection count | | `ping_monitor` | Monitor | RTT, packet loss, jitter per configured host | | `filesystem_info` | Info | Mounted filesystems (excludes pseudo filesystems) | | `nagios_runner` | Monitor | Output of configured Nagios-compatible check commands | | `zfs_monitor` | Monitor | ZFS pool health, capacity, fragmentation, dedup ratio, I/O | ### Custom plugins Create a `.py` file in `hbd/client/plugins/`: ```python from hbd.client.plugin import MonitorPlugin class MyPlugin(MonitorPlugin): name = "my_plugin" interval = 60 async def collect(self): return {"my_metric": 42} ``` `initialize()` is called once at load time; return `False` to disable the plugin (e.g., if a required binary is missing). ### Nagios integration The `nagios_runner` plugin executes any Nagios-compatible check binary: ```yaml plugins: nagios_runner: commands: - name: check_http command: /usr/lib/nagios/plugins/check_http -H example.com ``` - Commands are validated (absolute paths, executable) at startup. - Exit codes map to OK / WARNING / CRITICAL / UNKNOWN. - Performance data fields are extracted and stored individually. - The `nagios` threshold operator maps exit codes directly to alert levels (see Threshold Alerting). --- ## Threshold Alerting The server evaluates plugin metrics against configurable thresholds and fires notifications on state changes. ### Configuration ```yaml thresholds: cpu_monitor: cpu_percent: warning: 80.0 critical: 90.0 operator: ">" # >, >=, <, <=, ==, != (default: >) hysteresis: 0.1 # 10%: recover at 81 when critical=90 count: 1 # Require N consecutive breaches before alerting display: "CPU {cpu_percent}% (threshold: {op_symbol}{threshold_value})" memory_monitor: percent: warning: 85.0 critical: 95.0 disk_monitor: partitions: /: percent: warning: 80.0 critical: 90.0 free_gb: warning: 10.0 critical: 5.0 operator: "<" nagios_runner: status_code: operator: "nagios" # 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN display: "{check_name}: {output}" ``` ### Per-host threshold profiles Named profiles let different hosts use different thresholds. A single name or a list is accepted; lists are applied left-to-right. ```yaml threshold_configs: default: thresholds: cpu_monitor: cpu_percent: {warning: 80, critical: 90} tight_cpu: thresholds: cpu_monitor: cpu_percent: {warning: 60, critical: 75} hosts: web-01: threshold_config: default db-01: threshold_config: [default, tight_cpu] ``` ### Alert states | State | Meaning | |---|---| | OK | Metric within normal range | | WARNING | Metric crossed warning threshold | | CRITICAL | Metric crossed critical threshold | | UNKNOWN | Cannot determine (e.g. Nagios exit code 3) | Notifications are sent on state transitions (OK → WARNING, WARNING → CRITICAL, CRITICAL → OK). De-escalations (CRITICAL → WARNING) do not trigger a notification. Ongoing alerts generate a re-notification every `threshold_renotify_interval` seconds (default: 3600). Alerts can be acknowledged via the web UI or API to suppress re-notifications. ### RTT thresholds The server measures heartbeat round-trip time and supports RTT thresholds using the same format: ```yaml thresholds: rtt: webserver01: warning: 100.0 # ms critical: 500.0 ``` ### Generic threshold matching When a metric has no exact threshold entry, the server strips leading segments and retries. This allows one entry to cover all Nagios checks: ``` nagios_runner.check_disk_root_status_code → no match nagios_runner.disk_root_status_code → no match nagios_runner.root_status_code → no match nagios_runner.status_code → matched ✓ ``` The stripped prefix (`check_disk_root`) is available as `{check_name}` in the `display` template. ### Display template variables | Variable | Description | |---|---| | `{value}` | Current metric value | | `{threshold_value}` | Threshold that was crossed | | `{op_symbol}` | Comparison operator | | `{check_name}` | Prefix stripped by generic matching | | `{metric_name}` | Full field name | | `{output}` | Nagios check output text | | `{status}` | Nagios status name (OK/WARNING/CRITICAL/UNKNOWN) | | any plugin field | Any field present in the plugin's data | --- ## Notification Channels Notifications are dispatched to the host's owner, managers, and monitors. Each user specifies which channels to use. ### Supported channel types | Type | Required fields | |---|---| | `pushover` | `token`, `user` | | `email` | `smtp_server`, `recipients`, `sender`, `user`, `password`, `port` | | `mattermost` | `webhook_url`, `channel` | | `matrix` | `homeserver`, `user`, `password`, `room_id` | | `signal` | `phone_number`, `recipient` | | `sms_voipms` | `api_key`, `recipient` | Each channel can set a `min_level` (`WARNING` or `CRITICAL`) to filter low-severity alerts. Recovery notifications are only sent to channels that received the original alert. --- ## Web Dashboard & HTTP API The server exposes a web UI and REST API on `hbd_port` (default 50004). ### Web pages | Path | Description | |---|---| | `/login` | Login form (shown automatically when auth is configured) | | `/live` | Real-time host connectivity, RTT, and message stream | | `/plugins/` | Per-host plugin metrics | | `/alerts` | Active alerts with severity filtering | | `/settings` | Server config, users, notification channels, thresholds | Live views use WebSocket connections for real-time updates. Non-admin users see only hosts where they have a role (monitor, manager, or owner). Admins see all hosts. ### REST API All endpoints are under `/api/0/`. When authentication is configured, include a session token: ```bash # Log in, get a token TOKEN=$(curl -s -X POST http://localhost:50004/api/0/auth/login \ -H 'Content-Type: application/json' \ -d '{"username":"alice","password":"secret"}' | jq -r .token) # Use the token curl -H "Authorization: Bearer $TOKEN" http://localhost:50004/api/0/hosts ``` | Method | Endpoint | Description | |---|---|---| | GET | `/api/0/hosts` | All visible hosts | | GET | `/api/0/alerts` | All active alerts | | GET | `/api/0/alert_summary` | Count of ok/warning/critical | | GET | `/api/0/messages` | Last 30 messages | | GET | `/api/0/hosts/{host}/plugins` | All plugin data for host | | GET | `/api/0/hosts/{host}/plugins/{plugin}?limit=N` | Plugin samples | | GET | `/api/0/hosts/{host}/alerts` | Alert states for host | | GET | `/api/0/hosts/{host}/access` | Access roles | | PUT | `/api/0/hosts/{host}/access` | Update access roles | | GET | `/api/0/hosts/{host}/info` | Host info (hbc version, thresholds) | | POST | `/api/0/alerts/acknowledge` | Acknowledge alert | | GET | `/api/0/users` | All users (admin only) | | GET | `/api/0/users/me` | Current user profile | | PUT | `/api/0/users/me` | Update own profile | | POST | `/api/0/auth/login` | Create session | | POST | `/api/0/auth/logout` | Destroy session | | GET | `/api/0/config` | Server config (secrets redacted) | | POST | `/api/0/config` | Update config | | GET | `/api/0/config/backups` | List config backups | | POST | `/api/0/config/rollback` | Roll back to previous config | | GET | `/api/0/notification_channels` | List channels | | POST | `/api/0/notification_channels` | Create channel | | PUT | `/api/0/notification_channels/{name}` | Update channel | | DELETE | `/api/0/notification_channels/{name}` | Delete channel | --- ## User Management & Authentication When no `users:` block is in config, the server runs unauthenticated — all existing behaviour is preserved. ### Roles | Role | Capabilities | |---|---| | monitor | View status, plugin data, alerts | | manager | monitor + queue commands, trigger DNS, queue upgrades | | owner | manager + drop host, transfer ownership, update access | | admin | Owner-level on all hosts + access to server config and users | ### Setup ```yaml users: alice: full_name: Alice Smith password: pbkdf2:sha256:... # hbd passwd alice admin: true notification_channels: [pushover_ops] default_owner: alice # Owns any host with no explicit owner hosts: webserver01: owner: alice managers: [bob] monitors: [carol] ``` Password hashing uses PBKDF2-HMAC-SHA256 (260,000 iterations). Sessions expire after 24 hours. OAuth2 login (Gitea) is supported: ```yaml oauth: gitea: url: https://git.example.com client_id: xxx client_secret: yyy ``` --- ## Dynamic DNS When `dyndns: true` is set on a host and `dyndomains` is configured, the server updates DNS via `nsupdate` whenever the host's source address changes. ```yaml nsupdate_bin: /usr/bin/nsupdate dyndomains: - example.com hosts: webserver01: dyndns: true ``` DNS updates run asynchronously in a background worker. --- ## Message Journal All received messages are logged in JSONL format with automatic size-based rotation. ```yaml journal_enabled: true journal_dir: /var/log/heartbeat journal_file: messages.journal journal_max_size: 104857600 # 100 MB journal_max_backups: 10 ``` Example entry: ```json {"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver01","interval":10}} ``` --- ## `hbc_mini` — Zero-dependency client `scripts/hbc_mini.py` is a single-file client requiring only Python 3.8+ and no external packages. Copy it to any host and run directly. ```bash python3 hbc_mini.py your-server.example.com python3 hbc_mini.py -d your-server.example.com # daemon mode python3 hbc_mini.py -b your-server.example.com # send boot message ``` Config: `~/.hbc.json` (JSON format, same keys as `~/.hbc.yaml`). **Available plugins:** | Plugin | Platform | |---|---| | `os_info` | All | | `ping_monitor` | All | | `nagios_runner` | All (not Windows) | | `cpu_monitor` | Linux (`/proc/stat`; no per-core, no frequency) | | `memory_monitor` | Linux (`/proc/meminfo`) | | `disk_monitor` | Linux, macOS, BSD (`df -P`) | | `network_monitor` | Linux (`/proc/net/dev`) | Not available vs full `hbc`: no YAML config, no `filesystem_info`, no `zfs_monitor`, no IPv6 early-fail protection. --- ## `hbc_mini.c` — C client `scripts/c/hbc_mini.c` is a single-file C port of `hbc_mini.py`. It has no runtime dependencies beyond libc, zlib, pthreads, and libm, and runs on Linux, FreeBSD, NetBSD, and DragonFly BSD. ### Build ```bash cc -O2 -o hbc_mini scripts/c/hbc_mini.c -lz -lpthread -lm ``` ### Usage The CLI is identical to `hbc_mini.py`: ```bash ./hbc_mini your-server.example.com ./hbc_mini -d your-server.example.com # daemon mode (logs to syslog) ./hbc_mini -b your-server.example.com # send boot message ./hbc_mini -m "note" your-server.example.com # send one-shot message ./hbc_mini -4 your-server.example.com # IPv4 only ./hbc_mini -6 your-server.example.com # IPv6 only ``` Config: `~/.hbc.json` (JSON, same keys as the Python version). ### Architecture The C client uses two threads: - **Main thread** — heartbeat sender loop + `select()`-based receive loop (1 s timeout). Sends `HTB` at the configured interval, receives `ACK`/`CMD` messages, and re-sends `os_info` on server request. - **Monitor thread** — all periodic plugins in a single thread with a 1-second sleep loop. Each plugin has its own next-run timestamp tracked independently. SIGHUP causes the process to restart itself via `execv()`. SIGTERM/SIGINT trigger a clean shutdown (sends a shutdown heartbeat if `-b` was used). ### Available plugins | Plugin | Platform | Data source | |---|---|---| | `os_info` | Linux, FreeBSD, NetBSD, DragonFly | `uname(2)`, `/etc/os-release`, `kern.osrelease` sysctl | | `cpu_monitor` | Linux | `/proc/stat` | | `cpu_monitor` | FreeBSD, DragonFly, NetBSD | `kern.cp_time` sysctl | | `memory_monitor` | Linux | `/proc/meminfo` (ZFS ARC-aware) | | `memory_monitor` | FreeBSD, DragonFly | `vm.stats.vm.*` sysctl | | `memory_monitor` | NetBSD | `VM_UVMEXP` sysctl | | `disk_monitor` | All | `df -P` subprocess | | `network_monitor` | Linux | `/proc/net/dev` | | `network_monitor` | FreeBSD, NetBSD, DragonFly | `getifaddrs()` + `AF_LINK` | | `ping_monitor` | All | `ping` subprocess | | `nagios_runner` | All | `popen()` subprocess | `cpu_monitor` reports: `cpu_percent`, `cpu_user`, `cpu_system`, `cpu_idle`, `cpu_iowait` (Linux only), load averages, `cpu_core_count`, `uptime_seconds`. `memory_monitor` reports: `memory_total`, `memory_used`, `memory_available`, `memory_free`, `memory_percent`, and swap fields when swap is present. `network_monitor` reports per-interface cumulative `bytes_recv`/`bytes_sent` and interval deltas. The loopback interface (`lo`) is skipped by default; this is configurable: ```json { "plugins": { "network_monitor": { "skip_interfaces": ["lo", "docker0"] } } } ``` `disk_monitor` reports per-mount `total`, `used`, `free`, `percent`. An optional mount filter restricts reporting to specific paths: ```json { "plugins": { "disk_monitor": { "mounts": ["/", "/data"] } } } ``` ### Differences from `hbc_mini.py` - No `filesystem_info` or `zfs_monitor` plugins - `UPD` (self-update) messages are logged but not acted on - No IPv6 early-fail protection - Config is JSON only (`~/.hbc.json`), no YAML --- ## Development ### Running tests ```bash PYTHONPATH=. python -m unittest discover -v # or pytest -q ``` ### Linting and type checking ```bash tox -e lint tox -e mypy ``` ### Debugging in VS Code A `.vscode/launch.json` is included with configurations for running and attaching the debugger. Select the project `.venv` as the Python interpreter, then use F5. To start with debugpy and wait for attach: ```bash PYTHONPATH=. python -m debugpy --listen 5678 --wait-for-client -m hbd.server.cli serve -c .hb.yaml -f -v ``` --- ## License MIT. See `LICENSE` for details.