Files
heartbeat/README.md
T
Andreas Wrede 8a1f412d1d
Release / release (push) Successful in 43s
version 5.3.9
2026-05-31 20:58:58 -04:00

756 lines
22 KiB
Markdown

# Heartbeat Daemon (hbd)
A lightweight UDP-based host monitoring system. Monitored hosts run a client (`hbc`) that sends periodic heartbeat packets and system metrics to a central server (`hbd`). The server tracks host reachability, evaluates metric thresholds, sends notifications, and serves a web dashboard.
---
## Architecture
```
[ host running hbc ] [ server running hbd ]
┌────────────────────┐ ┌────────────────────────────┐
│ heartbeat client │ UDP 50003 │ heartbeat daemon │
│ │ ──────────> │ │
│ plugins: │ HTB / PLG │ host state tracking │
│ - cpu_monitor │ │ threshold evaluation │
│ - memory_monitor │ <────────── │ DNS updates (nsupdate) │
│ - disk_monitor │ ACK/CMD/UPD │ notifications │
│ - nagios_runner │ │ web dashboard + REST API │
│ - ... │ │ WebSocket live updates │
└────────────────────┘ └────────────────────────────┘
```
**Package:** `hbd` v5.3.9
**Python:** 3.11+
### Subpackages
| Package | Purpose |
|---|---|
| `hbd.common` | Protocol encoding/decoding, shared utilities |
| `hbd.server` | The `hbd` daemon |
| `hbd.client` | The `hbc` client |
---
## Installation
Dependencies are declared in `pyproject.toml`. Install into a virtualenv:
```bash
# Server + client
pip install .
# Using the install script
scripts/hb_install.sh
```
**Entry points:**
- `hbd` — server (`hbd.server.cli:main`)
- `hbc` — client (`hbd.client.main:main`)
**Runtime dependencies:**
| Component | Packages |
|---|---|
| Both | PyYAML ≥6.0 |
| Client | psutil ≥5.9.0 |
| Server | aiohttp ≥3.11, websockets ≥13.2, Jinja2 ≥3.1.6, ruamel.yaml ≥0.18, mattermostdriver ≥7.3.0, matrix-nio ≥0.24 |
---
## Server (`hbd`)
### Starting the server
```bash
# Foreground, verbose, with config file
hbd serve -c /etc/hb.yaml -f -v
# As a module
python -m hbd.server.cli serve -c /etc/hb.yaml
```
### CLI subcommands
| Command | Description |
|---|---|
| `hbd serve` | Start the daemon (default) |
| `hbd passwd <username>` | Generate a password hash for config |
| `hbd notify` | Test notification channels |
| `hbd stop` | Stop a running daemon |
| `hbd reload` | Reload config (send SIGHUP) |
| `hbd restart` | Restart daemon |
### Configuration (`~/.hb.yaml`)
```yaml
# Network
hb_port: 50003 # UDP port for heartbeat messages
hbd_port: 50004 # HTTP API / web UI port
hbd_host: "" # Bind address (empty = all interfaces)
ws_port: 50005 # WebSocket port (plain)
wss_port: ~ # WebSocket port (TLS; requires cert_path/wss_pem/wss_key)
# Timing
interval: 20 # Expected heartbeat interval (seconds)
grace: 2 # Extra seconds before declaring a host overdue
# Persistence
pickfile: ~/.hb.pick # Host state persistence
pidfile: ~/.hb.pid
logfile: ~/.hb.log
# Message journal
journal_enabled: true
journal_dir: /var/log/heartbeat
journal_file: messages.journal
journal_max_size: 104857600 # 100 MB
journal_max_backups: 10
# DNS
nsupdate_bin: /usr/bin/nsupdate
dyndomains:
- example.com
# Threshold alert re-notification interval (seconds)
threshold_renotify_interval: 3600
# Notification channels
notification_channels:
pushover_ops:
type: pushover
token: YOUR_APP_TOKEN
user: YOUR_USER_KEY
email_ops:
type: email
smtp_server: smtp.example.com
port: 587
user: alerts@example.com
password: secret
recipients: [ops@example.com]
# Users
users:
alice:
full_name: Alice Smith
password: pbkdf2:sha256:... # generate with: hbd passwd alice
admin: true
notification_channels: [pushover_ops]
bob:
password: pbkdf2:sha256:...
notification_channels: [email_ops]
default_owner: alice
# Hosts
hosts:
webserver01:
dyndns: true # Update DNS when address changes
owner: alice
managers: [bob]
monitors: []
database01:
watch: false # Suppress all notifications for this host
```
Send SIGHUP (or `hbd reload`) to reload configuration without restarting. Changes to ports, certificates, pickle path, and journal path require a full restart.
### Persistence
Host state (reachability, plugin data, alert states) is saved to `pickfile` every 5 minutes and on clean shutdown. The server loads this state on startup.
---
## Client (`hbc`)
### Usage
```bash
# Basic — send heartbeats to a server
hbc your-server.example.com
# Multiple servers
hbc server1.example.com server2.example.com
# With config file, running as a daemon
hbc -d -c /etc/hbc.yaml your-server.example.com
# Send a boot message, then heartbeat normally
hbc -b your-server.example.com
# One-off message
hbc -m "maintenance starting" your-server.example.com
# Force IPv4 or IPv6 only
hbc -4 your-server.example.com
hbc -6 your-server.example.com
```
### Options
| Flag | Description |
|---|---|
| `-b`, `--boot` | Send a boot message at startup |
| `-c`, `--config FILE` | Config file path (default: `~/.hbc.yaml`) |
| `-d`, `--daemon` | Daemonize (logs go to syslog) |
| `-m`, `--message TEXT` | Send a one-off message and exit |
| `-n`, `--name NAME` | Override reported hostname |
| `-v`, `--verbose` | Verbose output |
| `-x`, `--debug` | Debug level (repeatable) |
| `-4` / `-6` | Restrict to IPv4 or IPv6 |
### Configuration (`~/.hbc.yaml`)
```yaml
hb_port: 50003 # Server UDP port
interval: 10 # Heartbeat interval (seconds)
owner: alice # Optional: claim ownership of this host
plugins:
cpu_monitor:
interval: 300 # Override collection interval
per_core: true # Report per-core CPU usage
memory_monitor:
interval: 300
disk_monitor:
interval: 300
network_monitor:
interval: 300
ping_monitor:
interval: 60
hosts: [8.8.8.8, 192.168.1.1]
nagios_runner:
interval: 300
commands:
- name: check_load
command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
- name: check_disk_root
command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
zfs_monitor:
interval: 300
```
### Connection behaviour
- The client sends heartbeats over UDP to each server address resolved from the hostname (IPv4 and IPv6).
- If a connection fails to open at startup, IPv6 connections are dropped after 3 consecutive failures. IPv4 connections retry indefinitely.
- In daemon mode (`-d`), all log output goes to syslog (`LOG_DAEMON` facility).
---
## UDP Protocol
All messages are zlib-compressed key=value pairs with an ID prefix.
```
!<ID>: <zlib-compressed payload>
```
Payload format: `key=value;key=value;...`
| Message | Direction | Purpose |
|---|---|---|
| `HTB` | client → server | Heartbeat (name, timestamp, RTT, acks, interval) |
| `PLG` | client → server | Plugin data (plugin name + metrics) |
| `ACK` | server → client | Acknowledgment |
| `CMD` | server → client | Execute a shell command on the client |
| `UPD` | server → client | Trigger self-update via `hb_install.sh` |
Value encoding:
- Floats: 5 decimal places
- Lists/dicts: JSON prefixed with `@`
- Booleans: `1` / `0`
RTT is measured using kernel SO_TIMESTAMP when available (Linux, macOS, FreeBSD), falling back to application-layer timing.
---
## Plugin System
Plugins run on the client and collect system metrics that are sent to the server as `PLG` messages.
### Plugin types
| Type | `interval` | When collected |
|---|---|---|
| `InfoPlugin` | 0 | Once at startup; re-collected on server request |
| `MonitorPlugin` | 30 (default) | Periodically on the configured interval |
### Built-in plugins
| Plugin | Type | Data collected |
|---|---|---|
| `os_info` | Info | OS, kernel, distro, architecture, Python version, hbc version |
| `cpu_monitor` | Monitor | cpu_percent, per-core usage, load averages, process count, frequency |
| `memory_monitor` | Monitor | RAM and swap usage (ZFS ARC-aware) |
| `disk_monitor` | Monitor | Per-partition usage, disk I/O stats |
| `network_monitor` | Monitor | Per-interface byte/packet counts, connection count |
| `ping_monitor` | Monitor | RTT, packet loss, jitter per configured host |
| `filesystem_info` | Info | Mounted filesystems (excludes pseudo filesystems) |
| `nagios_runner` | Monitor | Output of configured Nagios-compatible check commands |
| `zfs_monitor` | Monitor | ZFS pool health, capacity, fragmentation, dedup ratio, I/O |
### Custom plugins
Create a `.py` file in `hbd/client/plugins/`:
```python
from hbd.client.plugin import MonitorPlugin
class MyPlugin(MonitorPlugin):
name = "my_plugin"
interval = 60
async def collect(self):
return {"my_metric": 42}
```
`initialize()` is called once at load time; return `False` to disable the plugin (e.g., if a required binary is missing).
### Nagios integration
The `nagios_runner` plugin executes any Nagios-compatible check binary:
```yaml
plugins:
nagios_runner:
commands:
- name: check_http
command: /usr/lib/nagios/plugins/check_http -H example.com
```
- Commands are validated (absolute paths, executable) at startup.
- Exit codes map to OK / WARNING / CRITICAL / UNKNOWN.
- Performance data fields are extracted and stored individually.
- The `nagios` threshold operator maps exit codes directly to alert levels (see Threshold Alerting).
---
## Threshold Alerting
The server evaluates plugin metrics against configurable thresholds and fires notifications on state changes.
### Configuration
```yaml
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
operator: ">" # >, >=, <, <=, ==, != (default: >)
hysteresis: 0.1 # 10%: recover at 81 when critical=90
count: 1 # Require N consecutive breaches before alerting
display: "CPU {cpu_percent}% (threshold: {op_symbol}{threshold_value})"
memory_monitor:
percent:
warning: 85.0
critical: 95.0
disk_monitor:
partitions:
/:
percent:
warning: 80.0
critical: 90.0
free_gb:
warning: 10.0
critical: 5.0
operator: "<"
nagios_runner:
status_code:
operator: "nagios" # 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN
display: "{check_name}: {output}"
```
### Per-host threshold profiles
Named profiles let different hosts use different thresholds. A single name or a list is accepted; lists are applied left-to-right.
```yaml
threshold_configs:
default:
thresholds:
cpu_monitor:
cpu_percent: {warning: 80, critical: 90}
tight_cpu:
thresholds:
cpu_monitor:
cpu_percent: {warning: 60, critical: 75}
hosts:
web-01:
threshold_config: default
db-01:
threshold_config: [default, tight_cpu]
```
### Alert states
| State | Meaning |
|---|---|
| OK | Metric within normal range |
| WARNING | Metric crossed warning threshold |
| CRITICAL | Metric crossed critical threshold |
| UNKNOWN | Cannot determine (e.g. Nagios exit code 3) |
Notifications are sent on state transitions (OK → WARNING, WARNING → CRITICAL, CRITICAL → OK). De-escalations (CRITICAL → WARNING) do not trigger a notification. Ongoing alerts generate a re-notification every `threshold_renotify_interval` seconds (default: 3600). Alerts can be acknowledged via the web UI or API to suppress re-notifications.
### RTT thresholds
The server measures heartbeat round-trip time and supports RTT thresholds using the same format:
```yaml
thresholds:
rtt:
webserver01:
warning: 100.0 # ms
critical: 500.0
```
### Generic threshold matching
When a metric has no exact threshold entry, the server strips leading segments and retries. This allows one entry to cover all Nagios checks:
```
nagios_runner.check_disk_root_status_code → no match
nagios_runner.disk_root_status_code → no match
nagios_runner.root_status_code → no match
nagios_runner.status_code → matched ✓
```
The stripped prefix (`check_disk_root`) is available as `{check_name}` in the `display` template.
### Display template variables
| Variable | Description |
|---|---|
| `{value}` | Current metric value |
| `{threshold_value}` | Threshold that was crossed |
| `{op_symbol}` | Comparison operator |
| `{check_name}` | Prefix stripped by generic matching |
| `{metric_name}` | Full field name |
| `{output}` | Nagios check output text |
| `{status}` | Nagios status name (OK/WARNING/CRITICAL/UNKNOWN) |
| any plugin field | Any field present in the plugin's data |
---
## Notification Channels
Notifications are dispatched to the host's owner, managers, and monitors. Each user specifies which channels to use.
### Supported channel types
| Type | Required fields |
|---|---|
| `pushover` | `token`, `user` |
| `email` | `smtp_server`, `recipients`, `sender`, `user`, `password`, `port` |
| `mattermost` | `webhook_url`, `channel` |
| `matrix` | `homeserver`, `user`, `password`, `room_id` |
| `signal` | `phone_number`, `recipient` |
| `sms_voipms` | `api_key`, `recipient` |
Each channel can set a `min_level` (`WARNING` or `CRITICAL`) to filter low-severity alerts.
Recovery notifications are only sent to channels that received the original alert.
---
## Web Dashboard & HTTP API
The server exposes a web UI and REST API on `hbd_port` (default 50004).
### Web pages
| Path | Description |
|---|---|
| `/login` | Login form (shown automatically when auth is configured) |
| `/live` | Real-time host connectivity, RTT, and message stream |
| `/plugins/<host>` | Per-host plugin metrics |
| `/alerts` | Active alerts with severity filtering |
| `/settings` | Server config, users, notification channels, thresholds |
Live views use WebSocket connections for real-time updates.
Non-admin users see only hosts where they have a role (monitor, manager, or owner). Admins see all hosts.
### REST API
All endpoints are under `/api/0/`. When authentication is configured, include a session token:
```bash
# Log in, get a token
TOKEN=$(curl -s -X POST http://localhost:50004/api/0/auth/login \
-H 'Content-Type: application/json' \
-d '{"username":"alice","password":"secret"}' | jq -r .token)
# Use the token
curl -H "Authorization: Bearer $TOKEN" http://localhost:50004/api/0/hosts
```
| Method | Endpoint | Description |
|---|---|---|
| GET | `/api/0/hosts` | All visible hosts |
| GET | `/api/0/alerts` | All active alerts |
| GET | `/api/0/alert_summary` | Count of ok/warning/critical |
| GET | `/api/0/messages` | Last 30 messages |
| GET | `/api/0/hosts/{host}/plugins` | All plugin data for host |
| GET | `/api/0/hosts/{host}/plugins/{plugin}?limit=N` | Plugin samples |
| GET | `/api/0/hosts/{host}/alerts` | Alert states for host |
| GET | `/api/0/hosts/{host}/access` | Access roles |
| PUT | `/api/0/hosts/{host}/access` | Update access roles |
| GET | `/api/0/hosts/{host}/info` | Host info (hbc version, thresholds) |
| POST | `/api/0/alerts/acknowledge` | Acknowledge alert |
| GET | `/api/0/users` | All users (admin only) |
| GET | `/api/0/users/me` | Current user profile |
| PUT | `/api/0/users/me` | Update own profile |
| POST | `/api/0/auth/login` | Create session |
| POST | `/api/0/auth/logout` | Destroy session |
| GET | `/api/0/config` | Server config (secrets redacted) |
| POST | `/api/0/config` | Update config |
| GET | `/api/0/config/backups` | List config backups |
| POST | `/api/0/config/rollback` | Roll back to previous config |
| GET | `/api/0/notification_channels` | List channels |
| POST | `/api/0/notification_channels` | Create channel |
| PUT | `/api/0/notification_channels/{name}` | Update channel |
| DELETE | `/api/0/notification_channels/{name}` | Delete channel |
---
## User Management & Authentication
When no `users:` block is in config, the server runs unauthenticated — all existing behaviour is preserved.
### Roles
| Role | Capabilities |
|---|---|
| monitor | View status, plugin data, alerts |
| manager | monitor + queue commands, trigger DNS, queue upgrades |
| owner | manager + drop host, transfer ownership, update access |
| admin | Owner-level on all hosts + access to server config and users |
### Setup
```yaml
users:
alice:
full_name: Alice Smith
password: pbkdf2:sha256:... # hbd passwd alice
admin: true
notification_channels: [pushover_ops]
default_owner: alice # Owns any host with no explicit owner
hosts:
webserver01:
owner: alice
managers: [bob]
monitors: [carol]
```
Password hashing uses PBKDF2-HMAC-SHA256 (260,000 iterations). Sessions expire after 24 hours.
OAuth2 login (Gitea) is supported:
```yaml
oauth:
gitea:
url: https://git.example.com
client_id: xxx
client_secret: yyy
```
---
## Dynamic DNS
When `dyndns: true` is set on a host and `dyndomains` is configured, the server updates DNS via `nsupdate` whenever the host's source address changes.
```yaml
nsupdate_bin: /usr/bin/nsupdate
dyndomains:
- example.com
hosts:
webserver01:
dyndns: true
```
DNS updates run asynchronously in a background worker.
---
## Message Journal
All received messages are logged in JSONL format with automatic size-based rotation.
```yaml
journal_enabled: true
journal_dir: /var/log/heartbeat
journal_file: messages.journal
journal_max_size: 104857600 # 100 MB
journal_max_backups: 10
```
Example entry:
```json
{"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver01","interval":10}}
```
---
## `hbc_mini` — Zero-dependency client
`scripts/hbc_mini.py` is a single-file client requiring only Python 3.8+ and no external packages. Copy it to any host and run directly.
```bash
python3 hbc_mini.py your-server.example.com
python3 hbc_mini.py -d your-server.example.com # daemon mode
python3 hbc_mini.py -b your-server.example.com # send boot message
```
Config: `~/.hbc.json` (JSON format, same keys as `~/.hbc.yaml`).
**Available plugins:**
| Plugin | Platform |
|---|---|
| `os_info` | All |
| `ping_monitor` | All |
| `nagios_runner` | All (not Windows) |
| `cpu_monitor` | Linux (`/proc/stat`; no per-core, no frequency) |
| `memory_monitor` | Linux (`/proc/meminfo`) |
| `disk_monitor` | Linux, macOS, BSD (`df -P`) |
| `network_monitor` | Linux (`/proc/net/dev`) |
Not available vs full `hbc`: no YAML config, no `filesystem_info`, no `zfs_monitor`, no IPv6 early-fail protection.
---
## `hbc_mini.c` — C client
`scripts/c/hbc_mini.c` is a single-file C port of `hbc_mini.py`. It has no runtime dependencies beyond libc, zlib, pthreads, and libm, and runs on Linux, FreeBSD, NetBSD, and DragonFly BSD.
### Build
```bash
cc -O2 -o hbc_mini scripts/c/hbc_mini.c -lz -lpthread -lm
```
### Usage
The CLI is identical to `hbc_mini.py`:
```bash
./hbc_mini your-server.example.com
./hbc_mini -d your-server.example.com # daemon mode (logs to syslog)
./hbc_mini -b your-server.example.com # send boot message
./hbc_mini -m "note" your-server.example.com # send one-shot message
./hbc_mini -4 your-server.example.com # IPv4 only
./hbc_mini -6 your-server.example.com # IPv6 only
```
Config: `~/.hbc.json` (JSON, same keys as the Python version).
### Architecture
The C client uses two threads:
- **Main thread** — heartbeat sender loop + `select()`-based receive loop (1 s timeout). Sends `HTB` at the configured interval, receives `ACK`/`CMD` messages, and re-sends `os_info` on server request.
- **Monitor thread** — all periodic plugins in a single thread with a 1-second sleep loop. Each plugin has its own next-run timestamp tracked independently.
SIGHUP causes the process to restart itself via `execv()`. SIGTERM/SIGINT trigger a clean shutdown (sends a shutdown heartbeat if `-b` was used).
### Available plugins
| Plugin | Platform | Data source |
|---|---|---|
| `os_info` | Linux, FreeBSD, NetBSD, DragonFly | `uname(2)`, `/etc/os-release`, `kern.osrelease` sysctl |
| `cpu_monitor` | Linux | `/proc/stat` |
| `cpu_monitor` | FreeBSD, DragonFly, NetBSD | `kern.cp_time` sysctl |
| `memory_monitor` | Linux | `/proc/meminfo` (ZFS ARC-aware) |
| `memory_monitor` | FreeBSD, DragonFly | `vm.stats.vm.*` sysctl |
| `memory_monitor` | NetBSD | `VM_UVMEXP` sysctl |
| `disk_monitor` | All | `df -P` subprocess |
| `network_monitor` | Linux | `/proc/net/dev` |
| `network_monitor` | FreeBSD, NetBSD, DragonFly | `getifaddrs()` + `AF_LINK` |
| `ping_monitor` | All | `ping` subprocess |
| `nagios_runner` | All | `popen()` subprocess |
`cpu_monitor` reports: `cpu_percent`, `cpu_user`, `cpu_system`, `cpu_idle`, `cpu_iowait` (Linux only), load averages, `cpu_core_count`, `uptime_seconds`.
`memory_monitor` reports: `memory_total`, `memory_used`, `memory_available`, `memory_free`, `memory_percent`, and swap fields when swap is present.
`network_monitor` reports per-interface cumulative `bytes_recv`/`bytes_sent` and interval deltas. The loopback interface (`lo`) is skipped by default; this is configurable:
```json
{
"plugins": {
"network_monitor": {
"skip_interfaces": ["lo", "docker0"]
}
}
}
```
`disk_monitor` reports per-mount `total`, `used`, `free`, `percent`. An optional mount filter restricts reporting to specific paths:
```json
{
"plugins": {
"disk_monitor": {
"mounts": ["/", "/data"]
}
}
}
```
### Differences from `hbc_mini.py`
- No `filesystem_info` or `zfs_monitor` plugins
- `UPD` (self-update) messages are logged but not acted on
- No IPv6 early-fail protection
- Config is JSON only (`~/.hbc.json`), no YAML
---
## Development
### Running tests
```bash
PYTHONPATH=. python -m unittest discover -v
# or
pytest -q
```
### Linting and type checking
```bash
tox -e lint
tox -e mypy
```
### Debugging in VS Code
A `.vscode/launch.json` is included with configurations for running and attaching the debugger. Select the project `.venv` as the Python interpreter, then use F5.
To start with debugpy and wait for attach:
```bash
PYTHONPATH=. python -m debugpy --listen 5678 --wait-for-client -m hbd.server.cli serve -c .hb.yaml -f -v
```
---
## License
MIT. See `LICENSE` for details.