22 KiB
Heartbeat Daemon (hbd)
A lightweight UDP-based host monitoring system. Monitored hosts run a client (hbc) that sends periodic heartbeat packets and system metrics to a central server (hbd). The server tracks host reachability, evaluates metric thresholds, sends notifications, and serves a web dashboard.
Architecture
[ host running hbc ] [ server running hbd ]
┌────────────────────┐ ┌────────────────────────────┐
│ heartbeat client │ UDP 50003 │ heartbeat daemon │
│ │ ──────────> │ │
│ plugins: │ HTB / PLG │ host state tracking │
│ - cpu_monitor │ │ threshold evaluation │
│ - memory_monitor │ <────────── │ DNS updates (nsupdate) │
│ - disk_monitor │ ACK/CMD/UPD │ notifications │
│ - nagios_runner │ │ web dashboard + REST API │
│ - ... │ │ WebSocket live updates │
└────────────────────┘ └────────────────────────────┘
Package: hbd v5.3.10
Python: 3.11+
Subpackages
| Package | Purpose |
|---|---|
hbd.common |
Protocol encoding/decoding, shared utilities |
hbd.server |
The hbd daemon |
hbd.client |
The hbc client |
Installation
Dependencies are declared in pyproject.toml. Install into a virtualenv:
# Server + client
pip install .
# Using the install script
scripts/hb_install.sh
Entry points:
hbd— server (hbd.server.cli:main)hbc— client (hbd.client.main:main)
Runtime dependencies:
| Component | Packages |
|---|---|
| Both | PyYAML ≥6.0 |
| Client | psutil ≥5.9.0 |
| Server | aiohttp ≥3.11, websockets ≥13.2, Jinja2 ≥3.1.6, ruamel.yaml ≥0.18, mattermostdriver ≥7.3.0, matrix-nio ≥0.24 |
Server (hbd)
Starting the server
# Foreground, verbose, with config file
hbd serve -c /etc/hb.yaml -f -v
# As a module
python -m hbd.server.cli serve -c /etc/hb.yaml
CLI subcommands
| Command | Description |
|---|---|
hbd serve |
Start the daemon (default) |
hbd passwd <username> |
Generate a password hash for config |
hbd notify |
Test notification channels |
hbd stop |
Stop a running daemon |
hbd reload |
Reload config (send SIGHUP) |
hbd restart |
Restart daemon |
Configuration (~/.hb.yaml)
# Network
hb_port: 50003 # UDP port for heartbeat messages
hbd_port: 50004 # HTTP API / web UI port
hbd_host: "" # Bind address (empty = all interfaces)
ws_port: 50005 # WebSocket port (plain)
wss_port: ~ # WebSocket port (TLS; requires cert_path/wss_pem/wss_key)
# Timing
interval: 20 # Expected heartbeat interval (seconds)
grace: 2 # Extra seconds before declaring a host overdue
# Persistence
pickfile: ~/.hb.pick # Host state persistence
pidfile: ~/.hb.pid
logfile: ~/.hb.log
# Message journal
journal_enabled: true
journal_dir: /var/log/heartbeat
journal_file: messages.journal
journal_max_size: 104857600 # 100 MB
journal_max_backups: 10
# DNS
nsupdate_bin: /usr/bin/nsupdate
dyndomains:
- example.com
# Threshold alert re-notification interval (seconds)
threshold_renotify_interval: 3600
# Notification channels
notification_channels:
pushover_ops:
type: pushover
token: YOUR_APP_TOKEN
user: YOUR_USER_KEY
email_ops:
type: email
smtp_server: smtp.example.com
port: 587
user: alerts@example.com
password: secret
recipients: [ops@example.com]
# Users
users:
alice:
full_name: Alice Smith
password: pbkdf2:sha256:... # generate with: hbd passwd alice
admin: true
notification_channels: [pushover_ops]
bob:
password: pbkdf2:sha256:...
notification_channels: [email_ops]
default_owner: alice
# Hosts
hosts:
webserver01:
dyndns: true # Update DNS when address changes
owner: alice
managers: [bob]
monitors: []
database01:
watch: false # Suppress all notifications for this host
Send SIGHUP (or hbd reload) to reload configuration without restarting. Changes to ports, certificates, pickle path, and journal path require a full restart.
Persistence
Host state (reachability, plugin data, alert states) is saved to pickfile every 5 minutes and on clean shutdown. The server loads this state on startup.
Client (hbc)
Usage
# Basic — send heartbeats to a server
hbc your-server.example.com
# Multiple servers
hbc server1.example.com server2.example.com
# With config file, running as a daemon
hbc -d -c /etc/hbc.yaml your-server.example.com
# Send a boot message, then heartbeat normally
hbc -b your-server.example.com
# One-off message
hbc -m "maintenance starting" your-server.example.com
# Force IPv4 or IPv6 only
hbc -4 your-server.example.com
hbc -6 your-server.example.com
Options
| Flag | Description |
|---|---|
-b, --boot |
Send a boot message at startup |
-c, --config FILE |
Config file path (default: ~/.hbc.yaml) |
-d, --daemon |
Daemonize (logs go to syslog) |
-m, --message TEXT |
Send a one-off message and exit |
-n, --name NAME |
Override reported hostname |
-v, --verbose |
Verbose output |
-x, --debug |
Debug level (repeatable) |
-4 / -6 |
Restrict to IPv4 or IPv6 |
Configuration (~/.hbc.yaml)
hb_port: 50003 # Server UDP port
interval: 10 # Heartbeat interval (seconds)
owner: alice # Optional: claim ownership of this host
plugins:
cpu_monitor:
interval: 300 # Override collection interval
per_core: true # Report per-core CPU usage
memory_monitor:
interval: 300
disk_monitor:
interval: 300
network_monitor:
interval: 300
ping_monitor:
interval: 60
hosts: [8.8.8.8, 192.168.1.1]
nagios_runner:
interval: 300
commands:
- name: check_load
command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
- name: check_disk_root
command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
zfs_monitor:
interval: 300
Connection behaviour
- The client sends heartbeats over UDP to each server address resolved from the hostname (IPv4 and IPv6).
- If a connection fails to open at startup, IPv6 connections are dropped after 3 consecutive failures. IPv4 connections retry indefinitely.
- In daemon mode (
-d), all log output goes to syslog (LOG_DAEMONfacility).
UDP Protocol
All messages are zlib-compressed key=value pairs with an ID prefix.
!<ID>: <zlib-compressed payload>
Payload format: key=value;key=value;...
| Message | Direction | Purpose |
|---|---|---|
HTB |
client → server | Heartbeat (name, timestamp, RTT, acks, interval) |
PLG |
client → server | Plugin data (plugin name + metrics) |
ACK |
server → client | Acknowledgment |
CMD |
server → client | Execute a shell command on the client |
UPD |
server → client | Trigger self-update via hb_install.sh |
Value encoding:
- Floats: 5 decimal places
- Lists/dicts: JSON prefixed with
@ - Booleans:
1/0
RTT is measured using kernel SO_TIMESTAMP when available (Linux, macOS, FreeBSD), falling back to application-layer timing.
Plugin System
Plugins run on the client and collect system metrics that are sent to the server as PLG messages.
Plugin types
| Type | interval |
When collected |
|---|---|---|
InfoPlugin |
0 | Once at startup; re-collected on server request |
MonitorPlugin |
30 (default) | Periodically on the configured interval |
Built-in plugins
| Plugin | Type | Data collected |
|---|---|---|
os_info |
Info | OS, kernel, distro, architecture, Python version, hbc version |
cpu_monitor |
Monitor | cpu_percent, per-core usage, load averages, process count, frequency |
memory_monitor |
Monitor | RAM and swap usage (ZFS ARC-aware) |
disk_monitor |
Monitor | Per-partition usage, disk I/O stats |
network_monitor |
Monitor | Per-interface byte/packet counts, connection count |
ping_monitor |
Monitor | RTT, packet loss, jitter per configured host |
filesystem_info |
Info | Mounted filesystems (excludes pseudo filesystems) |
nagios_runner |
Monitor | Output of configured Nagios-compatible check commands |
zfs_monitor |
Monitor | ZFS pool health, capacity, fragmentation, dedup ratio, I/O |
Custom plugins
Create a .py file in hbd/client/plugins/:
from hbd.client.plugin import MonitorPlugin
class MyPlugin(MonitorPlugin):
name = "my_plugin"
interval = 60
async def collect(self):
return {"my_metric": 42}
initialize() is called once at load time; return False to disable the plugin (e.g., if a required binary is missing).
Nagios integration
The nagios_runner plugin executes any Nagios-compatible check binary:
plugins:
nagios_runner:
commands:
- name: check_http
command: /usr/lib/nagios/plugins/check_http -H example.com
- Commands are validated (absolute paths, executable) at startup.
- Exit codes map to OK / WARNING / CRITICAL / UNKNOWN.
- Performance data fields are extracted and stored individually.
- The
nagiosthreshold operator maps exit codes directly to alert levels (see Threshold Alerting).
Threshold Alerting
The server evaluates plugin metrics against configurable thresholds and fires notifications on state changes.
Configuration
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
operator: ">" # >, >=, <, <=, ==, != (default: >)
hysteresis: 0.1 # 10%: recover at 81 when critical=90
count: 1 # Require N consecutive breaches before alerting
display: "CPU {cpu_percent}% (threshold: {op_symbol}{threshold_value})"
memory_monitor:
percent:
warning: 85.0
critical: 95.0
disk_monitor:
partitions:
/:
percent:
warning: 80.0
critical: 90.0
free_gb:
warning: 10.0
critical: 5.0
operator: "<"
nagios_runner:
status_code:
operator: "nagios" # 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN
display: "{check_name}: {output}"
Per-host threshold profiles
Named profiles let different hosts use different thresholds. A single name or a list is accepted; lists are applied left-to-right.
threshold_configs:
default:
thresholds:
cpu_monitor:
cpu_percent: {warning: 80, critical: 90}
tight_cpu:
thresholds:
cpu_monitor:
cpu_percent: {warning: 60, critical: 75}
hosts:
web-01:
threshold_config: default
db-01:
threshold_config: [default, tight_cpu]
Alert states
| State | Meaning |
|---|---|
| OK | Metric within normal range |
| WARNING | Metric crossed warning threshold |
| CRITICAL | Metric crossed critical threshold |
| UNKNOWN | Cannot determine (e.g. Nagios exit code 3) |
Notifications are sent on state transitions (OK → WARNING, WARNING → CRITICAL, CRITICAL → OK). De-escalations (CRITICAL → WARNING) do not trigger a notification. Ongoing alerts generate a re-notification every threshold_renotify_interval seconds (default: 3600). Alerts can be acknowledged via the web UI or API to suppress re-notifications.
RTT thresholds
The server measures heartbeat round-trip time and supports RTT thresholds using the same format:
thresholds:
rtt:
webserver01:
warning: 100.0 # ms
critical: 500.0
Generic threshold matching
When a metric has no exact threshold entry, the server strips leading segments and retries. This allows one entry to cover all Nagios checks:
nagios_runner.check_disk_root_status_code → no match
nagios_runner.disk_root_status_code → no match
nagios_runner.root_status_code → no match
nagios_runner.status_code → matched ✓
The stripped prefix (check_disk_root) is available as {check_name} in the display template.
Display template variables
| Variable | Description |
|---|---|
{value} |
Current metric value |
{threshold_value} |
Threshold that was crossed |
{op_symbol} |
Comparison operator |
{check_name} |
Prefix stripped by generic matching |
{metric_name} |
Full field name |
{output} |
Nagios check output text |
{status} |
Nagios status name (OK/WARNING/CRITICAL/UNKNOWN) |
| any plugin field | Any field present in the plugin's data |
Notification Channels
Notifications are dispatched to the host's owner, managers, and monitors. Each user specifies which channels to use.
Supported channel types
| Type | Required fields |
|---|---|
pushover |
token, user |
email |
smtp_server, recipients, sender, user, password, port |
mattermost |
webhook_url, channel |
matrix |
homeserver, user, password, room_id |
signal |
phone_number, recipient |
sms_voipms |
api_key, recipient |
Each channel can set a min_level (WARNING or CRITICAL) to filter low-severity alerts.
Recovery notifications are only sent to channels that received the original alert.
Web Dashboard & HTTP API
The server exposes a web UI and REST API on hbd_port (default 50004).
Web pages
| Path | Description |
|---|---|
/login |
Login form (shown automatically when auth is configured) |
/live |
Real-time host connectivity, RTT, and message stream |
/plugins/<host> |
Per-host plugin metrics |
/alerts |
Active alerts with severity filtering |
/settings |
Server config, users, notification channels, thresholds |
Live views use WebSocket connections for real-time updates.
Non-admin users see only hosts where they have a role (monitor, manager, or owner). Admins see all hosts.
REST API
All endpoints are under /api/0/. When authentication is configured, include a session token:
# Log in, get a token
TOKEN=$(curl -s -X POST http://localhost:50004/api/0/auth/login \
-H 'Content-Type: application/json' \
-d '{"username":"alice","password":"secret"}' | jq -r .token)
# Use the token
curl -H "Authorization: Bearer $TOKEN" http://localhost:50004/api/0/hosts
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/0/hosts |
All visible hosts |
| GET | /api/0/alerts |
All active alerts |
| GET | /api/0/alert_summary |
Count of ok/warning/critical |
| GET | /api/0/messages |
Last 30 messages |
| GET | /api/0/hosts/{host}/plugins |
All plugin data for host |
| GET | /api/0/hosts/{host}/plugins/{plugin}?limit=N |
Plugin samples |
| GET | /api/0/hosts/{host}/alerts |
Alert states for host |
| GET | /api/0/hosts/{host}/access |
Access roles |
| PUT | /api/0/hosts/{host}/access |
Update access roles |
| GET | /api/0/hosts/{host}/info |
Host info (hbc version, thresholds) |
| POST | /api/0/alerts/acknowledge |
Acknowledge alert |
| GET | /api/0/users |
All users (admin only) |
| GET | /api/0/users/me |
Current user profile |
| PUT | /api/0/users/me |
Update own profile |
| POST | /api/0/auth/login |
Create session |
| POST | /api/0/auth/logout |
Destroy session |
| GET | /api/0/config |
Server config (secrets redacted) |
| POST | /api/0/config |
Update config |
| GET | /api/0/config/backups |
List config backups |
| POST | /api/0/config/rollback |
Roll back to previous config |
| GET | /api/0/notification_channels |
List channels |
| POST | /api/0/notification_channels |
Create channel |
| PUT | /api/0/notification_channels/{name} |
Update channel |
| DELETE | /api/0/notification_channels/{name} |
Delete channel |
User Management & Authentication
When no users: block is in config, the server runs unauthenticated — all existing behaviour is preserved.
Roles
| Role | Capabilities |
|---|---|
| monitor | View status, plugin data, alerts |
| manager | monitor + queue commands, trigger DNS, queue upgrades |
| owner | manager + drop host, transfer ownership, update access |
| admin | Owner-level on all hosts + access to server config and users |
Setup
users:
alice:
full_name: Alice Smith
password: pbkdf2:sha256:... # hbd passwd alice
admin: true
notification_channels: [pushover_ops]
default_owner: alice # Owns any host with no explicit owner
hosts:
webserver01:
owner: alice
managers: [bob]
monitors: [carol]
Password hashing uses PBKDF2-HMAC-SHA256 (260,000 iterations). Sessions expire after 24 hours.
OAuth2 login (Gitea) is supported:
oauth:
gitea:
url: https://git.example.com
client_id: xxx
client_secret: yyy
Dynamic DNS
When dyndns: true is set on a host and dyndomains is configured, the server updates DNS via nsupdate whenever the host's source address changes.
nsupdate_bin: /usr/bin/nsupdate
dyndomains:
- example.com
hosts:
webserver01:
dyndns: true
DNS updates run asynchronously in a background worker.
Message Journal
All received messages are logged in JSONL format with automatic size-based rotation.
journal_enabled: true
journal_dir: /var/log/heartbeat
journal_file: messages.journal
journal_max_size: 104857600 # 100 MB
journal_max_backups: 10
Example entry:
{"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver01","interval":10}}
hbc_mini — Zero-dependency client
scripts/hbc_mini.py is a single-file client requiring only Python 3.8+ and no external packages. Copy it to any host and run directly.
python3 hbc_mini.py your-server.example.com
python3 hbc_mini.py -d your-server.example.com # daemon mode
python3 hbc_mini.py -b your-server.example.com # send boot message
Config: ~/.hbc.json (JSON format, same keys as ~/.hbc.yaml).
Available plugins:
| Plugin | Platform |
|---|---|
os_info |
All |
ping_monitor |
All |
nagios_runner |
All (not Windows) |
cpu_monitor |
Linux (/proc/stat; no per-core, no frequency) |
memory_monitor |
Linux (/proc/meminfo) |
disk_monitor |
Linux, macOS, BSD (df -P) |
network_monitor |
Linux (/proc/net/dev) |
Not available vs full hbc: no YAML config, no filesystem_info, no zfs_monitor, no IPv6 early-fail protection.
hbc_mini.c — C client
scripts/c/hbc_mini.c is a single-file C port of hbc_mini.py. It has no runtime dependencies beyond libc, zlib, pthreads, and libm, and runs on Linux, FreeBSD, NetBSD, and DragonFly BSD.
Build
cc -O2 -o hbc_mini scripts/c/hbc_mini.c -lz -lpthread -lm
Usage
The CLI is identical to hbc_mini.py:
./hbc_mini your-server.example.com
./hbc_mini -d your-server.example.com # daemon mode (logs to syslog)
./hbc_mini -b your-server.example.com # send boot message
./hbc_mini -m "note" your-server.example.com # send one-shot message
./hbc_mini -4 your-server.example.com # IPv4 only
./hbc_mini -6 your-server.example.com # IPv6 only
Config: ~/.hbc.json (JSON, same keys as the Python version).
Architecture
The C client uses two threads:
- Main thread — heartbeat sender loop +
select()-based receive loop (1 s timeout). SendsHTBat the configured interval, receivesACK/CMDmessages, and re-sendsos_infoon server request. - Monitor thread — all periodic plugins in a single thread with a 1-second sleep loop. Each plugin has its own next-run timestamp tracked independently.
SIGHUP causes the process to restart itself via execv(). SIGTERM/SIGINT trigger a clean shutdown (sends a shutdown heartbeat if -b was used).
Available plugins
| Plugin | Platform | Data source |
|---|---|---|
os_info |
Linux, FreeBSD, NetBSD, DragonFly | uname(2), /etc/os-release, kern.osrelease sysctl |
cpu_monitor |
Linux | /proc/stat |
cpu_monitor |
FreeBSD, DragonFly, NetBSD | kern.cp_time sysctl |
memory_monitor |
Linux | /proc/meminfo (ZFS ARC-aware) |
memory_monitor |
FreeBSD, DragonFly | vm.stats.vm.* sysctl |
memory_monitor |
NetBSD | VM_UVMEXP sysctl |
disk_monitor |
All | df -P subprocess |
network_monitor |
Linux | /proc/net/dev |
network_monitor |
FreeBSD, NetBSD, DragonFly | getifaddrs() + AF_LINK |
ping_monitor |
All | ping subprocess |
nagios_runner |
All | popen() subprocess |
cpu_monitor reports: cpu_percent, cpu_user, cpu_system, cpu_idle, cpu_iowait (Linux only), load averages, cpu_core_count, uptime_seconds.
memory_monitor reports: memory_total, memory_used, memory_available, memory_free, memory_percent, and swap fields when swap is present.
network_monitor reports per-interface cumulative bytes_recv/bytes_sent and interval deltas. The loopback interface (lo) is skipped by default; this is configurable:
{
"plugins": {
"network_monitor": {
"skip_interfaces": ["lo", "docker0"]
}
}
}
disk_monitor reports per-mount total, used, free, percent. An optional mount filter restricts reporting to specific paths:
{
"plugins": {
"disk_monitor": {
"mounts": ["/", "/data"]
}
}
}
Differences from hbc_mini.py
- No
filesystem_infoorzfs_monitorplugins UPD(self-update) messages are logged but not acted on- No IPv6 early-fail protection
- Config is JSON only (
~/.hbc.json), no YAML
Development
Running tests
PYTHONPATH=. python -m unittest discover -v
# or
pytest -q
Linting and type checking
tox -e lint
tox -e mypy
Debugging in VS Code
A .vscode/launch.json is included with configurations for running and attaching the debugger. Select the project .venv as the Python interpreter, then use F5.
To start with debugpy and wait for attach:
PYTHONPATH=. python -m debugpy --listen 5678 --wait-for-client -m hbd.server.cli serve -c .hb.yaml -f -v
License
MIT. See LICENSE for details.