andreas/heartbeat

Public Access

Fork 1

Files

T

Andreas Wrede 39670f4e63 version 5.3.10

2026-06-06 08:28:43 -04:00

22 KiB

Raw Permalink Blame History

Heartbeat Daemon (hbd)

A lightweight UDP-based host monitoring system. Monitored hosts run a client (hbc) that sends periodic heartbeat packets and system metrics to a central server (hbd). The server tracks host reachability, evaluates metric thresholds, sends notifications, and serves a web dashboard.

Architecture

  [ host running hbc ]                [ server running hbd ]
  ┌────────────────────┐              ┌────────────────────────────┐
  │  heartbeat client  │  UDP 50003   │  heartbeat daemon          │
  │                    │ ──────────>  │                            │
  │  plugins:          │  HTB / PLG   │  host state tracking       │
  │  - cpu_monitor     │              │  threshold evaluation      │
  │  - memory_monitor  │  <────────── │  DNS updates (nsupdate)    │
  │  - disk_monitor    │  ACK/CMD/UPD │  notifications             │
  │  - nagios_runner   │              │  web dashboard + REST API  │
  │  - ...             │              │  WebSocket live updates    │
  └────────────────────┘              └────────────────────────────┘

Package: hbd v5.3.10 Python: 3.11+

Subpackages

Package	Purpose
`hbd.common`	Protocol encoding/decoding, shared utilities
`hbd.server`	The `hbd` daemon
`hbd.client`	The `hbc` client

Installation

Dependencies are declared in pyproject.toml. Install into a virtualenv:

# Server + client
pip install .

# Using the install script
scripts/hb_install.sh

Entry points:

hbd — server (hbd.server.cli:main)
hbc — client (hbd.client.main:main)

Runtime dependencies:

Component	Packages
Both	PyYAML ≥6.0
Client	psutil ≥5.9.0
Server	aiohttp ≥3.11, websockets ≥13.2, Jinja2 ≥3.1.6, ruamel.yaml ≥0.18, mattermostdriver ≥7.3.0, matrix-nio ≥0.24

Server (`hbd`)

Starting the server

# Foreground, verbose, with config file
hbd serve -c /etc/hb.yaml -f -v

# As a module
python -m hbd.server.cli serve -c /etc/hb.yaml

CLI subcommands

Command	Description
`hbd serve`	Start the daemon (default)
`hbd passwd <username>`	Generate a password hash for config
`hbd notify`	Test notification channels
`hbd stop`	Stop a running daemon
`hbd reload`	Reload config (send SIGHUP)
`hbd restart`	Restart daemon

Configuration (`~/.hb.yaml`)

# Network
hb_port: 50003          # UDP port for heartbeat messages
hbd_port: 50004         # HTTP API / web UI port
hbd_host: ""            # Bind address (empty = all interfaces)
ws_port: 50005          # WebSocket port (plain)
wss_port: ~             # WebSocket port (TLS; requires cert_path/wss_pem/wss_key)

# Timing
interval: 20            # Expected heartbeat interval (seconds)
grace: 2                # Extra seconds before declaring a host overdue

# Persistence
pickfile: ~/.hb.pick    # Host state persistence
pidfile: ~/.hb.pid
logfile: ~/.hb.log

# Message journal
journal_enabled: true
journal_dir: /var/log/heartbeat
journal_file: messages.journal
journal_max_size: 104857600   # 100 MB
journal_max_backups: 10

# DNS
nsupdate_bin: /usr/bin/nsupdate
dyndomains:
  - example.com

# Threshold alert re-notification interval (seconds)
threshold_renotify_interval: 3600

# Notification channels
notification_channels:
  pushover_ops:
    type: pushover
    token: YOUR_APP_TOKEN
    user: YOUR_USER_KEY
  email_ops:
    type: email
    smtp_server: smtp.example.com
    port: 587
    user: alerts@example.com
    password: secret
    recipients: [ops@example.com]

# Users
users:
  alice:
    full_name: Alice Smith
    password: pbkdf2:sha256:...    # generate with: hbd passwd alice
    admin: true
    notification_channels: [pushover_ops]
  bob:
    password: pbkdf2:sha256:...
    notification_channels: [email_ops]

default_owner: alice

# Hosts
hosts:
  webserver01:
    dyndns: true          # Update DNS when address changes
    owner: alice
    managers: [bob]
    monitors: []
  database01:
    watch: false          # Suppress all notifications for this host

Send SIGHUP (or hbd reload) to reload configuration without restarting. Changes to ports, certificates, pickle path, and journal path require a full restart.

Persistence

Host state (reachability, plugin data, alert states) is saved to pickfile every 5 minutes and on clean shutdown. The server loads this state on startup.

Client (`hbc`)

Usage

# Basic — send heartbeats to a server
hbc your-server.example.com

# Multiple servers
hbc server1.example.com server2.example.com

# With config file, running as a daemon
hbc -d -c /etc/hbc.yaml your-server.example.com

# Send a boot message, then heartbeat normally
hbc -b your-server.example.com

# One-off message
hbc -m "maintenance starting" your-server.example.com

# Force IPv4 or IPv6 only
hbc -4 your-server.example.com
hbc -6 your-server.example.com

Options

Flag	Description
`-b`, `--boot`	Send a boot message at startup
`-c`, `--config FILE`	Config file path (default: `~/.hbc.yaml`)
`-d`, `--daemon`	Daemonize (logs go to syslog)
`-m`, `--message TEXT`	Send a one-off message and exit
`-n`, `--name NAME`	Override reported hostname
`-v`, `--verbose`	Verbose output
`-x`, `--debug`	Debug level (repeatable)
`-4` / `-6`	Restrict to IPv4 or IPv6

Configuration (`~/.hbc.yaml`)

hb_port: 50003         # Server UDP port
interval: 10           # Heartbeat interval (seconds)
owner: alice           # Optional: claim ownership of this host

plugins:
  cpu_monitor:
    interval: 300      # Override collection interval
    per_core: true     # Report per-core CPU usage
  memory_monitor:
    interval: 300
  disk_monitor:
    interval: 300
  network_monitor:
    interval: 300
  ping_monitor:
    interval: 60
    hosts: [8.8.8.8, 192.168.1.1]
  nagios_runner:
    interval: 300
    commands:
      - name: check_load
        command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
      - name: check_disk_root
        command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
  zfs_monitor:
    interval: 300

Connection behaviour

The client sends heartbeats over UDP to each server address resolved from the hostname (IPv4 and IPv6).
If a connection fails to open at startup, IPv6 connections are dropped after 3 consecutive failures. IPv4 connections retry indefinitely.
In daemon mode (-d), all log output goes to syslog (LOG_DAEMON facility).

UDP Protocol

All messages are zlib-compressed key=value pairs with an ID prefix.

!<ID>: <zlib-compressed payload>

Payload format: key=value;key=value;...

Message	Direction	Purpose
`HTB`	client → server	Heartbeat (name, timestamp, RTT, acks, interval)
`PLG`	client → server	Plugin data (plugin name + metrics)
`ACK`	server → client	Acknowledgment
`CMD`	server → client	Execute a shell command on the client
`UPD`	server → client	Trigger self-update via `hb_install.sh`

Value encoding:

Floats: 5 decimal places
Lists/dicts: JSON prefixed with @
Booleans: 1 / 0

RTT is measured using kernel SO_TIMESTAMP when available (Linux, macOS, FreeBSD), falling back to application-layer timing.

Plugin System

Plugins run on the client and collect system metrics that are sent to the server as PLG messages.

Plugin types

Type	`interval`	When collected
`InfoPlugin`	0	Once at startup; re-collected on server request
`MonitorPlugin`	30 (default)	Periodically on the configured interval

Built-in plugins

Plugin	Type	Data collected
`os_info`	Info	OS, kernel, distro, architecture, Python version, hbc version
`cpu_monitor`	Monitor	cpu_percent, per-core usage, load averages, process count, frequency
`memory_monitor`	Monitor	RAM and swap usage (ZFS ARC-aware)
`disk_monitor`	Monitor	Per-partition usage, disk I/O stats
`network_monitor`	Monitor	Per-interface byte/packet counts, connection count
`ping_monitor`	Monitor	RTT, packet loss, jitter per configured host
`filesystem_info`	Info	Mounted filesystems (excludes pseudo filesystems)
`nagios_runner`	Monitor	Output of configured Nagios-compatible check commands
`zfs_monitor`	Monitor	ZFS pool health, capacity, fragmentation, dedup ratio, I/O

Custom plugins

Create a .py file in hbd/client/plugins/:

from hbd.client.plugin import MonitorPlugin

class MyPlugin(MonitorPlugin):
    name = "my_plugin"
    interval = 60

    async def collect(self):
        return {"my_metric": 42}

initialize() is called once at load time; return False to disable the plugin (e.g., if a required binary is missing).

Nagios integration

The nagios_runner plugin executes any Nagios-compatible check binary:

plugins:
  nagios_runner:
    commands:
      - name: check_http
        command: /usr/lib/nagios/plugins/check_http -H example.com

Commands are validated (absolute paths, executable) at startup.
Exit codes map to OK / WARNING / CRITICAL / UNKNOWN.
Performance data fields are extracted and stored individually.
The nagios threshold operator maps exit codes directly to alert levels (see Threshold Alerting).

Threshold Alerting

The server evaluates plugin metrics against configurable thresholds and fires notifications on state changes.

Configuration

thresholds:
  cpu_monitor:
    cpu_percent:
      warning: 80.0
      critical: 90.0
      operator: ">"         # >, >=, <, <=, ==, != (default: >)
      hysteresis: 0.1       # 10%: recover at 81 when critical=90
      count: 1              # Require N consecutive breaches before alerting
      display: "CPU {cpu_percent}% (threshold: {op_symbol}{threshold_value})"

  memory_monitor:
    percent:
      warning: 85.0
      critical: 95.0

  disk_monitor:
    partitions:
      /:
        percent:
          warning: 80.0
          critical: 90.0
        free_gb:
          warning: 10.0
          critical: 5.0
          operator: "<"

  nagios_runner:
    status_code:
      operator: "nagios"    # 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN
      display: "{check_name}: {output}"

Per-host threshold profiles

Named profiles let different hosts use different thresholds. A single name or a list is accepted; lists are applied left-to-right.

threshold_configs:
  default:
    thresholds:
      cpu_monitor:
        cpu_percent: {warning: 80, critical: 90}

  tight_cpu:
    thresholds:
      cpu_monitor:
        cpu_percent: {warning: 60, critical: 75}

hosts:
  web-01:
    threshold_config: default
  db-01:
    threshold_config: [default, tight_cpu]

Alert states

State	Meaning
OK	Metric within normal range
WARNING	Metric crossed warning threshold
CRITICAL	Metric crossed critical threshold
UNKNOWN	Cannot determine (e.g. Nagios exit code 3)

Notifications are sent on state transitions (OK → WARNING, WARNING → CRITICAL, CRITICAL → OK). De-escalations (CRITICAL → WARNING) do not trigger a notification. Ongoing alerts generate a re-notification every threshold_renotify_interval seconds (default: 3600). Alerts can be acknowledged via the web UI or API to suppress re-notifications.

RTT thresholds

The server measures heartbeat round-trip time and supports RTT thresholds using the same format:

thresholds:
  rtt:
    webserver01:
      warning: 100.0    # ms
      critical: 500.0

Generic threshold matching

When a metric has no exact threshold entry, the server strips leading segments and retries. This allows one entry to cover all Nagios checks:

nagios_runner.check_disk_root_status_code → no match
nagios_runner.disk_root_status_code       → no match
nagios_runner.root_status_code            → no match
nagios_runner.status_code                 → matched ✓

The stripped prefix (check_disk_root) is available as {check_name} in the display template.

Display template variables

Variable	Description
`{value}`	Current metric value
`{threshold_value}`	Threshold that was crossed
`{op_symbol}`	Comparison operator
`{check_name}`	Prefix stripped by generic matching
`{metric_name}`	Full field name
`{output}`	Nagios check output text
`{status}`	Nagios status name (OK/WARNING/CRITICAL/UNKNOWN)
any plugin field	Any field present in the plugin's data

Notification Channels

Notifications are dispatched to the host's owner, managers, and monitors. Each user specifies which channels to use.

Supported channel types

Type	Required fields
`pushover`	`token`, `user`
`email`	`smtp_server`, `recipients`, `sender`, `user`, `password`, `port`
`mattermost`	`webhook_url`, `channel`
`matrix`	`homeserver`, `user`, `password`, `room_id`
`signal`	`phone_number`, `recipient`
`sms_voipms`	`api_key`, `recipient`

Each channel can set a min_level (WARNING or CRITICAL) to filter low-severity alerts.

Recovery notifications are only sent to channels that received the original alert.

Web Dashboard & HTTP API

The server exposes a web UI and REST API on hbd_port (default 50004).

Web pages

Path	Description
`/login`	Login form (shown automatically when auth is configured)
`/live`	Real-time host connectivity, RTT, and message stream
`/plugins/<host>`	Per-host plugin metrics
`/alerts`	Active alerts with severity filtering
`/settings`	Server config, users, notification channels, thresholds

Live views use WebSocket connections for real-time updates.

Non-admin users see only hosts where they have a role (monitor, manager, or owner). Admins see all hosts.

REST API

All endpoints are under /api/0/. When authentication is configured, include a session token:

# Log in, get a token
TOKEN=$(curl -s -X POST http://localhost:50004/api/0/auth/login \
  -H 'Content-Type: application/json' \
  -d '{"username":"alice","password":"secret"}' | jq -r .token)

# Use the token
curl -H "Authorization: Bearer $TOKEN" http://localhost:50004/api/0/hosts

Method	Endpoint	Description
GET	`/api/0/hosts`	All visible hosts
GET	`/api/0/alerts`	All active alerts
GET	`/api/0/alert_summary`	Count of ok/warning/critical
GET	`/api/0/messages`	Last 30 messages
GET	`/api/0/hosts/{host}/plugins`	All plugin data for host
GET	`/api/0/hosts/{host}/plugins/{plugin}?limit=N`	Plugin samples
GET	`/api/0/hosts/{host}/alerts`	Alert states for host
GET	`/api/0/hosts/{host}/access`	Access roles
PUT	`/api/0/hosts/{host}/access`	Update access roles
GET	`/api/0/hosts/{host}/info`	Host info (hbc version, thresholds)
POST	`/api/0/alerts/acknowledge`	Acknowledge alert
GET	`/api/0/users`	All users (admin only)
GET	`/api/0/users/me`	Current user profile
PUT	`/api/0/users/me`	Update own profile
POST	`/api/0/auth/login`	Create session
POST	`/api/0/auth/logout`	Destroy session
GET	`/api/0/config`	Server config (secrets redacted)
POST	`/api/0/config`	Update config
GET	`/api/0/config/backups`	List config backups
POST	`/api/0/config/rollback`	Roll back to previous config
GET	`/api/0/notification_channels`	List channels
POST	`/api/0/notification_channels`	Create channel
PUT	`/api/0/notification_channels/{name}`	Update channel
DELETE	`/api/0/notification_channels/{name}`	Delete channel

User Management & Authentication

When no users: block is in config, the server runs unauthenticated — all existing behaviour is preserved.

Roles

Role	Capabilities
monitor	View status, plugin data, alerts
manager	monitor + queue commands, trigger DNS, queue upgrades
owner	manager + drop host, transfer ownership, update access
admin	Owner-level on all hosts + access to server config and users

Setup

users:
  alice:
    full_name: Alice Smith
    password: pbkdf2:sha256:...    # hbd passwd alice
    admin: true
    notification_channels: [pushover_ops]

default_owner: alice    # Owns any host with no explicit owner

hosts:
  webserver01:
    owner: alice
    managers: [bob]
    monitors: [carol]

Password hashing uses PBKDF2-HMAC-SHA256 (260,000 iterations). Sessions expire after 24 hours.

OAuth2 login (Gitea) is supported:

oauth:
  gitea:
    url: https://git.example.com
    client_id: xxx
    client_secret: yyy

Dynamic DNS

When dyndns: true is set on a host and dyndomains is configured, the server updates DNS via nsupdate whenever the host's source address changes.

nsupdate_bin: /usr/bin/nsupdate
dyndomains:
  - example.com

hosts:
  webserver01:
    dyndns: true

DNS updates run asynchronously in a background worker.

Message Journal

All received messages are logged in JSONL format with automatic size-based rotation.

journal_enabled: true
journal_dir: /var/log/heartbeat
journal_file: messages.journal
journal_max_size: 104857600    # 100 MB
journal_max_backups: 10

Example entry:

{"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver01","interval":10}}

`hbc_mini` — Zero-dependency client

scripts/hbc_mini.py is a single-file client requiring only Python 3.8+ and no external packages. Copy it to any host and run directly.

python3 hbc_mini.py your-server.example.com
python3 hbc_mini.py -d your-server.example.com     # daemon mode
python3 hbc_mini.py -b your-server.example.com     # send boot message

Config: ~/.hbc.json (JSON format, same keys as ~/.hbc.yaml).

Available plugins:

Plugin	Platform
`os_info`	All
`ping_monitor`	All
`nagios_runner`	All (not Windows)
`cpu_monitor`	Linux (`/proc/stat`; no per-core, no frequency)
`memory_monitor`	Linux (`/proc/meminfo`)
`disk_monitor`	Linux, macOS, BSD (`df -P`)
`network_monitor`	Linux (`/proc/net/dev`)

Not available vs full hbc: no YAML config, no filesystem_info, no zfs_monitor, no IPv6 early-fail protection.

`hbc_mini.c` — C client

scripts/c/hbc_mini.c is a single-file C port of hbc_mini.py. It has no runtime dependencies beyond libc, zlib, pthreads, and libm, and runs on Linux, FreeBSD, NetBSD, and DragonFly BSD.

Build

cc -O2 -o hbc_mini scripts/c/hbc_mini.c -lz -lpthread -lm

Usage

The CLI is identical to hbc_mini.py:

./hbc_mini your-server.example.com
./hbc_mini -d your-server.example.com      # daemon mode (logs to syslog)
./hbc_mini -b your-server.example.com      # send boot message
./hbc_mini -m "note" your-server.example.com   # send one-shot message
./hbc_mini -4 your-server.example.com      # IPv4 only
./hbc_mini -6 your-server.example.com      # IPv6 only

Config: ~/.hbc.json (JSON, same keys as the Python version).

Architecture

The C client uses two threads:

Main thread — heartbeat sender loop + select()-based receive loop (1 s timeout). Sends HTB at the configured interval, receives ACK/CMD messages, and re-sends os_info on server request.
Monitor thread — all periodic plugins in a single thread with a 1-second sleep loop. Each plugin has its own next-run timestamp tracked independently.

SIGHUP causes the process to restart itself via execv(). SIGTERM/SIGINT trigger a clean shutdown (sends a shutdown heartbeat if -b was used).

Available plugins

Plugin	Platform	Data source
`os_info`	Linux, FreeBSD, NetBSD, DragonFly	`uname(2)`, `/etc/os-release`, `kern.osrelease` sysctl
`cpu_monitor`	Linux	`/proc/stat`
`cpu_monitor`	FreeBSD, DragonFly, NetBSD	`kern.cp_time` sysctl
`memory_monitor`	Linux	`/proc/meminfo` (ZFS ARC-aware)
`memory_monitor`	FreeBSD, DragonFly	`vm.stats.vm.*` sysctl
`memory_monitor`	NetBSD	`VM_UVMEXP` sysctl
`disk_monitor`	All	`df -P` subprocess
`network_monitor`	Linux	`/proc/net/dev`
`network_monitor`	FreeBSD, NetBSD, DragonFly	`getifaddrs()` + `AF_LINK`
`ping_monitor`	All	`ping` subprocess
`nagios_runner`	All	`popen()` subprocess

cpu_monitor reports: cpu_percent, cpu_user, cpu_system, cpu_idle, cpu_iowait (Linux only), load averages, cpu_core_count, uptime_seconds.

memory_monitor reports: memory_total, memory_used, memory_available, memory_free, memory_percent, and swap fields when swap is present.

network_monitor reports per-interface cumulative bytes_recv/bytes_sent and interval deltas. The loopback interface (lo) is skipped by default; this is configurable:

{
  "plugins": {
    "network_monitor": {
      "skip_interfaces": ["lo", "docker0"]
    }
  }
}

disk_monitor reports per-mount total, used, free, percent. An optional mount filter restricts reporting to specific paths:

{
  "plugins": {
    "disk_monitor": {
      "mounts": ["/", "/data"]
    }
  }
}

Differences from `hbc_mini.py`

No filesystem_info or zfs_monitor plugins
UPD (self-update) messages are logged but not acted on
No IPv6 early-fail protection
Config is JSON only (~/.hbc.json), no YAML

Development

Running tests

PYTHONPATH=. python -m unittest discover -v
# or
pytest -q

Linting and type checking

tox -e lint
tox -e mypy

Debugging in VS Code

A .vscode/launch.json is included with configurations for running and attaching the debugger. Select the project .venv as the Python interpreter, then use F5.

To start with debugpy and wait for attach:

PYTHONPATH=. python -m debugpy --listen 5678 --wait-for-client -m hbd.server.cli serve -c .hb.yaml -f -v

License

MIT. See LICENSE for details.

22 KiB Raw Permalink Blame History

Heartbeat Daemon (hbd)

Architecture

Subpackages

Installation

Server (hbd)

Starting the server

CLI subcommands

Configuration (~/.hb.yaml)

Persistence

Client (hbc)

Usage

Options

Configuration (~/.hbc.yaml)

Connection behaviour

UDP Protocol

Plugin System

Plugin types

Built-in plugins

Custom plugins

Nagios integration

Threshold Alerting

Configuration

Per-host threshold profiles

Alert states

RTT thresholds

Generic threshold matching

Display template variables

Notification Channels

Supported channel types

Web Dashboard & HTTP API

Web pages

REST API

User Management & Authentication

Roles

Setup

Dynamic DNS

Message Journal

hbc_mini — Zero-dependency client

hbc_mini.c — C client

Build

Usage

Architecture

Available plugins

Differences from hbc_mini.py

Development

Running tests

Linting and type checking

Debugging in VS Code

License

22 KiB

Raw Permalink Blame History

Server (`hbd`)

Configuration (`~/.hb.yaml`)

Client (`hbc`)

Configuration (`~/.hbc.yaml`)

`hbc_mini` — Zero-dependency client

`hbc_mini.c` — C client

Differences from `hbc_mini.py`