Settings page: pass threshold_checker to http.start so the Threshold
Configurations section has data. Use threshold_checker's already-parsed
ThresholdConfig objects instead of re-parsing the raw nested YAML.
Named (non-default) configs now display only their explicit overrides
via threshold_raw_configs, not the full merged set with defaults.
hbc/hbc_mini: send boot and shutdown messages on first connection only
to avoid duplicate packets when multiple servers are configured.
Replace print("Daemonizing...") with logging.info so output goes to
syslog in daemon mode.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Heartbeat Daemon (hbd) ✅
A lightweight daemon that listens for UDP heartbeat messages and acts on them: keeps host state, optionally updates DNS records via nsupdate, forwards messages to WebSocket clients, and sends notifications (email, Pushover, Mattermost, Signal). It is a refactor of a previously monolithic script into a modular Python package (hbd).
📌 Features
- Receive and parse heartbeat datagrams (text or zlib-compressed) ✅
- Maintain host state and detect up/down transitions ✅
- Queue DNS updates via
nsupdateand run them in a background thread ✅ - WebSocket API for live updates (hosts & messages) ✅
- Notification pipeline (email, Pushover, Mattermost, Signal) ✅
- User management & access control ✅
- Optional user accounts with bcrypt-style password hashing (stdlib only)
- Per-host roles: owner, manager, monitor
- Session-based auth with cookie support (browser login page included)
- Backwards compatible: no auth required when no users are configured
- HTTP API & Web UI ✅
- REST API for plugin data, alerts, host information, and user management
- Live dashboard with WebSocket updates
- Interactive plugin metrics visualization
- Alerts dashboard with filtering and summaries
- Message journal with automatic log rotation ✅
- Logs all received messages in JSON format
- Size-based automatic rotation
- Configurable retention and backup management
- Plugin system for extensible monitoring ✅
- Collect system metrics (CPU, memory, disk, network)
- Monitor ZFS pool health, capacity, and I/O via
zpool(8) - Execute existing Nagios monitoring plugins
- Create custom plugins with simple Python classes
- Threshold alerting system ✅
- Monitor metrics against configurable WARNING/CRITICAL thresholds
- Hysteresis to prevent alert flapping
- Automatic notifications on state changes
- Re-notification for ongoing alerts
- Per-host watch flag — set
watch: falseon any host to silence all notifications for that host without removing its configuration ✅ - Role-filtered dashboards — Live Dashboard and Host Overview show only hosts where the logged-in user is owner or manager (admins see all) ✅
- Modular codebase suitable for unit testing and CI ✅
🔌 Plugin System
Heartbeat includes a comprehensive plugin architecture that extends monitoring beyond simple heartbeats. The plugin system allows you to:
- Collect system information: OS details, hardware info, system configuration
- Monitor resources: CPU usage, memory, disk space, network statistics
- Run Nagios plugins: Execute thousands of existing Nagios monitoring plugins without modification
- Create custom plugins: Build your own monitoring logic with simple Python classes
Plugin Types
- InfoPlugin: Collects static information once (e.g., OS version, hardware specs)
- MonitorPlugin: Collects metrics periodically (e.g., CPU usage every 30 seconds)
Built-in Plugins
os_info: Collects OS, kernel, distribution, and architecture informationcpu_monitor: Monitors CPU usage, load average, frequency, and process countsmemory_monitor: Monitors RAM and swap usage, available memorydisk_monitor: Monitors disk usage, I/O statistics, and filesystem metricsnetwork_monitor: Monitors network interface statistics, bandwidth, and connectionsfilesystem_info: Collects mounted filesystem information (physical filesystems only by default)nagios_runner: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)zfs_monitor: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O viazpool(8)
Nagios Integration
The nagios_runner plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored:
- Executes plugins asynchronously (non-blocking) with timeout protection
- Captures both stdout and stderr; if stdout is empty, stderr is used as the status message
- Handles signal-killed processes (negative exit code → UNKNOWN status)
- Validates absolute command paths at startup and warns on missing or non-executable files
- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
- Extracts performance data with thresholds
- Reports aggregated status across all configured checks
See docs/NAGIOS_INTEGRATION.md for complete integration guide including configuration examples and custom plugin development.
Creating Custom Plugins
from hbd.client.plugin import MonitorPlugin
class DiskMonitorPlugin(MonitorPlugin):
name = "disk_monitor"
interval = 60 # Run every 60 seconds
async def collect(self):
return {
"disk_usage": get_disk_usage(),
"timestamp": time.time()
}
Place plugins in hbd/client/plugins/ and they'll be automatically discovered and loaded by the client.
📝 Message Journal
Heartbeat includes a message journal that logs all received messages with automatic rotation.
Features
- JSON Format: All messages logged in JSONL (JSON Lines) format for easy parsing
- Automatic Rotation: Size-based rotation with configurable thresholds
- Backup Management: Keeps configurable number of rotated log files
- Non-blocking: Async logging with minimal performance impact
Configuration
# Message journal settings
journal_enabled: true # Enable/disable journaling
journal_dir: /var/log/heartbeat # Journal directory
journal_file: messages.journal # Base filename
journal_max_size: 104857600 # Max size (100MB default)
journal_max_backups: 10 # Number of backups to keep
Example Journal Entry
{"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver1","interval":30}}
Analyzing Journal Files
# View recent messages
tail -100 /var/log/heartbeat/messages.journal | jq .
# Count messages by type
cat /var/log/heartbeat/messages.journal | jq -r '.message.ID' | sort | uniq -c
# Filter by hostname
cat /var/log/heartbeat/messages.journal | jq 'select(.message.name == "webserver1")'
See docs/MESSAGE_JOURNAL.md for complete documentation including rotation behavior, integration with log management systems, and analysis examples.
🚨 Threshold Alerting
Heartbeat includes a sophisticated threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured limits.
Features
- Multi-level alerts: WARNING and CRITICAL severity levels
- Flexible operators: Support for >, >=, <, <=, ==, != comparisons
- Hysteresis: Prevents alert flapping with configurable recovery thresholds
- Smart notifications: Alerts only on state changes, not every check; de-escalations (e.g. CRITICAL → WARNING) do not generate a notification
- Re-notifications: Periodic reminders for ongoing alerts
- Short-duration suppression: Recovery notifications are suppressed for down events under 4 seconds (avoids noise from transient blips)
- Journal integration: All threshold events logged for audit trail
ping_monitorthresholds: Latency and packet-loss thresholds use the same format as all other plugin metrics
Configuration
thresholds:
# RTT (Round-Trip Time) thresholds for heartbeat monitoring
# These are checked on every HTB message arrival
rtt:
webserver01:
warning: 100.0 # Warn when RTT > 100ms
critical: 500.0 # Critical when RTT > 500ms
database01:
warning: 50.0
critical: 200.0
# Plugin metric thresholds
cpu_monitor:
cpu_percent:
warning: 80.0 # Warn when CPU > 80%
critical: 90.0 # Critical when CPU > 90%
operator: ">"
hysteresis: 0.1 # 10% hysteresis to prevent flapping
memory_monitor:
percent:
warning: 85.0
critical: 95.0
disk_monitor:
partitions:
/:
percent:
warning: 80.0
critical: 90.0
free_gb:
warning: 10.0 # Alert when < 10GB free
critical: 5.0
operator: "<" # Inverse threshold
# Global settings
threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts
RTT Monitoring
Heartbeat monitors network latency (Round-Trip Time) for each host's heartbeat messages. RTT thresholds are fully integrated with the threshold alerting system:
- Per-host configuration: Set different thresholds for each monitored host
- Real-time checking: Thresholds evaluated on every HTB message arrival
- Alert state tracking: RTT alerts use the same state management as plugin metrics
- Hysteresis support: Configurable hysteresis prevents rapid state transitions
- Alerts dashboard: RTT alerts visible on the
/alertsweb page alongside plugin alerts - Smart notifications: Only triggers on state changes (OK → WARNING → CRITICAL)
- Re-notification: Periodic reminders for ongoing RTT issues
- Event & journal logging: All RTT events logged for audit trail
Configuration format:
thresholds:
rtt:
<hostname>:
warning: <milliseconds> # Warn when RTT > this value
critical: <milliseconds> # Critical when RTT > this value
hysteresis: 0.1 # Optional: 10% hysteresis (default)
Example alerts:
WARNING: webserver01 - rtt.webserver01 = 125.3
CRITICAL: database01 - rtt.database01 = 520.1
RECOVERED: webserver01 - rtt.webserver01 = 45.2 (WARNING -> OK)
RTT alerts appear on the Alerts dashboard and can be filtered by severity level. The metric_path format is rtt.<hostname>, making it easy to distinguish from plugin metrics.
Alert Behavior
-
State Changes: Notifications sent when crossing thresholds
- OK → WARNING: Early notification
- WARNING → CRITICAL: Escalation
- CRITICAL → OK: Recovery
-
Hysteresis: Prevents rapid state transitions
Critical threshold: 90% Hysteresis: 10% Recovery threshold: 81% (90 - 10% of 90) Value 91% → CRITICAL (threshold crossed) Value 85% → CRITICAL (still above 81%) Value 79% → OK (below recovery threshold) -
Re-notifications: Periodic reminders for ongoing alerts
- Default: Every 60 minutes
- Configurable via
threshold_renotify_interval
Example Notifications
WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
CRITICAL: webserver01 - memory_monitor.percent = 96.0
RECOVERED: database01 - disk_monitor./.percent = 75.0 (WARNING -> OK)
REMINDER (CRITICAL): mailserver - cpu_monitor.load_1min = 12.5 (ongoing for 3600s)
Supported Metrics
All plugin metrics can be thresholded:
- CPU: cpu_percent, load_1min, load_5min, load_15min
- Memory: percent, available_mb, swap_percent
- Disk: Per-partition percent, free_gb, free_mb
- Network: errors_total, dropped packets, connection counts
- Nagios: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
Per-Host Threshold Profiles
Named threshold configurations let different hosts use different limits. A host's threshold_config can be a single name or a list — lists are applied left-to-right so profiles compose without duplication:
threshold_configs:
default:
thresholds:
cpu_monitor:
cpu_percent: {warning: 80, critical: 90}
memory_monitor:
memory_percent: {warning: 85, critical: 95}
tight_cpu: # override CPU limits only
thresholds:
cpu_monitor:
cpu_percent: {warning: 60, critical: 75}
db_disk: # add a database partition check
thresholds:
disk_monitor:
partitions:
/var/lib/postgresql:
percent: {warning: 75, critical: 88}
hosts:
web-01:
threshold_config: default # single profile
db-01:
threshold_config: [tight_cpu, db_disk] # layered: CPU override + extra disk check
Each named config's overrides are applied in order on top of the defaults. Metrics not mentioned in a profile are inherited unchanged.
See docs/THRESHOLD_ALERTING.md for comprehensive documentation including best practices, troubleshooting, and advanced configuration.
👥 User Management
Heartbeat supports optional user accounts with role-based access control per host.
Roles
- monitor — view status, plugin data, alerts
- manager — monitor + queue commands, trigger DNS, queue upgrades
- owner — manager + drop host, transfer ownership, update access
- admin (user flag) — owner-level access on every host
When no users are configured the server runs in unauthenticated mode — all existing behaviour is unchanged.
Quick setup
users:
alice:
full_name: Alice Smith
password: pbkdf2:sha256:... # hbd passwd alice
admin: true
default_owner: alice
hosts:
webserver01:
owner: alice
managers: [bob]
monitors: [carol]
# Generate a password hash
hbd passwd alice
Browser users are redirected to /login automatically. The session cookie is set on login, so fetch() calls from dashboards work without any JavaScript changes.
See docs/USERS.md for complete user management documentation.
🌐 HTTP API & Web UI
Heartbeat includes a built-in HTTP/WebSocket server that provides both a REST API and web-based dashboards for monitoring and visualization.
Features
- User auth: Optional session-based authentication with per-host role enforcement
- REST API: JSON endpoints for accessing plugin data, alerts, host information, and user management
- Live Dashboard: Real-time WebSocket-powered host status view
- Plugin Metrics: Interactive visualization of all plugin data with auto-refresh
- Alerts Dashboard: Comprehensive alert monitoring with filtering and summaries
Web Dashboards
- Login (
/login): Browser login form (shown automatically when auth is configured) - Live View (
/live): Real-time host connectivity, latency, and messages; hostnames link directly to the Host Overview page - Host Overview (
/plugins/<host>): Per-host plugin metrics with ZFS pool visualization; filtered to hosts where the logged-in user is owner or manager (admins see all) - Alerts Dashboard (
/alerts): Monitor active alerts with severity filtering; alert count pie chart shown in the navigation bar - Settings (
/settings): Server configuration, user management, and threshold configuration viewer
API Endpoints
# Log in (when auth is configured)
TOKEN=$(curl -s -X POST http://localhost:50004/api/0/auth/login \
-H 'Content-Type: application/json' \
-d '{"username":"alice","password":"secret"}' | jq -r .token)
AUTH="-H \"Authorization: Bearer $TOKEN\""
# List all monitored hosts
curl $AUTH http://localhost:50004/api/0/hosts
# Get all plugin data for a host
curl $AUTH http://localhost:50004/api/0/hosts/webserver01/plugins
# Get detailed plugin history (last 50 samples)
curl $AUTH "http://localhost:50004/api/0/hosts/webserver01/plugins/cpu_monitor?limit=50"
# Get alert states for a specific host
curl $AUTH http://localhost:50004/api/0/hosts/webserver01/alerts
# Get all active alerts across all hosts
curl $AUTH http://localhost:50004/api/0/alerts
# View/update host access roles
curl $AUTH http://localhost:50004/api/0/hosts/webserver01/access
See docs/HTTP_API.md for complete API documentation including response formats, error handling, and integration examples.
⚙️ Quickstart
Prerequisites:
- Python 3.11+ (project uses language features from recent Python)
nsupdate(for DNS updates) if using dynamic DNS
Install dependencies (recommended into a venv):
This project now declares its dependencies in pyproject.toml. Instead
of the old requirements.txt flow, install the package into a virtualenv
using pip:
See scripts/hb_install.sh for a way to install.
Run the daemon (example):
# run with default config lookup (~/.hb.yaml)
hbd -c .hb.yaml -f -v
You can also run it directly via the package entrypoint after installation:
python -m hbd.server.cli -c /path/to/config.yaml
Running the Client
The heartbeat client (hbc) sends periodic heartbeats and plugin data to the server:
# Basic usage pointing to server (host is a positional argument)
hbc your-server.example.com
# Run as daemon with a config file
hbc -d -c /etc/hbc.yaml your-server.example.com
# Send a one-off boot message
hbc --boot your-server.example.com
# Verbose output
hbc -v your-server.example.com
You can also run it via the module entrypoint:
python -m hbd.client.main your-server.example.com
Client configuration can also be specified in YAML:
server: hbd.example.com
port: 50003
interval: 30
plugins:
cpu_monitor:
interval: 300 # Check every 5 minutes (default)
per_core: true
memory_monitor:
interval: 300 # Check every 5 minutes (default)
disk_monitor:
interval: 300 # Check every 5 minutes (default)
network_monitor:
interval: 300 # Check every 5 minutes (default)
nagios_runner:
interval: 300 # Check every 5 minutes (default)
commands:
- /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
- /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
Connection retry: If a server is temporarily unreachable, hbc retries open() indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
Daemon logging: When running with -d, hbc routes all log output to syslog (LOG_DAEMON facility) after daemonizing. Without -d, logs go to stderr as usual.
hbc_mini — single-file client (no external dependencies)
scripts/hbc_mini.py is a self-contained version of the heartbeat client that requires only Python 3.8+ and no external packages. Copy it to any host and run it directly — no virtualenv, no pip install.
# Basic usage
python3 hbc_mini.py your-server.example.com
# Run as daemon
python3 hbc_mini.py -d your-server.example.com
# Send a boot message
python3 hbc_mini.py -b your-server.example.com
# Send a one-off message
python3 hbc_mini.py -m "maintenance starting" your-server.example.com
Config: ~/.hbc.json (same keys as ~/.hbc.yaml, JSON format). Example:
{
"hb_port": 50003,
"interval": 30,
"plugins": {
"ping_monitor": {
"interval": 60,
"hosts": ["8.8.8.8", "192.168.1.1"]
},
"nagios_runner": {
"interval": 300,
"commands": [
{"name": "check_load", "command": "/usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6"}
]
}
}
}
Plugin availability:
| Plugin | Platform | Data source |
|---|---|---|
os_info |
all | platform stdlib |
ping_monitor |
all | ping subprocess |
nagios_runner |
all (not Windows) | subprocess |
cpu_monitor |
Linux | /proc/stat |
memory_monitor |
Linux | /proc/meminfo |
disk_monitor |
Linux, macOS, BSD | df -P subprocess |
network_monitor |
Linux | /proc/net/dev |
What is not available compared to the full hbc:
- No YAML config (use JSON instead)
- No
filesystem_infoplugin - No
zfs_monitorplugin (requireszpool(8)and the full plugin loader) cpu_monitordoes not report per-core usage or CPU frequency (no psutil)- Plugins cannot be loaded from external
.pyfiles — all plugins are compiled in - No IPv6 early-fail protection — connections that fail to open at startup are silently skipped rather than retried
Everything else — heartbeat protocol, ACK/CMD/UPD handling, hb_install.sh-based self-update, daemonize, syslog — is identical to the full client.
🐞 Debugging in VS Code
This repository includes a ready-to-use .vscode/launch.json with configurations to run or attach the VS Code debugger to hbd.
- Ensure the Python extension is installed and select the project
.venvas the interpreter (bottom-left of VS Code). - Use F5 and pick one of these configurations from the Run view:
- Python: Run hbd (module) — runs
hbd.server.clias a module and setsPYTHONPATHto the workspace root (recommended). - Python: Run hbd with debugpy (listen) — launches
debugpyandhbdtogether; useful when you want the process to listen for a debugger. - Python: Attach (localhost:5678) — attach the debugger to a running process started with
debugpy.
- Python: Run hbd (module) — runs
To start hbd manually and wait for the debugger to attach, run:
PYTHONPATH=. python -m debugpy --listen 5678 --wait-for-client -m hbd.server.cli -c .hb.yaml -f -v
Set breakpoints in modules such as hbd/server/udp.py, hbd/server/dns.py, or hbd/server/main.py, and use the Attach configuration to connect. Use justMyCode: false if you need to step into third-party code.
🛠 Configuration
hbd reads YAML configuration (optional). If PyYAML is not installed, built-in defaults are used. Example configuration keys (see hbd/server/config.py):
hb_port: UDP port to listen for heartbeats (default: 50003)hbd_port: internal control port (default: 50004)hbd_host: bind address for HTTP/WSSpickfile: path for persisted statelogfile: path to log filepushsrv: push service (pushover|mattermost|all)interval/grace: heartbeat timing configurationdyndomains: list of dyndomains to update viansupdatensupdate_bin: path to nsupdate binaryws_port: port for plain WebSocket connections (default: 50005)wss_port: port for secure WebSocket (WSS) connections (default: none). If set,hbdwill attempt to serve WSS on this port whenwss_pemandwss_keySSL files are available undercert_path(see below).cert_path: directory where TLS certificate and key are looked up (default: /usr/local/etc/ssl/)wss_pem: filename for the certificate chain (default: fullchain.pem)wss_key: filename for the private key (default: privkey.pem)users: mapping of username → user attributes (full_name, avatar, password, admin, notification_channels)default_owner: username that owns hosts with no explicit owner (falls back to first admin user)
Example .hb.yaml (minimal):
hbd_host: 0.0.0.0
hbd_port: 50004
dyndomains:
- example.com
nsupdate_bin: /usr/bin/nsupdate
pushsrv: pushover
Tip:
SERVER_DEFAULTSinhbd/server/config.pycontains the canonical defaults and accepted configuration keys.
🔧 Architecture & Modules
The package is organized into three subpackages:
hbd.common — shared code used by both client and server:
hbd.common.proto— serialization/deserialization of heartbeat messages (supports compressed payloads and plugin data)hbd.common.utils— small utility helpers (shortname,dur,initlog)
hbd.server — the heartbeat daemon (hbd):
hbd.server.cli— CLI entrypoint and argument parsinghbd.server.main— async orchestration to run UDP/HTTP/WSS componentshbd.server.udp— UDP parsing andhandle_datagramimplementation (main state machine)hbd.server.dns—create_nsupdate_payload,nsupdate, and an asyncio DNS worker (start_dns_worker). The DNS worker runs as anasynciotask and the package exposes a small thread-safe bridge so legacy synchronous code canput()updates into the queue.hbd.server.notify— email and push notification helpershbd.server.ws— WebSocket server and thread-safe broadcast helpershbd.server.http— HTTP handler factory for the status UI/APIhbd.server.journal— message journal with size-based log rotation and backup managementhbd.server.threshold— threshold alerting enginehbd.server.monitor— host state monitoringhbd.server.hbdclass—Hostclass and shared server statehbd.server.config— configuration loader and defaults
hbd.client — the heartbeat client (hbc):
hbd.client.main— client entrypoint; sends heartbeats and plugin data to the serverhbd.client.plugin— plugin framework with base classes, registry, and dynamic loaderhbd.client.plugins/— built-in plugins (os_info, cpu_monitor, memory_monitor, disk_monitor, network_monitor, filesystem_info, nagios_runner)hbd.client.config— client configuration loader
This modular layout makes the code easier to test and maintain.
Runtime & Shutdown
- The main runtime is asyncio-based. Services (UDP listener, HTTP server, WebSocket server, monitor, and DNS worker) run as asyncio tasks.
- On SIGINT/SIGTERM the server triggers a graceful shutdown: it cancels active tasks, signals the DNS worker via a sentinel, and cleans up resources before exit.
- The DNS update worker is implemented as an
asynciotask; synchronous producers can still enqueue DNS updates via a small thread-safe bridge available athbd.server.hbdclass.Host.dnsQ.
Templates & Static Files
- Template files are located under
hbd/server/templates. The HTTP server resolves templates relative to thehbd.serverpackage but the path can be overridden with thetemplates_dirconfig key. - Static assets (CSS/JS/images) are served from
hbd/server/staticvia the/static/<path>HTTP route.
🧪 Testing & Dev
Tests are implemented using unittest and additional tests rely on pytest if you prefer. To run tests locally without installing anything beyond the dev requirements:
# with project root on PYTHONPATH
PYTHONPATH=. python -m unittest discover -v
# or with pytest if installed
pytest -q
Developer tooling included:
pyproject.toml— project metadata and dependenciestox.ini— convenience wrappers for running tests, lint, and mypy
To run linters and type checks locally:
# after installing dev deps
tox -e lint
tox -e mypy
🚀 Running in production
- Use your system service manager (systemd, launchd, etc.) to run
hbdin the background. - Ensure
nsupdateand necessary credentials are available for dynamic DNS updates. - Configure TLS for WSS if you enable secure websockets.
Note: The project contains a small example for obtaining DNS-verified certs (certbot with RFC2136) — see earlier commit history or ask me to re-add the example to this README if you want it documented here.
🤝 Contributing
Contributions welcome! Please:
- Open an issue to discuss larger changes.
- Create a topic branch and a clear PR.
- Add tests for new features and run linters.
- Keep changes focused and documented.
📜 License
This repository is licensed under the MIT license. See LICENSE for details.
If you'd like, I can also:
- add a GitHub Actions workflow that runs tests and lint on push/PR 🔁
- add a
CONTRIBUTING.mdtemplate for PRs and code style 💬
Which one should I do next? ✨