37f1c58969
These fields were never read by the plugin; thresholds are configured server-side. Also document the -b flag in README. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
779 lines
30 KiB
Markdown
779 lines
30 KiB
Markdown
# Heartbeat Daemon (hbd) ✅
|
|
|
|
A lightweight daemon that listens for UDP heartbeat messages and acts on them: keeps host state, optionally updates DNS records via `nsupdate`, forwards messages to WebSocket clients, and sends notifications (email, Pushover, Mattermost, Signal). It is a refactor of a previously monolithic script into a modular Python package (`hbd`).
|
|
|
|
---
|
|
|
|
## 📌 Features
|
|
|
|
- Receive and parse heartbeat datagrams (text or zlib-compressed) ✅
|
|
- Maintain host state and detect up/down transitions ✅
|
|
- Queue DNS updates via `nsupdate` and run them in a background thread ✅
|
|
- WebSocket API for live updates (hosts & messages) ✅
|
|
- Notification pipeline (email, Pushover, Mattermost, Signal) ✅
|
|
- **User management & access control** ✅
|
|
- Optional user accounts with bcrypt-style password hashing (stdlib only)
|
|
- Per-host roles: owner, manager, monitor
|
|
- Session-based auth with cookie support (browser login page included)
|
|
- Backwards compatible: no auth required when no users are configured
|
|
- **HTTP API & Web UI** ✅
|
|
- REST API for plugin data, alerts, host information, and user management
|
|
- Live dashboard with WebSocket updates
|
|
- Interactive plugin metrics visualization
|
|
- Alerts dashboard with filtering and summaries
|
|
- **Message journal with automatic log rotation** ✅
|
|
- Logs all received messages in JSON format
|
|
- Size-based automatic rotation
|
|
- Configurable retention and backup management
|
|
- **Plugin system for extensible monitoring** ✅
|
|
- Collect system metrics (CPU, memory, disk, network)
|
|
- Monitor ZFS pool health, capacity, and I/O via `zpool(8)`
|
|
- Execute existing Nagios monitoring plugins
|
|
- Create custom plugins with simple Python classes
|
|
- **Threshold alerting system** ✅
|
|
- Monitor metrics against configurable WARNING/CRITICAL thresholds
|
|
- Hysteresis to prevent alert flapping
|
|
- Automatic notifications on state changes
|
|
- Re-notification for ongoing alerts
|
|
- **Per-host watch flag** — set `watch: false` on any host to silence all notifications for that host without removing its configuration ✅
|
|
- **Role-filtered dashboards** — Live Dashboard and Host Overview show only hosts where the logged-in user is owner or manager (admins see all) ✅
|
|
- Modular codebase suitable for unit testing and CI ✅
|
|
|
|
---
|
|
|
|
## 🔌 Plugin System
|
|
|
|
Heartbeat includes a comprehensive plugin architecture that extends monitoring beyond simple heartbeats. The plugin system allows you to:
|
|
|
|
- **Collect system information**: OS details, hardware info, system configuration
|
|
- **Monitor resources**: CPU usage, memory, disk space, network statistics
|
|
- **Run Nagios plugins**: Execute thousands of existing Nagios monitoring plugins without modification
|
|
- **Create custom plugins**: Build your own monitoring logic with simple Python classes
|
|
|
|
### Plugin Types
|
|
|
|
- **InfoPlugin**: Collects static information once (e.g., OS version, hardware specs)
|
|
- **MonitorPlugin**: Collects metrics periodically (e.g., CPU usage every 30 seconds)
|
|
|
|
### Built-in Plugins
|
|
|
|
- `os_info`: Collects OS, kernel, distribution, and architecture information
|
|
- `cpu_monitor`: Monitors CPU usage, load average, frequency, process counts, and uptime
|
|
- `memory_monitor`: Monitors RAM and swap usage, available memory (ZFS ARC-aware)
|
|
- `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
|
|
- `network_monitor`: Monitors network interface statistics, bandwidth, and connections
|
|
- `ping_monitor`: Measures round-trip latency to configured hosts
|
|
- `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
|
|
- `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
|
|
- `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`
|
|
|
|
### Nagios Integration
|
|
|
|
The `nagios_runner` plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored:
|
|
|
|
- Executes plugins asynchronously (non-blocking) with timeout protection
|
|
- Captures both stdout and stderr; if stdout is empty, stderr is used as the status message
|
|
- Handles signal-killed processes (negative exit code → UNKNOWN status)
|
|
- Validates absolute command paths at startup and warns on missing or non-executable files
|
|
- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
|
|
- Extracts performance data with thresholds
|
|
- Reports per-check status, exit code, and output; no aggregate rollup field
|
|
|
|
See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.
|
|
|
|
### Creating Custom Plugins
|
|
|
|
```python
|
|
from hbd.client.plugin import MonitorPlugin
|
|
|
|
class DiskMonitorPlugin(MonitorPlugin):
|
|
name = "disk_monitor"
|
|
interval = 60 # Run every 60 seconds
|
|
|
|
async def collect(self):
|
|
return {
|
|
"disk_usage": get_disk_usage(),
|
|
"timestamp": time.time()
|
|
}
|
|
```
|
|
|
|
Place plugins in `hbd/client/plugins/` and they'll be automatically discovered and loaded by the client.
|
|
|
|
---
|
|
|
|
## 📝 Message Journal
|
|
|
|
Heartbeat includes a message journal that logs all received messages with automatic rotation.
|
|
|
|
### Features
|
|
|
|
- **JSON Format**: All messages logged in JSONL (JSON Lines) format for easy parsing
|
|
- **Automatic Rotation**: Size-based rotation with configurable thresholds
|
|
- **Backup Management**: Keeps configurable number of rotated log files
|
|
- **Non-blocking**: Async logging with minimal performance impact
|
|
|
|
### Configuration
|
|
|
|
```yaml
|
|
# Message journal settings
|
|
journal_enabled: true # Enable/disable journaling
|
|
journal_dir: /var/log/heartbeat # Journal directory
|
|
journal_file: messages.journal # Base filename
|
|
journal_max_size: 104857600 # Max size (100MB default)
|
|
journal_max_backups: 10 # Number of backups to keep
|
|
```
|
|
|
|
### Example Journal Entry
|
|
|
|
```json
|
|
{"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver1","interval":30}}
|
|
```
|
|
|
|
### Analyzing Journal Files
|
|
|
|
```bash
|
|
# View recent messages
|
|
tail -100 /var/log/heartbeat/messages.journal | jq .
|
|
|
|
# Count messages by type
|
|
cat /var/log/heartbeat/messages.journal | jq -r '.message.ID' | sort | uniq -c
|
|
|
|
# Filter by hostname
|
|
cat /var/log/heartbeat/messages.journal | jq 'select(.message.name == "webserver1")'
|
|
```
|
|
|
|
See [docs/MESSAGE_JOURNAL.md](docs/MESSAGE_JOURNAL.md) for complete documentation including rotation behavior, integration with log management systems, and analysis examples.
|
|
|
|
---
|
|
|
|
## 🚨 Threshold Alerting
|
|
|
|
Heartbeat includes a sophisticated threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured limits.
|
|
|
|
### Features
|
|
|
|
- **Multi-level alerts**: WARNING and CRITICAL severity levels
|
|
- **Flexible operators**: Support for >, >=, <, <=, ==, != comparisons
|
|
- **Hysteresis**: Prevents alert flapping with configurable recovery thresholds
|
|
- **Smart notifications**: Alerts only on state changes, not every check; de-escalations (e.g. CRITICAL → WARNING) do not generate a notification
|
|
- **Re-notifications**: Periodic reminders for ongoing alerts
|
|
- **Short-duration suppression**: Recovery notifications are suppressed for down events under 4 seconds (avoids noise from transient blips)
|
|
- **Journal integration**: All threshold events logged for audit trail
|
|
- **`ping_monitor` thresholds**: Latency and packet-loss thresholds use the same format as all other plugin metrics
|
|
|
|
### Configuration
|
|
|
|
```yaml
|
|
thresholds:
|
|
# RTT (Round-Trip Time) thresholds for heartbeat monitoring
|
|
# These are checked on every HTB message arrival
|
|
rtt:
|
|
webserver01:
|
|
warning: 100.0 # Warn when RTT > 100ms
|
|
critical: 500.0 # Critical when RTT > 500ms
|
|
|
|
database01:
|
|
warning: 50.0
|
|
critical: 200.0
|
|
|
|
# Plugin metric thresholds
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 80.0 # Warn when CPU > 80%
|
|
critical: 90.0 # Critical when CPU > 90%
|
|
operator: ">"
|
|
hysteresis: 0.02 # 2% hysteresis to prevent flapping
|
|
display: "(threshold: {op_symbol} {threshold_value}%)" # optional
|
|
|
|
memory_monitor:
|
|
percent:
|
|
warning: 85.0
|
|
critical: 95.0
|
|
|
|
disk_monitor:
|
|
partitions:
|
|
/:
|
|
percent:
|
|
warning: 80.0
|
|
critical: 90.0
|
|
free_gb:
|
|
warning: 10.0 # Alert when < 10GB free
|
|
critical: 5.0
|
|
operator: "<" # Inverse threshold
|
|
|
|
# Global settings
|
|
threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts
|
|
```
|
|
|
|
### RTT Monitoring
|
|
|
|
Heartbeat monitors network latency (Round-Trip Time) for each host's heartbeat messages. RTT thresholds are **fully integrated with the threshold alerting system**:
|
|
|
|
- **Per-host configuration**: Set different thresholds for each monitored host
|
|
- **Real-time checking**: Thresholds evaluated on every HTB message arrival
|
|
- **Alert state tracking**: RTT alerts use the same state management as plugin metrics
|
|
- **Hysteresis support**: Configurable hysteresis prevents rapid state transitions
|
|
- **Alerts dashboard**: RTT alerts visible on the `/alerts` web page alongside plugin alerts
|
|
- **Smart notifications**: Only triggers on state changes (OK → WARNING → CRITICAL)
|
|
- **Re-notification**: Periodic reminders for ongoing RTT issues
|
|
- **Event & journal logging**: All RTT events logged for audit trail
|
|
|
|
**Configuration format:**
|
|
```yaml
|
|
thresholds:
|
|
rtt:
|
|
<hostname>:
|
|
warning: <milliseconds> # Warn when RTT > this value
|
|
critical: <milliseconds> # Critical when RTT > this value
|
|
hysteresis: 0.02 # Optional: 2% hysteresis (default)
|
|
```
|
|
|
|
**Example alerts:**
|
|
```
|
|
WARNING: webserver01 - rtt.webserver01 = 125.3
|
|
CRITICAL: database01 - rtt.database01 = 520.1
|
|
RECOVERED: webserver01 - rtt.webserver01 = 45.2 (WARNING -> OK)
|
|
```
|
|
|
|
RTT alerts appear on the Alerts dashboard and can be filtered by severity level. The `metric_path` format is `rtt.<hostname>`, making it easy to distinguish from plugin metrics.
|
|
|
|
### Alert Behavior
|
|
|
|
1. **State Changes**: Notifications sent when crossing thresholds
|
|
- OK → WARNING: Early notification
|
|
- WARNING → CRITICAL: Escalation
|
|
- CRITICAL → OK: Recovery
|
|
|
|
2. **Hysteresis**: Prevents rapid state transitions
|
|
```
|
|
Critical threshold: 90%
|
|
Hysteresis: 10%
|
|
Recovery threshold: 81% (90 - 10% of 90)
|
|
|
|
Value 91% → CRITICAL (threshold crossed)
|
|
Value 85% → CRITICAL (still above 81%)
|
|
Value 79% → OK (below recovery threshold)
|
|
```
|
|
|
|
3. **Re-notifications**: Periodic reminders for ongoing alerts
|
|
- Default: Every 60 minutes
|
|
- Configurable via `threshold_renotify_interval`
|
|
|
|
### Example Notifications
|
|
|
|
```
|
|
WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
|
|
CRITICAL: webserver01 - memory_monitor.percent = 96.0
|
|
RECOVERED: database01 - disk_monitor./.percent = 75.0 (WARNING -> OK)
|
|
REMINDER (CRITICAL): mailserver - cpu_monitor.load_1min = 12.5 (ongoing for 3600s)
|
|
```
|
|
|
|
### Supported Metrics
|
|
|
|
All plugin metrics can be thresholded:
|
|
|
|
- **CPU**: cpu_percent, load_1min, load_5min, load_15min
|
|
- **Memory**: percent, available_mb, swap_percent
|
|
- **Disk**: Per-partition percent, free_gb, free_mb
|
|
- **Network**: errors_total, dropped packets, connection counts
|
|
- **Nagios**: Any field emitted by `nagios_runner` (`<name>_status_code`, `<name>_status`, `<name>_output`, performance data fields)
|
|
|
|
### Display Format Templates
|
|
|
|
Each threshold entry accepts an optional `display` field — a Python format string shown in notifications and on the Alerts dashboard:
|
|
|
|
```yaml
|
|
nagios_runner:
|
|
status_code:
|
|
warning: 1
|
|
critical: 2
|
|
operator: ">="
|
|
display: "{check_name}: exit {value} (expected < {threshold_value})"
|
|
```
|
|
|
|
Available variables:
|
|
|
|
| Variable | Description |
|
|
|---|---|
|
|
| `{value}` | Current metric value |
|
|
| `{threshold_value}` | Threshold that was crossed |
|
|
| `{op_symbol}` | Comparison operator (`>`, `<`, `>=`, …); `"nagios"` for the nagios operator |
|
|
| `{check_name}` | Prefix stripped by generic matching (see below) |
|
|
| `{metric_name}` | Full field name within the plugin data |
|
|
| `{output}` | For `nagios_runner` generic matches: the matched check's status text (alias for `{check_name}_output`) |
|
|
| `{status}` | For `nagios_runner` generic matches: the matched check's status name — OK/WARNING/CRITICAL/UNKNOWN (alias for `{check_name}_status`) |
|
|
| any plugin field | Any other field present in the plugin's data |
|
|
|
|
### Generic Threshold Matching
|
|
|
|
When a metric name has no exact threshold entry, the server progressively strips leading underscore-separated segments and re-tries the lookup. This lets a single generic entry cover an entire family of metrics.
|
|
|
|
The classic use case is `nagios_runner`, which names each metric after the command that produced it:
|
|
|
|
```
|
|
nagios_runner.check_disk_root_status_code → no exact match
|
|
nagios_runner.disk_root_status_code → no match
|
|
nagios_runner.root_status_code → no match
|
|
nagios_runner.status_code → matched ✓
|
|
```
|
|
|
|
Configure the generic threshold once using the `nagios` operator, which maps exit codes directly to alert severity without requiring numeric warning/critical values:
|
|
|
|
```yaml
|
|
nagios_runner:
|
|
status_code:
|
|
operator: "nagios" # 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN
|
|
display: "{check_name}: {output}"
|
|
```
|
|
|
|
The stripped prefix (`check_disk_root` in the example above) is available as `{check_name}` in the display template, so you can identify which check triggered the alert without writing a separate threshold entry per command.
|
|
|
|
Exact matches always take priority. A generic entry only applies when no specific one is defined.
|
|
|
|
### Per-Host Threshold Profiles
|
|
|
|
Named threshold configurations let different hosts use different limits. A host's `threshold_config` can be a single name or a **list** — lists are applied left-to-right so profiles compose without duplication:
|
|
|
|
```yaml
|
|
threshold_configs:
|
|
default:
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent: {warning: 80, critical: 90}
|
|
memory_monitor:
|
|
memory_percent: {warning: 85, critical: 95}
|
|
|
|
tight_cpu: # override CPU limits only
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent: {warning: 60, critical: 75}
|
|
|
|
db_disk: # add a database partition check
|
|
thresholds:
|
|
disk_monitor:
|
|
partitions:
|
|
/var/lib/postgresql:
|
|
percent: {warning: 75, critical: 88}
|
|
|
|
hosts:
|
|
web-01:
|
|
threshold_config: default # single profile
|
|
|
|
db-01:
|
|
threshold_config: [tight_cpu, db_disk] # layered: CPU override + extra disk check
|
|
```
|
|
|
|
Each named config's overrides are applied in order on top of the defaults. Metrics not mentioned in a profile are inherited unchanged.
|
|
|
|
See [docs/THRESHOLD_ALERTING.md](docs/THRESHOLD_ALERTING.md) for comprehensive documentation including best practices, troubleshooting, and advanced configuration.
|
|
|
|
---
|
|
|
|
## 👥 User Management
|
|
|
|
Heartbeat supports optional user accounts with role-based access control per host.
|
|
|
|
### Roles
|
|
|
|
- **monitor** — view status, plugin data, alerts
|
|
- **manager** — monitor + queue commands, trigger DNS, queue upgrades
|
|
- **owner** — manager + drop host, transfer ownership, update access
|
|
- **admin** (user flag) — owner-level access on every host
|
|
|
|
When no users are configured the server runs in **unauthenticated mode** — all existing behaviour is unchanged.
|
|
|
|
### Quick setup
|
|
|
|
```yaml
|
|
users:
|
|
alice:
|
|
full_name: Alice Smith
|
|
password: pbkdf2:sha256:... # hbd passwd alice
|
|
admin: true
|
|
|
|
default_owner: alice
|
|
|
|
hosts:
|
|
webserver01:
|
|
owner: alice
|
|
managers: [bob]
|
|
monitors: [carol]
|
|
```
|
|
|
|
```bash
|
|
# Generate a password hash
|
|
hbd passwd alice
|
|
```
|
|
|
|
Browser users are redirected to `/login` automatically. The session cookie is set on login, so `fetch()` calls from dashboards work without any JavaScript changes.
|
|
|
|
See [docs/USERS.md](docs/USERS.md) for complete user management documentation.
|
|
|
|
---
|
|
|
|
## 🌐 HTTP API & Web UI
|
|
|
|
Heartbeat includes a built-in HTTP/WebSocket server that provides both a REST API and web-based dashboards for monitoring and visualization.
|
|
|
|
### Features
|
|
|
|
- **User auth**: Optional session-based authentication with per-host role enforcement
|
|
- **REST API**: JSON endpoints for accessing plugin data, alerts, host information, and user management
|
|
- **Live Dashboard**: Real-time WebSocket-powered host status view
|
|
- **Plugin Metrics**: Interactive visualization of all plugin data with auto-refresh
|
|
- **Alerts Dashboard**: Comprehensive alert monitoring with filtering and summaries
|
|
|
|
### Web Dashboards
|
|
|
|
- **Login** (`/login`): Browser login form (shown automatically when auth is configured)
|
|
- **Live View** (`/live`): Real-time host connectivity, latency, and messages; hostnames link directly to the Host Overview page
|
|
- **Host Overview** (`/plugins/<host>`): Per-host plugin metrics with ZFS pool visualization; filtered to hosts where the logged-in user is owner or manager (admins see all)
|
|
- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering; alert count pie chart shown in the navigation bar
|
|
- **Settings** (`/settings`): Server configuration, user management, and threshold configuration viewer
|
|
|
|
### API Endpoints
|
|
|
|
```bash
|
|
# Log in (when auth is configured)
|
|
TOKEN=$(curl -s -X POST http://localhost:50004/api/0/auth/login \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"username":"alice","password":"secret"}' | jq -r .token)
|
|
AUTH="-H \"Authorization: Bearer $TOKEN\""
|
|
|
|
# List all monitored hosts
|
|
curl $AUTH http://localhost:50004/api/0/hosts
|
|
|
|
# Get all plugin data for a host
|
|
curl $AUTH http://localhost:50004/api/0/hosts/webserver01/plugins
|
|
|
|
# Get detailed plugin history (last 50 samples)
|
|
curl $AUTH "http://localhost:50004/api/0/hosts/webserver01/plugins/cpu_monitor?limit=50"
|
|
|
|
# Get alert states for a specific host
|
|
curl $AUTH http://localhost:50004/api/0/hosts/webserver01/alerts
|
|
|
|
# Get all active alerts across all hosts
|
|
curl $AUTH http://localhost:50004/api/0/alerts
|
|
|
|
# View/update host access roles
|
|
curl $AUTH http://localhost:50004/api/0/hosts/webserver01/access
|
|
```
|
|
|
|
See [docs/HTTP_API.md](docs/HTTP_API.md) for complete API documentation including response formats, error handling, and integration examples.
|
|
|
|
---
|
|
|
|
## ⚙️ Quickstart
|
|
|
|
Prerequisites:
|
|
|
|
- Python 3.11+ (project uses language features from recent Python)
|
|
- `nsupdate` (for DNS updates) if using dynamic DNS
|
|
|
|
Install dependencies (recommended into a venv):
|
|
|
|
This project now declares its dependencies in `pyproject.toml`. Instead
|
|
of the old `requirements.txt` flow, install the package into a virtualenv
|
|
using `pip`:
|
|
|
|
See `scripts/hb_install.sh` for a way to install.
|
|
|
|
Run the daemon (example):
|
|
|
|
```bash
|
|
# run with default config lookup (~/.hb.yaml)
|
|
hbd -c .hb.yaml -f -v
|
|
```
|
|
|
|
You can also run it directly via the package entrypoint after installation:
|
|
|
|
```bash
|
|
python -m hbd.server.cli -c /path/to/config.yaml
|
|
```
|
|
|
|
### Running the Client
|
|
|
|
The heartbeat client (`hbc`) sends periodic heartbeats and plugin data to the server:
|
|
|
|
```bash
|
|
# Basic usage pointing to server (host is a positional argument)
|
|
hbc your-server.example.com
|
|
|
|
# Run as daemon with a config file
|
|
hbc -d -c /etc/hbc.yaml your-server.example.com
|
|
|
|
# Send a one-off boot message
|
|
hbc --boot your-server.example.com
|
|
|
|
# Verbose output
|
|
hbc -v your-server.example.com
|
|
|
|
# Send 'boot' and 'shutdown' messages on start and exit
|
|
hbc -b your-server.example.com
|
|
```
|
|
|
|
You can also run it via the module entrypoint:
|
|
|
|
```bash
|
|
python -m hbd.client.main your-server.example.com
|
|
```
|
|
|
|
Client configuration can also be specified in YAML (`~/.hbc.yaml`):
|
|
|
|
```yaml
|
|
hb_port: 50003 # Server port (default: 50003)
|
|
interval: 30 # Heartbeat interval in seconds
|
|
plugins:
|
|
cpu_monitor:
|
|
interval: 300 # Check every 5 minutes (default)
|
|
per_core: true
|
|
memory_monitor:
|
|
interval: 300 # Check every 5 minutes (default)
|
|
disk_monitor:
|
|
interval: 300 # Check every 5 minutes (default)
|
|
network_monitor:
|
|
interval: 300 # Check every 5 minutes (default)
|
|
nagios_runner:
|
|
interval: 300 # Check every 5 minutes (default)
|
|
commands:
|
|
- name: check_load
|
|
command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
|
|
- name: check_disk
|
|
command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
|
|
```
|
|
|
|
The server hostname is always passed as a positional command-line argument; there is no `server:` config key.
|
|
|
|
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
|
|
|
|
**Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
|
|
|
|
**Daemon logging:** When running with `-d`, `hbc` routes all log output to syslog (`LOG_DAEMON` facility) after daemonizing. Without `-d`, logs go to stderr as usual.
|
|
|
|
### hbc_mini — single-file client (no external dependencies)
|
|
|
|
`scripts/hbc_mini.py` is a self-contained version of the heartbeat client that requires only Python 3.8+ and no external packages. Copy it to any host and run it directly — no virtualenv, no `pip install`.
|
|
|
|
```bash
|
|
# Basic usage
|
|
python3 hbc_mini.py your-server.example.com
|
|
|
|
# Run as daemon
|
|
python3 hbc_mini.py -d your-server.example.com
|
|
|
|
# Send a boot message
|
|
python3 hbc_mini.py -b your-server.example.com
|
|
|
|
# Send a one-off message
|
|
python3 hbc_mini.py -m "maintenance starting" your-server.example.com
|
|
```
|
|
|
|
**Config:** `~/.hbc.json` (same keys as `~/.hbc.yaml`, JSON format). Example:
|
|
|
|
```json
|
|
{
|
|
"hb_port": 50003,
|
|
"interval": 30,
|
|
"plugins": {
|
|
"ping_monitor": {
|
|
"interval": 60,
|
|
"hosts": ["8.8.8.8", "192.168.1.1"]
|
|
},
|
|
"nagios_runner": {
|
|
"interval": 300,
|
|
"commands": [
|
|
{"name": "check_load", "command": "/usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6"}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Plugin availability:**
|
|
|
|
| Plugin | Platform | Data source |
|
|
|---|---|---|
|
|
| `os_info` | all | `platform` stdlib |
|
|
| `ping_monitor` | all | `ping` subprocess |
|
|
| `nagios_runner` | all (not Windows) | subprocess |
|
|
| `cpu_monitor` | Linux | `/proc/stat` |
|
|
| `memory_monitor` | Linux | `/proc/meminfo` |
|
|
| `disk_monitor` | Linux, macOS, BSD | `df -P` subprocess |
|
|
| `network_monitor` | Linux | `/proc/net/dev` |
|
|
|
|
**What is not available compared to the full `hbc`:**
|
|
|
|
- No YAML config (use JSON instead)
|
|
- No `filesystem_info` plugin
|
|
- No `zfs_monitor` plugin (requires `zpool(8)` and the full plugin loader)
|
|
- `cpu_monitor` does not report per-core usage or CPU frequency (no psutil)
|
|
- Plugins cannot be loaded from external `.py` files — all plugins are compiled in
|
|
- No IPv6 early-fail protection — connections that fail to open at startup are silently skipped rather than retried
|
|
|
|
Everything else — heartbeat protocol, ACK/CMD/UPD handling, `hb_install.sh`-based self-update, daemonize, syslog — is identical to the full client.
|
|
|
|
---
|
|
|
|
## 🐞 Debugging in VS Code
|
|
|
|
This repository includes a ready-to-use `.vscode/launch.json` with configurations to run or attach the VS Code debugger to `hbd`.
|
|
|
|
- Ensure the **Python** extension is installed and select the project `.venv` as the interpreter (bottom-left of VS Code).
|
|
- Use **F5** and pick one of these configurations from the Run view:
|
|
- **Python: Run hbd (module)** — runs `hbd.server.cli` as a module and sets `PYTHONPATH` to the workspace root (recommended).
|
|
- **Python: Run hbd with debugpy (listen)** — launches `debugpy` and `hbd` together; useful when you want the process to listen for a debugger.
|
|
- **Python: Attach (localhost:5678)** — attach the debugger to a running process started with `debugpy`.
|
|
|
|
To start `hbd` manually and wait for the debugger to attach, run:
|
|
|
|
```bash
|
|
PYTHONPATH=. python -m debugpy --listen 5678 --wait-for-client -m hbd.server.cli -c .hb.yaml -f -v
|
|
```
|
|
|
|
Set breakpoints in modules such as `hbd/server/udp.py`, `hbd/server/dns.py`, or `hbd/server/main.py`, and use the **Attach** configuration to connect. Use `justMyCode: false` if you need to step into third-party code.
|
|
|
|
---
|
|
|
|
## 🛠 Configuration
|
|
|
|
`hbd` reads YAML configuration (optional). If `PyYAML` is not installed, built-in defaults are used. Example configuration keys (see `hbd/server/config.py`):
|
|
|
|
- `hb_port`: UDP port to listen for heartbeats (default: 50003)
|
|
- `hbd_port`: internal control port (default: 50004)
|
|
- `hbd_host`: bind address for HTTP/WSS
|
|
- `pickfile`: path for persisted state
|
|
- `logfile`: path to log file
|
|
- `pushsrv`: push service (`pushover`|`mattermost`|`all`)
|
|
- `interval` / `grace`: heartbeat timing configuration
|
|
- `dyndomains`: list of dyndomains to update via `nsupdate`
|
|
- `nsupdate_bin`: path to nsupdate binary
|
|
- `ws_port`: port for plain WebSocket connections (default: 50005)
|
|
- `wss_port`: port for secure WebSocket (WSS) connections (default: none).
|
|
If set, `hbd` will attempt to serve WSS on this port when `wss_pem` and
|
|
`wss_key` SSL files are available under `cert_path` (see below).
|
|
- `cert_path`: directory where TLS certificate and key are looked up (default: /usr/local/etc/ssl/)
|
|
- `wss_pem`: filename for the certificate chain (default: fullchain.pem)
|
|
- `wss_key`: filename for the private key (default: privkey.pem)
|
|
- `users`: mapping of username → user attributes (full_name, avatar, password, admin, notification_channels)
|
|
- `default_owner`: username that owns hosts with no explicit owner (falls back to first admin user)
|
|
|
|
Example `.hb.yaml` (minimal):
|
|
|
|
```yaml
|
|
hbd_host: 0.0.0.0
|
|
hbd_port: 50004
|
|
dyndomains:
|
|
- example.com
|
|
nsupdate_bin: /usr/bin/nsupdate
|
|
pushsrv: pushover
|
|
```
|
|
|
|
> Tip: `SERVER_DEFAULTS` in `hbd/server/config.py` contains the canonical defaults and accepted configuration keys.
|
|
|
|
---
|
|
|
|
## 🔧 Architecture & Modules
|
|
|
|
The package is organized into three subpackages:
|
|
|
|
**`hbd.common`** — shared code used by both client and server:
|
|
- `hbd.common.proto` — serialization/deserialization of heartbeat messages (supports compressed payloads and plugin data)
|
|
- `hbd.common.utils` — small utility helpers (`shortname`, `dur`, `initlog`)
|
|
|
|
**`hbd.server`** — the heartbeat daemon (`hbd`):
|
|
- `hbd.server.cli` — CLI entrypoint and argument parsing
|
|
- `hbd.server.main` — async orchestration to run UDP/HTTP/WSS components
|
|
- `hbd.server.udp` — UDP parsing and `handle_datagram` implementation (main state machine)
|
|
- `hbd.server.dns` — `create_nsupdate_payload`, `nsupdate`, and an asyncio DNS worker (`start_dns_worker`).
|
|
The DNS worker runs as an `asyncio` task and the package exposes a small thread-safe bridge
|
|
so legacy synchronous code can `put()` updates into the queue.
|
|
- `hbd.server.notify` — email and push notification helpers
|
|
- `hbd.server.ws` — WebSocket server and thread-safe broadcast helpers
|
|
- `hbd.server.http` — HTTP handler factory for the status UI/API
|
|
- `hbd.server.journal` — message journal with size-based log rotation and backup management
|
|
- `hbd.server.threshold` — threshold alerting engine
|
|
- `hbd.server.monitor` — host state monitoring
|
|
- `hbd.server.hbdclass` — `Host` class and shared server state
|
|
- `hbd.server.config` — configuration loader and defaults
|
|
|
|
**`hbd.client`** — the heartbeat client (`hbc`):
|
|
- `hbd.client.main` — client entrypoint; sends heartbeats and plugin data to the server
|
|
- `hbd.client.plugin` — plugin framework with base classes, registry, and dynamic loader
|
|
- `hbd.client.plugins/` — built-in plugins (os_info, cpu_monitor, memory_monitor, disk_monitor, network_monitor, filesystem_info, nagios_runner)
|
|
- `hbd.client.config` — client configuration loader
|
|
|
|
This modular layout makes the code easier to test and maintain.
|
|
|
|
**Runtime & Shutdown**
|
|
|
|
- The main runtime is asyncio-based. Services (UDP listener, HTTP server, WebSocket server, monitor, and DNS worker) run as asyncio tasks.
|
|
- On SIGINT/SIGTERM the server triggers a graceful shutdown: it cancels active tasks, signals the DNS worker via a sentinel, and cleans up resources before exit.
|
|
- The DNS update worker is implemented as an `asyncio` task; synchronous producers can still enqueue DNS updates via a small thread-safe bridge available at `hbd.server.hbdclass.Host.dnsQ`.
|
|
|
|
**Templates & Static Files**
|
|
|
|
- Template files are located under `hbd/server/templates`. The HTTP server resolves templates relative to the `hbd.server` package but the path can be overridden with the `templates_dir` config key.
|
|
- Static assets (CSS/JS/images) are served from `hbd/server/static` via the `/static/<path>` HTTP route.
|
|
|
|
---
|
|
|
|
## 🧪 Testing & Dev
|
|
|
|
Tests are implemented using `unittest` and additional tests rely on `pytest` if you prefer. To run tests locally without installing anything beyond the dev requirements:
|
|
|
|
```bash
|
|
# with project root on PYTHONPATH
|
|
PYTHONPATH=. python -m unittest discover -v
|
|
# or with pytest if installed
|
|
pytest -q
|
|
```
|
|
|
|
Developer tooling included:
|
|
|
|
- `pyproject.toml` — project metadata and dependencies
|
|
- `tox.ini` — convenience wrappers for running tests, lint, and mypy
|
|
|
|
To run linters and type checks locally:
|
|
|
|
```bash
|
|
# after installing dev deps
|
|
tox -e lint
|
|
tox -e mypy
|
|
```
|
|
|
|
---
|
|
|
|
## 🚀 Running in production
|
|
|
|
- Use your system service manager (systemd, launchd, etc.) to run `hbd` in the background.
|
|
- Ensure `nsupdate` and necessary credentials are available for dynamic DNS updates.
|
|
- Configure TLS for WSS if you enable secure websockets.
|
|
|
|
> Note: The project contains a small example for obtaining DNS-verified certs (certbot with RFC2136) — see earlier commit history or ask me to re-add the example to this README if you want it documented here.
|
|
|
|
---
|
|
|
|
## 🤝 Contributing
|
|
|
|
Contributions welcome! Please:
|
|
|
|
1. Open an issue to discuss larger changes.
|
|
2. Create a topic branch and a clear PR.
|
|
3. Add tests for new features and run linters.
|
|
4. Keep changes focused and documented.
|
|
|
|
---
|
|
|
|
## 📜 License
|
|
|
|
This repository is licensed under the MIT license. See `LICENSE` for details.
|
|
|
|
---
|
|
|
|
If you'd like, I can also:
|
|
|
|
- add a **GitHub Actions** workflow that runs tests and lint on push/PR 🔁
|
|
- add a `CONTRIBUTING.md` template for PRs and code style 💬
|
|
|
|
Which one should I do next? ✨
|