heartbeat/README.md

# Heartbeat Daemon (hbd)

A lightweight UDP-based host monitoring system. Monitored hosts run a client (`hbc`) that sends periodic heartbeat packets and system metrics to a central server (`hbd`). The server tracks host reachability, evaluates metric thresholds, sends notifications, and serves a web dashboard.

---

## Architecture

```
  [ host running hbc ]                [ server running hbd ]
  ┌────────────────────┐              ┌────────────────────────────┐
  │  heartbeat client  │  UDP 50003   │  heartbeat daemon          │
  │                    │ ──────────>  │                            │
  │  plugins:          │  HTB / PLG   │  host state tracking       │
  │  - cpu_monitor     │              │  threshold evaluation      │
  │  - memory_monitor  │  <────────── │  DNS updates (nsupdate)    │
  │  - disk_monitor    │  ACK/CMD/UPD │  notifications             │
  │  - nagios_runner   │              │  web dashboard + REST API  │
  │  - ...             │              │  WebSocket live updates    │
  └────────────────────┘              └────────────────────────────┘
```

**Package:** `hbd` v5.3.9
**Python:** 3.11+

### Subpackages

| Package | Purpose |
|---|---|
| `hbd.common` | Protocol encoding/decoding, shared utilities |
| `hbd.server` | The `hbd` daemon |
| `hbd.client` | The `hbc` client |

---

## Installation

Dependencies are declared in `pyproject.toml`. Install into a virtualenv:

```bash
# Server + client
pip install .

# Using the install script
scripts/hb_install.sh
```

**Entry points:**
- `hbd` — server (`hbd.server.cli:main`)
- `hbc` — client (`hbd.client.main:main`)

**Runtime dependencies:**

| Component | Packages |
|---|---|
| Both | PyYAML ≥6.0 |
| Client | psutil ≥5.9.0 |
| Server | aiohttp ≥3.11, websockets ≥13.2, Jinja2 ≥3.1.6, ruamel.yaml ≥0.18, mattermostdriver ≥7.3.0, matrix-nio ≥0.24 |

---

## Server (`hbd`)

### Starting the server

```bash
# Foreground, verbose, with config file
hbd serve -c /etc/hb.yaml -f -v

# As a module
python -m hbd.server.cli serve -c /etc/hb.yaml
```

### CLI subcommands

| Command | Description |
|---|---|
| `hbd serve` | Start the daemon (default) |
| `hbd passwd <username>` | Generate a password hash for config |
| `hbd notify` | Test notification channels |
| `hbd stop` | Stop a running daemon |
| `hbd reload` | Reload config (send SIGHUP) |
| `hbd restart` | Restart daemon |

### Configuration (`~/.hb.yaml`)

```yaml
# Network
hb_port: 50003          # UDP port for heartbeat messages
hbd_port: 50004         # HTTP API / web UI port
hbd_host: ""            # Bind address (empty = all interfaces)
ws_port: 50005          # WebSocket port (plain)
wss_port: ~             # WebSocket port (TLS; requires cert_path/wss_pem/wss_key)

# Timing
interval: 20            # Expected heartbeat interval (seconds)
grace: 2                # Extra seconds before declaring a host overdue

# Persistence
pickfile: ~/.hb.pick    # Host state persistence
pidfile: ~/.hb.pid
logfile: ~/.hb.log

# Message journal
journal_enabled: true
journal_dir: /var/log/heartbeat
journal_file: messages.journal
journal_max_size: 104857600   # 100 MB
journal_max_backups: 10

# DNS
nsupdate_bin: /usr/bin/nsupdate
dyndomains:
  - example.com

# Threshold alert re-notification interval (seconds)
threshold_renotify_interval: 3600

# Notification channels
notification_channels:
  pushover_ops:
    type: pushover
    token: YOUR_APP_TOKEN
    user: YOUR_USER_KEY
  email_ops:
    type: email
    smtp_server: smtp.example.com
    port: 587
    user: alerts@example.com
    password: secret
    recipients: [ops@example.com]

# Users
users:
  alice:
    full_name: Alice Smith
    password: pbkdf2:sha256:...    # generate with: hbd passwd alice
    admin: true
    notification_channels: [pushover_ops]
  bob:
    password: pbkdf2:sha256:...
    notification_channels: [email_ops]

default_owner: alice

# Hosts
hosts:
  webserver01:
    dyndns: true          # Update DNS when address changes
    owner: alice
    managers: [bob]
    monitors: []
  database01:
    watch: false          # Suppress all notifications for this host
```

Send SIGHUP (or `hbd reload`) to reload configuration without restarting. Changes to ports, certificates, pickle path, and journal path require a full restart.

### Persistence

Host state (reachability, plugin data, alert states) is saved to `pickfile` every 5 minutes and on clean shutdown. The server loads this state on startup.

---

## Client (`hbc`)

### Usage

```bash
# Basic — send heartbeats to a server
hbc your-server.example.com

# Multiple servers
hbc server1.example.com server2.example.com

# With config file, running as a daemon
hbc -d -c /etc/hbc.yaml your-server.example.com

# Send a boot message, then heartbeat normally
hbc -b your-server.example.com

# One-off message
hbc -m "maintenance starting" your-server.example.com

# Force IPv4 or IPv6 only
hbc -4 your-server.example.com
hbc -6 your-server.example.com
```

### Options

| Flag | Description |
|---|---|
| `-b`, `--boot` | Send a boot message at startup |
| `-c`, `--config FILE` | Config file path (default: `~/.hbc.yaml`) |
| `-d`, `--daemon` | Daemonize (logs go to syslog) |
| `-m`, `--message TEXT` | Send a one-off message and exit |
| `-n`, `--name NAME` | Override reported hostname |
| `-v`, `--verbose` | Verbose output |
| `-x`, `--debug` | Debug level (repeatable) |
| `-4` / `-6` | Restrict to IPv4 or IPv6 |

### Configuration (`~/.hbc.yaml`)

```yaml
hb_port: 50003         # Server UDP port
interval: 10           # Heartbeat interval (seconds)
owner: alice           # Optional: claim ownership of this host

plugins:
  cpu_monitor:
    interval: 300      # Override collection interval
    per_core: true     # Report per-core CPU usage
  memory_monitor:
    interval: 300
  disk_monitor:
    interval: 300
  network_monitor:
    interval: 300
  ping_monitor:
    interval: 60
    hosts: [8.8.8.8, 192.168.1.1]
  nagios_runner:
    interval: 300
    commands:
      - name: check_load
        command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
      - name: check_disk_root
        command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
  zfs_monitor:
    interval: 300
```

### Connection behaviour

- The client sends heartbeats over UDP to each server address resolved from the hostname (IPv4 and IPv6).
- If a connection fails to open at startup, IPv6 connections are dropped after 3 consecutive failures. IPv4 connections retry indefinitely.
- In daemon mode (`-d`), all log output goes to syslog (`LOG_DAEMON` facility).

---

## UDP Protocol

All messages are zlib-compressed key=value pairs with an ID prefix.

```
!<ID>: <zlib-compressed payload>
```

Payload format: `key=value;key=value;...`

| Message | Direction | Purpose |
|---|---|---|
| `HTB` | client → server | Heartbeat (name, timestamp, RTT, acks, interval) |
| `PLG` | client → server | Plugin data (plugin name + metrics) |
| `ACK` | server → client | Acknowledgment |
| `CMD` | server → client | Execute a shell command on the client |
| `UPD` | server → client | Trigger self-update via `hb_install.sh` |

Value encoding:
- Floats: 5 decimal places
- Lists/dicts: JSON prefixed with `@`
- Booleans: `1` / `0`

RTT is measured using kernel SO_TIMESTAMP when available (Linux, macOS, FreeBSD), falling back to application-layer timing.

---

## Plugin System

Plugins run on the client and collect system metrics that are sent to the server as `PLG` messages.

### Plugin types

| Type | `interval` | When collected |
|---|---|---|
| `InfoPlugin` | 0 | Once at startup; re-collected on server request |
| `MonitorPlugin` | 30 (default) | Periodically on the configured interval |

### Built-in plugins

| Plugin | Type | Data collected |
|---|---|---|
| `os_info` | Info | OS, kernel, distro, architecture, Python version, hbc version |
| `cpu_monitor` | Monitor | cpu_percent, per-core usage, load averages, process count, frequency |
| `memory_monitor` | Monitor | RAM and swap usage (ZFS ARC-aware) |
| `disk_monitor` | Monitor | Per-partition usage, disk I/O stats |
| `network_monitor` | Monitor | Per-interface byte/packet counts, connection count |
| `ping_monitor` | Monitor | RTT, packet loss, jitter per configured host |
| `filesystem_info` | Info | Mounted filesystems (excludes pseudo filesystems) |
| `nagios_runner` | Monitor | Output of configured Nagios-compatible check commands |
| `zfs_monitor` | Monitor | ZFS pool health, capacity, fragmentation, dedup ratio, I/O |

### Custom plugins

Create a `.py` file in `hbd/client/plugins/`:

```python
from hbd.client.plugin import MonitorPlugin

class MyPlugin(MonitorPlugin):
    name = "my_plugin"
    interval = 60

    async def collect(self):
        return {"my_metric": 42}
```

`initialize()` is called once at load time; return `False` to disable the plugin (e.g., if a required binary is missing).

### Nagios integration

The `nagios_runner` plugin executes any Nagios-compatible check binary:

```yaml
plugins:
  nagios_runner:
    commands:
      - name: check_http
        command: /usr/lib/nagios/plugins/check_http -H example.com
```

- Commands are validated (absolute paths, executable) at startup.
- Exit codes map to OK / WARNING / CRITICAL / UNKNOWN.
- Performance data fields are extracted and stored individually.
- The `nagios` threshold operator maps exit codes directly to alert levels (see Threshold Alerting).

---

## Threshold Alerting

The server evaluates plugin metrics against configurable thresholds and fires notifications on state changes.

### Configuration

```yaml
thresholds:
  cpu_monitor:
    cpu_percent:
      warning: 80.0
      critical: 90.0
      operator: ">"         # >, >=, <, <=, ==, != (default: >)
      hysteresis: 0.1       # 10%: recover at 81 when critical=90
      count: 1              # Require N consecutive breaches before alerting
      display: "CPU {cpu_percent}% (threshold: {op_symbol}{threshold_value})"

  memory_monitor:
    percent:
      warning: 85.0
      critical: 95.0

  disk_monitor:
    partitions:
      /:
        percent:
          warning: 80.0
          critical: 90.0
        free_gb:
          warning: 10.0
          critical: 5.0
          operator: "<"

  nagios_runner:
    status_code:
      operator: "nagios"    # 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN
      display: "{check_name}: {output}"
```

### Per-host threshold profiles

Named profiles let different hosts use different thresholds. A single name or a list is accepted; lists are applied left-to-right.

```yaml
threshold_configs:
  default:
    thresholds:
      cpu_monitor:
        cpu_percent: {warning: 80, critical: 90}

  tight_cpu:
    thresholds:
      cpu_monitor:
        cpu_percent: {warning: 60, critical: 75}

hosts:
  web-01:
    threshold_config: default
  db-01:
    threshold_config: [default, tight_cpu]
```

### Alert states

| State | Meaning |
|---|---|
| OK | Metric within normal range |
| WARNING | Metric crossed warning threshold |
| CRITICAL | Metric crossed critical threshold |
| UNKNOWN | Cannot determine (e.g. Nagios exit code 3) |

Notifications are sent on state transitions (OK → WARNING, WARNING → CRITICAL, CRITICAL → OK). De-escalations (CRITICAL → WARNING) do not trigger a notification. Ongoing alerts generate a re-notification every `threshold_renotify_interval` seconds (default: 3600). Alerts can be acknowledged via the web UI or API to suppress re-notifications.

### RTT thresholds

The server measures heartbeat round-trip time and supports RTT thresholds using the same format:

```yaml
thresholds:
  rtt:
    webserver01:
      warning: 100.0    # ms
      critical: 500.0
```

### Generic threshold matching

When a metric has no exact threshold entry, the server strips leading segments and retries. This allows one entry to cover all Nagios checks:

```
nagios_runner.check_disk_root_status_code → no match
nagios_runner.disk_root_status_code       → no match
nagios_runner.root_status_code            → no match
nagios_runner.status_code                 → matched ✓
```

The stripped prefix (`check_disk_root`) is available as `{check_name}` in the `display` template.

### Display template variables

| Variable | Description |
|---|---|
| `{value}` | Current metric value |
| `{threshold_value}` | Threshold that was crossed |
| `{op_symbol}` | Comparison operator |
| `{check_name}` | Prefix stripped by generic matching |
| `{metric_name}` | Full field name |
| `{output}` | Nagios check output text |
| `{status}` | Nagios status name (OK/WARNING/CRITICAL/UNKNOWN) |
| any plugin field | Any field present in the plugin's data |

---

## Notification Channels

Notifications are dispatched to the host's owner, managers, and monitors. Each user specifies which channels to use.

### Supported channel types

| Type | Required fields |
|---|---|
| `pushover` | `token`, `user` |
| `email` | `smtp_server`, `recipients`, `sender`, `user`, `password`, `port` |
| `mattermost` | `webhook_url`, `channel` |
| `matrix` | `homeserver`, `user`, `password`, `room_id` |
| `signal` | `phone_number`, `recipient` |
| `sms_voipms` | `api_key`, `recipient` |

Each channel can set a `min_level` (`WARNING` or `CRITICAL`) to filter low-severity alerts.

Recovery notifications are only sent to channels that received the original alert.

---

## Web Dashboard & HTTP API

The server exposes a web UI and REST API on `hbd_port` (default 50004).

### Web pages

| Path | Description |
|---|---|
| `/login` | Login form (shown automatically when auth is configured) |
| `/live` | Real-time host connectivity, RTT, and message stream |
| `/plugins/<host>` | Per-host plugin metrics |
| `/alerts` | Active alerts with severity filtering |
| `/settings` | Server config, users, notification channels, thresholds |

Live views use WebSocket connections for real-time updates.

Non-admin users see only hosts where they have a role (monitor, manager, or owner). Admins see all hosts.

### REST API

All endpoints are under `/api/0/`. When authentication is configured, include a session token:

```bash
# Log in, get a token
TOKEN=$(curl -s -X POST http://localhost:50004/api/0/auth/login \
  -H 'Content-Type: application/json' \
  -d '{"username":"alice","password":"secret"}' | jq -r .token)

# Use the token
curl -H "Authorization: Bearer $TOKEN" http://localhost:50004/api/0/hosts
```

| Method | Endpoint | Description |
|---|---|---|
| GET | `/api/0/hosts` | All visible hosts |
| GET | `/api/0/alerts` | All active alerts |
| GET | `/api/0/alert_summary` | Count of ok/warning/critical |
| GET | `/api/0/messages` | Last 30 messages |
| GET | `/api/0/hosts/{host}/plugins` | All plugin data for host |
| GET | `/api/0/hosts/{host}/plugins/{plugin}?limit=N` | Plugin samples |
| GET | `/api/0/hosts/{host}/alerts` | Alert states for host |
| GET | `/api/0/hosts/{host}/access` | Access roles |
| PUT | `/api/0/hosts/{host}/access` | Update access roles |
| GET | `/api/0/hosts/{host}/info` | Host info (hbc version, thresholds) |
| POST | `/api/0/alerts/acknowledge` | Acknowledge alert |
| GET | `/api/0/users` | All users (admin only) |
| GET | `/api/0/users/me` | Current user profile |
| PUT | `/api/0/users/me` | Update own profile |
| POST | `/api/0/auth/login` | Create session |
| POST | `/api/0/auth/logout` | Destroy session |
| GET | `/api/0/config` | Server config (secrets redacted) |
| POST | `/api/0/config` | Update config |
| GET | `/api/0/config/backups` | List config backups |
| POST | `/api/0/config/rollback` | Roll back to previous config |
| GET | `/api/0/notification_channels` | List channels |
| POST | `/api/0/notification_channels` | Create channel |
| PUT | `/api/0/notification_channels/{name}` | Update channel |
| DELETE | `/api/0/notification_channels/{name}` | Delete channel |

---

## User Management & Authentication

When no `users:` block is in config, the server runs unauthenticated — all existing behaviour is preserved.

### Roles

| Role | Capabilities |
|---|---|
| monitor | View status, plugin data, alerts |
| manager | monitor + queue commands, trigger DNS, queue upgrades |
| owner | manager + drop host, transfer ownership, update access |
| admin | Owner-level on all hosts + access to server config and users |

### Setup

```yaml
users:
  alice:
    full_name: Alice Smith
    password: pbkdf2:sha256:...    # hbd passwd alice
    admin: true
    notification_channels: [pushover_ops]

default_owner: alice    # Owns any host with no explicit owner

hosts:
  webserver01:
    owner: alice
    managers: [bob]
    monitors: [carol]
```

Password hashing uses PBKDF2-HMAC-SHA256 (260,000 iterations). Sessions expire after 24 hours.

OAuth2 login (Gitea) is supported:

```yaml
oauth:
  gitea:
    url: https://git.example.com
    client_id: xxx
    client_secret: yyy
```

---

## Dynamic DNS

When `dyndns: true` is set on a host and `dyndomains` is configured, the server updates DNS via `nsupdate` whenever the host's source address changes.

```yaml
nsupdate_bin: /usr/bin/nsupdate
dyndomains:
  - example.com

hosts:
  webserver01:
    dyndns: true
```

DNS updates run asynchronously in a background worker.

---

## Message Journal

All received messages are logged in JSONL format with automatic size-based rotation.

```yaml
journal_enabled: true
journal_dir: /var/log/heartbeat
journal_file: messages.journal
journal_max_size: 104857600    # 100 MB
journal_max_backups: 10
```

Example entry:

```json
{"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver01","interval":10}}
```

---

## `hbc_mini` — Zero-dependency client

`scripts/hbc_mini.py` is a single-file client requiring only Python 3.8+ and no external packages. Copy it to any host and run directly.

```bash
python3 hbc_mini.py your-server.example.com
python3 hbc_mini.py -d your-server.example.com     # daemon mode
python3 hbc_mini.py -b your-server.example.com     # send boot message
```

Config: `~/.hbc.json` (JSON format, same keys as `~/.hbc.yaml`).

**Available plugins:**

| Plugin | Platform |
|---|---|
| `os_info` | All |
| `ping_monitor` | All |
| `nagios_runner` | All (not Windows) |
| `cpu_monitor` | Linux (`/proc/stat`; no per-core, no frequency) |
| `memory_monitor` | Linux (`/proc/meminfo`) |
| `disk_monitor` | Linux, macOS, BSD (`df -P`) |
| `network_monitor` | Linux (`/proc/net/dev`) |

Not available vs full `hbc`: no YAML config, no `filesystem_info`, no `zfs_monitor`, no IPv6 early-fail protection.

---

## `hbc_mini.c` — C client

`scripts/c/hbc_mini.c` is a single-file C port of `hbc_mini.py`. It has no runtime dependencies beyond libc, zlib, pthreads, and libm, and runs on Linux, FreeBSD, NetBSD, and DragonFly BSD.

### Build

```bash
cc -O2 -o hbc_mini scripts/c/hbc_mini.c -lz -lpthread -lm
```

### Usage

The CLI is identical to `hbc_mini.py`:

```bash
./hbc_mini your-server.example.com
./hbc_mini -d your-server.example.com      # daemon mode (logs to syslog)
./hbc_mini -b your-server.example.com      # send boot message
./hbc_mini -m "note" your-server.example.com   # send one-shot message
./hbc_mini -4 your-server.example.com      # IPv4 only
./hbc_mini -6 your-server.example.com      # IPv6 only
```

Config: `~/.hbc.json` (JSON, same keys as the Python version).

### Architecture

The C client uses two threads:

- **Main thread** — heartbeat sender loop + `select()`-based receive loop (1 s timeout). Sends `HTB` at the configured interval, receives `ACK`/`CMD` messages, and re-sends `os_info` on server request.
- **Monitor thread** — all periodic plugins in a single thread with a 1-second sleep loop. Each plugin has its own next-run timestamp tracked independently.

SIGHUP causes the process to restart itself via `execv()`. SIGTERM/SIGINT trigger a clean shutdown (sends a shutdown heartbeat if `-b` was used).

### Available plugins

| Plugin | Platform | Data source |
|---|---|---|
| `os_info` | Linux, FreeBSD, NetBSD, DragonFly | `uname(2)`, `/etc/os-release`, `kern.osrelease` sysctl |
| `cpu_monitor` | Linux | `/proc/stat` |
| `cpu_monitor` | FreeBSD, DragonFly, NetBSD | `kern.cp_time` sysctl |
| `memory_monitor` | Linux | `/proc/meminfo` (ZFS ARC-aware) |
| `memory_monitor` | FreeBSD, DragonFly | `vm.stats.vm.*` sysctl |
| `memory_monitor` | NetBSD | `VM_UVMEXP` sysctl |
| `disk_monitor` | All | `df -P` subprocess |
| `network_monitor` | Linux | `/proc/net/dev` |
| `network_monitor` | FreeBSD, NetBSD, DragonFly | `getifaddrs()` + `AF_LINK` |
| `ping_monitor` | All | `ping` subprocess |
| `nagios_runner` | All | `popen()` subprocess |

`cpu_monitor` reports: `cpu_percent`, `cpu_user`, `cpu_system`, `cpu_idle`, `cpu_iowait` (Linux only), load averages, `cpu_core_count`, `uptime_seconds`.

`memory_monitor` reports: `memory_total`, `memory_used`, `memory_available`, `memory_free`, `memory_percent`, and swap fields when swap is present.

`network_monitor` reports per-interface cumulative `bytes_recv`/`bytes_sent` and interval deltas. The loopback interface (`lo`) is skipped by default; this is configurable:

```json
{
  "plugins": {
    "network_monitor": {
      "skip_interfaces": ["lo", "docker0"]
    }
  }
}
```

`disk_monitor` reports per-mount `total`, `used`, `free`, `percent`. An optional mount filter restricts reporting to specific paths:

```json
{
  "plugins": {
    "disk_monitor": {
      "mounts": ["/", "/data"]
    }
  }
}
```

### Differences from `hbc_mini.py`

- No `filesystem_info` or `zfs_monitor` plugins
- `UPD` (self-update) messages are logged but not acted on
- No IPv6 early-fail protection
- Config is JSON only (`~/.hbc.json`), no YAML

---

## Development

### Running tests

```bash
PYTHONPATH=. python -m unittest discover -v
# or
pytest -q
```

### Linting and type checking

```bash
tox -e lint
tox -e mypy
```

### Debugging in VS Code

A `.vscode/launch.json` is included with configurations for running and attaching the debugger. Select the project `.venv` as the Python interpreter, then use F5.

To start with debugpy and wait for attach:

```bash
PYTHONPATH=. python -m debugpy --listen 5678 --wait-for-client -m hbd.server.cli serve -c .hb.yaml -f -v
```

---

## License

MIT. See `LICENSE` for details.