Metadata-Version: 2.4
Name: hbd
Version: 5.0.11
Summary: Heartbeat monitoring system — client (hbc) and server (hbd)
Author: heartbeat contributors
License-Expression: MIT
Keywords: heartbeat,monitoring,dns,websocket,system-monitoring
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: PyYAML>=6.0
Provides-Extra: client
Requires-Dist: psutil>=5.9.0; extra == "client"
Provides-Extra: server
Requires-Dist: websockets>=13.2; extra == "server"
Requires-Dist: mattermostdriver>=7.3.0; extra == "server"
Requires-Dist: aiohttp>=3.11; extra == "server"
Requires-Dist: Jinja2>=3.1.6; extra == "server"
Provides-Extra: all
Requires-Dist: hbd[client,server]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: flake8>=5.0; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: isort>=5.0; extra == "dev"
Requires-Dist: tox>=4.0; extra == "dev"

# Heartbeat Daemon (hbd) ✅

A lightweight daemon that listens for UDP heartbeat messages and acts on them: keeps host state, optionally updates DNS records via `nsupdate`, forwards messages to WebSocket clients, and sends notifications (email, Pushover, Mattermost, Signal). It is a refactor of a previously monolithic script into a modular Python package (`hbd`).

---

## 📌 Features

- Receive and parse heartbeat datagrams (text or zlib-compressed) ✅
- Maintain host state and detect up/down transitions ✅
- Queue DNS updates via `nsupdate` and run them in a background thread ✅
- WebSocket API for live updates (hosts & messages) ✅
- Notification pipeline (email, Pushover, Mattermost, Signal) ✅
- **HTTP API & Web UI** ✅
  - REST API for plugin data, alerts, and host information
  - Live dashboard with WebSocket updates
  - Interactive plugin metrics visualization
  - Alerts dashboard with filtering and summaries
- **Message journal with automatic log rotation** ✅
  - Logs all received messages in JSON format
  - Size-based automatic rotation
  - Configurable retention and backup management
- **Plugin system for extensible monitoring** ✅
  - Collect system metrics (CPU, memory, disk, network)
  - Execute existing Nagios monitoring plugins
  - Create custom plugins with simple Python classes
- **Threshold alerting system** ✅
  - Monitor metrics against configurable WARNING/CRITICAL thresholds
  - Hysteresis to prevent alert flapping
  - Automatic notifications on state changes
  - Re-notification for ongoing alerts
- Modular codebase suitable for unit testing and CI ✅

---

## 🔌 Plugin System

Heartbeat includes a comprehensive plugin architecture that extends monitoring beyond simple heartbeats. The plugin system allows you to:

- **Collect system information**: OS details, hardware info, system configuration
- **Monitor resources**: CPU usage, memory, disk space, network statistics
- **Run Nagios plugins**: Execute thousands of existing Nagios monitoring plugins without modification
- **Create custom plugins**: Build your own monitoring logic with simple Python classes

### Plugin Types

- **InfoPlugin**: Collects static information once (e.g., OS version, hardware specs)
- **MonitorPlugin**: Collects metrics periodically (e.g., CPU usage every 30 seconds)

### Built-in Plugins

- `os_info`: Collects OS, kernel, distribution, and architecture information
- `cpu_monitor`: Monitors CPU usage, load average, frequency, and process counts
- `memory_monitor`: Monitors RAM and swap usage, available memory
- `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
- `network_monitor`: Monitors network interface statistics, bandwidth, and connections
- `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
- `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)

### Nagios Integration

The `nagios_runner` plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored:

- Executes plugins via subprocess with timeout protection
- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
- Extracts performance data with thresholds
- Reports aggregated status across all configured checks

See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.

### Creating Custom Plugins

```python
from hbd.plugin import MonitorPlugin

class DiskMonitorPlugin(MonitorPlugin):
    name = "disk_monitor"
    interval = 60  # Run every 60 seconds
    
    async def collect(self):
        return {
            "disk_usage": get_disk_usage(),
            "timestamp": time.time()
        }
```

Place plugins in `hbd/plugins/` and they'll be automatically discovered and loaded by the client.

---

## 📝 Message Journal

Heartbeat includes a message journal that logs all received messages with automatic rotation.

### Features

- **JSON Format**: All messages logged in JSONL (JSON Lines) format for easy parsing
- **Automatic Rotation**: Size-based rotation with configurable thresholds
- **Backup Management**: Keeps configurable number of rotated log files
- **Non-blocking**: Async logging with minimal performance impact

### Configuration

```yaml
# Message journal settings
journal_enabled: true                    # Enable/disable journaling
journal_dir: /var/log/heartbeat         # Journal directory
journal_file: messages.journal           # Base filename
journal_max_size: 104857600             # Max size (100MB default)
journal_max_backups: 10                 # Number of backups to keep
```

### Example Journal Entry

```json
{"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver1","interval":30}}
```

### Analyzing Journal Files

```bash
# View recent messages
tail -100 /var/log/heartbeat/messages.journal | jq .

# Count messages by type
cat /var/log/heartbeat/messages.journal | jq -r '.message.ID' | sort | uniq -c

# Filter by hostname
cat /var/log/heartbeat/messages.journal | jq 'select(.message.name == "webserver1")'
```

See [docs/MESSAGE_JOURNAL.md](docs/MESSAGE_JOURNAL.md) for complete documentation including rotation behavior, integration with log management systems, and analysis examples.

---

## 🚨 Threshold Alerting

Heartbeat includes a sophisticated threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured limits.

### Features

- **Multi-level alerts**: WARNING and CRITICAL severity levels
- **Flexible operators**: Support for >, >=, <, <=, ==, != comparisons
- **Hysteresis**: Prevents alert flapping with configurable recovery thresholds
- **Smart notifications**: Alerts only on state changes, not every check
- **Re-notifications**: Periodic reminders for ongoing alerts
- **Journal integration**: All threshold events logged for audit trail

### Configuration

```yaml
thresholds:
  # RTT (Round-Trip Time) thresholds for heartbeat monitoring
  # These are checked on every HTB message arrival
  rtt:
    webserver01:
      warning: 100.0   # Warn when RTT > 100ms
      critical: 500.0  # Critical when RTT > 500ms
    
    database01:
      warning: 50.0
      critical: 200.0
  
  # Plugin metric thresholds
  cpu_monitor:
    cpu_percent:
      warning: 80.0      # Warn when CPU > 80%
      critical: 90.0     # Critical when CPU > 90%
      operator: ">"
      hysteresis: 0.1    # 10% hysteresis to prevent flapping
  
  memory_monitor:
    percent:
      warning: 85.0
      critical: 95.0
  
  disk_monitor:
    partitions:
      /:
        percent:
          warning: 80.0
          critical: 90.0
        free_gb:
          warning: 10.0   # Alert when < 10GB free
          critical: 5.0
          operator: "<"   # Inverse threshold

# Global settings
threshold_renotify_interval: 3600  # Re-notify every hour for ongoing alerts
```

### RTT Monitoring

Heartbeat monitors network latency (Round-Trip Time) for each host's heartbeat messages. RTT thresholds are **fully integrated with the threshold alerting system**:

- **Per-host configuration**: Set different thresholds for each monitored host
- **Real-time checking**: Thresholds evaluated on every HTB message arrival
- **Alert state tracking**: RTT alerts use the same state management as plugin metrics
- **Hysteresis support**: Configurable hysteresis prevents rapid state transitions
- **Alerts dashboard**: RTT alerts visible on the `/alerts` web page alongside plugin alerts
- **Smart notifications**: Only triggers on state changes (OK → WARNING → CRITICAL)
- **Re-notification**: Periodic reminders for ongoing RTT issues
- **Event & journal logging**: All RTT events logged for audit trail

**Configuration format:**
```yaml
thresholds:
  rtt:
    <hostname>:
      warning: <milliseconds>   # Warn when RTT > this value
      critical: <milliseconds>  # Critical when RTT > this value
      hysteresis: 0.1           # Optional: 10% hysteresis (default)
```

**Example alerts:**
```
WARNING: webserver01 - rtt.webserver01 = 125.3
CRITICAL: database01 - rtt.database01 = 520.1
RECOVERED: webserver01 - rtt.webserver01 = 45.2 (WARNING -> OK)
```

RTT alerts appear on the Alerts dashboard and can be filtered by severity level. The `metric_path` format is `rtt.<hostname>`, making it easy to distinguish from plugin metrics.

### Alert Behavior

1. **State Changes**: Notifications sent when crossing thresholds
   - OK → WARNING: Early notification
   - WARNING → CRITICAL: Escalation
   - CRITICAL → OK: Recovery

2. **Hysteresis**: Prevents rapid state transitions
   ```
   Critical threshold: 90%
   Hysteresis: 10%
   Recovery threshold: 81% (90 - 10% of 90)
   
   Value 91% → CRITICAL (threshold crossed)
   Value 85% → CRITICAL (still above 81%)
   Value 79% → OK (below recovery threshold)
   ```

3. **Re-notifications**: Periodic reminders for ongoing alerts
   - Default: Every 60 minutes
   - Configurable via `threshold_renotify_interval`

### Example Notifications

```
WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
CRITICAL: webserver01 - memory_monitor.percent = 96.0
RECOVERED: database01 - disk_monitor./.percent = 75.0 (WARNING -> OK)
REMINDER (CRITICAL): mailserver - cpu_monitor.load_1min = 12.5 (ongoing for 3600s)
```

### Supported Metrics

All plugin metrics can be thresholded:

- **CPU**: cpu_percent, load_1min, load_5min, load_15min
- **Memory**: percent, available_mb, swap_percent
- **Disk**: Per-partition percent, free_gb, free_mb
- **Network**: errors_total, dropped packets, connection counts
- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)

See [docs/THRESHOLD_ALERTING.md](docs/THRESHOLD_ALERTING.md) for comprehensive documentation including best practices, troubleshooting, and advanced configuration.

---

## 🌐 HTTP API & Web UI

Heartbeat includes a built-in HTTP/WebSocket server that provides both a REST API and web-based dashboards for monitoring and visualization.

### Features

- **REST API**: JSON endpoints for accessing plugin data, alerts, and host information
- **Live Dashboard**: Real-time WebSocket-powered host status view
- **Plugin Metrics**: Interactive visualization of all plugin data with auto-refresh
- **Alerts Dashboard**: Comprehensive alert monitoring with filtering and summaries
- **CORS Support**: Configurable for integration with external applications

### Web Dashboards

- **Live View** (`/live`): Real-time host connectivity, latency, and messages  
- **Plugin Metrics** (`/plugins`): Browse and visualize metrics from all plugins  
- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering  

### API Endpoints

```bash
# List all monitored hosts
curl http://localhost:50004/api/0/hosts

# Get all plugin data for a host
curl http://localhost:50004/api/0/hosts/webserver01/plugins

# Get detailed plugin history (last 50 samples)
curl http://localhost:50004/api/0/hosts/webserver01/plugins/cpu_monitor?limit=50

# Get alert states for a specific host
curl http://localhost:50004/api/0/hosts/webserver01/alerts

# Get all active alerts across all hosts
curl http://localhost:50004/api/0/alerts
```

### Integration Examples

**Python Client:**
```python
import requests

# Monitor for critical alerts
response = requests.get('http://localhost:50004/api/0/alerts')
alerts = response.json()

if alerts['summary']['critical'] > 0:
    print(f"⚠️ {alerts['summary']['critical']} CRITICAL alerts!")
    for alert in alerts['alerts']:
        if alert['level'] == 'CRITICAL':
            print(f"  {alert['hostname']}: {alert['metric_path']} = {alert['last_value']}")
```

**Bash Monitoring Script:**
```bash
#!/bin/bash
# Check for critical alerts
CRITICAL=$(curl -s http://localhost:50004/api/0/alerts | jq '.summary.critical')
if [ "$CRITICAL" -gt 0 ]; then
    echo "CRITICAL: $CRITICAL critical alerts detected!"
    # Send notification
fi
```

### Demo & Testing

Run the API demo script to test all endpoints:

```bash
python3 scripts/demo_http_api.py
```

See [docs/HTTP_API.md](docs/HTTP_API.md) for complete API documentation including response formats, error handling, and integration examples.

---

## ⚙️ Quickstart

Prerequisites:

- Python 3.10+ (project uses language features from recent Python)
- `nsupdate` (for DNS updates) if using dynamic DNS

Install dependencies (recommended into a venv):

This project now declares its dependencies in `pyproject.toml`. Instead
of the old `requirements.txt` flow, install the package into a virtualenv
using `pip`:

See `scripts/install.sh` for a way to install.

Run the daemon (example):

```bash
# run with default config lookup (~/.hb.yaml)
hbd -c .hb.yaml -f -v
```

You can also run it directly via the package entrypoint after installation:

```bash
python -m hbd.cli -c /path/to/config.yaml
```

### Running the Client

The heartbeat client (`hbc`) sends periodic heartbeats and plugin data to the server:

```bash
# Basic usage pointing to server
python -m hbd.hbc --server your-server.example.com

# With custom configuration
python -m hbd.hbc --server 192.168.1.100 --port 50003 --interval 30

# Run with specific plugins enabled/disabled
python -m hbd.hbc --server hbd.local --disable-plugin os_info
```

Client configuration can also be specified in YAML:

```yaml
server: hbd.example.com
port: 50003
interval: 30
plugins:
  cpu_monitor:
    interval: 300      # Check every 5 minutes (default)
    per_core: true
  memory_monitor:
    interval: 300      # Check every 5 minutes (default)
  disk_monitor:
    interval: 300      # Check every 5 minutes (default)
  network_monitor:
    interval: 300      # Check every 5 minutes (default)
  nagios_runner:
    interval: 300      # Check every 5 minutes (default)
    commands:
      - /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
      - /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
```

All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.

## 🐞 Debugging in VS Code

This repository includes a ready-to-use `.vscode/launch.json` with configurations to run or attach the VS Code debugger to `hbd`.

- Ensure the **Python** extension is installed and select the project `.venv` as the interpreter (bottom-left of VS Code).
- Use **F5** and pick one of these configurations from the Run view:
  - **Python: Run hbd (module)** — runs `hbd.cli` as a module and sets `PYTHONPATH` to the workspace root (recommended).
  - **Python: Run hbd with debugpy (listen)** — launches `debugpy` and `hbd` together; useful when you want the process to listen for a debugger.
  - **Python: Attach (localhost:5678)** — attach the debugger to a running process started with `debugpy`.

To start `hbd` manually and wait for the debugger to attach, run:

```bash
PYTHONPATH=. python -m debugpy --listen 5678 --wait-for-client -m hbd.cli -c .hb.yaml -f -v
```

Set breakpoints in modules such as `hbd/udp.py`, `hbd/dns.py`, or `hbd/server.py`, and use the **Attach** configuration to connect. Use `justMyCode: false` if you need to step into third-party code.

---

## 🛠 Configuration

`hbd` reads YAML configuration (optional). If `PyYAML` is not installed, built-in defaults are used. Example configuration keys (see `hbd/config.py`):

- `hb_port`: UDP port to listen for heartbeats (default: 50003)
- `hbd_port`: internal control port (default: 50004)
- `hbd_host`: bind address for HTTP/WSS
- `pickfile`: path for persisted state
- `logfile`: path to log file
- `logfmt`: `text` or `msg`
- `pushsrv`: push service (`pushover`|`mattermost`|`all`)
- `interval` / `grace`: heartbeat timing configuration
- `dyndomains`: list of dyndomains to update via `nsupdate`
- `nsupdate_bin`: path to nsupdate binary
- `ws_port`: port for plain WebSocket connections (default: 50005)
- `wss_port`: port for secure WebSocket (WSS) connections (default: none).
  If set, `hbd` will attempt to serve WSS on this port when `wss_pem` and
  `wss_key` SSL files are available under `cert_path` (see below).
- `cert_path`: directory where TLS certificate and key are looked up (default: /usr/local/etc/ssl/)
- `wss_pem`: filename for the certificate chain (default: fullchain.pem)
- `wss_key`: filename for the private key (default: privkey.pem)

Example `.hb.yaml` (minimal):

```yaml
hbd_host: 0.0.0.0
hbd_port: 50004
dyndomains:
  - example.com
nsupdate_bin: /usr/bin/nsupdate
pushsrv: pushover
```

> Tip: `config.DEFAULTS` in `hbd/config.py` contains the canonical defaults and accepted configuration keys.

---

## 🔧 Architecture & Modules

- `hbd.proto` — serialization/deserialization of heartbeat messages (supports compressed payloads and plugin data)
- `hbd.udp` — UDP parsing and `handle_datagram` implementation (main state machine)
- `hbd.dns` — `create_nsupdate_payload`, `nsupdate`, and an asyncio DNS worker (`start_dns_worker`).
  The DNS worker now runs as an `asyncio` task and the package exposes a
  small thread-safe bridge so legacy synchronous code can `put()` updates
  into the queue; there is no longer a permanently-blocking background
  `threading.Thread`.
- `hbd.notify` — email and push notification helpers
- `hbd.ws` — WebSocket server and thread-safe broadcast helpers
- `hbd.http` — HTTP handler factory for the status UI/API
- `hbd.journal` — message journal with size-based log rotation and backup management
- `hbd.plugin` — plugin framework with base classes, registry, and dynamic loader
- `hbd.plugins/` — built-in plugins (os_info, cpu_monitor, memory_monitor, disk_monitor, network_monitor, filesystem_info, nagios_runner)
- `hbd.hbc` — heartbeat client that sends heartbeats and plugin data to server
- `hbd.utils` — small utility helpers (`shortname`, `dur`, `initlog`)
- `hbd.cli` — CLI entrypoint and argument parsing
- `hbd.server` — async orchestration to run UDP/HTTP/WSS components

This modular layout makes the code easier to test and maintain.

**Runtime & Shutdown**

- The main runtime is asyncio-based. Services (UDP listener, HTTP server, WebSocket server, monitor, and DNS worker) run as asyncio tasks.
- On SIGINT/SIGTERM the server triggers a graceful shutdown: it cancels active tasks, signals the DNS worker via a sentinel, and cleans up resources before exit.
- The DNS update worker is implemented as an `asyncio` task; synchronous producers can still enqueue DNS updates via a small thread-safe bridge available at `hbd.hbdclass.Host.dnsQ`.

**Templates & Static Files**

- Template files are located under `hbd/templates` by default. The HTTP server resolves templates relative to the `hbd` package but the path can be overridden with the `templates_dir` config key.
- Static assets (CSS/JS/images) are served from `hbd/static` via the `/static/<path>` HTTP route. Place your static files in that directory or configure the HTTP server as needed.

---

## 🧪 Testing & Dev

Tests are implemented using `unittest` and additional tests rely on `pytest` if you prefer. To run tests locally without installing anything beyond the dev requirements:

```bash
# with project root on PYTHONPATH
PYTHONPATH=. python -m unittest discover -v
# or with pytest if installed
pytest -q
```

Developer tooling included:

- `pyproject.toml` — project metadata and dependencies
- `tox.ini` — convenience wrappers for running tests, lint, and mypy

To run linters and type checks locally:

```bash
# after installing dev deps
tox -e lint
tox -e mypy
```

---

## 🚀 Running in production

- Use your system service manager (systemd, launchd, etc.) to run `hbd` in the background.
- Ensure `nsupdate` and necessary credentials are available for dynamic DNS updates.
- Configure TLS for WSS if you enable secure websockets.

> Note: The project contains a small example for obtaining DNS-verified certs (certbot with RFC2136) — see earlier commit history or ask me to re-add the example to this README if you want it documented here.

---

## 🤝 Contributing

Contributions welcome! Please:

1. Open an issue to discuss larger changes.
2. Create a topic branch and a clear PR.
3. Add tests for new features and run linters.
4. Keep changes focused and documented.

---

## 📜 License

This repository is licensed under the MIT license. See `LICENSE` for details.

---

If you'd like, I can also:

- add a **GitHub Actions** workflow that runs tests and lint on push/PR 🔁
- add a `CONTRIBUTING.md` template for PRs and code style 💬

Which one should I do next? ✨
