Files
heartbeat/README.md
T
Andreas Wrede 0543266c92 Major refactoring of the codebase, including restructuring of files and directories, renaming of modules and classes, and improvements to the overall organization and readability of the code. This refactoring aims to enhance maintainability, scalability, and clarity of the codebase while preserving existing functionality. The changes include:
- Restructuring of the project directory into client and server components
- Renaming of modules and classes to better reflect their purpose and functionality
- Moving common utilities and configurations to a shared location
- Updating import statements to reflect the new structure
- Adding new documentation files for better clarity on various aspects of the project
- Removing deprecated or unused code to streamline the codebase
- Ensuring that all existing functionality is preserved and that the codebase remains functional after the refactoring.
2026-03-29 11:13:40 -04:00

521 lines
18 KiB
Markdown

# Heartbeat Daemon (hbd) ✅
A lightweight daemon that listens for UDP heartbeat messages and acts on them: keeps host state, optionally updates DNS records via `nsupdate`, forwards messages to WebSocket clients, and sends notifications (email, Pushover, Mattermost, Signal). It is a refactor of a previously monolithic script into a modular Python package (`hbd`).
---
## 📌 Features
- Receive and parse heartbeat datagrams (text or zlib-compressed) ✅
- Maintain host state and detect up/down transitions ✅
- Queue DNS updates via `nsupdate` and run them in a background thread ✅
- WebSocket API for live updates (hosts & messages) ✅
- Notification pipeline (email, Pushover, Mattermost, Signal) ✅
- **HTTP API & Web UI** ✅
- REST API for plugin data, alerts, and host information
- Live dashboard with WebSocket updates
- Interactive plugin metrics visualization
- Alerts dashboard with filtering and summaries
- **Message journal with automatic log rotation** ✅
- Logs all received messages in JSON format
- Size-based automatic rotation
- Configurable retention and backup management
- **Plugin system for extensible monitoring** ✅
- Collect system metrics (CPU, memory, disk, network)
- Execute existing Nagios monitoring plugins
- Create custom plugins with simple Python classes
- **Threshold alerting system** ✅
- Monitor metrics against configurable WARNING/CRITICAL thresholds
- Hysteresis to prevent alert flapping
- Automatic notifications on state changes
- Re-notification for ongoing alerts
- Modular codebase suitable for unit testing and CI ✅
---
## 🔌 Plugin System
Heartbeat includes a comprehensive plugin architecture that extends monitoring beyond simple heartbeats. The plugin system allows you to:
- **Collect system information**: OS details, hardware info, system configuration
- **Monitor resources**: CPU usage, memory, disk space, network statistics
- **Run Nagios plugins**: Execute thousands of existing Nagios monitoring plugins without modification
- **Create custom plugins**: Build your own monitoring logic with simple Python classes
### Plugin Types
- **InfoPlugin**: Collects static information once (e.g., OS version, hardware specs)
- **MonitorPlugin**: Collects metrics periodically (e.g., CPU usage every 30 seconds)
### Built-in Plugins
- `os_info`: Collects OS, kernel, distribution, and architecture information
- `cpu_monitor`: Monitors CPU usage, load average, frequency, and process counts
- `memory_monitor`: Monitors RAM and swap usage, available memory
- `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
- `network_monitor`: Monitors network interface statistics, bandwidth, and connections
- `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
- `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
### Nagios Integration
The `nagios_runner` plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored:
- Executes plugins via subprocess with timeout protection
- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
- Extracts performance data with thresholds
- Reports aggregated status across all configured checks
See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.
### Creating Custom Plugins
```python
from hbd.plugin import MonitorPlugin
class DiskMonitorPlugin(MonitorPlugin):
name = "disk_monitor"
interval = 60 # Run every 60 seconds
async def collect(self):
return {
"disk_usage": get_disk_usage(),
"timestamp": time.time()
}
```
Place plugins in `hbd/plugins/` and they'll be automatically discovered and loaded by the client.
---
## 📝 Message Journal
Heartbeat includes a message journal that logs all received messages with automatic rotation.
### Features
- **JSON Format**: All messages logged in JSONL (JSON Lines) format for easy parsing
- **Automatic Rotation**: Size-based rotation with configurable thresholds
- **Backup Management**: Keeps configurable number of rotated log files
- **Non-blocking**: Async logging with minimal performance impact
### Configuration
```yaml
# Message journal settings
journal_enabled: true # Enable/disable journaling
journal_dir: /var/log/heartbeat # Journal directory
journal_file: messages.journal # Base filename
journal_max_size: 104857600 # Max size (100MB default)
journal_max_backups: 10 # Number of backups to keep
```
### Example Journal Entry
```json
{"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver1","interval":30}}
```
### Analyzing Journal Files
```bash
# View recent messages
tail -100 /var/log/heartbeat/messages.journal | jq .
# Count messages by type
cat /var/log/heartbeat/messages.journal | jq -r '.message.ID' | sort | uniq -c
# Filter by hostname
cat /var/log/heartbeat/messages.journal | jq 'select(.message.name == "webserver1")'
```
See [docs/MESSAGE_JOURNAL.md](docs/MESSAGE_JOURNAL.md) for complete documentation including rotation behavior, integration with log management systems, and analysis examples.
---
## 🚨 Threshold Alerting
Heartbeat includes a sophisticated threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured limits.
### Features
- **Multi-level alerts**: WARNING and CRITICAL severity levels
- **Flexible operators**: Support for >, >=, <, <=, ==, != comparisons
- **Hysteresis**: Prevents alert flapping with configurable recovery thresholds
- **Smart notifications**: Alerts only on state changes, not every check
- **Re-notifications**: Periodic reminders for ongoing alerts
- **Journal integration**: All threshold events logged for audit trail
### Configuration
```yaml
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0 # Warn when CPU > 80%
critical: 90.0 # Critical when CPU > 90%
operator: ">"
hysteresis: 0.1 # 10% hysteresis to prevent flapping
memory_monitor:
percent:
warning: 85.0
critical: 95.0
disk_monitor:
partitions:
/:
percent:
warning: 80.0
critical: 90.0
free_gb:
warning: 10.0 # Alert when < 10GB free
critical: 5.0
operator: "<" # Inverse threshold
# Global settings
threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts
```
### Alert Behavior
1. **State Changes**: Notifications sent when crossing thresholds
- OK → WARNING: Early notification
- WARNING → CRITICAL: Escalation
- CRITICAL → OK: Recovery
2. **Hysteresis**: Prevents rapid state transitions
```
Critical threshold: 90%
Hysteresis: 10%
Recovery threshold: 81% (90 - 10% of 90)
Value 91% → CRITICAL (threshold crossed)
Value 85% → CRITICAL (still above 81%)
Value 79% → OK (below recovery threshold)
```
3. **Re-notifications**: Periodic reminders for ongoing alerts
- Default: Every 60 minutes
- Configurable via `threshold_renotify_interval`
### Example Notifications
```
WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
CRITICAL: webserver01 - memory_monitor.percent = 96.0
RECOVERED: database01 - disk_monitor./.percent = 75.0 (WARNING -> OK)
REMINDER (CRITICAL): mailserver - cpu_monitor.load_1min = 12.5 (ongoing for 3600s)
```
### Supported Metrics
All plugin metrics can be thresholded:
- **CPU**: cpu_percent, load_1min, load_5min, load_15min
- **Memory**: percent, available_mb, swap_percent
- **Disk**: Per-partition percent, free_gb, free_mb
- **Network**: errors_total, dropped packets, connection counts
- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
See [docs/THRESHOLD_ALERTING.md](docs/THRESHOLD_ALERTING.md) for comprehensive documentation including best practices, troubleshooting, and advanced configuration.
---
## 🌐 HTTP API & Web UI
Heartbeat includes a built-in HTTP/WebSocket server that provides both a REST API and web-based dashboards for monitoring and visualization.
### Features
- **REST API**: JSON endpoints for accessing plugin data, alerts, and host information
- **Live Dashboard**: Real-time WebSocket-powered host status view
- **Plugin Metrics**: Interactive visualization of all plugin data with auto-refresh
- **Alerts Dashboard**: Comprehensive alert monitoring with filtering and summaries
- **CORS Support**: Configurable for integration with external applications
### Web Dashboards
- **Live View** (`/live`): Real-time host connectivity, latency, and messages
- **Plugin Metrics** (`/plugins`): Browse and visualize metrics from all plugins
- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering
### API Endpoints
```bash
# List all monitored hosts
curl http://localhost:50004/api/0/hosts
# Get all plugin data for a host
curl http://localhost:50004/api/0/hosts/webserver01/plugins
# Get detailed plugin history (last 50 samples)
curl http://localhost:50004/api/0/hosts/webserver01/plugins/cpu_monitor?limit=50
# Get alert states for a specific host
curl http://localhost:50004/api/0/hosts/webserver01/alerts
# Get all active alerts across all hosts
curl http://localhost:50004/api/0/alerts
```
### Integration Examples
**Python Client:**
```python
import requests
# Monitor for critical alerts
response = requests.get('http://localhost:50004/api/0/alerts')
alerts = response.json()
if alerts['summary']['critical'] > 0:
print(f"⚠️ {alerts['summary']['critical']} CRITICAL alerts!")
for alert in alerts['alerts']:
if alert['level'] == 'CRITICAL':
print(f" {alert['hostname']}: {alert['metric_path']} = {alert['last_value']}")
```
**Bash Monitoring Script:**
```bash
#!/bin/bash
# Check for critical alerts
CRITICAL=$(curl -s http://localhost:50004/api/0/alerts | jq '.summary.critical')
if [ "$CRITICAL" -gt 0 ]; then
echo "CRITICAL: $CRITICAL critical alerts detected!"
# Send notification
fi
```
### Demo & Testing
Run the API demo script to test all endpoints:
```bash
python3 scripts/demo_http_api.py
```
See [docs/HTTP_API.md](docs/HTTP_API.md) for complete API documentation including response formats, error handling, and integration examples.
---
## ⚙️ Quickstart
Prerequisites:
- Python 3.10+ (project uses language features from recent Python)
- `nsupdate` (for DNS updates) if using dynamic DNS
Install dependencies (recommended into a venv):
This project now declares its dependencies in `pyproject.toml`. Instead
of the old `requirements.txt` flow, install the package into a virtualenv
using `pip`:
See `scripts/install.sh` for a way to install.
Run the daemon (example):
```bash
# run with default config lookup (~/.hb.yaml)
hbd -c .hb.yaml -f -v
```
You can also run it directly via the package entrypoint after installation:
```bash
python -m hbd.cli -c /path/to/config.yaml
```
### Running the Client
The heartbeat client (`hbc`) sends periodic heartbeats and plugin data to the server:
```bash
# Basic usage pointing to server
python -m hbd.hbc --server your-server.example.com
# With custom configuration
python -m hbd.hbc --server 192.168.1.100 --port 50003 --interval 30
# Run with specific plugins enabled/disabled
python -m hbd.hbc --server hbd.local --disable-plugin os_info
```
Client configuration can also be specified in YAML:
```yaml
server: hbd.example.com
port: 50003
interval: 30
plugins:
cpu_monitor:
interval: 300 # Check every 5 minutes (default)
per_core: true
memory_monitor:
interval: 300 # Check every 5 minutes (default)
disk_monitor:
interval: 300 # Check every 5 minutes (default)
network_monitor:
interval: 300 # Check every 5 minutes (default)
nagios_runner:
interval: 300 # Check every 5 minutes (default)
commands:
- /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
- /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
```
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
## 🐞 Debugging in VS Code
This repository includes a ready-to-use `.vscode/launch.json` with configurations to run or attach the VS Code debugger to `hbd`.
- Ensure the **Python** extension is installed and select the project `.venv` as the interpreter (bottom-left of VS Code).
- Use **F5** and pick one of these configurations from the Run view:
- **Python: Run hbd (module)** — runs `hbd.cli` as a module and sets `PYTHONPATH` to the workspace root (recommended).
- **Python: Run hbd with debugpy (listen)** — launches `debugpy` and `hbd` together; useful when you want the process to listen for a debugger.
- **Python: Attach (localhost:5678)** — attach the debugger to a running process started with `debugpy`.
To start `hbd` manually and wait for the debugger to attach, run:
```bash
PYTHONPATH=. python -m debugpy --listen 5678 --wait-for-client -m hbd.cli -c .hb.yaml -f -v
```
Set breakpoints in modules such as `hbd/udp.py`, `hbd/dns.py`, or `hbd/server.py`, and use the **Attach** configuration to connect. Use `justMyCode: false` if you need to step into third-party code.
---
## 🛠 Configuration
`hbd` reads YAML configuration (optional). If `PyYAML` is not installed, built-in defaults are used. Example configuration keys (see `hbd/config.py`):
- `hb_port`: UDP port to listen for heartbeats (default: 50003)
- `hbd_port`: internal control port (default: 50004)
- `hbd_host`: bind address for HTTP/WSS
- `pickfile`: path for persisted state
- `logfile`: path to log file
- `logfmt`: `text` or `msg`
- `pushsrv`: push service (`pushover`|`mattermost`|`all`)
- `interval` / `grace`: heartbeat timing configuration
- `dyndomains`: list of dyndomains to update via `nsupdate`
- `nsupdate_bin`: path to nsupdate binary
- `ws_port`: port for plain WebSocket connections (default: 50005)
- `wss_port`: port for secure WebSocket (WSS) connections (default: none).
If set, `hbd` will attempt to serve WSS on this port when `wss_pem` and
`wss_key` SSL files are available under `cert_path` (see below).
- `cert_path`: directory where TLS certificate and key are looked up (default: /usr/local/etc/ssl/)
- `wss_pem`: filename for the certificate chain (default: fullchain.pem)
- `wss_key`: filename for the private key (default: privkey.pem)
Example `.hb.yaml` (minimal):
```yaml
hbd_host: 0.0.0.0
hbd_port: 50004
dyndomains:
- example.com
nsupdate_bin: /usr/bin/nsupdate
pushsrv: pushover
```
> Tip: `config.DEFAULTS` in `hbd/config.py` contains the canonical defaults and accepted configuration keys.
---
## 🔧 Architecture & Modules
- `hbd.proto` — serialization/deserialization of heartbeat messages (supports compressed payloads and plugin data)
- `hbd.udp` — UDP parsing and `handle_datagram` implementation (main state machine)
- `hbd.dns` — `create_nsupdate_payload`, `nsupdate`, and an asyncio DNS worker (`start_dns_worker`).
The DNS worker now runs as an `asyncio` task and the package exposes a
small thread-safe bridge so legacy synchronous code can `put()` updates
into the queue; there is no longer a permanently-blocking background
`threading.Thread`.
- `hbd.notify` — email and push notification helpers
- `hbd.ws` — WebSocket server and thread-safe broadcast helpers
- `hbd.http` — HTTP handler factory for the status UI/API
- `hbd.journal` — message journal with size-based log rotation and backup management
- `hbd.plugin` — plugin framework with base classes, registry, and dynamic loader
- `hbd.plugins/` — built-in plugins (os_info, cpu_monitor, memory_monitor, disk_monitor, network_monitor, filesystem_info, nagios_runner)
- `hbd.hbc` — heartbeat client that sends heartbeats and plugin data to server
- `hbd.utils` — small utility helpers (`shortname`, `dur`, `initlog`)
- `hbd.cli` — CLI entrypoint and argument parsing
- `hbd.server` — async orchestration to run UDP/HTTP/WSS components
This modular layout makes the code easier to test and maintain.
**Runtime & Shutdown**
- The main runtime is asyncio-based. Services (UDP listener, HTTP server, WebSocket server, monitor, and DNS worker) run as asyncio tasks.
- On SIGINT/SIGTERM the server triggers a graceful shutdown: it cancels active tasks, signals the DNS worker via a sentinel, and cleans up resources before exit.
- The DNS update worker is implemented as an `asyncio` task; synchronous producers can still enqueue DNS updates via a small thread-safe bridge available at `hbd.hbdclass.Host.dnsQ`.
**Templates & Static Files**
- Template files are located under `hbd/templates` by default. The HTTP server resolves templates relative to the `hbd` package but the path can be overridden with the `templates_dir` config key.
- Static assets (CSS/JS/images) are served from `hbd/static` via the `/static/<path>` HTTP route. Place your static files in that directory or configure the HTTP server as needed.
---
## 🧪 Testing & Dev
Tests are implemented using `unittest` and additional tests rely on `pytest` if you prefer. To run tests locally without installing anything beyond the dev requirements:
```bash
# with project root on PYTHONPATH
PYTHONPATH=. python -m unittest discover -v
# or with pytest if installed
pytest -q
```
Developer tooling included:
- `pyproject.toml` — project metadata and dependencies
- `tox.ini` — convenience wrappers for running tests, lint, and mypy
To run linters and type checks locally:
```bash
# after installing dev deps
tox -e lint
tox -e mypy
```
---
## 🚀 Running in production
- Use your system service manager (systemd, launchd, etc.) to run `hbd` in the background.
- Ensure `nsupdate` and necessary credentials are available for dynamic DNS updates.
- Configure TLS for WSS if you enable secure websockets.
> Note: The project contains a small example for obtaining DNS-verified certs (certbot with RFC2136) — see earlier commit history or ask me to re-add the example to this README if you want it documented here.
---
## 🤝 Contributing
Contributions welcome! Please:
1. Open an issue to discuss larger changes.
2. Create a topic branch and a clear PR.
3. Add tests for new features and run linters.
4. Keep changes focused and documented.
---
## 📜 License
This repository is licensed under the MIT license. See `LICENSE` for details.
---
If you'd like, I can also:
- add a **GitHub Actions** workflow that runs tests and lint on push/PR 🔁
- add a `CONTRIBUTING.md` template for PRs and code style 💬
Which one should I do next? ✨