Major refactoring of the codebase, including restructuring of files and directories, renaming of modules and classes, and improvements to the overall organization and readability of the code. This refactoring aims to enhance maintainability, scalability, and clarity of the codebase while preserving existing functionality. The changes include:

- Restructuring of the project directory into client and server components - Renaming of modules and classes to better reflect their purpose and functionality - Moving common utilities and configurations to a shared location - Updating import statements to reflect the new structure - Adding new documentation files for better clarity on various aspects of the project - Removing deprecated or unused code to streamline the codebase - Ensuring that all existing functionality is preserved and that the codebase remains functional after the refactoring.
2026-03-29 11:13:40 -04:00
parent 7e2038ecac
commit 0543266c92
65 changed files with 11371 additions and 140 deletions
@@ -11,10 +11,294 @@ A lightweight daemon that listens for UDP heartbeat messages and acts on them: k
 - Queue DNS updates via `nsupdate` and run them in a background thread ✅
 - WebSocket API for live updates (hosts & messages) ✅
 - Notification pipeline (email, Pushover, Mattermost, Signal) ✅
+- **HTTP API & Web UI** ✅
+  - REST API for plugin data, alerts, and host information
+  - Live dashboard with WebSocket updates
+  - Interactive plugin metrics visualization
+  - Alerts dashboard with filtering and summaries
+- **Message journal with automatic log rotation** ✅
+  - Logs all received messages in JSON format
+  - Size-based automatic rotation
+  - Configurable retention and backup management
+- **Plugin system for extensible monitoring** ✅
+  - Collect system metrics (CPU, memory, disk, network)
+  - Execute existing Nagios monitoring plugins
+  - Create custom plugins with simple Python classes
+- **Threshold alerting system** ✅
+  - Monitor metrics against configurable WARNING/CRITICAL thresholds
+  - Hysteresis to prevent alert flapping
+  - Automatic notifications on state changes
+  - Re-notification for ongoing alerts
 - Modular codebase suitable for unit testing and CI ✅

 ---

+## 🔌 Plugin System
+
+Heartbeat includes a comprehensive plugin architecture that extends monitoring beyond simple heartbeats. The plugin system allows you to:
+
+- **Collect system information**: OS details, hardware info, system configuration
+- **Monitor resources**: CPU usage, memory, disk space, network statistics
+- **Run Nagios plugins**: Execute thousands of existing Nagios monitoring plugins without modification
+- **Create custom plugins**: Build your own monitoring logic with simple Python classes
+
+### Plugin Types
+
+- **InfoPlugin**: Collects static information once (e.g., OS version, hardware specs)
+- **MonitorPlugin**: Collects metrics periodically (e.g., CPU usage every 30 seconds)
+
+### Built-in Plugins
+
+- `os_info`: Collects OS, kernel, distribution, and architecture information
+- `cpu_monitor`: Monitors CPU usage, load average, frequency, and process counts
+- `memory_monitor`: Monitors RAM and swap usage, available memory
+- `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
+- `network_monitor`: Monitors network interface statistics, bandwidth, and connections
+- `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
+- `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
+
+### Nagios Integration
+
+The `nagios_runner` plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored:
+
+- Executes plugins via subprocess with timeout protection
+- Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
+- Extracts performance data with thresholds
+- Reports aggregated status across all configured checks
+
+See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.
+
+### Creating Custom Plugins
+
+```python
+from hbd.plugin import MonitorPlugin
+
+class DiskMonitorPlugin(MonitorPlugin):
+    name = "disk_monitor"
+    interval = 60  # Run every 60 seconds
+    
+    async def collect(self):
+        return {
+            "disk_usage": get_disk_usage(),
+            "timestamp": time.time()
+        }
+```
+
+Place plugins in `hbd/plugins/` and they'll be automatically discovered and loaded by the client.
+
+---
+
+## 📝 Message Journal
+
+Heartbeat includes a message journal that logs all received messages with automatic rotation.
+
+### Features
+
+- **JSON Format**: All messages logged in JSONL (JSON Lines) format for easy parsing
+- **Automatic Rotation**: Size-based rotation with configurable thresholds
+- **Backup Management**: Keeps configurable number of rotated log files
+- **Non-blocking**: Async logging with minimal performance impact
+
+### Configuration
+
+```yaml
+# Message journal settings
+journal_enabled: true                    # Enable/disable journaling
+journal_dir: /var/log/heartbeat         # Journal directory
+journal_file: messages.journal           # Base filename
+journal_max_size: 104857600             # Max size (100MB default)
+journal_max_backups: 10                 # Number of backups to keep
+```
+
+### Example Journal Entry
+
+```json
+{"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver1","interval":30}}
+```
+
+### Analyzing Journal Files
+
+```bash
+# View recent messages
+tail -100 /var/log/heartbeat/messages.journal | jq .
+
+# Count messages by type
+cat /var/log/heartbeat/messages.journal | jq -r '.message.ID' | sort | uniq -c
+
+# Filter by hostname
+cat /var/log/heartbeat/messages.journal | jq 'select(.message.name == "webserver1")'
+```
+
+See [docs/MESSAGE_JOURNAL.md](docs/MESSAGE_JOURNAL.md) for complete documentation including rotation behavior, integration with log management systems, and analysis examples.
+
+---
+
+## 🚨 Threshold Alerting
+
+Heartbeat includes a sophisticated threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured limits.
+
+### Features
+
+- **Multi-level alerts**: WARNING and CRITICAL severity levels
+- **Flexible operators**: Support for >, >=, <, <=, ==, != comparisons
+- **Hysteresis**: Prevents alert flapping with configurable recovery thresholds
+- **Smart notifications**: Alerts only on state changes, not every check
+- **Re-notifications**: Periodic reminders for ongoing alerts
+- **Journal integration**: All threshold events logged for audit trail
+
+### Configuration
+
+```yaml
+thresholds:
+  cpu_monitor:
+    cpu_percent:
+      warning: 80.0      # Warn when CPU > 80%
+      critical: 90.0     # Critical when CPU > 90%
+      operator: ">"
+      hysteresis: 0.1    # 10% hysteresis to prevent flapping
+  
+  memory_monitor:
+    percent:
+      warning: 85.0
+      critical: 95.0
+  
+  disk_monitor:
+    partitions:
+      /:
+        percent:
+          warning: 80.0
+          critical: 90.0
+        free_gb:
+          warning: 10.0   # Alert when < 10GB free
+          critical: 5.0
+          operator: "<"   # Inverse threshold
+
+# Global settings
+threshold_renotify_interval: 3600  # Re-notify every hour for ongoing alerts
+```
+
+### Alert Behavior
+
+1. **State Changes**: Notifications sent when crossing thresholds
+   - OK → WARNING: Early notification
+   - WARNING → CRITICAL: Escalation
+   - CRITICAL → OK: Recovery
+
+2. **Hysteresis**: Prevents rapid state transitions
+   ```
+   Critical threshold: 90%
+   Hysteresis: 10%
+   Recovery threshold: 81% (90 - 10% of 90)
+   
+   Value 91% → CRITICAL (threshold crossed)
+   Value 85% → CRITICAL (still above 81%)
+   Value 79% → OK (below recovery threshold)
+   ```
+
+3. **Re-notifications**: Periodic reminders for ongoing alerts
+   - Default: Every 60 minutes
+   - Configurable via `threshold_renotify_interval`
+
+### Example Notifications
+
+```
+WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
+CRITICAL: webserver01 - memory_monitor.percent = 96.0
+RECOVERED: database01 - disk_monitor./.percent = 75.0 (WARNING -> OK)
+REMINDER (CRITICAL): mailserver - cpu_monitor.load_1min = 12.5 (ongoing for 3600s)
+```
+
+### Supported Metrics
+
+All plugin metrics can be thresholded:
+
+- **CPU**: cpu_percent, load_1min, load_5min, load_15min
+- **Memory**: percent, available_mb, swap_percent
+- **Disk**: Per-partition percent, free_gb, free_mb
+- **Network**: errors_total, dropped packets, connection counts
+- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
+
+See [docs/THRESHOLD_ALERTING.md](docs/THRESHOLD_ALERTING.md) for comprehensive documentation including best practices, troubleshooting, and advanced configuration.
+
+---
+
+## 🌐 HTTP API & Web UI
+
+Heartbeat includes a built-in HTTP/WebSocket server that provides both a REST API and web-based dashboards for monitoring and visualization.
+
+### Features
+
+- **REST API**: JSON endpoints for accessing plugin data, alerts, and host information
+- **Live Dashboard**: Real-time WebSocket-powered host status view
+- **Plugin Metrics**: Interactive visualization of all plugin data with auto-refresh
+- **Alerts Dashboard**: Comprehensive alert monitoring with filtering and summaries
+- **CORS Support**: Configurable for integration with external applications
+
+### Web Dashboards
+
+- **Live View** (`/live`): Real-time host connectivity, latency, and messages  
+- **Plugin Metrics** (`/plugins`): Browse and visualize metrics from all plugins  
+- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering  
+
+### API Endpoints
+
+```bash
+# List all monitored hosts
+curl http://localhost:50004/api/0/hosts
+
+# Get all plugin data for a host
+curl http://localhost:50004/api/0/hosts/webserver01/plugins
+
+# Get detailed plugin history (last 50 samples)
+curl http://localhost:50004/api/0/hosts/webserver01/plugins/cpu_monitor?limit=50
+
+# Get alert states for a specific host
+curl http://localhost:50004/api/0/hosts/webserver01/alerts
+
+# Get all active alerts across all hosts
+curl http://localhost:50004/api/0/alerts
+```
+
+### Integration Examples
+
+**Python Client:**
+```python
+import requests
+
+# Monitor for critical alerts
+response = requests.get('http://localhost:50004/api/0/alerts')
+alerts = response.json()
+
+if alerts['summary']['critical'] > 0:
+    print(f"⚠️ {alerts['summary']['critical']} CRITICAL alerts!")
+    for alert in alerts['alerts']:
+        if alert['level'] == 'CRITICAL':
+            print(f"  {alert['hostname']}: {alert['metric_path']} = {alert['last_value']}")
+```
+
+**Bash Monitoring Script:**
+```bash
+#!/bin/bash
+# Check for critical alerts
+CRITICAL=$(curl -s http://localhost:50004/api/0/alerts | jq '.summary.critical')
+if [ "$CRITICAL" -gt 0 ]; then
+    echo "CRITICAL: $CRITICAL critical alerts detected!"
+    # Send notification
+fi
+```
+
+### Demo & Testing
+
+Run the API demo script to test all endpoints:
+
+```bash
+python3 scripts/demo_http_api.py
+```
+
+See [docs/HTTP_API.md](docs/HTTP_API.md) for complete API documentation including response formats, error handling, and integration examples.
+
+---
+
 ## ⚙️ Quickstart

 Prerequisites:
@@ -43,6 +327,46 @@ You can also run it directly via the package entrypoint after installation:
 python -m hbd.cli -c /path/to/config.yaml
 ```

+### Running the Client
+
+The heartbeat client (`hbc`) sends periodic heartbeats and plugin data to the server:
+
+```bash
+# Basic usage pointing to server
+python -m hbd.hbc --server your-server.example.com
+
+# With custom configuration
+python -m hbd.hbc --server 192.168.1.100 --port 50003 --interval 30
+
+# Run with specific plugins enabled/disabled
+python -m hbd.hbc --server hbd.local --disable-plugin os_info
+```
+
+Client configuration can also be specified in YAML:
+
+```yaml
+server: hbd.example.com
+port: 50003
+interval: 30
+plugins:
+  cpu_monitor:
+    interval: 300      # Check every 5 minutes (default)
+    per_core: true
+  memory_monitor:
+    interval: 300      # Check every 5 minutes (default)
+  disk_monitor:
+    interval: 300      # Check every 5 minutes (default)
+  network_monitor:
+    interval: 300      # Check every 5 minutes (default)
+  nagios_runner:
+    interval: 300      # Check every 5 minutes (default)
+    commands:
+      - /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
+      - /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
+```
+
+All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
+
 ## 🐞 Debugging in VS Code

 This repository includes a ready-to-use `.vscode/launch.json` with configurations to run or attach the VS Code debugger to `hbd`.
@@ -102,7 +426,7 @@ pushsrv: pushover

 ## 🔧 Architecture & Modules

- `hbd.proto` — serialization/deserialization of heartbeat messages (supports compressed payloads)
+- `hbd.proto` — serialization/deserialization of heartbeat messages (supports compressed payloads and plugin data)
 - `hbd.udp` — UDP parsing and `handle_datagram` implementation (main state machine)
 - `hbd.dns` — `create_nsupdate_payload`, `nsupdate`, and an asyncio DNS worker (`start_dns_worker`).
  The DNS worker now runs as an `asyncio` task and the package exposes a
@@ -112,6 +436,10 @@ pushsrv: pushover
 - `hbd.notify` — email and push notification helpers
 - `hbd.ws` — WebSocket server and thread-safe broadcast helpers
 - `hbd.http` — HTTP handler factory for the status UI/API
+- `hbd.journal` — message journal with size-based log rotation and backup management
+- `hbd.plugin` — plugin framework with base classes, registry, and dynamic loader
+- `hbd.plugins/` — built-in plugins (os_info, cpu_monitor, memory_monitor, disk_monitor, network_monitor, filesystem_info, nagios_runner)
+- `hbd.hbc` — heartbeat client that sends heartbeats and plugin data to server
 - `hbd.utils` — small utility helpers (`shortname`, `dur`, `initlog`)
 - `hbd.cli` — CLI entrypoint and argument parsing
 - `hbd.server` — async orchestration to run UDP/HTTP/WSS components