Major refactoring of the codebase, including restructuring of files and directories, renaming of modules and classes, and improvements to the overall organization and readability of the code. This refactoring aims to enhance maintainability, scalability, and clarity of the codebase while preserving existing functionality. The changes include:
- Restructuring of the project directory into client and server components - Renaming of modules and classes to better reflect their purpose and functionality - Moving common utilities and configurations to a shared location - Updating import statements to reflect the new structure - Adding new documentation files for better clarity on various aspects of the project - Removing deprecated or unused code to streamline the codebase - Ensuring that all existing functionality is preserved and that the codebase remains functional after the refactoring.
This commit is contained in:
@@ -0,0 +1,532 @@
|
||||
# HTTP API and Web UI Documentation
|
||||
|
||||
## Overview
|
||||
|
||||
The Heartbeat Daemon provides a comprehensive HTTP API and web-based UI for monitoring plugin data and alert states. The API follows RESTful conventions and returns JSON responses.
|
||||
|
||||
## Base URL
|
||||
|
||||
All API endpoints are relative to the server base URL:
|
||||
```
|
||||
http://your-server:50004
|
||||
```
|
||||
|
||||
Default port is `50004` (configurable via `hbd_port` in configuration).
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Host Management
|
||||
|
||||
#### GET /api/0/hosts
|
||||
Get list of all monitored hosts with their state information.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"name": "webserver01",
|
||||
"dyn": false,
|
||||
"ver": 6,
|
||||
"connections": [...]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
#### GET /api/0/messages
|
||||
Get recent heartbeat messages (last 30).
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"time": 1711234567.123,
|
||||
"host": "webserver01",
|
||||
"msg": "heartbeat received"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Plugin Data Endpoints
|
||||
|
||||
#### GET /api/0/hosts/{hostname}/plugins
|
||||
Get all plugin data for a specific host.
|
||||
|
||||
**Parameters:**
|
||||
- `hostname` (path): Name of the host
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"hostname": "webserver01",
|
||||
"plugins": {
|
||||
"cpu_monitor": {
|
||||
"timestamp": 1711234567.123,
|
||||
"data": {
|
||||
"cpu_percent": 45.2,
|
||||
"load_1min": 2.5,
|
||||
"load_5min": 2.1,
|
||||
"load_15min": 1.8
|
||||
},
|
||||
"sample_count": 100
|
||||
},
|
||||
"memory_monitor": {
|
||||
"timestamp": 1711234568.456,
|
||||
"data": {
|
||||
"percent": 65.4,
|
||||
"available_mb": 4096,
|
||||
"total_mb": 16384
|
||||
},
|
||||
"sample_count": 100
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
curl http://localhost:50004/api/0/hosts/webserver01/plugins
|
||||
```
|
||||
|
||||
#### GET /api/0/hosts/{hostname}/plugins/{plugin_name}
|
||||
Get detailed historical data for a specific plugin.
|
||||
|
||||
**Parameters:**
|
||||
- `hostname` (path): Name of the host
|
||||
- `plugin_name` (path): Name of the plugin
|
||||
- `limit` (query, optional): Number of recent samples to return (default: 10)
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"hostname": "webserver01",
|
||||
"plugin": "cpu_monitor",
|
||||
"samples": [
|
||||
{
|
||||
"timestamp": 1711234567.123,
|
||||
"data": {
|
||||
"cpu_percent": 45.2,
|
||||
"load_1min": 2.5
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": 1711234267.123,
|
||||
"data": {
|
||||
"cpu_percent": 42.1,
|
||||
"load_1min": 2.3
|
||||
}
|
||||
}
|
||||
],
|
||||
"sample_count": 2
|
||||
}
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
```bash
|
||||
# Get last 1 sample (most recent)
|
||||
curl http://localhost:50004/api/0/hosts/webserver01/plugins/cpu_monitor?limit=1
|
||||
|
||||
# Get last 50 samples
|
||||
curl http://localhost:50004/api/0/hosts/webserver01/plugins/memory_monitor?limit=50
|
||||
|
||||
# Get disk monitor data
|
||||
curl http://localhost:50004/api/0/hosts/database01/plugins/disk_monitor
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Alert Endpoints
|
||||
|
||||
#### GET /api/0/hosts/{hostname}/alerts
|
||||
Get alert states for a specific host.
|
||||
|
||||
**Parameters:**
|
||||
- `hostname` (path): Name of the host
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"hostname": "webserver01",
|
||||
"alerts": [
|
||||
{
|
||||
"metric_path": "cpu_monitor.cpu_percent",
|
||||
"level": "WARNING",
|
||||
"since": 1711234000.0,
|
||||
"last_value": 85.5,
|
||||
"last_check": 1711234567.123,
|
||||
"notification_count": 2
|
||||
},
|
||||
{
|
||||
"metric_path": "disk_monitor./.percent",
|
||||
"level": "OK",
|
||||
"since": 1711230000.0,
|
||||
"last_value": 65.0,
|
||||
"last_check": 1711234567.123,
|
||||
"notification_count": 0
|
||||
}
|
||||
],
|
||||
"summary": {
|
||||
"ok": 15,
|
||||
"warning": 1,
|
||||
"critical": 0,
|
||||
"unknown": 0
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
curl http://localhost:50004/api/0/hosts/webserver01/alerts
|
||||
```
|
||||
|
||||
#### GET /api/0/alerts
|
||||
Get all active alerts across all monitored hosts.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"alerts": [
|
||||
{
|
||||
"hostname": "webserver01",
|
||||
"metric_path": "cpu_monitor.cpu_percent",
|
||||
"level": "CRITICAL",
|
||||
"since": 1711234000.0,
|
||||
"last_value": 95.5,
|
||||
"last_check": 1711234567.123,
|
||||
"notification_count": 3
|
||||
},
|
||||
{
|
||||
"hostname": "database01",
|
||||
"metric_path": "memory_monitor.percent",
|
||||
"level": "WARNING",
|
||||
"since": 1711233000.0,
|
||||
"last_value": 88.2,
|
||||
"last_check": 1711234567.123,
|
||||
"notification_count": 1
|
||||
}
|
||||
],
|
||||
"summary": {
|
||||
"critical": 1,
|
||||
"warning": 1,
|
||||
"unknown": 0,
|
||||
"total": 2
|
||||
},
|
||||
"host_count": 5
|
||||
}
|
||||
```
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
curl http://localhost:50004/api/0/alerts | jq .
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Web UI Pages
|
||||
|
||||
### Live Dashboard
|
||||
**URL:** `/live`
|
||||
|
||||
Real-time dashboard showing:
|
||||
- Host connection states
|
||||
- IPv4/IPv6 connectivity
|
||||
- Latency metrics
|
||||
- Recent messages
|
||||
|
||||
**Features:**
|
||||
- WebSocket-powered live updates
|
||||
- Sortable columns
|
||||
- Color-coded status indicators
|
||||
|
||||
### Plugin Metrics
|
||||
**URL:** `/plugins`
|
||||
|
||||
Interactive visualization of plugin metrics:
|
||||
- Select host and plugin from dropdown
|
||||
- View current metric values
|
||||
- Automatic refresh every 30 seconds
|
||||
- Support for nested metrics (e.g., per-partition disk stats)
|
||||
|
||||
**Features:**
|
||||
- Card-based metric display
|
||||
- Unit formatting (%, MB, GB)
|
||||
- Nested object visualization
|
||||
- Auto-refresh
|
||||
|
||||
**Screenshots of available data:**
|
||||
- CPU usage, load average, frequency
|
||||
- Memory usage, available memory, swap
|
||||
- Disk usage per partition, I/O statistics
|
||||
- Network interface statistics, connection counts
|
||||
- Custom plugin data
|
||||
|
||||
### Alerts Dashboard
|
||||
**URL:** `/alerts`
|
||||
|
||||
Comprehensive alert monitoring:
|
||||
- Summary cards (Critical, Warning, Total Hosts)
|
||||
- Filter by severity (All, Critical, Warning)
|
||||
- Alert details with duration
|
||||
- Auto-refresh every 15 seconds
|
||||
|
||||
**Features:**
|
||||
- Color-coded alert levels
|
||||
- Duration tracking
|
||||
- Filterable list
|
||||
- Real-time updates
|
||||
- Summary statistics
|
||||
|
||||
---
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Monitoring Script
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Check for critical alerts and send notification
|
||||
|
||||
RESPONSE=$(curl -s http://localhost:50004/api/0/alerts)
|
||||
CRITICAL_COUNT=$(echo "$RESPONSE" | jq '.summary.critical')
|
||||
|
||||
if [ "$CRITICAL_COUNT" -gt 0 ]; then
|
||||
echo "CRITICAL: $CRITICAL_COUNT critical alerts detected!"
|
||||
echo "$RESPONSE" | jq '.alerts[] | select(.level=="CRITICAL")'
|
||||
# Send notification
|
||||
# mail -s "Critical Alerts" admin@example.com < alert_details.txt
|
||||
fi
|
||||
```
|
||||
|
||||
### Python Client
|
||||
|
||||
```python
|
||||
import requests
|
||||
import json
|
||||
|
||||
# Get all plugin data for a host
|
||||
response = requests.get('http://localhost:50004/api/0/hosts/webserver01/plugins')
|
||||
data = response.json()
|
||||
|
||||
print(f"Host: {data['hostname']}")
|
||||
print(f"Plugins: {', '.join(data['plugins'].keys())}")
|
||||
|
||||
for plugin, info in data['plugins'].items():
|
||||
print(f"\n{plugin}:")
|
||||
for metric, value in info['data'].items():
|
||||
print(f" {metric}: {value}")
|
||||
|
||||
# Check for alerts
|
||||
response = requests.get('http://localhost:50004/api/0/alerts')
|
||||
alerts = response.json()
|
||||
|
||||
if alerts['summary']['critical'] > 0:
|
||||
print(f"\n⚠️ {alerts['summary']['critical']} CRITICAL ALERTS!")
|
||||
for alert in alerts['alerts']:
|
||||
if alert['level'] == 'CRITICAL':
|
||||
print(f" - {alert['hostname']}: {alert['metric_path']} = {alert['last_value']}")
|
||||
```
|
||||
|
||||
### Grafana Integration
|
||||
|
||||
The API endpoints can be used with Grafana's JSON datasource plugin:
|
||||
|
||||
1. Install the SimpleJSON datasource plugin
|
||||
2. Configure datasource URL: `http://your-server:50004`
|
||||
3. Create queries:
|
||||
- Metrics: `/api/0/hosts/webserver01/plugins/cpu_monitor?limit=100`
|
||||
- Alerts: `/api/0/alerts`
|
||||
|
||||
### Prometheus Integration
|
||||
|
||||
Export metrics in Prometheus format (future enhancement):
|
||||
|
||||
```python
|
||||
# Example prometheus exporter
|
||||
from prometheus_client import Gauge, generate_latest
|
||||
import requests
|
||||
|
||||
cpu_usage = Gauge('heartbeat_cpu_percent', 'CPU usage percentage', ['hostname'])
|
||||
memory_usage = Gauge('heartbeat_memory_percent', 'Memory usage percentage', ['hostname'])
|
||||
|
||||
def collect_metrics():
|
||||
hosts = requests.get('http://localhost:50004/api/0/hosts').json()
|
||||
for host in hosts:
|
||||
hostname = host['name']
|
||||
plugins = requests.get(f'http://localhost:50004/api/0/hosts/{hostname}/plugins').json()
|
||||
|
||||
if 'cpu_monitor' in plugins['plugins']:
|
||||
cpu_data = plugins['plugins']['cpu_monitor']['data']
|
||||
cpu_usage.labels(hostname=hostname).set(cpu_data.get('cpu_percent', 0))
|
||||
|
||||
if 'memory_monitor' in plugins['plugins']:
|
||||
mem_data = plugins['plugins']['memory_monitor']['data']
|
||||
memory_usage.labels(hostname=hostname).set(mem_data.get('percent', 0))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Response Formats
|
||||
|
||||
### Success Response
|
||||
All successful API calls return HTTP 200 with JSON body:
|
||||
```json
|
||||
{
|
||||
"field": "value",
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
### Error Response
|
||||
API errors return appropriate HTTP status codes with JSON:
|
||||
```json
|
||||
{
|
||||
"error": "Host 'unknown-host' not found"
|
||||
}
|
||||
```
|
||||
|
||||
**Common Status Codes:**
|
||||
- `200 OK` - Success
|
||||
- `400 Bad Request` - Invalid parameters
|
||||
- `404 Not Found` - Resource not found
|
||||
- `500 Internal Server Error` - Server error
|
||||
|
||||
---
|
||||
|
||||
## WebSocket API
|
||||
|
||||
For real-time updates, connect to the WebSocket endpoint:
|
||||
|
||||
**URL:** `ws://your-server:50005/hbd` (or `wss://` for secure)
|
||||
|
||||
**Messages:**
|
||||
```json
|
||||
{
|
||||
"type": "host",
|
||||
"data": {
|
||||
"name": "webserver01",
|
||||
"state": "UP"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "plugin",
|
||||
"data": {
|
||||
"host": "webserver01",
|
||||
"plugin": "cpu_monitor",
|
||||
"data": {...},
|
||||
"timestamp": 1711234567.123
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
### Enable HTTP Server
|
||||
|
||||
```yaml
|
||||
# In your hbd configuration file
|
||||
hbd_host: "" # Listen on all interfaces
|
||||
hbd_port: 50004 # HTTP port
|
||||
ws_port: 50005 # WebSocket port (optional)
|
||||
# wss_port: 50006 # Secure WebSocket (requires SSL)
|
||||
```
|
||||
|
||||
### SSL/TLS Configuration
|
||||
|
||||
For secure WebSocket connections:
|
||||
|
||||
```yaml
|
||||
wss_port: 50006
|
||||
cert_path: /etc/heartbeat/certs/
|
||||
wss_pem: server.pem
|
||||
wss_key: server.key
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
The API currently does not implement rate limiting. For production use, consider:
|
||||
|
||||
- Placing behind a reverse proxy (nginx, Apache)
|
||||
- Using API gateway for rate limiting
|
||||
- Implementing caching for frequently accessed endpoints
|
||||
|
||||
---
|
||||
|
||||
## CORS Support
|
||||
|
||||
By default, CORS is not enabled. To enable for web applications:
|
||||
|
||||
```python
|
||||
# In http.py, add CORS middleware
|
||||
from aiohttp_cors import setup as cors_setup
|
||||
|
||||
app = web.Application()
|
||||
cors = cors_setup(app)
|
||||
|
||||
# Configure CORS for all routes
|
||||
for route in list(app.router.routes()):
|
||||
cors.add(route, {
|
||||
"*": aiohttp_cors.ResourceOptions(
|
||||
allow_credentials=True,
|
||||
expose_headers="*",
|
||||
allow_headers="*",
|
||||
)
|
||||
})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Caching
|
||||
- Plugin data is cached in memory (last 100 samples per plugin)
|
||||
- No database queries required
|
||||
- Responses are fast (<10ms typical)
|
||||
|
||||
### Scalability
|
||||
- Each host stores its own data independently
|
||||
- Memory usage: ~1KB per host + ~1KB per plugin sample
|
||||
- For 100 hosts with 5 plugins: ~50MB memory
|
||||
|
||||
### Best Practices
|
||||
1. Use `limit` parameter to control response size
|
||||
2. Cache responses on client side when appropriate
|
||||
3. Use WebSocket for real-time updates instead of polling
|
||||
4. Consider pagination for large deployments (future enhancement)
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### API Returns 404
|
||||
- Verify hostname in URL matches actual host name
|
||||
- Check host is sending heartbeats: `curl http://localhost:50004/api/0/hosts`
|
||||
|
||||
### No Plugin Data
|
||||
- Verify client is configured with plugins
|
||||
- Check client logs for plugin errors
|
||||
- Ensure plugins are sending data (check journal logs)
|
||||
|
||||
### Empty Alerts
|
||||
- Verify thresholds are configured
|
||||
- Check host is in `watchhosts` list
|
||||
- Ensure plugins are collecting metrics
|
||||
- Review server logs for threshold checker errors
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- [Plugin Development Guide](PLUGIN_DEVELOPMENT.md)
|
||||
- [Threshold Alerting Documentation](THRESHOLD_ALERTING.md)
|
||||
- [Message Journal Documentation](MESSAGE_JOURNAL.md)
|
||||
- Configuration examples: `hbd/config_example.yaml`
|
||||
@@ -0,0 +1,413 @@
|
||||
# Message Journal
|
||||
|
||||
The message journal provides persistent logging of all received heartbeat messages with automatic size-based log rotation.
|
||||
|
||||
## Overview
|
||||
|
||||
The journal logs every message received by the heartbeat daemon (hbd) in JSON format, making it easy to:
|
||||
- Audit message history
|
||||
- Debug connection issues
|
||||
- Analyze traffic patterns
|
||||
- Replay messages for testing
|
||||
- Create historical reports
|
||||
|
||||
## Features
|
||||
|
||||
- **JSON Format**: Each message is logged as a single JSON line for easy parsing
|
||||
- **Size-Based Rotation**: Automatically rotates logs when size threshold is reached
|
||||
- **Automatic Cleanup**: Keeps only a configurable number of backup files
|
||||
- **Thread-Safe**: Safe for concurrent access from multiple async tasks
|
||||
- **Configurable**: All settings controllable via configuration file
|
||||
- **Performance**: Non-blocking async operation with minimal overhead
|
||||
|
||||
## Configuration
|
||||
|
||||
Add these settings to your hbd configuration file (e.g., `.hb.yaml`):
|
||||
|
||||
```yaml
|
||||
# Message journal configuration
|
||||
journal_enabled: true # Enable/disable journaling
|
||||
journal_dir: /var/log/heartbeat # Directory for journal files
|
||||
journal_file: messages.journal # Base filename
|
||||
journal_max_size: 104857600 # Max size in bytes (100MB default)
|
||||
journal_max_backups: 10 # Number of backup files to keep
|
||||
```
|
||||
|
||||
### Configuration Options
|
||||
|
||||
| Option | Default | Description |
|
||||
|--------|---------|-------------|
|
||||
| `journal_enabled` | `true` | Enable or disable message journaling |
|
||||
| `journal_dir` | `/var/log/heartbeat` | Directory where journal files are stored |
|
||||
| `journal_file` | `messages.journal` | Base filename for the journal |
|
||||
| `journal_max_size` | `104857600` (100MB) | Maximum file size before rotation |
|
||||
| `journal_max_backups` | `10` | Number of rotated backup files to keep |
|
||||
|
||||
## File Format
|
||||
|
||||
Messages are logged in JSONL (JSON Lines) format - one JSON object per line:
|
||||
|
||||
```json
|
||||
{"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver1","interval":30,"ver":1}}
|
||||
{"timestamp":1711234597.456,"datetime":"2026-03-28T12:35:37","source_ip":"192.168.1.101","source_port":50003,"message":{"ID":"PLG","plugin":"cpu_monitor","cpu_percent":45.2,"load_1min":1.5}}
|
||||
```
|
||||
|
||||
### Entry Structure
|
||||
|
||||
Each journal entry contains:
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `timestamp` | float | Unix timestamp (seconds since epoch) |
|
||||
| `datetime` | string | ISO 8601 formatted datetime |
|
||||
| `source_ip` | string | Source IP address |
|
||||
| `source_port` | integer | Source UDP port |
|
||||
| `message` | object | Complete parsed message dictionary |
|
||||
|
||||
## Log Rotation
|
||||
|
||||
### How Rotation Works
|
||||
|
||||
1. Journal writes messages to the current file
|
||||
2. When file size exceeds `journal_max_size`, rotation is triggered
|
||||
3. Current file is renamed with timestamp: `messages.journal.YYYYMMDD-HHMMSS`
|
||||
4. New empty file is created as the current journal
|
||||
5. Old backup files exceeding `journal_max_backups` are deleted
|
||||
|
||||
### Example File Structure
|
||||
|
||||
```
|
||||
/var/log/heartbeat/
|
||||
├── messages.journal # Current active journal
|
||||
├── messages.journal.20260328-120000 # Rotated backup
|
||||
├── messages.journal.20260328-140000 # Rotated backup
|
||||
└── messages.journal.20260328-160000 # Rotated backup (oldest)
|
||||
```
|
||||
|
||||
### Rotation Behavior
|
||||
|
||||
- Rotation is triggered when the next message would exceed the size limit
|
||||
- Rotation is automatic and requires no manual intervention
|
||||
- Old backups are deleted in FIFO order (oldest first)
|
||||
- Rotation is thread-safe and won't lose messages
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Reading Journal Files
|
||||
|
||||
#### Using Python
|
||||
|
||||
```python
|
||||
import json
|
||||
|
||||
# Read all entries from current journal
|
||||
with open('/var/log/heartbeat/messages.journal', 'r') as f:
|
||||
for line in f:
|
||||
entry = json.loads(line)
|
||||
print(f"{entry['datetime']} - {entry['source_ip']} - {entry['message']['ID']}")
|
||||
```
|
||||
|
||||
#### Using jq (command line)
|
||||
|
||||
```bash
|
||||
# View all messages
|
||||
cat /var/log/heartbeat/messages.journal | jq .
|
||||
|
||||
# Filter by message type
|
||||
cat /var/log/heartbeat/messages.journal | jq 'select(.message.ID == "HTB")'
|
||||
|
||||
# Filter by hostname
|
||||
cat /var/log/heartbeat/messages.journal | jq 'select(.message.name == "webserver1")'
|
||||
|
||||
# Count messages by type
|
||||
cat /var/log/heartbeat/messages.journal | jq -r '.message.ID' | sort | uniq -c
|
||||
|
||||
# Extract timestamps and source IPs
|
||||
cat /var/log/heartbeat/messages.journal | jq -r '[.datetime, .source_ip, .message.ID] | @tsv'
|
||||
```
|
||||
|
||||
#### Using shell tools
|
||||
|
||||
```bash
|
||||
# Count total messages
|
||||
wc -l /var/log/heartbeat/messages.journal
|
||||
|
||||
# View recent messages
|
||||
tail -n 100 /var/log/heartbeat/messages.journal | jq .
|
||||
|
||||
# Search for specific host
|
||||
grep -F '"name":"webserver1"' /var/log/heartbeat/messages.journal
|
||||
|
||||
# Check journal file size
|
||||
du -h /var/log/heartbeat/messages.journal
|
||||
```
|
||||
|
||||
### Analyzing Historical Data
|
||||
|
||||
```bash
|
||||
# Combine all journal files (current + backups)
|
||||
cat /var/log/heartbeat/messages.journal* | jq . > all_messages.json
|
||||
|
||||
# Count messages per host
|
||||
cat /var/log/heartbeat/messages.journal* | jq -r '.message.name // "unknown"' | sort | uniq -c
|
||||
|
||||
# Find all plugin messages
|
||||
cat /var/log/heartbeat/messages.journal* | jq 'select(.message.ID == "PLG")'
|
||||
|
||||
# Extract CPU metrics from plugin messages
|
||||
cat /var/log/heartbeat/messages.journal* | \
|
||||
jq 'select(.message.plugin == "cpu_monitor") | {time: .datetime, host: .message.name, cpu: .message.cpu_percent}'
|
||||
```
|
||||
|
||||
## Integration with Log Management
|
||||
|
||||
### Logrotate
|
||||
|
||||
While the journal has built-in rotation, you can also use logrotate for additional management:
|
||||
|
||||
```
|
||||
/var/log/heartbeat/messages.journal.* {
|
||||
daily
|
||||
rotate 30
|
||||
compress
|
||||
delaycompress
|
||||
missingok
|
||||
notifempty
|
||||
}
|
||||
```
|
||||
|
||||
### Elasticsearch/OpenSearch
|
||||
|
||||
Import journal data into Elasticsearch for advanced analysis:
|
||||
|
||||
```python
|
||||
from elasticsearch import Elasticsearch
|
||||
import json
|
||||
|
||||
es = Elasticsearch(['localhost:9200'])
|
||||
|
||||
with open('/var/log/heartbeat/messages.journal', 'r') as f:
|
||||
for line in f:
|
||||
entry = json.loads(line)
|
||||
es.index(index='heartbeat-messages', body=entry)
|
||||
```
|
||||
|
||||
### Splunk
|
||||
|
||||
Create a Splunk input for the journal:
|
||||
|
||||
```ini
|
||||
[monitor:///var/log/heartbeat/messages.journal*]
|
||||
sourcetype = heartbeat_json
|
||||
index = heartbeat
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Overhead
|
||||
|
||||
- Journal writing is async and non-blocking
|
||||
- Typical overhead: < 1ms per message
|
||||
- Minimal impact on heartbeat processing
|
||||
|
||||
### Disk Usage
|
||||
|
||||
Calculate expected disk usage:
|
||||
|
||||
```
|
||||
Messages per day = (86400 seconds / interval) * number_of_hosts
|
||||
Average message size ≈ 200-500 bytes
|
||||
Daily disk usage = Messages per day * Average message size
|
||||
|
||||
Example:
|
||||
- 100 hosts
|
||||
- 30 second interval
|
||||
- 2880 messages/day per host
|
||||
- 288,000 messages/day total
|
||||
- ~60-140 MB/day
|
||||
```
|
||||
|
||||
### Recommendations
|
||||
|
||||
- **Small deployments** (< 50 hosts): Default settings work well
|
||||
- **Medium deployments** (50-500 hosts): Increase `journal_max_size` to 500MB, `journal_max_backups` to 20
|
||||
- **Large deployments** (> 500 hosts): Consider 1GB+ journal files, 30+ backups, or external log aggregation
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Check Journal Status
|
||||
|
||||
The journal exposes statistics that can be queried:
|
||||
|
||||
```python
|
||||
from hbd.journal import get_journal
|
||||
|
||||
journal = get_journal()
|
||||
stats = journal.get_stats()
|
||||
print(f"Current size: {stats['current_size']:,} bytes")
|
||||
print(f"Rotation threshold: {stats['rotation_threshold']}")
|
||||
```
|
||||
|
||||
### Log Messages
|
||||
|
||||
Journal operations are logged at appropriate levels:
|
||||
|
||||
- `INFO`: Initialization, rotation events, cleanup
|
||||
- `DEBUG`: Individual message logging
|
||||
- `WARNING`: Non-critical issues
|
||||
- `ERROR`: Critical failures
|
||||
|
||||
Check hbd logs for journal-related messages:
|
||||
|
||||
```bash
|
||||
grep journal /var/log/heartbeat.log
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Journal Files Not Created
|
||||
|
||||
**Problem**: No journal files appear in the configured directory.
|
||||
|
||||
**Solutions**:
|
||||
- Check `journal_enabled: true` in configuration
|
||||
- Verify directory exists and hbd has write permissions
|
||||
- Check hbd logs for initialization errors
|
||||
- Verify disk space is available
|
||||
|
||||
### Rotation Not Working
|
||||
|
||||
**Problem**: Journal file grows beyond `journal_max_size`.
|
||||
|
||||
**Solutions**:
|
||||
- Check that `journal_max_size` is properly configured
|
||||
- Verify hbd has permission to rename/create files
|
||||
- Check for filesystem issues
|
||||
- Review hbd logs for rotation errors
|
||||
|
||||
### Missing Messages
|
||||
|
||||
**Problem**: Some messages don't appear in journal.
|
||||
|
||||
**Solutions**:
|
||||
- Verify `journal_enabled: true`
|
||||
- Check for write errors in hbd logs
|
||||
- Verify sufficient disk space
|
||||
- Check if filesystem is read-only
|
||||
|
||||
### Performance Issues
|
||||
|
||||
**Problem**: Journal causing slow message processing.
|
||||
|
||||
**Solutions**:
|
||||
- Use faster storage (SSD) for journal directory
|
||||
- Increase `journal_max_size` to reduce rotation frequency
|
||||
- Disable journal if not needed: `journal_enabled: false`
|
||||
- Consider async syslog forwarding instead
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### File Permissions
|
||||
|
||||
Ensure proper permissions on journal files:
|
||||
|
||||
```bash
|
||||
# Journal directory
|
||||
chmod 750 /var/log/heartbeat
|
||||
chown hbd:hbd /var/log/heartbeat
|
||||
|
||||
# Journal files
|
||||
chmod 640 /var/log/heartbeat/messages.journal*
|
||||
```
|
||||
|
||||
### Sensitive Data
|
||||
|
||||
Journal files may contain:
|
||||
- Hostnames and IP addresses
|
||||
- System metrics
|
||||
- Custom message content
|
||||
|
||||
**Recommendations**:
|
||||
- Restrict read access to authorized users only
|
||||
- Consider encryption for archived journals
|
||||
- Implement log retention policies
|
||||
- Sanitize data if sharing for debugging
|
||||
|
||||
## API Reference
|
||||
|
||||
### MessageJournal Class
|
||||
|
||||
```python
|
||||
class MessageJournal:
|
||||
def __init__(self, config: Dict[str, Any])
|
||||
async def initialize(self) -> bool
|
||||
async def log_message(self, msg: Dict, addr: tuple, timestamp: float)
|
||||
async def close(self)
|
||||
def get_stats(self) -> Dict[str, Any]
|
||||
```
|
||||
|
||||
### Module Functions
|
||||
|
||||
```python
|
||||
def get_journal(config: Dict = None) -> MessageJournal
|
||||
async def log_message(msg: Dict, addr: tuple, timestamp: float = None)
|
||||
```
|
||||
|
||||
## Example: Custom Message Processing
|
||||
|
||||
Process journal messages in real-time:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
async def tail_journal(journal_path):
|
||||
"""Follow journal file and process new messages."""
|
||||
path = Path(journal_path)
|
||||
|
||||
with open(path, 'r') as f:
|
||||
# Jump to end
|
||||
f.seek(0, 2)
|
||||
|
||||
while True:
|
||||
line = f.readline()
|
||||
if line:
|
||||
entry = json.loads(line)
|
||||
await process_message(entry)
|
||||
else:
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
async def process_message(entry):
|
||||
"""Process a journal entry."""
|
||||
msg = entry['message']
|
||||
|
||||
# Alert on boot messages
|
||||
if msg.get('boot'):
|
||||
print(f"ALERT: {msg['name']} rebooted at {entry['datetime']}")
|
||||
|
||||
# Track CPU usage
|
||||
if msg.get('ID') == 'PLG' and msg.get('plugin') == 'cpu_monitor':
|
||||
cpu = msg.get('cpu_percent', 0)
|
||||
if cpu > 90:
|
||||
print(f"WARNING: {entry['source_ip']} CPU usage: {cpu}%")
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Potential improvements for future versions:
|
||||
|
||||
- Compression of rotated logs (gzip)
|
||||
- Time-based rotation in addition to size-based
|
||||
- Filtering to exclude certain message types
|
||||
- Structured logging output formats (CEF, GELF)
|
||||
- Remote syslog forwarding
|
||||
- Message deduplication
|
||||
- Journal file encryption
|
||||
- Signed journal entries
|
||||
|
||||
## See Also
|
||||
|
||||
- [Configuration Guide](../hbd/config.py) - Full configuration options
|
||||
- [UDP Protocol](../hbd/udp.py) - Message handling
|
||||
- [Server Architecture](../hbd/server.py) - Server initialization
|
||||
@@ -0,0 +1,331 @@
|
||||
# Nagios Plugin Integration Guide
|
||||
|
||||
The Heartbeat monitoring system now supports running existing Nagios-compatible monitoring plugins through the `nagios_runner` plugin. This allows you to leverage the thousands of existing Nagios plugins without modification.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Install Nagios Plugins
|
||||
|
||||
**Debian/Ubuntu:**
|
||||
```bash
|
||||
sudo apt-get install nagios-plugins
|
||||
```
|
||||
|
||||
**RHEL/CentOS/Fedora:**
|
||||
```bash
|
||||
sudo yum install nagios-plugins-all
|
||||
# or
|
||||
sudo dnf install nagios-plugins-all
|
||||
```
|
||||
|
||||
**Arch Linux:**
|
||||
```bash
|
||||
sudo pacman -S monitoring-plugins
|
||||
```
|
||||
|
||||
### 2. Configure Heartbeat
|
||||
|
||||
Add the `nagios_runner` section to your `~/.hb.yaml` config:
|
||||
|
||||
```yaml
|
||||
nagios_runner:
|
||||
interval: 60 # Run plugins every 60 seconds
|
||||
timeout: 30 # Command timeout in seconds
|
||||
commands:
|
||||
- name: check_disk_root
|
||||
command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
|
||||
|
||||
- name: check_load
|
||||
command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
|
||||
|
||||
- name: check_procs
|
||||
command: /usr/lib/nagios/plugins/check_procs -w 250 -c 400
|
||||
```
|
||||
|
||||
### 3. Start Heartbeat Client
|
||||
|
||||
```bash
|
||||
hbc -v localhost
|
||||
```
|
||||
|
||||
The client will now execute the configured Nagios plugins and send their results to the server.
|
||||
|
||||
## How It Works
|
||||
|
||||
### Nagios Plugin Standard
|
||||
|
||||
Nagios plugins follow a simple interface:
|
||||
|
||||
1. **Exit Codes:**
|
||||
- `0` = OK
|
||||
- `1` = WARNING
|
||||
- `2` = CRITICAL
|
||||
- `3` = UNKNOWN
|
||||
|
||||
2. **Output Format:**
|
||||
```
|
||||
STATUS - Message | performance_data
|
||||
```
|
||||
|
||||
3. **Performance Data Format:**
|
||||
```
|
||||
'label'=value[UOM];[warn];[crit];[min];[max]
|
||||
```
|
||||
|
||||
### Example Plugin Output
|
||||
|
||||
```bash
|
||||
$ /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
|
||||
DISK OK - free space: / 156 GB (78%); | /=44GB;127;142;0;159
|
||||
```
|
||||
|
||||
This output includes:
|
||||
- **Status:** `DISK OK`
|
||||
- **Message:** `free space: / 156 GB (78%)`
|
||||
- **Performance Data:** `/=44GB;127;142;0;159`
|
||||
- Current value: 44GB
|
||||
- Warning threshold: 127GB
|
||||
- Critical threshold: 142GB
|
||||
- Min: 0GB
|
||||
- Max: 159GB
|
||||
|
||||
### Data Collected
|
||||
|
||||
The `nagios_runner` plugin collects:
|
||||
|
||||
**For each configured command:**
|
||||
- `{name}_status` - Status string (OK, WARNING, CRITICAL, UNKNOWN)
|
||||
- `{name}_status_code` - Numeric exit code (0-3)
|
||||
- `{name}_output` - Status message
|
||||
- `{name}_{metric}` - Each performance metric value
|
||||
- `{name}_{metric}_uom` - Unit of measurement (if present)
|
||||
- `{name}_{metric}_warn` - Warning threshold (if present)
|
||||
- `{name}_{metric}_crit` - Critical threshold (if present)
|
||||
- `{name}_{metric}_min` - Minimum value (if present)
|
||||
- `{name}_{metric}_max` - Maximum value (if present)
|
||||
|
||||
**Overall:**
|
||||
- `overall_status` - Worst status from all commands
|
||||
- `overall_status_code` - Worst status code
|
||||
- `plugin_count` - Number of Nagios plugins executed
|
||||
|
||||
## Configuration Options
|
||||
|
||||
```yaml
|
||||
nagios_runner:
|
||||
# Collection interval in seconds (default: 60)
|
||||
interval: 60
|
||||
|
||||
# Command execution timeout in seconds (default: 30)
|
||||
timeout: 30
|
||||
|
||||
# Execute commands via shell (default: true)
|
||||
# Set to false for direct execution (more secure but less flexible)
|
||||
shell: true
|
||||
|
||||
# List of Nagios plugins to run
|
||||
commands:
|
||||
- name: unique_name # Required: unique identifier
|
||||
command: /path/to/plugin [args] # Required: full command to execute
|
||||
```
|
||||
|
||||
## Common Nagios Plugins
|
||||
|
||||
### System Resources
|
||||
|
||||
**Disk Space:**
|
||||
```yaml
|
||||
- name: check_disk_root
|
||||
command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
|
||||
```
|
||||
|
||||
**Load Average:**
|
||||
```yaml
|
||||
- name: check_load
|
||||
command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
|
||||
```
|
||||
|
||||
**Swap Usage:**
|
||||
```yaml
|
||||
- name: check_swap
|
||||
command: /usr/lib/nagios/plugins/check_swap -w 20% -c 10%
|
||||
```
|
||||
|
||||
**Process Count:**
|
||||
```yaml
|
||||
- name: check_procs
|
||||
command: /usr/lib/nagios/plugins/check_procs -w 250 -c 400
|
||||
```
|
||||
|
||||
**Users Logged In:**
|
||||
```yaml
|
||||
- name: check_users
|
||||
command: /usr/lib/nagios/plugins/check_users -w 5 -c 10
|
||||
```
|
||||
|
||||
### Network Services
|
||||
|
||||
**SSH:**
|
||||
```yaml
|
||||
- name: check_ssh
|
||||
command: /usr/lib/nagios/plugins/check_ssh localhost
|
||||
```
|
||||
|
||||
**HTTP:**
|
||||
```yaml
|
||||
- name: check_http_local
|
||||
command: /usr/lib/nagios/plugins/check_http -H localhost
|
||||
|
||||
- name: check_http_ssl
|
||||
command: /usr/lib/nagios/plugins/check_http -H example.com --ssl
|
||||
```
|
||||
|
||||
**DNS:**
|
||||
```yaml
|
||||
- name: check_dns
|
||||
command: /usr/lib/nagios/plugins/check_dns -H google.com
|
||||
```
|
||||
|
||||
**Ping:**
|
||||
```yaml
|
||||
- name: check_ping_gateway
|
||||
command: /usr/lib/nagios/plugins/check_ping -H 192.168.1.1 -w 100,20% -c 500,60%
|
||||
```
|
||||
|
||||
### Databases
|
||||
|
||||
**MySQL:**
|
||||
```yaml
|
||||
- name: check_mysql
|
||||
command: /usr/lib/nagios/plugins/check_mysql -H localhost -u user -p password
|
||||
```
|
||||
|
||||
**PostgreSQL:**
|
||||
```yaml
|
||||
- name: check_pgsql
|
||||
command: /usr/lib/nagios/plugins/check_pgsql -H localhost -d database
|
||||
```
|
||||
|
||||
## Writing Custom Nagios Plugins
|
||||
|
||||
You can write your own Nagios-compatible plugins in any language. Here's a simple example:
|
||||
|
||||
**Bash:**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# /usr/local/bin/check_example.sh
|
||||
|
||||
# Get the value to check
|
||||
value=$(some_command)
|
||||
|
||||
# Define thresholds
|
||||
warn=80
|
||||
crit=90
|
||||
|
||||
# Check and output result
|
||||
if [ $value -ge $crit ]; then
|
||||
echo "CRITICAL - Value is $value | value=${value};${warn};${crit};0;100"
|
||||
exit 2
|
||||
elif [ $value -ge $warn ]; then
|
||||
echo "WARNING - Value is $value | value=${value};${warn};${crit};0;100"
|
||||
exit 1
|
||||
else
|
||||
echo "OK - Value is $value | value=${value};${warn};${crit};0;100"
|
||||
exit 0
|
||||
fi
|
||||
```
|
||||
|
||||
**Python:**
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
# /usr/local/bin/check_example.py
|
||||
|
||||
import sys
|
||||
|
||||
def check_something():
|
||||
value = get_value() # Your check logic here
|
||||
warn = 80
|
||||
crit = 90
|
||||
|
||||
perfdata = f"value={value};{warn};{crit};0;100"
|
||||
|
||||
if value >= crit:
|
||||
print(f"CRITICAL - Value is {value} | {perfdata}")
|
||||
sys.exit(2)
|
||||
elif value >= warn:
|
||||
print(f"WARNING - Value is {value} | {perfdata}")
|
||||
sys.exit(1)
|
||||
else:
|
||||
print(f"OK - Value is {value} | {perfdata}")
|
||||
sys.exit(0)
|
||||
|
||||
if __name__ == "__main__":
|
||||
check_something()
|
||||
```
|
||||
|
||||
Then configure in Heartbeat:
|
||||
```yaml
|
||||
nagios_runner:
|
||||
commands:
|
||||
- name: my_custom_check
|
||||
command: /usr/local/bin/check_example.sh
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Plugin not found
|
||||
```
|
||||
Error: Command not found
|
||||
```
|
||||
**Solution:** Use the full path to the plugin. Common locations:
|
||||
- `/usr/lib/nagios/plugins/`
|
||||
- `/usr/lib64/nagios/plugins/`
|
||||
- `/usr/local/nagios/libexec/`
|
||||
|
||||
### Permission denied
|
||||
```
|
||||
Error: Permission denied
|
||||
```
|
||||
**Solution:** Ensure the plugin is executable:
|
||||
```bash
|
||||
chmod +x /path/to/plugin
|
||||
```
|
||||
|
||||
### Timeout errors
|
||||
```
|
||||
Command timed out after 30s
|
||||
```
|
||||
**Solution:** Increase the timeout in config:
|
||||
```yaml
|
||||
nagios_runner:
|
||||
timeout: 60 # Increase timeout
|
||||
```
|
||||
|
||||
### No performance data
|
||||
If performance data is not being parsed:
|
||||
1. Check plugin output includes `|` separator
|
||||
2. Verify performance data format: `'label'=value[UOM];...`
|
||||
3. Enable debug logging: `hbc -v -x localhost`
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Massive Plugin Library:** Thousands of existing Nagios plugins available
|
||||
2. **No Rewriting:** Use plugins as-is without modification
|
||||
3. **Community Support:** Well-documented and maintained plugins
|
||||
4. **Flexibility:** Mix Nagios plugins with native Heartbeat plugins
|
||||
5. **Standard Interface:** Consistent exit codes and output format
|
||||
6. **Performance Data:** Automatic extraction of metrics
|
||||
|
||||
## Resources
|
||||
|
||||
- [Nagios Plugin Development Guidelines](https://nagios-plugins.org/doc/guidelines.html)
|
||||
- [Monitoring Plugins Project](https://www.monitoring-plugins.org/)
|
||||
- [Nagios Exchange](https://exchange.nagios.org/) - Plugin repository
|
||||
- [Check_MK Local Checks](https://docs.checkmk.com/latest/en/localchecks.html) - Compatible format
|
||||
|
||||
## Next Steps
|
||||
|
||||
- Configure threshold alerts based on Nagios plugin status codes
|
||||
- View plugin data in the Heartbeat web UI
|
||||
- Create custom plugins for your specific monitoring needs
|
||||
- Integrate with existing Nagios/Icinga configurations
|
||||
@@ -0,0 +1,544 @@
|
||||
# Plugin Development Guide
|
||||
|
||||
This guide explains how to create custom plugins for the Heartbeat monitoring system.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Plugin Architecture](#plugin-architecture)
|
||||
- [Plugin Types](#plugin-types)
|
||||
- [Creating a Plugin](#creating-a-plugin)
|
||||
- [Plugin Lifecycle](#plugin-lifecycle)
|
||||
- [Configuration](#configuration)
|
||||
- [Best Practices](#best-practices)
|
||||
- [Examples](#examples)
|
||||
- [Testing](#testing)
|
||||
|
||||
## Plugin Architecture
|
||||
|
||||
Heartbeat's plugin system is designed to be simple yet powerful. Plugins are Python classes that inherit from one of the base plugin types and implement a few key methods.
|
||||
|
||||
### Key Concepts
|
||||
|
||||
- **Plugin Registry**: Central registry that manages all loaded plugins
|
||||
- **Plugin Loader**: Automatically discovers and loads plugins from the `hbd/plugins/` directory
|
||||
- **Plugin Types**: InfoPlugin (static data) and MonitorPlugin (periodic metrics)
|
||||
- **Async/Await**: All plugin methods are async for non-blocking operation
|
||||
|
||||
## Plugin Types
|
||||
|
||||
### InfoPlugin
|
||||
|
||||
InfoPlugins collect static information that doesn't change frequently (OS version, hardware specs, etc.).
|
||||
|
||||
- **Runs once** at startup (interval = 0)
|
||||
- **Cached** - data is collected once and reused
|
||||
- **Lightweight** - no periodic overhead
|
||||
|
||||
**Use InfoPlugin for:**
|
||||
- Operating system details
|
||||
- Hardware information
|
||||
- Software versions
|
||||
- Configuration data
|
||||
- Static inventory
|
||||
|
||||
### MonitorPlugin
|
||||
|
||||
MonitorPlugins collect metrics that change over time (CPU usage, memory, network traffic).
|
||||
|
||||
- **Runs periodically** based on configured interval
|
||||
- **Scheduled** - collected at regular intervals
|
||||
- **Dynamic** - captures changing system state
|
||||
|
||||
**Use MonitorPlugin for:**
|
||||
- Resource usage (CPU, memory, disk, network)
|
||||
- Performance metrics
|
||||
- Counters and gauges
|
||||
- Time-series data
|
||||
|
||||
## Creating a Plugin
|
||||
|
||||
### Step 1: Choose Plugin Type
|
||||
|
||||
Decide whether your plugin collects static information (InfoPlugin) or dynamic metrics (MonitorPlugin).
|
||||
|
||||
### Step 2: Create Plugin File
|
||||
|
||||
Create a new Python file in `hbd/plugins/` directory:
|
||||
|
||||
```python
|
||||
"""
|
||||
My awesome plugin for Heartbeat.
|
||||
|
||||
Brief description of what this plugin does.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Dict, Any, Optional
|
||||
|
||||
# Import psutil or other dependencies if needed
|
||||
try:
|
||||
import psutil
|
||||
except ImportError:
|
||||
psutil = None
|
||||
|
||||
from hbd.plugin import MonitorPlugin # or InfoPlugin
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class MyAwesomePlugin(MonitorPlugin): # or InfoPlugin
|
||||
"""
|
||||
One-line description of the plugin.
|
||||
|
||||
Collects:
|
||||
- List of metrics/data collected
|
||||
- Another metric
|
||||
|
||||
Configuration:
|
||||
interval: Collection interval in seconds (default: 60)
|
||||
option1: Description of option1 (default: value)
|
||||
option2: Description of option2 (default: value)
|
||||
"""
|
||||
|
||||
name = "my_awesome_plugin" # Unique plugin name
|
||||
interval = 60 # For MonitorPlugin, use 0 for InfoPlugin
|
||||
|
||||
def __init__(self, config: Optional[Dict[str, Any]] = None):
|
||||
"""Initialize the plugin with optional configuration."""
|
||||
super().__init__(config)
|
||||
|
||||
# Extract configuration options
|
||||
self.option1 = self.config.get('option1', 'default_value')
|
||||
self.option2 = self.config.get('option2', True)
|
||||
|
||||
# Check dependencies
|
||||
if psutil is None:
|
||||
raise ImportError("psutil is required for my_awesome_plugin")
|
||||
|
||||
async def initialize(self):
|
||||
"""
|
||||
Initialize the plugin.
|
||||
|
||||
This is called once when the plugin is loaded.
|
||||
Use this to verify dependencies, establish connections, etc.
|
||||
|
||||
Returns:
|
||||
True if initialization successful, False otherwise
|
||||
"""
|
||||
logger.info(f"My awesome plugin initialized (option1: {self.option1})")
|
||||
return True
|
||||
|
||||
async def collect(self) -> Dict[str, Any]:
|
||||
"""
|
||||
Collect data.
|
||||
|
||||
This is called periodically (MonitorPlugin) or once (InfoPlugin).
|
||||
|
||||
Returns:
|
||||
Dictionary of collected data (will be sent to server)
|
||||
"""
|
||||
try:
|
||||
data = await self._collect_metrics()
|
||||
logger.debug(f"Collected {len(data)} metrics")
|
||||
return data
|
||||
except Exception as e:
|
||||
logger.error(f"Error collecting data: {e}")
|
||||
return {"error": str(e)}
|
||||
|
||||
async def _collect_metrics(self) -> Dict[str, Any]:
|
||||
"""Internal method to collect actual metrics."""
|
||||
metrics = {}
|
||||
|
||||
# Collect your data here
|
||||
metrics['metric1'] = self._get_metric1()
|
||||
metrics['metric2'] = self._get_metric2()
|
||||
|
||||
return metrics
|
||||
|
||||
def _get_metric1(self):
|
||||
"""Helper method for metric collection."""
|
||||
# Implementation here
|
||||
return 42
|
||||
|
||||
def _get_metric2(self):
|
||||
"""Helper method for metric collection."""
|
||||
# Implementation here
|
||||
return "hello"
|
||||
|
||||
async def cleanup(self):
|
||||
"""
|
||||
Cleanup resources.
|
||||
|
||||
This is called when the plugin is unloaded or the client shuts down.
|
||||
Use this to close connections, release resources, etc.
|
||||
"""
|
||||
logger.info("My awesome plugin cleanup")
|
||||
|
||||
|
||||
# Plugin instance for automatic discovery
|
||||
plugin = MyAwesomePlugin
|
||||
```
|
||||
|
||||
### Step 3: Test Your Plugin
|
||||
|
||||
Create a test script to verify your plugin works:
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
import asyncio
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add parent directory to path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from hbd.plugins.my_awesome_plugin import MyAwesomePlugin
|
||||
|
||||
async def test():
|
||||
# Create plugin instance
|
||||
plugin = MyAwesomePlugin({'option1': 'test_value'})
|
||||
|
||||
# Initialize
|
||||
if not await plugin.initialize():
|
||||
print("Failed to initialize")
|
||||
return False
|
||||
|
||||
# Collect data
|
||||
data = await plugin.collect()
|
||||
print(f"Collected data: {data}")
|
||||
|
||||
# Cleanup
|
||||
await plugin.cleanup()
|
||||
|
||||
return True
|
||||
|
||||
if __name__ == '__main__':
|
||||
success = asyncio.run(test())
|
||||
sys.exit(0 if success else 1)
|
||||
```
|
||||
|
||||
## Plugin Lifecycle
|
||||
|
||||
Understanding the plugin lifecycle helps you implement plugins correctly:
|
||||
|
||||
```
|
||||
1. Plugin Discovery
|
||||
└─> Loader scans hbd/plugins/ directory
|
||||
└─> Finds Python files (except those starting with _)
|
||||
└─> Imports modules
|
||||
|
||||
2. Plugin Instantiation
|
||||
└─> Creates instance with configuration
|
||||
└─> __init__() is called
|
||||
|
||||
3. Plugin Initialization
|
||||
└─> initialize() is called
|
||||
└─> Plugin verifies dependencies, establishes connections
|
||||
└─> Returns True/False for success/failure
|
||||
|
||||
4. Plugin Registration
|
||||
└─> If initialization succeeds, plugin is registered
|
||||
└─> Plugin becomes active
|
||||
|
||||
5. Data Collection
|
||||
└─> For InfoPlugin: collect() called once after initialization
|
||||
└─> For MonitorPlugin: collect() called periodically based on interval
|
||||
└─> Data is sent to server via PLG message
|
||||
|
||||
6. Plugin Shutdown
|
||||
└─> cleanup() is called
|
||||
└─> Plugin releases resources, closes connections
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Plugin-Specific Configuration
|
||||
|
||||
Plugins receive configuration through the `config` parameter in `__init__`:
|
||||
|
||||
```python
|
||||
def __init__(self, config: Optional[Dict[str, Any]] = None):
|
||||
super().__init__(config)
|
||||
|
||||
# Access configuration with defaults
|
||||
self.interval = self.config.get('interval', 60)
|
||||
self.threshold = self.config.get('threshold', 80)
|
||||
self.enabled_features = self.config.get('features', ['feature1', 'feature2'])
|
||||
```
|
||||
|
||||
### Client Configuration File
|
||||
|
||||
Users configure plugins in the client configuration YAML:
|
||||
|
||||
```yaml
|
||||
plugins:
|
||||
my_awesome_plugin:
|
||||
enabled: true
|
||||
interval: 120
|
||||
option1: custom_value
|
||||
option2: false
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Error Handling
|
||||
|
||||
Always handle errors gracefully:
|
||||
|
||||
```python
|
||||
async def collect(self) -> Dict[str, Any]:
|
||||
try:
|
||||
return await self._collect_metrics()
|
||||
except Exception as e:
|
||||
logger.error(f"Error collecting metrics: {e}")
|
||||
return {"error": str(e)}
|
||||
```
|
||||
|
||||
### 2. Logging
|
||||
|
||||
Use appropriate log levels:
|
||||
|
||||
```python
|
||||
logger.debug("Detailed information for debugging")
|
||||
logger.info("Normal operation messages")
|
||||
logger.warning("Warning messages for unusual but handled situations")
|
||||
logger.error("Error messages for failures")
|
||||
```
|
||||
|
||||
### 3. Dependencies
|
||||
|
||||
Check for optional dependencies:
|
||||
|
||||
```python
|
||||
try:
|
||||
import some_optional_library
|
||||
except ImportError:
|
||||
some_optional_library = None
|
||||
|
||||
# Later in __init__:
|
||||
if some_optional_library is None:
|
||||
raise ImportError("some_optional_library is required")
|
||||
```
|
||||
|
||||
### 4. Performance
|
||||
|
||||
- Keep collection methods fast (< 1 second)
|
||||
- Use async/await for I/O operations
|
||||
- Cache expensive computations
|
||||
- Don't block the event loop
|
||||
|
||||
### 5. Data Structure
|
||||
|
||||
Return clean, structured data:
|
||||
|
||||
```python
|
||||
{
|
||||
'metric_name': value,
|
||||
'nested_data': {
|
||||
'sub_metric': value
|
||||
},
|
||||
'list_data': [item1, item2],
|
||||
'timestamp': time.time() # Optional timestamp
|
||||
}
|
||||
```
|
||||
|
||||
### 6. Documentation
|
||||
|
||||
Document your plugin thoroughly:
|
||||
|
||||
- Class docstring with description and configuration
|
||||
- Method docstrings explaining purpose and return values
|
||||
- Inline comments for complex logic
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Simple InfoPlugin
|
||||
|
||||
```python
|
||||
from hbd.plugin import InfoPlugin
|
||||
import platform
|
||||
|
||||
class SimpleInfoPlugin(InfoPlugin):
|
||||
"""Collect basic system information."""
|
||||
|
||||
name = "simple_info"
|
||||
interval = 0 # InfoPlugin
|
||||
|
||||
async def initialize(self):
|
||||
return True
|
||||
|
||||
async def collect(self) -> Dict[str, Any]:
|
||||
return {
|
||||
'hostname': platform.node(),
|
||||
'system': platform.system(),
|
||||
'python_version': platform.python_version()
|
||||
}
|
||||
|
||||
async def cleanup(self):
|
||||
pass
|
||||
|
||||
plugin = SimpleInfoPlugin
|
||||
```
|
||||
|
||||
### Example 2: MonitorPlugin with State
|
||||
|
||||
```python
|
||||
from hbd.plugin import MonitorPlugin
|
||||
import time
|
||||
|
||||
class CounterPlugin(MonitorPlugin):
|
||||
"""Track a counter over time."""
|
||||
|
||||
name = "counter"
|
||||
interval = 30
|
||||
|
||||
def __init__(self, config=None):
|
||||
super().__init__(config)
|
||||
self._counter = 0
|
||||
self._start_time = time.time()
|
||||
|
||||
async def initialize(self):
|
||||
return True
|
||||
|
||||
async def collect(self) -> Dict[str, Any]:
|
||||
self._counter += 1
|
||||
uptime = time.time() - self._start_time
|
||||
|
||||
return {
|
||||
'count': self._counter,
|
||||
'uptime': uptime,
|
||||
'rate': self._counter / uptime
|
||||
}
|
||||
|
||||
async def cleanup(self):
|
||||
pass
|
||||
|
||||
plugin = CounterPlugin
|
||||
```
|
||||
|
||||
### Example 3: Plugin with External Command
|
||||
|
||||
```python
|
||||
from hbd.plugin import MonitorPlugin
|
||||
import asyncio
|
||||
|
||||
class CommandPlugin(MonitorPlugin):
|
||||
"""Execute external command and capture output."""
|
||||
|
||||
name = "command_executor"
|
||||
interval = 60
|
||||
|
||||
def __init__(self, config=None):
|
||||
super().__init__(config)
|
||||
self.command = self.config.get('command', 'echo "no command"')
|
||||
|
||||
async def initialize(self):
|
||||
return True
|
||||
|
||||
async def collect(self) -> Dict[str, Any]:
|
||||
try:
|
||||
process = await asyncio.create_subprocess_shell(
|
||||
self.command,
|
||||
stdout=asyncio.subprocess.PIPE,
|
||||
stderr=asyncio.subprocess.PIPE
|
||||
)
|
||||
stdout, stderr = await asyncio.wait_for(
|
||||
process.communicate(),
|
||||
timeout=30
|
||||
)
|
||||
|
||||
return {
|
||||
'exit_code': process.returncode,
|
||||
'stdout': stdout.decode('utf-8'),
|
||||
'stderr': stderr.decode('utf-8')
|
||||
}
|
||||
except Exception as e:
|
||||
return {'error': str(e)}
|
||||
|
||||
async def cleanup(self):
|
||||
pass
|
||||
|
||||
plugin = CommandPlugin
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Unit Testing
|
||||
|
||||
Create unit tests for your plugins:
|
||||
|
||||
```python
|
||||
import unittest
|
||||
import asyncio
|
||||
|
||||
class TestMyPlugin(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.plugin = MyAwesomePlugin({'option1': 'test'})
|
||||
|
||||
def test_initialization(self):
|
||||
result = asyncio.run(self.plugin.initialize())
|
||||
self.assertTrue(result)
|
||||
|
||||
def test_collection(self):
|
||||
asyncio.run(self.plugin.initialize())
|
||||
data = asyncio.run(self.plugin.collect())
|
||||
|
||||
self.assertIsInstance(data, dict)
|
||||
self.assertIn('metric1', data)
|
||||
self.assertGreater(data['metric1'], 0)
|
||||
|
||||
def tearDown(self):
|
||||
asyncio.run(self.plugin.cleanup())
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
```
|
||||
|
||||
### Integration Testing
|
||||
|
||||
Test your plugin with the actual client:
|
||||
|
||||
```bash
|
||||
# Create test configuration
|
||||
cat > test_config.yaml <<EOF
|
||||
server: localhost
|
||||
plugins:
|
||||
my_awesome_plugin:
|
||||
enabled: true
|
||||
interval: 10
|
||||
option1: test_value
|
||||
EOF
|
||||
|
||||
# Run client in test mode
|
||||
python -m hbd.hbc -c test_config.yaml --verbose
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### My plugin isn't loading
|
||||
|
||||
1. Check filename doesn't start with underscore
|
||||
2. Verify plugin class inherits from InfoPlugin or MonitorPlugin
|
||||
3. Check `initialize()` returns True
|
||||
4. Look for import errors in logs
|
||||
|
||||
### Plugin loads but doesn't collect data
|
||||
|
||||
1. Check `interval` is set correctly (0 for InfoPlugin, > 0 for MonitorPlugin)
|
||||
2. Verify `collect()` returns a dictionary
|
||||
3. Check for exceptions in `collect()` method
|
||||
4. Enable DEBUG logging to see detailed errors
|
||||
|
||||
### Data isn't appearing on server
|
||||
|
||||
1. Verify client is connected to server
|
||||
2. Check server logs for PLG message handling
|
||||
3. Verify returned data is JSON-serializable
|
||||
4. Check for large data sizes (may exceed UDP packet size)
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [Plugin Framework Source](../hbd/plugin.py) - Core plugin implementation
|
||||
- [Built-in Plugins](../hbd/plugins/) - Examples of working plugins
|
||||
- [Nagios Integration](NAGIOS_INTEGRATION.md) - Running external plugins
|
||||
- [Configuration Guide](../hbd/config_example.yaml) - Full configuration reference
|
||||
@@ -0,0 +1,742 @@
|
||||
# Threshold Alerting System
|
||||
|
||||
## Overview
|
||||
|
||||
The Heartbeat Monitoring System includes a comprehensive threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured thresholds. This system is designed to:
|
||||
|
||||
- **Detect anomalies**: Automatically identify when system metrics exceed safe operating ranges
|
||||
- **Prevent alert fatigue**: Use hysteresis to prevent notification flapping
|
||||
- **Escalate appropriately**: Support WARNING and CRITICAL severity levels
|
||||
- **Track state**: Maintain alert history and state transitions per host
|
||||
- **Integrate seamlessly**: Work with existing notification infrastructure (email, pushover, etc.)
|
||||
|
||||
## Architecture
|
||||
|
||||
### Components
|
||||
|
||||
1. **ThresholdChecker** (`hbd/threshold.py`)
|
||||
- Main threshold checking engine
|
||||
- Parses configuration
|
||||
- Evaluates metrics against thresholds
|
||||
- Triggers notifications on state changes
|
||||
|
||||
2. **ThresholdConfig**
|
||||
- Individual threshold configuration
|
||||
- Supports multiple comparison operators
|
||||
- Implements hysteresis logic
|
||||
|
||||
3. **AlertState**
|
||||
- Tracks current alert state per metric
|
||||
- Records state transitions
|
||||
- Manages notification timing
|
||||
|
||||
4. **Integration Points**
|
||||
- UDP handler: Checks thresholds when plugin data arrives
|
||||
- Host objects: Store alert states per host
|
||||
- Notification system: Sends alerts via configured channels
|
||||
|
||||
### Alert Levels
|
||||
|
||||
- **OK**: Metric is within normal range
|
||||
- **WARNING**: Metric has exceeded warning threshold (first-level concern)
|
||||
- **CRITICAL**: Metric has exceeded critical threshold (requires immediate attention)
|
||||
- **UNKNOWN**: Metric value cannot be evaluated (e.g., non-numeric data)
|
||||
|
||||
## Configuration
|
||||
|
||||
### Basic Structure
|
||||
|
||||
Thresholds are configured in the YAML configuration file under the `thresholds` section:
|
||||
|
||||
```yaml
|
||||
thresholds:
|
||||
plugin_name:
|
||||
metric_name:
|
||||
warning: 80.0
|
||||
critical: 90.0
|
||||
operator: ">"
|
||||
hysteresis: 0.1
|
||||
enabled: true
|
||||
```
|
||||
|
||||
### Configuration Parameters
|
||||
|
||||
#### Required Parameters
|
||||
|
||||
- **warning**: Warning threshold value (numeric)
|
||||
- **critical**: Critical threshold value (numeric)
|
||||
|
||||
Note: At least one of `warning` or `critical` must be specified.
|
||||
|
||||
#### Optional Parameters
|
||||
|
||||
- **operator**: Comparison operator (default: `">"`)
|
||||
- `">"` - Greater than
|
||||
- `">="` - Greater than or equal
|
||||
- `"<"` - Less than
|
||||
- `"<="` - Less than or equal
|
||||
- `"=="` - Equal to
|
||||
- `"!="` - Not equal to
|
||||
|
||||
- **hysteresis**: Hysteresis percentage to prevent flapping (default: `0.1` = 10%)
|
||||
- Range: 0.0 to 1.0
|
||||
- Prevents rapid state transitions when value hovers near threshold
|
||||
|
||||
- **enabled**: Whether this threshold is active (default: `true`)
|
||||
|
||||
### Comparison Operators
|
||||
|
||||
#### Greater Than (`>`, `>=`)
|
||||
|
||||
Used for metrics where **higher values are problematic**:
|
||||
|
||||
```yaml
|
||||
cpu_monitor:
|
||||
cpu_percent:
|
||||
warning: 80.0 # Alert when CPU > 80%
|
||||
critical: 90.0 # Alert when CPU > 90%
|
||||
operator: ">"
|
||||
```
|
||||
|
||||
Examples:
|
||||
- CPU usage percentage
|
||||
- Memory usage percentage
|
||||
- Disk usage percentage
|
||||
- Load average
|
||||
- Error counters
|
||||
|
||||
#### Less Than (`<`, `<=`)
|
||||
|
||||
Used for metrics where **lower values are problematic**:
|
||||
|
||||
```yaml
|
||||
memory_monitor:
|
||||
available_mb:
|
||||
warning: 1000 # Alert when available memory < 1GB
|
||||
critical: 500 # Alert when available memory < 500MB
|
||||
operator: "<"
|
||||
```
|
||||
|
||||
Examples:
|
||||
- Available memory
|
||||
- Free disk space
|
||||
- Connection pool availability
|
||||
- Battery level
|
||||
|
||||
## Hysteresis
|
||||
|
||||
Hysteresis prevents alert flapping by requiring values to improve by a certain amount before recovering from an alert state.
|
||||
|
||||
### How It Works
|
||||
|
||||
When a metric crosses a threshold (e.g., CPU goes from 85% to 91%, triggering CRITICAL), hysteresis is applied when the value improves:
|
||||
|
||||
```
|
||||
Threshold: 90
|
||||
Hysteresis: 0.1 (10%)
|
||||
Recovery threshold: 90 - (90 * 0.1) = 81
|
||||
|
||||
Value 91 -> CRITICAL (threshold crossed)
|
||||
Value 89 -> CRITICAL (still above recovery threshold of 81)
|
||||
Value 85 -> CRITICAL (still above recovery threshold)
|
||||
Value 80 -> WARNING or OK (below recovery threshold, re-evaluated normally)
|
||||
```
|
||||
|
||||
### Configuration Recommendations
|
||||
|
||||
- **Stable metrics** (CPU, memory): 10-15% hysteresis
|
||||
```yaml
|
||||
hysteresis: 0.1
|
||||
```
|
||||
|
||||
- **Very stable metrics** (disk usage): 5% hysteresis
|
||||
```yaml
|
||||
hysteresis: 0.05
|
||||
```
|
||||
|
||||
- **Counter metrics** (errors, packets): 20% hysteresis
|
||||
```yaml
|
||||
hysteresis: 0.2
|
||||
```
|
||||
|
||||
- **Binary states** (exit codes): No hysteresis
|
||||
```yaml
|
||||
hysteresis: 0.0
|
||||
```
|
||||
|
||||
## Plugin-Specific Configuration
|
||||
|
||||
### CPU Monitor
|
||||
|
||||
```yaml
|
||||
cpu_monitor:
|
||||
cpu_percent:
|
||||
warning: 80.0
|
||||
critical: 90.0
|
||||
operator: ">"
|
||||
hysteresis: 0.1
|
||||
|
||||
load_1min:
|
||||
warning: 4.0
|
||||
critical: 8.0
|
||||
operator: ">"
|
||||
hysteresis: 0.15
|
||||
|
||||
load_5min:
|
||||
warning: 3.0
|
||||
critical: 6.0
|
||||
operator: ">"
|
||||
|
||||
load_15min:
|
||||
warning: 2.0
|
||||
critical: 4.0
|
||||
operator: ">"
|
||||
```
|
||||
|
||||
### Memory Monitor
|
||||
|
||||
```yaml
|
||||
memory_monitor:
|
||||
# Percentage-based threshold
|
||||
percent:
|
||||
warning: 85.0
|
||||
critical: 95.0
|
||||
operator: ">"
|
||||
|
||||
# Absolute value threshold (inverse - alert when LOW)
|
||||
available_mb:
|
||||
warning: 1000
|
||||
critical: 500
|
||||
operator: "<"
|
||||
|
||||
# Swap usage
|
||||
swap_percent:
|
||||
warning: 50.0
|
||||
critical: 80.0
|
||||
operator: ">"
|
||||
```
|
||||
|
||||
### Disk Monitor
|
||||
|
||||
Disk thresholds support **partition-specific configuration**:
|
||||
|
||||
```yaml
|
||||
disk_monitor:
|
||||
partitions:
|
||||
/:
|
||||
percent:
|
||||
warning: 80.0
|
||||
critical: 90.0
|
||||
operator: ">"
|
||||
hysteresis: 0.05
|
||||
|
||||
free_gb:
|
||||
warning: 10.0
|
||||
critical: 5.0
|
||||
operator: "<"
|
||||
|
||||
/home:
|
||||
percent:
|
||||
warning: 85.0
|
||||
critical: 95.0
|
||||
operator: ">"
|
||||
|
||||
/var:
|
||||
percent:
|
||||
warning: 80.0
|
||||
critical: 90.0
|
||||
operator: ">"
|
||||
|
||||
free_gb:
|
||||
warning: 5.0
|
||||
critical: 2.0
|
||||
operator: "<"
|
||||
```
|
||||
|
||||
### Network Monitor
|
||||
|
||||
```yaml
|
||||
network_monitor:
|
||||
# Error counters
|
||||
errors_total:
|
||||
warning: 100
|
||||
critical: 1000
|
||||
operator: ">"
|
||||
hysteresis: 0.2
|
||||
|
||||
# Dropped packets
|
||||
dropin_total:
|
||||
warning: 50
|
||||
critical: 200
|
||||
operator: ">"
|
||||
|
||||
dropout_total:
|
||||
warning: 50
|
||||
critical: 200
|
||||
operator: ">"
|
||||
|
||||
# Connection states
|
||||
connections_TIME_WAIT:
|
||||
warning: 1000
|
||||
critical: 5000
|
||||
operator: ">"
|
||||
|
||||
connections_ESTABLISHED:
|
||||
warning: 500
|
||||
critical: 1000
|
||||
operator: ">"
|
||||
```
|
||||
|
||||
### Nagios Runner
|
||||
|
||||
The Nagios plugin runner reports exit codes that can be thresholded:
|
||||
|
||||
```yaml
|
||||
nagios_runner:
|
||||
exit_code:
|
||||
warning: 1 # Map Nagios WARNING to our WARNING
|
||||
critical: 2 # Map Nagios CRITICAL to our CRITICAL
|
||||
operator: ">="
|
||||
hysteresis: 0.0 # No hysteresis for exit codes
|
||||
```
|
||||
|
||||
## Notification Behavior
|
||||
|
||||
### When Notifications Are Sent
|
||||
|
||||
Notifications are triggered on **state changes**:
|
||||
|
||||
1. **Escalation**: OK → WARNING, OK → CRITICAL, WARNING → CRITICAL
|
||||
```
|
||||
WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
|
||||
```
|
||||
|
||||
2. **Recovery**: CRITICAL → WARNING, CRITICAL → OK, WARNING → OK
|
||||
```
|
||||
RECOVERED: webserver01 - cpu_monitor.cpu_percent = 70.0 (CRITICAL -> OK)
|
||||
```
|
||||
|
||||
3. **Re-notifications**: Periodic reminders for ongoing alerts
|
||||
```
|
||||
REMINDER (CRITICAL): webserver01 - cpu_monitor.cpu_percent = 95.0 (ongoing for 3600s)
|
||||
```
|
||||
|
||||
### Notification Frequency
|
||||
|
||||
- **State changes**: Immediate notification
|
||||
- **Re-notifications**: Controlled by `threshold_renotify_interval` (default: 3600 seconds = 1 hour)
|
||||
|
||||
```yaml
|
||||
threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts
|
||||
```
|
||||
|
||||
### Notification Channels
|
||||
|
||||
Thresholds use the same notification infrastructure as heartbeat monitoring:
|
||||
|
||||
- **Email** (via SMTP)
|
||||
- **Pushover** (mobile notifications)
|
||||
- **Mattermost** (team chat)
|
||||
- **Custom webhooks**
|
||||
|
||||
Configuration:
|
||||
|
||||
```yaml
|
||||
# Email
|
||||
toemail:
|
||||
- admin@example.com
|
||||
- oncall@example.com
|
||||
fromemail: heartbeat@example.com
|
||||
smtpserver: smtp.example.com
|
||||
smtpport: 587
|
||||
smtpuser: heartbeat@example.com
|
||||
smtppassword: your-password
|
||||
|
||||
# Pushover
|
||||
pushover_token: your-app-token
|
||||
pushover_user: your-user-key
|
||||
```
|
||||
|
||||
### Watched Hosts
|
||||
|
||||
Only hosts in the `watchhosts` list will trigger notifications:
|
||||
|
||||
```yaml
|
||||
watchhosts:
|
||||
- webserver01
|
||||
- database01
|
||||
- mailserver
|
||||
```
|
||||
|
||||
Hosts not in this list will still have thresholds checked and alert states tracked, but won't send notifications.
|
||||
|
||||
## Alert State Tracking
|
||||
|
||||
Each host maintains alert states for all monitored metrics:
|
||||
|
||||
```python
|
||||
host.alert_states = {
|
||||
"cpu_monitor.cpu_percent": AlertState(level=WARNING, since=1234567890),
|
||||
"memory_monitor.percent": AlertState(level=CRITICAL, since=1234567800),
|
||||
"disk_monitor./.percent": AlertState(level=OK, since=1234567700),
|
||||
}
|
||||
```
|
||||
|
||||
Alert states persist in memory and are saved with host data (pickle).
|
||||
|
||||
### Alert State Information
|
||||
|
||||
Each `AlertState` tracks:
|
||||
|
||||
- **level**: Current alert level (OK, WARNING, CRITICAL, UNKNOWN)
|
||||
- **since**: Timestamp when current state started
|
||||
- **last_value**: Most recent metric value
|
||||
- **last_check**: Timestamp of last threshold check
|
||||
- **notification_count**: Number of notifications sent for this alert
|
||||
- **last_notification**: Timestamp of last notification
|
||||
|
||||
### Querying Alert States
|
||||
|
||||
Via HTTP API (future enhancement):
|
||||
|
||||
```bash
|
||||
GET /api/hosts/webserver01/alerts
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"active_alerts": [
|
||||
{
|
||||
"metric": "cpu_monitor.cpu_percent",
|
||||
"level": "WARNING",
|
||||
"since": 1234567890,
|
||||
"value": 85.0,
|
||||
"duration": 300
|
||||
}
|
||||
],
|
||||
"summary": {
|
||||
"ok": 15,
|
||||
"warning": 1,
|
||||
"critical": 0
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
A comprehensive test suite is provided in `test_threshold.py`:
|
||||
|
||||
```bash
|
||||
python test_threshold.py
|
||||
```
|
||||
|
||||
Tests cover:
|
||||
- Threshold configuration and parsing
|
||||
- All comparison operators
|
||||
- Hysteresis functionality
|
||||
- Alert state tracking
|
||||
- State change detection
|
||||
- Notification triggering
|
||||
- Nested metrics (partitions)
|
||||
- Alert summaries
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Start Conservative
|
||||
|
||||
Begin with higher thresholds to avoid alert fatigue:
|
||||
|
||||
```yaml
|
||||
cpu_monitor:
|
||||
cpu_percent:
|
||||
warning: 85.0 # Start higher
|
||||
critical: 95.0 # Very high for critical
|
||||
```
|
||||
|
||||
Adjust downward based on observed behavior.
|
||||
|
||||
### 2. Consider Workload Patterns
|
||||
|
||||
Different systems have different normal ranges:
|
||||
|
||||
**Web servers** (bursty traffic):
|
||||
```yaml
|
||||
cpu_percent:
|
||||
warning: 80.0
|
||||
critical: 90.0
|
||||
hysteresis: 0.15 # Higher hysteresis for burstiness
|
||||
```
|
||||
|
||||
**Database servers** (steady load):
|
||||
```yaml
|
||||
cpu_percent:
|
||||
warning: 70.0
|
||||
critical: 85.0
|
||||
hysteresis: 0.1 # Lower hysteresis for steady metrics
|
||||
```
|
||||
|
||||
### 3. Use Appropriate Operators
|
||||
|
||||
Match the operator to the metric:
|
||||
|
||||
| Metric Type | Example | Operator | Reason |
|
||||
|-------------|---------|----------|--------|
|
||||
| Resource usage | CPU%, Memory% | `>` | Alert when high |
|
||||
| Available resources | Free memory, Free disk | `<` | Alert when low |
|
||||
| Error counters | Network errors | `>` | Alert when increasing |
|
||||
| Health checks | Nagios exit code | `>=` | Map to standard codes |
|
||||
|
||||
### 4. Align with Monitoring Intervals
|
||||
|
||||
Ensure threshold checks align with plugin collection intervals:
|
||||
|
||||
```yaml
|
||||
plugins:
|
||||
cpu_monitor:
|
||||
interval: 300 # Check every 5 minutes
|
||||
|
||||
thresholds:
|
||||
cpu_monitor:
|
||||
cpu_percent:
|
||||
warning: 80.0
|
||||
# Will be checked every 5 minutes
|
||||
```
|
||||
|
||||
### 5. Test Before Production
|
||||
|
||||
1. **Start with disabled thresholds**:
|
||||
```yaml
|
||||
enabled: false
|
||||
```
|
||||
|
||||
2. **Observe metric ranges** over a week
|
||||
|
||||
3. **Set thresholds** based on observed data
|
||||
|
||||
4. **Enable gradually**:
|
||||
```yaml
|
||||
enabled: true
|
||||
```
|
||||
|
||||
5. **Monitor for false positives**
|
||||
|
||||
### 6. Document Baseline Values
|
||||
|
||||
Keep a record of normal operating ranges:
|
||||
|
||||
```yaml
|
||||
# Production web server baseline (observed over 30 days):
|
||||
# CPU: 20-40% normal, 60% peak
|
||||
# Memory: 60-70% normal, 80% peak
|
||||
# Disk /: 40-50% usage, growing 2%/month
|
||||
|
||||
cpu_monitor:
|
||||
cpu_percent:
|
||||
warning: 75.0 # Above peak + margin
|
||||
critical: 90.0 # Danger zone
|
||||
```
|
||||
|
||||
### 7. Layer Alerts
|
||||
|
||||
Use WARNING for early notification, CRITICAL for immediate action:
|
||||
|
||||
```yaml
|
||||
disk_monitor:
|
||||
partitions:
|
||||
/:
|
||||
percent:
|
||||
warning: 75.0 # Early warning: "check in next few days"
|
||||
critical: 90.0 # Critical: "act now before outage"
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No Notifications Being Sent
|
||||
|
||||
1. **Check if host is watched**:
|
||||
```yaml
|
||||
watchhosts:
|
||||
- your-host-name
|
||||
```
|
||||
|
||||
2. **Verify notification configuration**:
|
||||
```yaml
|
||||
toemail:
|
||||
- admin@example.com
|
||||
smtpserver: smtp.example.com
|
||||
```
|
||||
|
||||
3. **Check threshold configuration**:
|
||||
```bash
|
||||
# Look for parsing errors in server logs
|
||||
grep "threshold" /var/log/heartbeat/hbd.log
|
||||
```
|
||||
|
||||
4. **Verify metric names**:
|
||||
- Metric names must match exactly (case-sensitive)
|
||||
- Check journal or logs for actual metric names
|
||||
|
||||
### Too Many Alerts (Flapping)
|
||||
|
||||
1. **Increase hysteresis**:
|
||||
```yaml
|
||||
hysteresis: 0.2 # Increase from 0.1 to 0.2 (20%)
|
||||
```
|
||||
|
||||
2. **Adjust thresholds**:
|
||||
```yaml
|
||||
warning: 85.0 # Increase from 80.0
|
||||
```
|
||||
|
||||
3. **Increase renotification interval**:
|
||||
```yaml
|
||||
threshold_renotify_interval: 7200 # 2 hours instead of 1
|
||||
```
|
||||
|
||||
### Alerts Not Triggering
|
||||
|
||||
1. **Check threshold operator**:
|
||||
```yaml
|
||||
# For available memory (alert when LOW):
|
||||
operator: "<" # NOT ">"
|
||||
```
|
||||
|
||||
2. **Verify numeric values**:
|
||||
- Ensure metric values are numeric
|
||||
- Check for unit mismatches (MB vs GB)
|
||||
|
||||
3. **Check if threshold is enabled**:
|
||||
```yaml
|
||||
enabled: true # NOT false
|
||||
```
|
||||
|
||||
4. **Review hysteresis settings**:
|
||||
- Very high hysteresis may prevent state changes
|
||||
- Try reducing or disabling temporarily
|
||||
|
||||
### Alert State Not Recovering
|
||||
|
||||
1. **Check recovery threshold calculation**:
|
||||
```
|
||||
Threshold: 90
|
||||
Hysteresis: 0.1
|
||||
Recovery: 90 - (90 * 0.1) = 81
|
||||
|
||||
Value must drop below 81 to recover
|
||||
```
|
||||
|
||||
2. **Temporarily disable hysteresis**:
|
||||
```yaml
|
||||
hysteresis: 0.0
|
||||
```
|
||||
|
||||
3. **Monitor actual metric values**:
|
||||
```bash
|
||||
# Check journal for actual values
|
||||
grep "cpu_percent" /var/log/heartbeat/messages.journal | tail -20
|
||||
```
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Custom Notification Callbacks
|
||||
|
||||
The ThresholdChecker supports custom notification functions:
|
||||
|
||||
```python
|
||||
def custom_notifier(message):
|
||||
# Send to incident management system
|
||||
pagerduty.trigger(message)
|
||||
|
||||
# Log to custom system
|
||||
logger.critical(message)
|
||||
|
||||
# Update dashboard
|
||||
metrics.alert_count.inc()
|
||||
|
||||
checker = ThresholdChecker(
|
||||
config=config,
|
||||
notification_callback=custom_notifier
|
||||
)
|
||||
```
|
||||
|
||||
### Programmatic Access
|
||||
|
||||
Query alert states programmatically:
|
||||
|
||||
```python
|
||||
# Get all active alerts for a host
|
||||
active = threshold_checker.get_active_alerts(host.alert_states)
|
||||
|
||||
for alert in active:
|
||||
print(f"{alert.metric_path}: {alert.level.name} for {time.time() - alert.since}s")
|
||||
|
||||
# Get alert summary
|
||||
summary = threshold_checker.get_alert_summary(host.alert_states)
|
||||
print(f"WARNING: {summary['warning']}, CRITICAL: {summary['critical']}")
|
||||
```
|
||||
|
||||
### Integration with External Systems
|
||||
|
||||
Threshold violations can be integrated with:
|
||||
|
||||
- **PagerDuty**: Incident creation and escalation
|
||||
- **OpsGenie**: On-call scheduling and routing
|
||||
- **ServiceNow**: Ticket creation
|
||||
- **Grafana**: Dashboard annotations
|
||||
- **Elasticsearch**: Alert indexing and analysis
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Planned features:
|
||||
|
||||
1. **Composite thresholds**: Alert based on multiple metrics
|
||||
```yaml
|
||||
composite:
|
||||
high_load_with_low_memory:
|
||||
conditions:
|
||||
- cpu_monitor.load_1min > 8.0
|
||||
- memory_monitor.available_mb < 500
|
||||
```
|
||||
|
||||
2. **Time-based thresholds**: Different thresholds by time of day
|
||||
```yaml
|
||||
schedule:
|
||||
business_hours:
|
||||
warning: 70.0
|
||||
off_hours:
|
||||
warning: 85.0
|
||||
```
|
||||
|
||||
3. **Rate-of-change thresholds**: Alert on rapid changes
|
||||
```yaml
|
||||
rate_of_change:
|
||||
metric: cpu_percent
|
||||
period: 300
|
||||
threshold: 30.0 # Alert if changes >30% in 5 minutes
|
||||
```
|
||||
|
||||
4. **Alert grouping**: Combine related alerts
|
||||
```yaml
|
||||
groups:
|
||||
disk_critical:
|
||||
metrics:
|
||||
- disk_monitor./.percent
|
||||
- disk_monitor./var.percent
|
||||
action: single_notification
|
||||
```
|
||||
|
||||
5. **Maintenance windows**: Suppress alerts during planned maintenance
|
||||
```yaml
|
||||
maintenance:
|
||||
- host: webserver01
|
||||
start: 2024-01-15T02:00:00Z
|
||||
end: 2024-01-15T04:00:00Z
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [Plugin Development Guide](PLUGIN_DEVELOPMENT.md)
|
||||
- [Message Journal Documentation](MESSAGE_JOURNAL.md)
|
||||
- Configuration examples: `hbd/config_thresholds_example.yaml`
|
||||
- Test suite: `test_threshold.py`
|
||||
Reference in New Issue
Block a user