Major refactoring of the codebase, including restructuring of files and directories, renaming of modules and classes, and improvements to the overall organization and readability of the code. This refactoring aims to enhance maintainability, scalability, and clarity of the codebase while preserving existing functionality. The changes include:

- Restructuring of the project directory into client and server components
- Renaming of modules and classes to better reflect their purpose and functionality
- Moving common utilities and configurations to a shared location
- Updating import statements to reflect the new structure
- Adding new documentation files for better clarity on various aspects of the project
- Removing deprecated or unused code to streamline the codebase
- Ensuring that all existing functionality is preserved and that the codebase remains functional after the refactoring.
This commit is contained in:
Andreas Wrede
2026-03-29 11:13:40 -04:00
parent 7e2038ecac
commit 0543266c92
65 changed files with 11371 additions and 140 deletions
+532
View File
@@ -0,0 +1,532 @@
# HTTP API and Web UI Documentation
## Overview
The Heartbeat Daemon provides a comprehensive HTTP API and web-based UI for monitoring plugin data and alert states. The API follows RESTful conventions and returns JSON responses.
## Base URL
All API endpoints are relative to the server base URL:
```
http://your-server:50004
```
Default port is `50004` (configurable via `hbd_port` in configuration).
---
## API Endpoints
### Host Management
#### GET /api/0/hosts
Get list of all monitored hosts with their state information.
**Response:**
```json
[
{
"name": "webserver01",
"dyn": false,
"ver": 6,
"connections": [...]
}
]
```
#### GET /api/0/messages
Get recent heartbeat messages (last 30).
**Response:**
```json
[
{
"time": 1711234567.123,
"host": "webserver01",
"msg": "heartbeat received"
}
]
```
---
### Plugin Data Endpoints
#### GET /api/0/hosts/{hostname}/plugins
Get all plugin data for a specific host.
**Parameters:**
- `hostname` (path): Name of the host
**Response:**
```json
{
"hostname": "webserver01",
"plugins": {
"cpu_monitor": {
"timestamp": 1711234567.123,
"data": {
"cpu_percent": 45.2,
"load_1min": 2.5,
"load_5min": 2.1,
"load_15min": 1.8
},
"sample_count": 100
},
"memory_monitor": {
"timestamp": 1711234568.456,
"data": {
"percent": 65.4,
"available_mb": 4096,
"total_mb": 16384
},
"sample_count": 100
}
}
}
```
**Example:**
```bash
curl http://localhost:50004/api/0/hosts/webserver01/plugins
```
#### GET /api/0/hosts/{hostname}/plugins/{plugin_name}
Get detailed historical data for a specific plugin.
**Parameters:**
- `hostname` (path): Name of the host
- `plugin_name` (path): Name of the plugin
- `limit` (query, optional): Number of recent samples to return (default: 10)
**Response:**
```json
{
"hostname": "webserver01",
"plugin": "cpu_monitor",
"samples": [
{
"timestamp": 1711234567.123,
"data": {
"cpu_percent": 45.2,
"load_1min": 2.5
}
},
{
"timestamp": 1711234267.123,
"data": {
"cpu_percent": 42.1,
"load_1min": 2.3
}
}
],
"sample_count": 2
}
```
**Examples:**
```bash
# Get last 1 sample (most recent)
curl http://localhost:50004/api/0/hosts/webserver01/plugins/cpu_monitor?limit=1
# Get last 50 samples
curl http://localhost:50004/api/0/hosts/webserver01/plugins/memory_monitor?limit=50
# Get disk monitor data
curl http://localhost:50004/api/0/hosts/database01/plugins/disk_monitor
```
---
### Alert Endpoints
#### GET /api/0/hosts/{hostname}/alerts
Get alert states for a specific host.
**Parameters:**
- `hostname` (path): Name of the host
**Response:**
```json
{
"hostname": "webserver01",
"alerts": [
{
"metric_path": "cpu_monitor.cpu_percent",
"level": "WARNING",
"since": 1711234000.0,
"last_value": 85.5,
"last_check": 1711234567.123,
"notification_count": 2
},
{
"metric_path": "disk_monitor./.percent",
"level": "OK",
"since": 1711230000.0,
"last_value": 65.0,
"last_check": 1711234567.123,
"notification_count": 0
}
],
"summary": {
"ok": 15,
"warning": 1,
"critical": 0,
"unknown": 0
}
}
```
**Example:**
```bash
curl http://localhost:50004/api/0/hosts/webserver01/alerts
```
#### GET /api/0/alerts
Get all active alerts across all monitored hosts.
**Response:**
```json
{
"alerts": [
{
"hostname": "webserver01",
"metric_path": "cpu_monitor.cpu_percent",
"level": "CRITICAL",
"since": 1711234000.0,
"last_value": 95.5,
"last_check": 1711234567.123,
"notification_count": 3
},
{
"hostname": "database01",
"metric_path": "memory_monitor.percent",
"level": "WARNING",
"since": 1711233000.0,
"last_value": 88.2,
"last_check": 1711234567.123,
"notification_count": 1
}
],
"summary": {
"critical": 1,
"warning": 1,
"unknown": 0,
"total": 2
},
"host_count": 5
}
```
**Example:**
```bash
curl http://localhost:50004/api/0/alerts | jq .
```
---
## Web UI Pages
### Live Dashboard
**URL:** `/live`
Real-time dashboard showing:
- Host connection states
- IPv4/IPv6 connectivity
- Latency metrics
- Recent messages
**Features:**
- WebSocket-powered live updates
- Sortable columns
- Color-coded status indicators
### Plugin Metrics
**URL:** `/plugins`
Interactive visualization of plugin metrics:
- Select host and plugin from dropdown
- View current metric values
- Automatic refresh every 30 seconds
- Support for nested metrics (e.g., per-partition disk stats)
**Features:**
- Card-based metric display
- Unit formatting (%, MB, GB)
- Nested object visualization
- Auto-refresh
**Screenshots of available data:**
- CPU usage, load average, frequency
- Memory usage, available memory, swap
- Disk usage per partition, I/O statistics
- Network interface statistics, connection counts
- Custom plugin data
### Alerts Dashboard
**URL:** `/alerts`
Comprehensive alert monitoring:
- Summary cards (Critical, Warning, Total Hosts)
- Filter by severity (All, Critical, Warning)
- Alert details with duration
- Auto-refresh every 15 seconds
**Features:**
- Color-coded alert levels
- Duration tracking
- Filterable list
- Real-time updates
- Summary statistics
---
## Integration Examples
### Monitoring Script
```bash
#!/bin/bash
# Check for critical alerts and send notification
RESPONSE=$(curl -s http://localhost:50004/api/0/alerts)
CRITICAL_COUNT=$(echo "$RESPONSE" | jq '.summary.critical')
if [ "$CRITICAL_COUNT" -gt 0 ]; then
echo "CRITICAL: $CRITICAL_COUNT critical alerts detected!"
echo "$RESPONSE" | jq '.alerts[] | select(.level=="CRITICAL")'
# Send notification
# mail -s "Critical Alerts" admin@example.com < alert_details.txt
fi
```
### Python Client
```python
import requests
import json
# Get all plugin data for a host
response = requests.get('http://localhost:50004/api/0/hosts/webserver01/plugins')
data = response.json()
print(f"Host: {data['hostname']}")
print(f"Plugins: {', '.join(data['plugins'].keys())}")
for plugin, info in data['plugins'].items():
print(f"\n{plugin}:")
for metric, value in info['data'].items():
print(f" {metric}: {value}")
# Check for alerts
response = requests.get('http://localhost:50004/api/0/alerts')
alerts = response.json()
if alerts['summary']['critical'] > 0:
print(f"\n⚠️ {alerts['summary']['critical']} CRITICAL ALERTS!")
for alert in alerts['alerts']:
if alert['level'] == 'CRITICAL':
print(f" - {alert['hostname']}: {alert['metric_path']} = {alert['last_value']}")
```
### Grafana Integration
The API endpoints can be used with Grafana's JSON datasource plugin:
1. Install the SimpleJSON datasource plugin
2. Configure datasource URL: `http://your-server:50004`
3. Create queries:
- Metrics: `/api/0/hosts/webserver01/plugins/cpu_monitor?limit=100`
- Alerts: `/api/0/alerts`
### Prometheus Integration
Export metrics in Prometheus format (future enhancement):
```python
# Example prometheus exporter
from prometheus_client import Gauge, generate_latest
import requests
cpu_usage = Gauge('heartbeat_cpu_percent', 'CPU usage percentage', ['hostname'])
memory_usage = Gauge('heartbeat_memory_percent', 'Memory usage percentage', ['hostname'])
def collect_metrics():
hosts = requests.get('http://localhost:50004/api/0/hosts').json()
for host in hosts:
hostname = host['name']
plugins = requests.get(f'http://localhost:50004/api/0/hosts/{hostname}/plugins').json()
if 'cpu_monitor' in plugins['plugins']:
cpu_data = plugins['plugins']['cpu_monitor']['data']
cpu_usage.labels(hostname=hostname).set(cpu_data.get('cpu_percent', 0))
if 'memory_monitor' in plugins['plugins']:
mem_data = plugins['plugins']['memory_monitor']['data']
memory_usage.labels(hostname=hostname).set(mem_data.get('percent', 0))
```
---
## Response Formats
### Success Response
All successful API calls return HTTP 200 with JSON body:
```json
{
"field": "value",
...
}
```
### Error Response
API errors return appropriate HTTP status codes with JSON:
```json
{
"error": "Host 'unknown-host' not found"
}
```
**Common Status Codes:**
- `200 OK` - Success
- `400 Bad Request` - Invalid parameters
- `404 Not Found` - Resource not found
- `500 Internal Server Error` - Server error
---
## WebSocket API
For real-time updates, connect to the WebSocket endpoint:
**URL:** `ws://your-server:50005/hbd` (or `wss://` for secure)
**Messages:**
```json
{
"type": "host",
"data": {
"name": "webserver01",
"state": "UP"
}
}
```
```json
{
"type": "plugin",
"data": {
"host": "webserver01",
"plugin": "cpu_monitor",
"data": {...},
"timestamp": 1711234567.123
}
}
```
---
## Configuration
### Enable HTTP Server
```yaml
# In your hbd configuration file
hbd_host: "" # Listen on all interfaces
hbd_port: 50004 # HTTP port
ws_port: 50005 # WebSocket port (optional)
# wss_port: 50006 # Secure WebSocket (requires SSL)
```
### SSL/TLS Configuration
For secure WebSocket connections:
```yaml
wss_port: 50006
cert_path: /etc/heartbeat/certs/
wss_pem: server.pem
wss_key: server.key
```
---
## Rate Limiting
The API currently does not implement rate limiting. For production use, consider:
- Placing behind a reverse proxy (nginx, Apache)
- Using API gateway for rate limiting
- Implementing caching for frequently accessed endpoints
---
## CORS Support
By default, CORS is not enabled. To enable for web applications:
```python
# In http.py, add CORS middleware
from aiohttp_cors import setup as cors_setup
app = web.Application()
cors = cors_setup(app)
# Configure CORS for all routes
for route in list(app.router.routes()):
cors.add(route, {
"*": aiohttp_cors.ResourceOptions(
allow_credentials=True,
expose_headers="*",
allow_headers="*",
)
})
```
---
## Performance Considerations
### Caching
- Plugin data is cached in memory (last 100 samples per plugin)
- No database queries required
- Responses are fast (<10ms typical)
### Scalability
- Each host stores its own data independently
- Memory usage: ~1KB per host + ~1KB per plugin sample
- For 100 hosts with 5 plugins: ~50MB memory
### Best Practices
1. Use `limit` parameter to control response size
2. Cache responses on client side when appropriate
3. Use WebSocket for real-time updates instead of polling
4. Consider pagination for large deployments (future enhancement)
---
## Troubleshooting
### API Returns 404
- Verify hostname in URL matches actual host name
- Check host is sending heartbeats: `curl http://localhost:50004/api/0/hosts`
### No Plugin Data
- Verify client is configured with plugins
- Check client logs for plugin errors
- Ensure plugins are sending data (check journal logs)
### Empty Alerts
- Verify thresholds are configured
- Check host is in `watchhosts` list
- Ensure plugins are collecting metrics
- Review server logs for threshold checker errors
---
## See Also
- [Plugin Development Guide](PLUGIN_DEVELOPMENT.md)
- [Threshold Alerting Documentation](THRESHOLD_ALERTING.md)
- [Message Journal Documentation](MESSAGE_JOURNAL.md)
- Configuration examples: `hbd/config_example.yaml`
+413
View File
@@ -0,0 +1,413 @@
# Message Journal
The message journal provides persistent logging of all received heartbeat messages with automatic size-based log rotation.
## Overview
The journal logs every message received by the heartbeat daemon (hbd) in JSON format, making it easy to:
- Audit message history
- Debug connection issues
- Analyze traffic patterns
- Replay messages for testing
- Create historical reports
## Features
- **JSON Format**: Each message is logged as a single JSON line for easy parsing
- **Size-Based Rotation**: Automatically rotates logs when size threshold is reached
- **Automatic Cleanup**: Keeps only a configurable number of backup files
- **Thread-Safe**: Safe for concurrent access from multiple async tasks
- **Configurable**: All settings controllable via configuration file
- **Performance**: Non-blocking async operation with minimal overhead
## Configuration
Add these settings to your hbd configuration file (e.g., `.hb.yaml`):
```yaml
# Message journal configuration
journal_enabled: true # Enable/disable journaling
journal_dir: /var/log/heartbeat # Directory for journal files
journal_file: messages.journal # Base filename
journal_max_size: 104857600 # Max size in bytes (100MB default)
journal_max_backups: 10 # Number of backup files to keep
```
### Configuration Options
| Option | Default | Description |
|--------|---------|-------------|
| `journal_enabled` | `true` | Enable or disable message journaling |
| `journal_dir` | `/var/log/heartbeat` | Directory where journal files are stored |
| `journal_file` | `messages.journal` | Base filename for the journal |
| `journal_max_size` | `104857600` (100MB) | Maximum file size before rotation |
| `journal_max_backups` | `10` | Number of rotated backup files to keep |
## File Format
Messages are logged in JSONL (JSON Lines) format - one JSON object per line:
```json
{"timestamp":1711234567.123,"datetime":"2026-03-28T12:34:56","source_ip":"192.168.1.100","source_port":50003,"message":{"ID":"HTB","name":"webserver1","interval":30,"ver":1}}
{"timestamp":1711234597.456,"datetime":"2026-03-28T12:35:37","source_ip":"192.168.1.101","source_port":50003,"message":{"ID":"PLG","plugin":"cpu_monitor","cpu_percent":45.2,"load_1min":1.5}}
```
### Entry Structure
Each journal entry contains:
| Field | Type | Description |
|-------|------|-------------|
| `timestamp` | float | Unix timestamp (seconds since epoch) |
| `datetime` | string | ISO 8601 formatted datetime |
| `source_ip` | string | Source IP address |
| `source_port` | integer | Source UDP port |
| `message` | object | Complete parsed message dictionary |
## Log Rotation
### How Rotation Works
1. Journal writes messages to the current file
2. When file size exceeds `journal_max_size`, rotation is triggered
3. Current file is renamed with timestamp: `messages.journal.YYYYMMDD-HHMMSS`
4. New empty file is created as the current journal
5. Old backup files exceeding `journal_max_backups` are deleted
### Example File Structure
```
/var/log/heartbeat/
├── messages.journal # Current active journal
├── messages.journal.20260328-120000 # Rotated backup
├── messages.journal.20260328-140000 # Rotated backup
└── messages.journal.20260328-160000 # Rotated backup (oldest)
```
### Rotation Behavior
- Rotation is triggered when the next message would exceed the size limit
- Rotation is automatic and requires no manual intervention
- Old backups are deleted in FIFO order (oldest first)
- Rotation is thread-safe and won't lose messages
## Usage Examples
### Reading Journal Files
#### Using Python
```python
import json
# Read all entries from current journal
with open('/var/log/heartbeat/messages.journal', 'r') as f:
for line in f:
entry = json.loads(line)
print(f"{entry['datetime']} - {entry['source_ip']} - {entry['message']['ID']}")
```
#### Using jq (command line)
```bash
# View all messages
cat /var/log/heartbeat/messages.journal | jq .
# Filter by message type
cat /var/log/heartbeat/messages.journal | jq 'select(.message.ID == "HTB")'
# Filter by hostname
cat /var/log/heartbeat/messages.journal | jq 'select(.message.name == "webserver1")'
# Count messages by type
cat /var/log/heartbeat/messages.journal | jq -r '.message.ID' | sort | uniq -c
# Extract timestamps and source IPs
cat /var/log/heartbeat/messages.journal | jq -r '[.datetime, .source_ip, .message.ID] | @tsv'
```
#### Using shell tools
```bash
# Count total messages
wc -l /var/log/heartbeat/messages.journal
# View recent messages
tail -n 100 /var/log/heartbeat/messages.journal | jq .
# Search for specific host
grep -F '"name":"webserver1"' /var/log/heartbeat/messages.journal
# Check journal file size
du -h /var/log/heartbeat/messages.journal
```
### Analyzing Historical Data
```bash
# Combine all journal files (current + backups)
cat /var/log/heartbeat/messages.journal* | jq . > all_messages.json
# Count messages per host
cat /var/log/heartbeat/messages.journal* | jq -r '.message.name // "unknown"' | sort | uniq -c
# Find all plugin messages
cat /var/log/heartbeat/messages.journal* | jq 'select(.message.ID == "PLG")'
# Extract CPU metrics from plugin messages
cat /var/log/heartbeat/messages.journal* | \
jq 'select(.message.plugin == "cpu_monitor") | {time: .datetime, host: .message.name, cpu: .message.cpu_percent}'
```
## Integration with Log Management
### Logrotate
While the journal has built-in rotation, you can also use logrotate for additional management:
```
/var/log/heartbeat/messages.journal.* {
daily
rotate 30
compress
delaycompress
missingok
notifempty
}
```
### Elasticsearch/OpenSearch
Import journal data into Elasticsearch for advanced analysis:
```python
from elasticsearch import Elasticsearch
import json
es = Elasticsearch(['localhost:9200'])
with open('/var/log/heartbeat/messages.journal', 'r') as f:
for line in f:
entry = json.loads(line)
es.index(index='heartbeat-messages', body=entry)
```
### Splunk
Create a Splunk input for the journal:
```ini
[monitor:///var/log/heartbeat/messages.journal*]
sourcetype = heartbeat_json
index = heartbeat
```
## Performance Considerations
### Overhead
- Journal writing is async and non-blocking
- Typical overhead: < 1ms per message
- Minimal impact on heartbeat processing
### Disk Usage
Calculate expected disk usage:
```
Messages per day = (86400 seconds / interval) * number_of_hosts
Average message size ≈ 200-500 bytes
Daily disk usage = Messages per day * Average message size
Example:
- 100 hosts
- 30 second interval
- 2880 messages/day per host
- 288,000 messages/day total
- ~60-140 MB/day
```
### Recommendations
- **Small deployments** (< 50 hosts): Default settings work well
- **Medium deployments** (50-500 hosts): Increase `journal_max_size` to 500MB, `journal_max_backups` to 20
- **Large deployments** (> 500 hosts): Consider 1GB+ journal files, 30+ backups, or external log aggregation
## Monitoring
### Check Journal Status
The journal exposes statistics that can be queried:
```python
from hbd.journal import get_journal
journal = get_journal()
stats = journal.get_stats()
print(f"Current size: {stats['current_size']:,} bytes")
print(f"Rotation threshold: {stats['rotation_threshold']}")
```
### Log Messages
Journal operations are logged at appropriate levels:
- `INFO`: Initialization, rotation events, cleanup
- `DEBUG`: Individual message logging
- `WARNING`: Non-critical issues
- `ERROR`: Critical failures
Check hbd logs for journal-related messages:
```bash
grep journal /var/log/heartbeat.log
```
## Troubleshooting
### Journal Files Not Created
**Problem**: No journal files appear in the configured directory.
**Solutions**:
- Check `journal_enabled: true` in configuration
- Verify directory exists and hbd has write permissions
- Check hbd logs for initialization errors
- Verify disk space is available
### Rotation Not Working
**Problem**: Journal file grows beyond `journal_max_size`.
**Solutions**:
- Check that `journal_max_size` is properly configured
- Verify hbd has permission to rename/create files
- Check for filesystem issues
- Review hbd logs for rotation errors
### Missing Messages
**Problem**: Some messages don't appear in journal.
**Solutions**:
- Verify `journal_enabled: true`
- Check for write errors in hbd logs
- Verify sufficient disk space
- Check if filesystem is read-only
### Performance Issues
**Problem**: Journal causing slow message processing.
**Solutions**:
- Use faster storage (SSD) for journal directory
- Increase `journal_max_size` to reduce rotation frequency
- Disable journal if not needed: `journal_enabled: false`
- Consider async syslog forwarding instead
## Security Considerations
### File Permissions
Ensure proper permissions on journal files:
```bash
# Journal directory
chmod 750 /var/log/heartbeat
chown hbd:hbd /var/log/heartbeat
# Journal files
chmod 640 /var/log/heartbeat/messages.journal*
```
### Sensitive Data
Journal files may contain:
- Hostnames and IP addresses
- System metrics
- Custom message content
**Recommendations**:
- Restrict read access to authorized users only
- Consider encryption for archived journals
- Implement log retention policies
- Sanitize data if sharing for debugging
## API Reference
### MessageJournal Class
```python
class MessageJournal:
def __init__(self, config: Dict[str, Any])
async def initialize(self) -> bool
async def log_message(self, msg: Dict, addr: tuple, timestamp: float)
async def close(self)
def get_stats(self) -> Dict[str, Any]
```
### Module Functions
```python
def get_journal(config: Dict = None) -> MessageJournal
async def log_message(msg: Dict, addr: tuple, timestamp: float = None)
```
## Example: Custom Message Processing
Process journal messages in real-time:
```python
import asyncio
import json
from pathlib import Path
async def tail_journal(journal_path):
"""Follow journal file and process new messages."""
path = Path(journal_path)
with open(path, 'r') as f:
# Jump to end
f.seek(0, 2)
while True:
line = f.readline()
if line:
entry = json.loads(line)
await process_message(entry)
else:
await asyncio.sleep(0.1)
async def process_message(entry):
"""Process a journal entry."""
msg = entry['message']
# Alert on boot messages
if msg.get('boot'):
print(f"ALERT: {msg['name']} rebooted at {entry['datetime']}")
# Track CPU usage
if msg.get('ID') == 'PLG' and msg.get('plugin') == 'cpu_monitor':
cpu = msg.get('cpu_percent', 0)
if cpu > 90:
print(f"WARNING: {entry['source_ip']} CPU usage: {cpu}%")
```
## Future Enhancements
Potential improvements for future versions:
- Compression of rotated logs (gzip)
- Time-based rotation in addition to size-based
- Filtering to exclude certain message types
- Structured logging output formats (CEF, GELF)
- Remote syslog forwarding
- Message deduplication
- Journal file encryption
- Signed journal entries
## See Also
- [Configuration Guide](../hbd/config.py) - Full configuration options
- [UDP Protocol](../hbd/udp.py) - Message handling
- [Server Architecture](../hbd/server.py) - Server initialization
+331
View File
@@ -0,0 +1,331 @@
# Nagios Plugin Integration Guide
The Heartbeat monitoring system now supports running existing Nagios-compatible monitoring plugins through the `nagios_runner` plugin. This allows you to leverage the thousands of existing Nagios plugins without modification.
## Quick Start
### 1. Install Nagios Plugins
**Debian/Ubuntu:**
```bash
sudo apt-get install nagios-plugins
```
**RHEL/CentOS/Fedora:**
```bash
sudo yum install nagios-plugins-all
# or
sudo dnf install nagios-plugins-all
```
**Arch Linux:**
```bash
sudo pacman -S monitoring-plugins
```
### 2. Configure Heartbeat
Add the `nagios_runner` section to your `~/.hb.yaml` config:
```yaml
nagios_runner:
interval: 60 # Run plugins every 60 seconds
timeout: 30 # Command timeout in seconds
commands:
- name: check_disk_root
command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
- name: check_load
command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
- name: check_procs
command: /usr/lib/nagios/plugins/check_procs -w 250 -c 400
```
### 3. Start Heartbeat Client
```bash
hbc -v localhost
```
The client will now execute the configured Nagios plugins and send their results to the server.
## How It Works
### Nagios Plugin Standard
Nagios plugins follow a simple interface:
1. **Exit Codes:**
- `0` = OK
- `1` = WARNING
- `2` = CRITICAL
- `3` = UNKNOWN
2. **Output Format:**
```
STATUS - Message | performance_data
```
3. **Performance Data Format:**
```
'label'=value[UOM];[warn];[crit];[min];[max]
```
### Example Plugin Output
```bash
$ /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
DISK OK - free space: / 156 GB (78%); | /=44GB;127;142;0;159
```
This output includes:
- **Status:** `DISK OK`
- **Message:** `free space: / 156 GB (78%)`
- **Performance Data:** `/=44GB;127;142;0;159`
- Current value: 44GB
- Warning threshold: 127GB
- Critical threshold: 142GB
- Min: 0GB
- Max: 159GB
### Data Collected
The `nagios_runner` plugin collects:
**For each configured command:**
- `{name}_status` - Status string (OK, WARNING, CRITICAL, UNKNOWN)
- `{name}_status_code` - Numeric exit code (0-3)
- `{name}_output` - Status message
- `{name}_{metric}` - Each performance metric value
- `{name}_{metric}_uom` - Unit of measurement (if present)
- `{name}_{metric}_warn` - Warning threshold (if present)
- `{name}_{metric}_crit` - Critical threshold (if present)
- `{name}_{metric}_min` - Minimum value (if present)
- `{name}_{metric}_max` - Maximum value (if present)
**Overall:**
- `overall_status` - Worst status from all commands
- `overall_status_code` - Worst status code
- `plugin_count` - Number of Nagios plugins executed
## Configuration Options
```yaml
nagios_runner:
# Collection interval in seconds (default: 60)
interval: 60
# Command execution timeout in seconds (default: 30)
timeout: 30
# Execute commands via shell (default: true)
# Set to false for direct execution (more secure but less flexible)
shell: true
# List of Nagios plugins to run
commands:
- name: unique_name # Required: unique identifier
command: /path/to/plugin [args] # Required: full command to execute
```
## Common Nagios Plugins
### System Resources
**Disk Space:**
```yaml
- name: check_disk_root
command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
```
**Load Average:**
```yaml
- name: check_load
command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
```
**Swap Usage:**
```yaml
- name: check_swap
command: /usr/lib/nagios/plugins/check_swap -w 20% -c 10%
```
**Process Count:**
```yaml
- name: check_procs
command: /usr/lib/nagios/plugins/check_procs -w 250 -c 400
```
**Users Logged In:**
```yaml
- name: check_users
command: /usr/lib/nagios/plugins/check_users -w 5 -c 10
```
### Network Services
**SSH:**
```yaml
- name: check_ssh
command: /usr/lib/nagios/plugins/check_ssh localhost
```
**HTTP:**
```yaml
- name: check_http_local
command: /usr/lib/nagios/plugins/check_http -H localhost
- name: check_http_ssl
command: /usr/lib/nagios/plugins/check_http -H example.com --ssl
```
**DNS:**
```yaml
- name: check_dns
command: /usr/lib/nagios/plugins/check_dns -H google.com
```
**Ping:**
```yaml
- name: check_ping_gateway
command: /usr/lib/nagios/plugins/check_ping -H 192.168.1.1 -w 100,20% -c 500,60%
```
### Databases
**MySQL:**
```yaml
- name: check_mysql
command: /usr/lib/nagios/plugins/check_mysql -H localhost -u user -p password
```
**PostgreSQL:**
```yaml
- name: check_pgsql
command: /usr/lib/nagios/plugins/check_pgsql -H localhost -d database
```
## Writing Custom Nagios Plugins
You can write your own Nagios-compatible plugins in any language. Here's a simple example:
**Bash:**
```bash
#!/bin/bash
# /usr/local/bin/check_example.sh
# Get the value to check
value=$(some_command)
# Define thresholds
warn=80
crit=90
# Check and output result
if [ $value -ge $crit ]; then
echo "CRITICAL - Value is $value | value=${value};${warn};${crit};0;100"
exit 2
elif [ $value -ge $warn ]; then
echo "WARNING - Value is $value | value=${value};${warn};${crit};0;100"
exit 1
else
echo "OK - Value is $value | value=${value};${warn};${crit};0;100"
exit 0
fi
```
**Python:**
```python
#!/usr/bin/env python3
# /usr/local/bin/check_example.py
import sys
def check_something():
value = get_value() # Your check logic here
warn = 80
crit = 90
perfdata = f"value={value};{warn};{crit};0;100"
if value >= crit:
print(f"CRITICAL - Value is {value} | {perfdata}")
sys.exit(2)
elif value >= warn:
print(f"WARNING - Value is {value} | {perfdata}")
sys.exit(1)
else:
print(f"OK - Value is {value} | {perfdata}")
sys.exit(0)
if __name__ == "__main__":
check_something()
```
Then configure in Heartbeat:
```yaml
nagios_runner:
commands:
- name: my_custom_check
command: /usr/local/bin/check_example.sh
```
## Troubleshooting
### Plugin not found
```
Error: Command not found
```
**Solution:** Use the full path to the plugin. Common locations:
- `/usr/lib/nagios/plugins/`
- `/usr/lib64/nagios/plugins/`
- `/usr/local/nagios/libexec/`
### Permission denied
```
Error: Permission denied
```
**Solution:** Ensure the plugin is executable:
```bash
chmod +x /path/to/plugin
```
### Timeout errors
```
Command timed out after 30s
```
**Solution:** Increase the timeout in config:
```yaml
nagios_runner:
timeout: 60 # Increase timeout
```
### No performance data
If performance data is not being parsed:
1. Check plugin output includes `|` separator
2. Verify performance data format: `'label'=value[UOM];...`
3. Enable debug logging: `hbc -v -x localhost`
## Benefits
1. **Massive Plugin Library:** Thousands of existing Nagios plugins available
2. **No Rewriting:** Use plugins as-is without modification
3. **Community Support:** Well-documented and maintained plugins
4. **Flexibility:** Mix Nagios plugins with native Heartbeat plugins
5. **Standard Interface:** Consistent exit codes and output format
6. **Performance Data:** Automatic extraction of metrics
## Resources
- [Nagios Plugin Development Guidelines](https://nagios-plugins.org/doc/guidelines.html)
- [Monitoring Plugins Project](https://www.monitoring-plugins.org/)
- [Nagios Exchange](https://exchange.nagios.org/) - Plugin repository
- [Check_MK Local Checks](https://docs.checkmk.com/latest/en/localchecks.html) - Compatible format
## Next Steps
- Configure threshold alerts based on Nagios plugin status codes
- View plugin data in the Heartbeat web UI
- Create custom plugins for your specific monitoring needs
- Integrate with existing Nagios/Icinga configurations
+544
View File
@@ -0,0 +1,544 @@
# Plugin Development Guide
This guide explains how to create custom plugins for the Heartbeat monitoring system.
## Table of Contents
- [Plugin Architecture](#plugin-architecture)
- [Plugin Types](#plugin-types)
- [Creating a Plugin](#creating-a-plugin)
- [Plugin Lifecycle](#plugin-lifecycle)
- [Configuration](#configuration)
- [Best Practices](#best-practices)
- [Examples](#examples)
- [Testing](#testing)
## Plugin Architecture
Heartbeat's plugin system is designed to be simple yet powerful. Plugins are Python classes that inherit from one of the base plugin types and implement a few key methods.
### Key Concepts
- **Plugin Registry**: Central registry that manages all loaded plugins
- **Plugin Loader**: Automatically discovers and loads plugins from the `hbd/plugins/` directory
- **Plugin Types**: InfoPlugin (static data) and MonitorPlugin (periodic metrics)
- **Async/Await**: All plugin methods are async for non-blocking operation
## Plugin Types
### InfoPlugin
InfoPlugins collect static information that doesn't change frequently (OS version, hardware specs, etc.).
- **Runs once** at startup (interval = 0)
- **Cached** - data is collected once and reused
- **Lightweight** - no periodic overhead
**Use InfoPlugin for:**
- Operating system details
- Hardware information
- Software versions
- Configuration data
- Static inventory
### MonitorPlugin
MonitorPlugins collect metrics that change over time (CPU usage, memory, network traffic).
- **Runs periodically** based on configured interval
- **Scheduled** - collected at regular intervals
- **Dynamic** - captures changing system state
**Use MonitorPlugin for:**
- Resource usage (CPU, memory, disk, network)
- Performance metrics
- Counters and gauges
- Time-series data
## Creating a Plugin
### Step 1: Choose Plugin Type
Decide whether your plugin collects static information (InfoPlugin) or dynamic metrics (MonitorPlugin).
### Step 2: Create Plugin File
Create a new Python file in `hbd/plugins/` directory:
```python
"""
My awesome plugin for Heartbeat.
Brief description of what this plugin does.
"""
import logging
from typing import Dict, Any, Optional
# Import psutil or other dependencies if needed
try:
import psutil
except ImportError:
psutil = None
from hbd.plugin import MonitorPlugin # or InfoPlugin
logger = logging.getLogger(__name__)
class MyAwesomePlugin(MonitorPlugin): # or InfoPlugin
"""
One-line description of the plugin.
Collects:
- List of metrics/data collected
- Another metric
Configuration:
interval: Collection interval in seconds (default: 60)
option1: Description of option1 (default: value)
option2: Description of option2 (default: value)
"""
name = "my_awesome_plugin" # Unique plugin name
interval = 60 # For MonitorPlugin, use 0 for InfoPlugin
def __init__(self, config: Optional[Dict[str, Any]] = None):
"""Initialize the plugin with optional configuration."""
super().__init__(config)
# Extract configuration options
self.option1 = self.config.get('option1', 'default_value')
self.option2 = self.config.get('option2', True)
# Check dependencies
if psutil is None:
raise ImportError("psutil is required for my_awesome_plugin")
async def initialize(self):
"""
Initialize the plugin.
This is called once when the plugin is loaded.
Use this to verify dependencies, establish connections, etc.
Returns:
True if initialization successful, False otherwise
"""
logger.info(f"My awesome plugin initialized (option1: {self.option1})")
return True
async def collect(self) -> Dict[str, Any]:
"""
Collect data.
This is called periodically (MonitorPlugin) or once (InfoPlugin).
Returns:
Dictionary of collected data (will be sent to server)
"""
try:
data = await self._collect_metrics()
logger.debug(f"Collected {len(data)} metrics")
return data
except Exception as e:
logger.error(f"Error collecting data: {e}")
return {"error": str(e)}
async def _collect_metrics(self) -> Dict[str, Any]:
"""Internal method to collect actual metrics."""
metrics = {}
# Collect your data here
metrics['metric1'] = self._get_metric1()
metrics['metric2'] = self._get_metric2()
return metrics
def _get_metric1(self):
"""Helper method for metric collection."""
# Implementation here
return 42
def _get_metric2(self):
"""Helper method for metric collection."""
# Implementation here
return "hello"
async def cleanup(self):
"""
Cleanup resources.
This is called when the plugin is unloaded or the client shuts down.
Use this to close connections, release resources, etc.
"""
logger.info("My awesome plugin cleanup")
# Plugin instance for automatic discovery
plugin = MyAwesomePlugin
```
### Step 3: Test Your Plugin
Create a test script to verify your plugin works:
```python
#!/usr/bin/env python3
import asyncio
import sys
from pathlib import Path
# Add parent directory to path
sys.path.insert(0, str(Path(__file__).parent))
from hbd.plugins.my_awesome_plugin import MyAwesomePlugin
async def test():
# Create plugin instance
plugin = MyAwesomePlugin({'option1': 'test_value'})
# Initialize
if not await plugin.initialize():
print("Failed to initialize")
return False
# Collect data
data = await plugin.collect()
print(f"Collected data: {data}")
# Cleanup
await plugin.cleanup()
return True
if __name__ == '__main__':
success = asyncio.run(test())
sys.exit(0 if success else 1)
```
## Plugin Lifecycle
Understanding the plugin lifecycle helps you implement plugins correctly:
```
1. Plugin Discovery
└─> Loader scans hbd/plugins/ directory
└─> Finds Python files (except those starting with _)
└─> Imports modules
2. Plugin Instantiation
└─> Creates instance with configuration
└─> __init__() is called
3. Plugin Initialization
└─> initialize() is called
└─> Plugin verifies dependencies, establishes connections
└─> Returns True/False for success/failure
4. Plugin Registration
└─> If initialization succeeds, plugin is registered
└─> Plugin becomes active
5. Data Collection
└─> For InfoPlugin: collect() called once after initialization
└─> For MonitorPlugin: collect() called periodically based on interval
└─> Data is sent to server via PLG message
6. Plugin Shutdown
└─> cleanup() is called
└─> Plugin releases resources, closes connections
```
## Configuration
### Plugin-Specific Configuration
Plugins receive configuration through the `config` parameter in `__init__`:
```python
def __init__(self, config: Optional[Dict[str, Any]] = None):
super().__init__(config)
# Access configuration with defaults
self.interval = self.config.get('interval', 60)
self.threshold = self.config.get('threshold', 80)
self.enabled_features = self.config.get('features', ['feature1', 'feature2'])
```
### Client Configuration File
Users configure plugins in the client configuration YAML:
```yaml
plugins:
my_awesome_plugin:
enabled: true
interval: 120
option1: custom_value
option2: false
```
## Best Practices
### 1. Error Handling
Always handle errors gracefully:
```python
async def collect(self) -> Dict[str, Any]:
try:
return await self._collect_metrics()
except Exception as e:
logger.error(f"Error collecting metrics: {e}")
return {"error": str(e)}
```
### 2. Logging
Use appropriate log levels:
```python
logger.debug("Detailed information for debugging")
logger.info("Normal operation messages")
logger.warning("Warning messages for unusual but handled situations")
logger.error("Error messages for failures")
```
### 3. Dependencies
Check for optional dependencies:
```python
try:
import some_optional_library
except ImportError:
some_optional_library = None
# Later in __init__:
if some_optional_library is None:
raise ImportError("some_optional_library is required")
```
### 4. Performance
- Keep collection methods fast (< 1 second)
- Use async/await for I/O operations
- Cache expensive computations
- Don't block the event loop
### 5. Data Structure
Return clean, structured data:
```python
{
'metric_name': value,
'nested_data': {
'sub_metric': value
},
'list_data': [item1, item2],
'timestamp': time.time() # Optional timestamp
}
```
### 6. Documentation
Document your plugin thoroughly:
- Class docstring with description and configuration
- Method docstrings explaining purpose and return values
- Inline comments for complex logic
## Examples
### Example 1: Simple InfoPlugin
```python
from hbd.plugin import InfoPlugin
import platform
class SimpleInfoPlugin(InfoPlugin):
"""Collect basic system information."""
name = "simple_info"
interval = 0 # InfoPlugin
async def initialize(self):
return True
async def collect(self) -> Dict[str, Any]:
return {
'hostname': platform.node(),
'system': platform.system(),
'python_version': platform.python_version()
}
async def cleanup(self):
pass
plugin = SimpleInfoPlugin
```
### Example 2: MonitorPlugin with State
```python
from hbd.plugin import MonitorPlugin
import time
class CounterPlugin(MonitorPlugin):
"""Track a counter over time."""
name = "counter"
interval = 30
def __init__(self, config=None):
super().__init__(config)
self._counter = 0
self._start_time = time.time()
async def initialize(self):
return True
async def collect(self) -> Dict[str, Any]:
self._counter += 1
uptime = time.time() - self._start_time
return {
'count': self._counter,
'uptime': uptime,
'rate': self._counter / uptime
}
async def cleanup(self):
pass
plugin = CounterPlugin
```
### Example 3: Plugin with External Command
```python
from hbd.plugin import MonitorPlugin
import asyncio
class CommandPlugin(MonitorPlugin):
"""Execute external command and capture output."""
name = "command_executor"
interval = 60
def __init__(self, config=None):
super().__init__(config)
self.command = self.config.get('command', 'echo "no command"')
async def initialize(self):
return True
async def collect(self) -> Dict[str, Any]:
try:
process = await asyncio.create_subprocess_shell(
self.command,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
stdout, stderr = await asyncio.wait_for(
process.communicate(),
timeout=30
)
return {
'exit_code': process.returncode,
'stdout': stdout.decode('utf-8'),
'stderr': stderr.decode('utf-8')
}
except Exception as e:
return {'error': str(e)}
async def cleanup(self):
pass
plugin = CommandPlugin
```
## Testing
### Unit Testing
Create unit tests for your plugins:
```python
import unittest
import asyncio
class TestMyPlugin(unittest.TestCase):
def setUp(self):
self.plugin = MyAwesomePlugin({'option1': 'test'})
def test_initialization(self):
result = asyncio.run(self.plugin.initialize())
self.assertTrue(result)
def test_collection(self):
asyncio.run(self.plugin.initialize())
data = asyncio.run(self.plugin.collect())
self.assertIsInstance(data, dict)
self.assertIn('metric1', data)
self.assertGreater(data['metric1'], 0)
def tearDown(self):
asyncio.run(self.plugin.cleanup())
if __name__ == '__main__':
unittest.main()
```
### Integration Testing
Test your plugin with the actual client:
```bash
# Create test configuration
cat > test_config.yaml <<EOF
server: localhost
plugins:
my_awesome_plugin:
enabled: true
interval: 10
option1: test_value
EOF
# Run client in test mode
python -m hbd.hbc -c test_config.yaml --verbose
```
## Troubleshooting
### My plugin isn't loading
1. Check filename doesn't start with underscore
2. Verify plugin class inherits from InfoPlugin or MonitorPlugin
3. Check `initialize()` returns True
4. Look for import errors in logs
### Plugin loads but doesn't collect data
1. Check `interval` is set correctly (0 for InfoPlugin, > 0 for MonitorPlugin)
2. Verify `collect()` returns a dictionary
3. Check for exceptions in `collect()` method
4. Enable DEBUG logging to see detailed errors
### Data isn't appearing on server
1. Verify client is connected to server
2. Check server logs for PLG message handling
3. Verify returned data is JSON-serializable
4. Check for large data sizes (may exceed UDP packet size)
## Further Reading
- [Plugin Framework Source](../hbd/plugin.py) - Core plugin implementation
- [Built-in Plugins](../hbd/plugins/) - Examples of working plugins
- [Nagios Integration](NAGIOS_INTEGRATION.md) - Running external plugins
- [Configuration Guide](../hbd/config_example.yaml) - Full configuration reference
+742
View File
@@ -0,0 +1,742 @@
# Threshold Alerting System
## Overview
The Heartbeat Monitoring System includes a comprehensive threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured thresholds. This system is designed to:
- **Detect anomalies**: Automatically identify when system metrics exceed safe operating ranges
- **Prevent alert fatigue**: Use hysteresis to prevent notification flapping
- **Escalate appropriately**: Support WARNING and CRITICAL severity levels
- **Track state**: Maintain alert history and state transitions per host
- **Integrate seamlessly**: Work with existing notification infrastructure (email, pushover, etc.)
## Architecture
### Components
1. **ThresholdChecker** (`hbd/threshold.py`)
- Main threshold checking engine
- Parses configuration
- Evaluates metrics against thresholds
- Triggers notifications on state changes
2. **ThresholdConfig**
- Individual threshold configuration
- Supports multiple comparison operators
- Implements hysteresis logic
3. **AlertState**
- Tracks current alert state per metric
- Records state transitions
- Manages notification timing
4. **Integration Points**
- UDP handler: Checks thresholds when plugin data arrives
- Host objects: Store alert states per host
- Notification system: Sends alerts via configured channels
### Alert Levels
- **OK**: Metric is within normal range
- **WARNING**: Metric has exceeded warning threshold (first-level concern)
- **CRITICAL**: Metric has exceeded critical threshold (requires immediate attention)
- **UNKNOWN**: Metric value cannot be evaluated (e.g., non-numeric data)
## Configuration
### Basic Structure
Thresholds are configured in the YAML configuration file under the `thresholds` section:
```yaml
thresholds:
plugin_name:
metric_name:
warning: 80.0
critical: 90.0
operator: ">"
hysteresis: 0.1
enabled: true
```
### Configuration Parameters
#### Required Parameters
- **warning**: Warning threshold value (numeric)
- **critical**: Critical threshold value (numeric)
Note: At least one of `warning` or `critical` must be specified.
#### Optional Parameters
- **operator**: Comparison operator (default: `">"`)
- `">"` - Greater than
- `">="` - Greater than or equal
- `"<"` - Less than
- `"<="` - Less than or equal
- `"=="` - Equal to
- `"!="` - Not equal to
- **hysteresis**: Hysteresis percentage to prevent flapping (default: `0.1` = 10%)
- Range: 0.0 to 1.0
- Prevents rapid state transitions when value hovers near threshold
- **enabled**: Whether this threshold is active (default: `true`)
### Comparison Operators
#### Greater Than (`>`, `>=`)
Used for metrics where **higher values are problematic**:
```yaml
cpu_monitor:
cpu_percent:
warning: 80.0 # Alert when CPU > 80%
critical: 90.0 # Alert when CPU > 90%
operator: ">"
```
Examples:
- CPU usage percentage
- Memory usage percentage
- Disk usage percentage
- Load average
- Error counters
#### Less Than (`<`, `<=`)
Used for metrics where **lower values are problematic**:
```yaml
memory_monitor:
available_mb:
warning: 1000 # Alert when available memory < 1GB
critical: 500 # Alert when available memory < 500MB
operator: "<"
```
Examples:
- Available memory
- Free disk space
- Connection pool availability
- Battery level
## Hysteresis
Hysteresis prevents alert flapping by requiring values to improve by a certain amount before recovering from an alert state.
### How It Works
When a metric crosses a threshold (e.g., CPU goes from 85% to 91%, triggering CRITICAL), hysteresis is applied when the value improves:
```
Threshold: 90
Hysteresis: 0.1 (10%)
Recovery threshold: 90 - (90 * 0.1) = 81
Value 91 -> CRITICAL (threshold crossed)
Value 89 -> CRITICAL (still above recovery threshold of 81)
Value 85 -> CRITICAL (still above recovery threshold)
Value 80 -> WARNING or OK (below recovery threshold, re-evaluated normally)
```
### Configuration Recommendations
- **Stable metrics** (CPU, memory): 10-15% hysteresis
```yaml
hysteresis: 0.1
```
- **Very stable metrics** (disk usage): 5% hysteresis
```yaml
hysteresis: 0.05
```
- **Counter metrics** (errors, packets): 20% hysteresis
```yaml
hysteresis: 0.2
```
- **Binary states** (exit codes): No hysteresis
```yaml
hysteresis: 0.0
```
## Plugin-Specific Configuration
### CPU Monitor
```yaml
cpu_monitor:
cpu_percent:
warning: 80.0
critical: 90.0
operator: ">"
hysteresis: 0.1
load_1min:
warning: 4.0
critical: 8.0
operator: ">"
hysteresis: 0.15
load_5min:
warning: 3.0
critical: 6.0
operator: ">"
load_15min:
warning: 2.0
critical: 4.0
operator: ">"
```
### Memory Monitor
```yaml
memory_monitor:
# Percentage-based threshold
percent:
warning: 85.0
critical: 95.0
operator: ">"
# Absolute value threshold (inverse - alert when LOW)
available_mb:
warning: 1000
critical: 500
operator: "<"
# Swap usage
swap_percent:
warning: 50.0
critical: 80.0
operator: ">"
```
### Disk Monitor
Disk thresholds support **partition-specific configuration**:
```yaml
disk_monitor:
partitions:
/:
percent:
warning: 80.0
critical: 90.0
operator: ">"
hysteresis: 0.05
free_gb:
warning: 10.0
critical: 5.0
operator: "<"
/home:
percent:
warning: 85.0
critical: 95.0
operator: ">"
/var:
percent:
warning: 80.0
critical: 90.0
operator: ">"
free_gb:
warning: 5.0
critical: 2.0
operator: "<"
```
### Network Monitor
```yaml
network_monitor:
# Error counters
errors_total:
warning: 100
critical: 1000
operator: ">"
hysteresis: 0.2
# Dropped packets
dropin_total:
warning: 50
critical: 200
operator: ">"
dropout_total:
warning: 50
critical: 200
operator: ">"
# Connection states
connections_TIME_WAIT:
warning: 1000
critical: 5000
operator: ">"
connections_ESTABLISHED:
warning: 500
critical: 1000
operator: ">"
```
### Nagios Runner
The Nagios plugin runner reports exit codes that can be thresholded:
```yaml
nagios_runner:
exit_code:
warning: 1 # Map Nagios WARNING to our WARNING
critical: 2 # Map Nagios CRITICAL to our CRITICAL
operator: ">="
hysteresis: 0.0 # No hysteresis for exit codes
```
## Notification Behavior
### When Notifications Are Sent
Notifications are triggered on **state changes**:
1. **Escalation**: OK → WARNING, OK → CRITICAL, WARNING → CRITICAL
```
WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
```
2. **Recovery**: CRITICAL → WARNING, CRITICAL → OK, WARNING → OK
```
RECOVERED: webserver01 - cpu_monitor.cpu_percent = 70.0 (CRITICAL -> OK)
```
3. **Re-notifications**: Periodic reminders for ongoing alerts
```
REMINDER (CRITICAL): webserver01 - cpu_monitor.cpu_percent = 95.0 (ongoing for 3600s)
```
### Notification Frequency
- **State changes**: Immediate notification
- **Re-notifications**: Controlled by `threshold_renotify_interval` (default: 3600 seconds = 1 hour)
```yaml
threshold_renotify_interval: 3600 # Re-notify every hour for ongoing alerts
```
### Notification Channels
Thresholds use the same notification infrastructure as heartbeat monitoring:
- **Email** (via SMTP)
- **Pushover** (mobile notifications)
- **Mattermost** (team chat)
- **Custom webhooks**
Configuration:
```yaml
# Email
toemail:
- admin@example.com
- oncall@example.com
fromemail: heartbeat@example.com
smtpserver: smtp.example.com
smtpport: 587
smtpuser: heartbeat@example.com
smtppassword: your-password
# Pushover
pushover_token: your-app-token
pushover_user: your-user-key
```
### Watched Hosts
Only hosts in the `watchhosts` list will trigger notifications:
```yaml
watchhosts:
- webserver01
- database01
- mailserver
```
Hosts not in this list will still have thresholds checked and alert states tracked, but won't send notifications.
## Alert State Tracking
Each host maintains alert states for all monitored metrics:
```python
host.alert_states = {
"cpu_monitor.cpu_percent": AlertState(level=WARNING, since=1234567890),
"memory_monitor.percent": AlertState(level=CRITICAL, since=1234567800),
"disk_monitor./.percent": AlertState(level=OK, since=1234567700),
}
```
Alert states persist in memory and are saved with host data (pickle).
### Alert State Information
Each `AlertState` tracks:
- **level**: Current alert level (OK, WARNING, CRITICAL, UNKNOWN)
- **since**: Timestamp when current state started
- **last_value**: Most recent metric value
- **last_check**: Timestamp of last threshold check
- **notification_count**: Number of notifications sent for this alert
- **last_notification**: Timestamp of last notification
### Querying Alert States
Via HTTP API (future enhancement):
```bash
GET /api/hosts/webserver01/alerts
```
Response:
```json
{
"active_alerts": [
{
"metric": "cpu_monitor.cpu_percent",
"level": "WARNING",
"since": 1234567890,
"value": 85.0,
"duration": 300
}
],
"summary": {
"ok": 15,
"warning": 1,
"critical": 0
}
}
```
## Testing
A comprehensive test suite is provided in `test_threshold.py`:
```bash
python test_threshold.py
```
Tests cover:
- Threshold configuration and parsing
- All comparison operators
- Hysteresis functionality
- Alert state tracking
- State change detection
- Notification triggering
- Nested metrics (partitions)
- Alert summaries
## Best Practices
### 1. Start Conservative
Begin with higher thresholds to avoid alert fatigue:
```yaml
cpu_monitor:
cpu_percent:
warning: 85.0 # Start higher
critical: 95.0 # Very high for critical
```
Adjust downward based on observed behavior.
### 2. Consider Workload Patterns
Different systems have different normal ranges:
**Web servers** (bursty traffic):
```yaml
cpu_percent:
warning: 80.0
critical: 90.0
hysteresis: 0.15 # Higher hysteresis for burstiness
```
**Database servers** (steady load):
```yaml
cpu_percent:
warning: 70.0
critical: 85.0
hysteresis: 0.1 # Lower hysteresis for steady metrics
```
### 3. Use Appropriate Operators
Match the operator to the metric:
| Metric Type | Example | Operator | Reason |
|-------------|---------|----------|--------|
| Resource usage | CPU%, Memory% | `>` | Alert when high |
| Available resources | Free memory, Free disk | `<` | Alert when low |
| Error counters | Network errors | `>` | Alert when increasing |
| Health checks | Nagios exit code | `>=` | Map to standard codes |
### 4. Align with Monitoring Intervals
Ensure threshold checks align with plugin collection intervals:
```yaml
plugins:
cpu_monitor:
interval: 300 # Check every 5 minutes
thresholds:
cpu_monitor:
cpu_percent:
warning: 80.0
# Will be checked every 5 minutes
```
### 5. Test Before Production
1. **Start with disabled thresholds**:
```yaml
enabled: false
```
2. **Observe metric ranges** over a week
3. **Set thresholds** based on observed data
4. **Enable gradually**:
```yaml
enabled: true
```
5. **Monitor for false positives**
### 6. Document Baseline Values
Keep a record of normal operating ranges:
```yaml
# Production web server baseline (observed over 30 days):
# CPU: 20-40% normal, 60% peak
# Memory: 60-70% normal, 80% peak
# Disk /: 40-50% usage, growing 2%/month
cpu_monitor:
cpu_percent:
warning: 75.0 # Above peak + margin
critical: 90.0 # Danger zone
```
### 7. Layer Alerts
Use WARNING for early notification, CRITICAL for immediate action:
```yaml
disk_monitor:
partitions:
/:
percent:
warning: 75.0 # Early warning: "check in next few days"
critical: 90.0 # Critical: "act now before outage"
```
## Troubleshooting
### No Notifications Being Sent
1. **Check if host is watched**:
```yaml
watchhosts:
- your-host-name
```
2. **Verify notification configuration**:
```yaml
toemail:
- admin@example.com
smtpserver: smtp.example.com
```
3. **Check threshold configuration**:
```bash
# Look for parsing errors in server logs
grep "threshold" /var/log/heartbeat/hbd.log
```
4. **Verify metric names**:
- Metric names must match exactly (case-sensitive)
- Check journal or logs for actual metric names
### Too Many Alerts (Flapping)
1. **Increase hysteresis**:
```yaml
hysteresis: 0.2 # Increase from 0.1 to 0.2 (20%)
```
2. **Adjust thresholds**:
```yaml
warning: 85.0 # Increase from 80.0
```
3. **Increase renotification interval**:
```yaml
threshold_renotify_interval: 7200 # 2 hours instead of 1
```
### Alerts Not Triggering
1. **Check threshold operator**:
```yaml
# For available memory (alert when LOW):
operator: "<" # NOT ">"
```
2. **Verify numeric values**:
- Ensure metric values are numeric
- Check for unit mismatches (MB vs GB)
3. **Check if threshold is enabled**:
```yaml
enabled: true # NOT false
```
4. **Review hysteresis settings**:
- Very high hysteresis may prevent state changes
- Try reducing or disabling temporarily
### Alert State Not Recovering
1. **Check recovery threshold calculation**:
```
Threshold: 90
Hysteresis: 0.1
Recovery: 90 - (90 * 0.1) = 81
Value must drop below 81 to recover
```
2. **Temporarily disable hysteresis**:
```yaml
hysteresis: 0.0
```
3. **Monitor actual metric values**:
```bash
# Check journal for actual values
grep "cpu_percent" /var/log/heartbeat/messages.journal | tail -20
```
## Advanced Topics
### Custom Notification Callbacks
The ThresholdChecker supports custom notification functions:
```python
def custom_notifier(message):
# Send to incident management system
pagerduty.trigger(message)
# Log to custom system
logger.critical(message)
# Update dashboard
metrics.alert_count.inc()
checker = ThresholdChecker(
config=config,
notification_callback=custom_notifier
)
```
### Programmatic Access
Query alert states programmatically:
```python
# Get all active alerts for a host
active = threshold_checker.get_active_alerts(host.alert_states)
for alert in active:
print(f"{alert.metric_path}: {alert.level.name} for {time.time() - alert.since}s")
# Get alert summary
summary = threshold_checker.get_alert_summary(host.alert_states)
print(f"WARNING: {summary['warning']}, CRITICAL: {summary['critical']}")
```
### Integration with External Systems
Threshold violations can be integrated with:
- **PagerDuty**: Incident creation and escalation
- **OpsGenie**: On-call scheduling and routing
- **ServiceNow**: Ticket creation
- **Grafana**: Dashboard annotations
- **Elasticsearch**: Alert indexing and analysis
## Future Enhancements
Planned features:
1. **Composite thresholds**: Alert based on multiple metrics
```yaml
composite:
high_load_with_low_memory:
conditions:
- cpu_monitor.load_1min > 8.0
- memory_monitor.available_mb < 500
```
2. **Time-based thresholds**: Different thresholds by time of day
```yaml
schedule:
business_hours:
warning: 70.0
off_hours:
warning: 85.0
```
3. **Rate-of-change thresholds**: Alert on rapid changes
```yaml
rate_of_change:
metric: cpu_percent
period: 300
threshold: 30.0 # Alert if changes >30% in 5 minutes
```
4. **Alert grouping**: Combine related alerts
```yaml
groups:
disk_critical:
metrics:
- disk_monitor./.percent
- disk_monitor./var.percent
action: single_notification
```
5. **Maintenance windows**: Suppress alerts during planned maintenance
```yaml
maintenance:
- host: webserver01
start: 2024-01-15T02:00:00Z
end: 2024-01-15T04:00:00Z
```
## See Also
- [Plugin Development Guide](PLUGIN_DEVELOPMENT.md)
- [Message Journal Documentation](MESSAGE_JOURNAL.md)
- Configuration examples: `hbd/config_thresholds_example.yaml`
- Test suite: `test_threshold.py`