Files
heartbeat/docs/CONFIG_RELOAD.md
T

292 lines
7.8 KiB
Markdown

# Configuration Reload
The heartbeat daemon (hbd) supports runtime configuration reloading without requiring a full restart. This allows you to update certain configuration settings while the service continues running.
## How to Reload Configuration
Send a SIGHUP signal to the running hbd process:
```bash
# Find the process ID
ps aux | grep hbd
# Or use pidof/pgrep
pidof hbd
pgrep -f hbd
# Send SIGHUP signal
kill -HUP <pid>
# Or if using systemd
systemctl reload heartbeat
```
## What Can Be Reloaded
The following configuration sections can be reloaded without restarting:
### ✅ Fully Reloadable
- **Notification Channels** (`notification_channels`)
- Add, remove, or modify notification channel definitions
- Update tokens, API keys, SMTP credentials
- Change recipient lists
- **Threshold Configurations** (`threshold_configs`)
- Modify warning and critical thresholds
- Add or remove threshold rules
- Change operators and hysteresis values
- Update display formats
- **Host Configuration** (`hosts`)
- Change watch status
- Update notification channel assignments
- Modify threshold config assignments
- Change dyndns status
- **Host Lists**
- `watchhosts` - hosts to monitor
- `dyndnshosts` - hosts with dynamic DNS
- `drophosts` - hosts to ignore
- **Runtime Settings**
- `grace` - grace period multiplier
- `interval` - expected heartbeat interval
- `threshold_renotify_interval` - re-notification interval
- `debug` - debug level
- `verbose` - verbose output
- **DNS Settings**
- `dyndomains` - dynamic DNS domains
- `nsupdate_bin` - nsupdate binary path
- `rndc_key` - RNDC key path
### ⚠️ Requires Restart
The following settings **cannot** be reloaded and require a service restart:
- **Network Ports**
- `hb_port` - UDP heartbeat port
- `hbd_port` - HTTP API port
- `ws_port` - WebSocket port
- `wss_port` - Secure WebSocket port
- **SSL/TLS Settings**
- `cert_path` - SSL certificate path
- `wss_pem` - SSL certificate file
- `wss_key` - SSL key file
- **Persistence**
- `pickfile` - Pickle file path
- **Logging**
- `logfile` - Log file path
- **Journal Settings**
- `journal_enabled` - Enable/disable journaling
- `journal_dir` - Journal directory
- `journal_file` - Journal filename
- `journal_max_size` - Maximum journal size
- `journal_max_backups` - Number of backup files
## Reload Process
When a SIGHUP signal is received:
1. **Configuration File Loading**
- The config file is re-read from disk
- YAML parsing is performed
- Validation checks are run
2. **Component Updates**
- Notification system is updated with new channel definitions
- Threshold checker reloads all threshold configurations
- Alert states are preserved to maintain hysteresis
3. **Error Handling**
- If reload fails, the previous configuration is kept
- Error messages are logged
- Service continues running with old configuration
4. **Logging**
- Reload start and completion are logged
- Each component reports its reload status
- Total number of thresholds is reported
## Example Reload Session
```bash
# Terminal 1: Watch the logs
tail -f /var/log/heartbeat.log
# Terminal 2: Edit configuration
vim /path/to/.hb.yaml
# Make changes to notification channels or thresholds
# Save the file
# Terminal 3: Trigger reload
kill -HUP $(pgrep -f hbd)
# Terminal 1: See reload messages
2026-04-01 12:34:56 INFO: Received SIGHUP, initiating config reload...
2026-04-01 12:34:56 INFO: ============================================================
2026-04-01 12:34:56 INFO: Starting configuration reload...
2026-04-01 12:34:56 INFO: ============================================================
2026-04-01 12:34:56 INFO: Configuration reloaded from /path/to/.hb.yaml
2026-04-01 12:34:56 INFO: Notification configuration reloaded
2026-04-01 12:34:56 INFO: Reloading threshold configuration...
2026-04-01 12:34:56 INFO: Threshold configuration reloaded: 42 total thresholds
2026-04-01 12:34:56 INFO: ============================================================
2026-04-01 12:34:56 INFO: Configuration reload completed successfully
2026-04-01 12:34:56 INFO: ============================================================
```
## Common Use Cases
### 1. Update Notification Credentials
If you need to rotate API keys or update SMTP passwords:
```yaml
notification_channels:
pushover_standard:
type: pushover
token: new-token-here # Updated
user: new-user-key-here # Updated
```
Just edit the config file and send SIGHUP - no restart needed.
### 2. Adjust Threshold Values
Fine-tune alerting thresholds based on observed behavior:
```yaml
threshold_configs:
default:
thresholds:
cpu_monitor:
cpu_percent:
warning: 85.0 # Increased from 80.0
critical: 95.0 # Increased from 90.0
```
Send SIGHUP to apply the new thresholds immediately.
### 3. Add New Notification Channels
Add a new notification destination:
```yaml
notification_channels:
email_oncall:
type: email
recipients: [oncall@example.com]
sender: alerts@example.com
smtp_server: smtp.example.com
hosts:
critical_server:
threshold_config: default
watch: true
notification_channels: [pushover_standard, email_oncall] # Added
```
The new channel becomes active immediately after SIGHUP.
### 4. Update Watch List
Start or stop monitoring hosts without restart:
```yaml
hosts:
new_server:
threshold_config: default
watch: true # Start watching
notification_channels: [pushover_standard]
```
## Best Practices
1. **Test Configuration Before Reload**
- Validate YAML syntax before sending SIGHUP
- Check for typos in channel names
- Verify threshold values are reasonable
2. **Monitor Reload Logs**
- Always check logs after reload to confirm success
- Look for error messages if reload fails
- Verify expected number of thresholds loaded
3. **Backup Before Changes**
- Keep a backup of working configuration
- Use version control (git) for config files
- Document why changes were made
4. **Gradual Rollout**
- Test changes on development server first
- Apply to one production server at a time
- Verify behavior before applying everywhere
5. **Plan for Restart-Required Changes**
- Schedule downtime for port or SSL changes
- Use blue-green deployment if possible
- Keep service downtime minimal
## Troubleshooting
### Reload Doesn't Apply Changes
**Check:**
- Is the config file path correct?
- Did you save the file after editing?
- Are there YAML syntax errors?
- Check the logs for error messages
**Solution:**
```bash
# Validate YAML syntax
python -c "import yaml; yaml.safe_load(open('.hb.yaml'))"
# Check file modification time
ls -l .hb.yaml
# View logs
journalctl -u heartbeat -f
```
### Partial Configuration Applied
**Cause:** Some sections reloaded, others didn't.
**Solution:** Check logs to see which components failed. Common issues:
- Invalid channel type
- Missing required threshold fields
- Invalid host references
### Service Becomes Unresponsive
**Cause:** Malformed configuration caused an exception.
**Solution:**
1. Revert to backup configuration
2. Send SIGHUP again to reload the good config
3. If service is completely stuck, restart it
## Implementation Details
The reload mechanism uses:
- **Signal Handling**: SIGHUP triggers reload event
- **Async-Safe Reloading**: Configuration is loaded asynchronously
- **Component Coordination**: All affected components are updated atomically
- **State Preservation**: Alert states and hysteresis information are maintained
- **Error Recovery**: Failed reloads don't affect running configuration
## See Also
- [NOTIFICATIONS.md](NOTIFICATIONS.md) - Notification channel configuration
- [THRESHOLD_ALERTING.md](THRESHOLD_ALERTING.md) - Threshold configuration details
- Configuration examples in `hbd/config_*.yaml`