292 lines
7.8 KiB
Markdown
292 lines
7.8 KiB
Markdown
# Configuration Reload
|
|
|
|
The heartbeat daemon (hbd) supports runtime configuration reloading without requiring a full restart. This allows you to update certain configuration settings while the service continues running.
|
|
|
|
## How to Reload Configuration
|
|
|
|
Send a SIGHUP signal to the running hbd process:
|
|
|
|
```bash
|
|
# Find the process ID
|
|
ps aux | grep hbd
|
|
|
|
# Or use pidof/pgrep
|
|
pidof hbd
|
|
pgrep -f hbd
|
|
|
|
# Send SIGHUP signal
|
|
kill -HUP <pid>
|
|
|
|
# Or if using systemd
|
|
systemctl reload heartbeat
|
|
```
|
|
|
|
## What Can Be Reloaded
|
|
|
|
The following configuration sections can be reloaded without restarting:
|
|
|
|
### ✅ Fully Reloadable
|
|
|
|
- **Notification Channels** (`notification_channels`)
|
|
- Add, remove, or modify notification channel definitions
|
|
- Update tokens, API keys, SMTP credentials
|
|
- Change recipient lists
|
|
|
|
- **Threshold Configurations** (`threshold_configs`)
|
|
- Modify warning and critical thresholds
|
|
- Add or remove threshold rules
|
|
- Change operators and hysteresis values
|
|
- Update display formats
|
|
|
|
- **Host Configuration** (`hosts`)
|
|
- Change watch status
|
|
- Update notification channel assignments
|
|
- Modify threshold config assignments
|
|
- Change dyndns status
|
|
|
|
- **Host Lists**
|
|
- `watchhosts` - hosts to monitor
|
|
- `dyndnshosts` - hosts with dynamic DNS
|
|
- `drophosts` - hosts to ignore
|
|
|
|
- **Runtime Settings**
|
|
- `grace` - grace period multiplier
|
|
- `interval` - expected heartbeat interval
|
|
- `threshold_renotify_interval` - re-notification interval
|
|
- `debug` - debug level
|
|
- `verbose` - verbose output
|
|
|
|
- **DNS Settings**
|
|
- `dyndomains` - dynamic DNS domains
|
|
- `nsupdate_bin` - nsupdate binary path
|
|
- `rndc_key` - RNDC key path
|
|
|
|
### ⚠️ Requires Restart
|
|
|
|
The following settings **cannot** be reloaded and require a service restart:
|
|
|
|
- **Network Ports**
|
|
- `hb_port` - UDP heartbeat port
|
|
- `hbd_port` - HTTP API port
|
|
- `ws_port` - WebSocket port
|
|
- `wss_port` - Secure WebSocket port
|
|
|
|
- **SSL/TLS Settings**
|
|
- `cert_path` - SSL certificate path
|
|
- `wss_pem` - SSL certificate file
|
|
- `wss_key` - SSL key file
|
|
|
|
- **Persistence**
|
|
- `pickfile` - Pickle file path
|
|
|
|
- **Logging**
|
|
- `logfile` - Log file path
|
|
|
|
- **Journal Settings**
|
|
- `journal_enabled` - Enable/disable journaling
|
|
- `journal_dir` - Journal directory
|
|
- `journal_file` - Journal filename
|
|
- `journal_max_size` - Maximum journal size
|
|
- `journal_max_backups` - Number of backup files
|
|
|
|
## Reload Process
|
|
|
|
When a SIGHUP signal is received:
|
|
|
|
1. **Configuration File Loading**
|
|
- The config file is re-read from disk
|
|
- YAML parsing is performed
|
|
- Validation checks are run
|
|
|
|
2. **Component Updates**
|
|
- Notification system is updated with new channel definitions
|
|
- Threshold checker reloads all threshold configurations
|
|
- Alert states are preserved to maintain hysteresis
|
|
|
|
3. **Error Handling**
|
|
- If reload fails, the previous configuration is kept
|
|
- Error messages are logged
|
|
- Service continues running with old configuration
|
|
|
|
4. **Logging**
|
|
- Reload start and completion are logged
|
|
- Each component reports its reload status
|
|
- Total number of thresholds is reported
|
|
|
|
## Example Reload Session
|
|
|
|
```bash
|
|
# Terminal 1: Watch the logs
|
|
tail -f /var/log/heartbeat.log
|
|
|
|
# Terminal 2: Edit configuration
|
|
vim /path/to/.hb.yaml
|
|
|
|
# Make changes to notification channels or thresholds
|
|
# Save the file
|
|
|
|
# Terminal 3: Trigger reload
|
|
kill -HUP $(pgrep -f hbd)
|
|
|
|
# Terminal 1: See reload messages
|
|
2026-04-01 12:34:56 INFO: Received SIGHUP, initiating config reload...
|
|
2026-04-01 12:34:56 INFO: ============================================================
|
|
2026-04-01 12:34:56 INFO: Starting configuration reload...
|
|
2026-04-01 12:34:56 INFO: ============================================================
|
|
2026-04-01 12:34:56 INFO: Configuration reloaded from /path/to/.hb.yaml
|
|
2026-04-01 12:34:56 INFO: Notification configuration reloaded
|
|
2026-04-01 12:34:56 INFO: Reloading threshold configuration...
|
|
2026-04-01 12:34:56 INFO: Threshold configuration reloaded: 42 total thresholds
|
|
2026-04-01 12:34:56 INFO: ============================================================
|
|
2026-04-01 12:34:56 INFO: Configuration reload completed successfully
|
|
2026-04-01 12:34:56 INFO: ============================================================
|
|
```
|
|
|
|
## Common Use Cases
|
|
|
|
### 1. Update Notification Credentials
|
|
|
|
If you need to rotate API keys or update SMTP passwords:
|
|
|
|
```yaml
|
|
notification_channels:
|
|
pushover_standard:
|
|
type: pushover
|
|
token: new-token-here # Updated
|
|
user: new-user-key-here # Updated
|
|
```
|
|
|
|
Just edit the config file and send SIGHUP - no restart needed.
|
|
|
|
### 2. Adjust Threshold Values
|
|
|
|
Fine-tune alerting thresholds based on observed behavior:
|
|
|
|
```yaml
|
|
threshold_configs:
|
|
default:
|
|
thresholds:
|
|
cpu_monitor:
|
|
cpu_percent:
|
|
warning: 85.0 # Increased from 80.0
|
|
critical: 95.0 # Increased from 90.0
|
|
```
|
|
|
|
Send SIGHUP to apply the new thresholds immediately.
|
|
|
|
### 3. Add New Notification Channels
|
|
|
|
Add a new notification destination:
|
|
|
|
```yaml
|
|
notification_channels:
|
|
email_oncall:
|
|
type: email
|
|
recipients: [oncall@example.com]
|
|
sender: alerts@example.com
|
|
smtp_server: smtp.example.com
|
|
|
|
hosts:
|
|
critical_server:
|
|
threshold_config: default
|
|
watch: true
|
|
notification_channels: [pushover_standard, email_oncall] # Added
|
|
```
|
|
|
|
The new channel becomes active immediately after SIGHUP.
|
|
|
|
### 4. Update Watch List
|
|
|
|
Start or stop monitoring hosts without restart:
|
|
|
|
```yaml
|
|
hosts:
|
|
new_server:
|
|
threshold_config: default
|
|
watch: true # Start watching
|
|
notification_channels: [pushover_standard]
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Test Configuration Before Reload**
|
|
- Validate YAML syntax before sending SIGHUP
|
|
- Check for typos in channel names
|
|
- Verify threshold values are reasonable
|
|
|
|
2. **Monitor Reload Logs**
|
|
- Always check logs after reload to confirm success
|
|
- Look for error messages if reload fails
|
|
- Verify expected number of thresholds loaded
|
|
|
|
3. **Backup Before Changes**
|
|
- Keep a backup of working configuration
|
|
- Use version control (git) for config files
|
|
- Document why changes were made
|
|
|
|
4. **Gradual Rollout**
|
|
- Test changes on development server first
|
|
- Apply to one production server at a time
|
|
- Verify behavior before applying everywhere
|
|
|
|
5. **Plan for Restart-Required Changes**
|
|
- Schedule downtime for port or SSL changes
|
|
- Use blue-green deployment if possible
|
|
- Keep service downtime minimal
|
|
|
|
## Troubleshooting
|
|
|
|
### Reload Doesn't Apply Changes
|
|
|
|
**Check:**
|
|
- Is the config file path correct?
|
|
- Did you save the file after editing?
|
|
- Are there YAML syntax errors?
|
|
- Check the logs for error messages
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Validate YAML syntax
|
|
python -c "import yaml; yaml.safe_load(open('.hb.yaml'))"
|
|
|
|
# Check file modification time
|
|
ls -l .hb.yaml
|
|
|
|
# View logs
|
|
journalctl -u heartbeat -f
|
|
```
|
|
|
|
### Partial Configuration Applied
|
|
|
|
**Cause:** Some sections reloaded, others didn't.
|
|
|
|
**Solution:** Check logs to see which components failed. Common issues:
|
|
- Invalid channel type
|
|
- Missing required threshold fields
|
|
- Invalid host references
|
|
|
|
### Service Becomes Unresponsive
|
|
|
|
**Cause:** Malformed configuration caused an exception.
|
|
|
|
**Solution:**
|
|
1. Revert to backup configuration
|
|
2. Send SIGHUP again to reload the good config
|
|
3. If service is completely stuck, restart it
|
|
|
|
## Implementation Details
|
|
|
|
The reload mechanism uses:
|
|
|
|
- **Signal Handling**: SIGHUP triggers reload event
|
|
- **Async-Safe Reloading**: Configuration is loaded asynchronously
|
|
- **Component Coordination**: All affected components are updated atomically
|
|
- **State Preservation**: Alert states and hysteresis information are maintained
|
|
- **Error Recovery**: Failed reloads don't affect running configuration
|
|
|
|
## See Also
|
|
|
|
- [NOTIFICATIONS.md](NOTIFICATIONS.md) - Notification channel configuration
|
|
- [THRESHOLD_ALERTING.md](THRESHOLD_ALERTING.md) - Threshold configuration details
|
|
- Configuration examples in `hbd/config_*.yaml`
|