hbc proper termination, hbd config reloadable
This commit is contained in:
@@ -0,0 +1,292 @@
|
||||
# Configuration Reload
|
||||
|
||||
The heartbeat daemon (hbd) supports runtime configuration reloading without requiring a full restart. This allows you to update certain configuration settings while the service continues running.
|
||||
|
||||
## How to Reload Configuration
|
||||
|
||||
Send a SIGHUP signal to the running hbd process:
|
||||
|
||||
```bash
|
||||
# Find the process ID
|
||||
ps aux | grep hbd
|
||||
|
||||
# Or use pidof/pgrep
|
||||
pidof hbd
|
||||
pgrep -f hbd
|
||||
|
||||
# Send SIGHUP signal
|
||||
kill -HUP <pid>
|
||||
|
||||
# Or if using systemd
|
||||
systemctl reload heartbeat
|
||||
```
|
||||
|
||||
## What Can Be Reloaded
|
||||
|
||||
The following configuration sections can be reloaded without restarting:
|
||||
|
||||
### ✅ Fully Reloadable
|
||||
|
||||
- **Notification Channels** (`notification_channels`)
|
||||
- Add, remove, or modify notification channel definitions
|
||||
- Update tokens, API keys, SMTP credentials
|
||||
- Change recipient lists
|
||||
|
||||
- **Threshold Configurations** (`threshold_configs`)
|
||||
- Modify warning and critical thresholds
|
||||
- Add or remove threshold rules
|
||||
- Change operators and hysteresis values
|
||||
- Update display formats
|
||||
|
||||
- **Host Configuration** (`hosts`)
|
||||
- Change watch status
|
||||
- Update notification channel assignments
|
||||
- Modify threshold config assignments
|
||||
- Change dyndns status
|
||||
|
||||
- **Host Lists**
|
||||
- `watchhosts` - hosts to monitor
|
||||
- `dyndnshosts` - hosts with dynamic DNS
|
||||
- `drophosts` - hosts to ignore
|
||||
|
||||
- **Runtime Settings**
|
||||
- `grace` - grace period multiplier
|
||||
- `interval` - expected heartbeat interval
|
||||
- `threshold_renotify_interval` - re-notification interval
|
||||
- `debug` - debug level
|
||||
- `verbose` - verbose output
|
||||
|
||||
- **DNS Settings**
|
||||
- `dyndomains` - dynamic DNS domains
|
||||
- `nsupdate_bin` - nsupdate binary path
|
||||
- `rndc_key` - RNDC key path
|
||||
|
||||
### ⚠️ Requires Restart
|
||||
|
||||
The following settings **cannot** be reloaded and require a service restart:
|
||||
|
||||
- **Network Ports**
|
||||
- `hb_port` - UDP heartbeat port
|
||||
- `hbd_port` - HTTP API port
|
||||
- `ws_port` - WebSocket port
|
||||
- `wss_port` - Secure WebSocket port
|
||||
|
||||
- **SSL/TLS Settings**
|
||||
- `cert_path` - SSL certificate path
|
||||
- `wss_pem` - SSL certificate file
|
||||
- `wss_key` - SSL key file
|
||||
|
||||
- **Persistence**
|
||||
- `pickfile` - Pickle file path
|
||||
|
||||
- **Logging**
|
||||
- `logfile` - Log file path
|
||||
- `logfmt` - Log format
|
||||
|
||||
- **Journal Settings**
|
||||
- `journal_enabled` - Enable/disable journaling
|
||||
- `journal_dir` - Journal directory
|
||||
- `journal_file` - Journal filename
|
||||
- `journal_max_size` - Maximum journal size
|
||||
- `journal_max_backups` - Number of backup files
|
||||
|
||||
## Reload Process
|
||||
|
||||
When a SIGHUP signal is received:
|
||||
|
||||
1. **Configuration File Loading**
|
||||
- The config file is re-read from disk
|
||||
- YAML parsing is performed
|
||||
- Validation checks are run
|
||||
|
||||
2. **Component Updates**
|
||||
- Notification system is updated with new channel definitions
|
||||
- Threshold checker reloads all threshold configurations
|
||||
- Alert states are preserved to maintain hysteresis
|
||||
|
||||
3. **Error Handling**
|
||||
- If reload fails, the previous configuration is kept
|
||||
- Error messages are logged
|
||||
- Service continues running with old configuration
|
||||
|
||||
4. **Logging**
|
||||
- Reload start and completion are logged
|
||||
- Each component reports its reload status
|
||||
- Total number of thresholds is reported
|
||||
|
||||
## Example Reload Session
|
||||
|
||||
```bash
|
||||
# Terminal 1: Watch the logs
|
||||
tail -f /var/log/heartbeat.log
|
||||
|
||||
# Terminal 2: Edit configuration
|
||||
vim /path/to/.hb.yaml
|
||||
|
||||
# Make changes to notification channels or thresholds
|
||||
# Save the file
|
||||
|
||||
# Terminal 3: Trigger reload
|
||||
kill -HUP $(pgrep -f hbd)
|
||||
|
||||
# Terminal 1: See reload messages
|
||||
2026-04-01 12:34:56 INFO: Received SIGHUP, initiating config reload...
|
||||
2026-04-01 12:34:56 INFO: ============================================================
|
||||
2026-04-01 12:34:56 INFO: Starting configuration reload...
|
||||
2026-04-01 12:34:56 INFO: ============================================================
|
||||
2026-04-01 12:34:56 INFO: Configuration reloaded from /path/to/.hb.yaml
|
||||
2026-04-01 12:34:56 INFO: Notification configuration reloaded
|
||||
2026-04-01 12:34:56 INFO: Reloading threshold configuration...
|
||||
2026-04-01 12:34:56 INFO: Threshold configuration reloaded: 42 total thresholds
|
||||
2026-04-01 12:34:56 INFO: ============================================================
|
||||
2026-04-01 12:34:56 INFO: Configuration reload completed successfully
|
||||
2026-04-01 12:34:56 INFO: ============================================================
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. Update Notification Credentials
|
||||
|
||||
If you need to rotate API keys or update SMTP passwords:
|
||||
|
||||
```yaml
|
||||
notification_channels:
|
||||
pushover_standard:
|
||||
type: pushover
|
||||
token: new-token-here # Updated
|
||||
user: new-user-key-here # Updated
|
||||
```
|
||||
|
||||
Just edit the config file and send SIGHUP - no restart needed.
|
||||
|
||||
### 2. Adjust Threshold Values
|
||||
|
||||
Fine-tune alerting thresholds based on observed behavior:
|
||||
|
||||
```yaml
|
||||
threshold_configs:
|
||||
default:
|
||||
thresholds:
|
||||
cpu_monitor:
|
||||
cpu_percent:
|
||||
warning: 85.0 # Increased from 80.0
|
||||
critical: 95.0 # Increased from 90.0
|
||||
```
|
||||
|
||||
Send SIGHUP to apply the new thresholds immediately.
|
||||
|
||||
### 3. Add New Notification Channels
|
||||
|
||||
Add a new notification destination:
|
||||
|
||||
```yaml
|
||||
notification_channels:
|
||||
email_oncall:
|
||||
type: email
|
||||
recipients: [oncall@example.com]
|
||||
sender: alerts@example.com
|
||||
smtp_server: smtp.example.com
|
||||
|
||||
hosts:
|
||||
critical_server:
|
||||
threshold_config: default
|
||||
watch: true
|
||||
notification_channels: [pushover_standard, email_oncall] # Added
|
||||
```
|
||||
|
||||
The new channel becomes active immediately after SIGHUP.
|
||||
|
||||
### 4. Update Watch List
|
||||
|
||||
Start or stop monitoring hosts without restart:
|
||||
|
||||
```yaml
|
||||
hosts:
|
||||
new_server:
|
||||
threshold_config: default
|
||||
watch: true # Start watching
|
||||
notification_channels: [pushover_standard]
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Test Configuration Before Reload**
|
||||
- Validate YAML syntax before sending SIGHUP
|
||||
- Check for typos in channel names
|
||||
- Verify threshold values are reasonable
|
||||
|
||||
2. **Monitor Reload Logs**
|
||||
- Always check logs after reload to confirm success
|
||||
- Look for error messages if reload fails
|
||||
- Verify expected number of thresholds loaded
|
||||
|
||||
3. **Backup Before Changes**
|
||||
- Keep a backup of working configuration
|
||||
- Use version control (git) for config files
|
||||
- Document why changes were made
|
||||
|
||||
4. **Gradual Rollout**
|
||||
- Test changes on development server first
|
||||
- Apply to one production server at a time
|
||||
- Verify behavior before applying everywhere
|
||||
|
||||
5. **Plan for Restart-Required Changes**
|
||||
- Schedule downtime for port or SSL changes
|
||||
- Use blue-green deployment if possible
|
||||
- Keep service downtime minimal
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Reload Doesn't Apply Changes
|
||||
|
||||
**Check:**
|
||||
- Is the config file path correct?
|
||||
- Did you save the file after editing?
|
||||
- Are there YAML syntax errors?
|
||||
- Check the logs for error messages
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Validate YAML syntax
|
||||
python -c "import yaml; yaml.safe_load(open('.hb.yaml'))"
|
||||
|
||||
# Check file modification time
|
||||
ls -l .hb.yaml
|
||||
|
||||
# View logs
|
||||
journalctl -u heartbeat -f
|
||||
```
|
||||
|
||||
### Partial Configuration Applied
|
||||
|
||||
**Cause:** Some sections reloaded, others didn't.
|
||||
|
||||
**Solution:** Check logs to see which components failed. Common issues:
|
||||
- Invalid channel type
|
||||
- Missing required threshold fields
|
||||
- Invalid host references
|
||||
|
||||
### Service Becomes Unresponsive
|
||||
|
||||
**Cause:** Malformed configuration caused an exception.
|
||||
|
||||
**Solution:**
|
||||
1. Revert to backup configuration
|
||||
2. Send SIGHUP again to reload the good config
|
||||
3. If service is completely stuck, restart it
|
||||
|
||||
## Implementation Details
|
||||
|
||||
The reload mechanism uses:
|
||||
|
||||
- **Signal Handling**: SIGHUP triggers reload event
|
||||
- **Async-Safe Reloading**: Configuration is loaded asynchronously
|
||||
- **Component Coordination**: All affected components are updated atomically
|
||||
- **State Preservation**: Alert states and hysteresis information are maintained
|
||||
- **Error Recovery**: Failed reloads don't affect running configuration
|
||||
|
||||
## See Also
|
||||
|
||||
- [NOTIFICATIONS.md](NOTIFICATIONS.md) - Notification channel configuration
|
||||
- [THRESHOLD_ALERTING.md](THRESHOLD_ALERTING.md) - Threshold configuration details
|
||||
- Configuration examples in `hbd/config_*.yaml`
|
||||
Reference in New Issue
Block a user