Compare commits
14 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 8da3d550eb | |||
| a76d0fc840 | |||
| 94cbb31c48 | |||
| ae60844a8a | |||
| 49fa310361 | |||
| 28e2180f7b | |||
| ce0590f015 | |||
| f50acca509 | |||
| 72fc82b91f | |||
| 46f8c32c0b | |||
| 691f62aa69 | |||
| cffc9805f9 | |||
| 917d6a401b | |||
| 2bd3a9beb6 |
@@ -267,6 +267,41 @@ All plugin metrics can be thresholded:
|
|||||||
- **Network**: errors_total, dropped packets, connection counts
|
- **Network**: errors_total, dropped packets, connection counts
|
||||||
- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
|
- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
|
||||||
|
|
||||||
|
### Per-Host Threshold Profiles
|
||||||
|
|
||||||
|
Named threshold configurations let different hosts use different limits. A host's `threshold_config` can be a single name or a **list** — lists are applied left-to-right so profiles compose without duplication:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
threshold_configs:
|
||||||
|
default:
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 80, critical: 90}
|
||||||
|
memory_monitor:
|
||||||
|
memory_percent: {warning: 85, critical: 95}
|
||||||
|
|
||||||
|
tight_cpu: # override CPU limits only
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 60, critical: 75}
|
||||||
|
|
||||||
|
db_disk: # add a database partition check
|
||||||
|
thresholds:
|
||||||
|
disk_monitor:
|
||||||
|
partitions:
|
||||||
|
/var/lib/postgresql:
|
||||||
|
percent: {warning: 75, critical: 88}
|
||||||
|
|
||||||
|
hosts:
|
||||||
|
web-01:
|
||||||
|
threshold_config: default # single profile
|
||||||
|
|
||||||
|
db-01:
|
||||||
|
threshold_config: [tight_cpu, db_disk] # layered: CPU override + extra disk check
|
||||||
|
```
|
||||||
|
|
||||||
|
Each named config's overrides are applied in order on top of the defaults. Metrics not mentioned in a profile are inherited unchanged.
|
||||||
|
|
||||||
See [docs/THRESHOLD_ALERTING.md](docs/THRESHOLD_ALERTING.md) for comprehensive documentation including best practices, troubleshooting, and advanced configuration.
|
See [docs/THRESHOLD_ALERTING.md](docs/THRESHOLD_ALERTING.md) for comprehensive documentation including best practices, troubleshooting, and advanced configuration.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|||||||
+183
-47
@@ -814,34 +814,32 @@ Planned features:
|
|||||||
|
|
||||||
## Multi-Threshold Configuration
|
## Multi-Threshold Configuration
|
||||||
|
|
||||||
**New in version 2.0**: Support for multiple named threshold configurations with per-host mapping.
|
Support for multiple named threshold configurations with per-host mapping and composable layering.
|
||||||
|
|
||||||
### Overview
|
### Overview
|
||||||
|
|
||||||
The multi-threshold feature allows you to:
|
The multi-threshold feature allows you to:
|
||||||
- Define multiple sets of threshold configurations
|
- Define multiple named threshold configurations
|
||||||
- Map different hosts to different threshold sets
|
- Assign one or more configurations to each host
|
||||||
|
- Compose configurations by layering — each named config's overrides are applied in order on top of the defaults
|
||||||
- Use different sensitivity levels for different environments
|
- Use different sensitivity levels for different environments
|
||||||
- Maintain a default configuration for unmapped hosts
|
|
||||||
|
|
||||||
### Configuration Structure
|
### Configuration Structure
|
||||||
|
|
||||||
|
Named configurations are defined under `threshold_configs`. Each host selects which ones to use via `threshold_config` in the `hosts` section (a string for a single config, or a list to layer multiple):
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
# Optional: Set the default configuration name (defaults to "default")
|
# Optional: set the default configuration name (defaults to "default")
|
||||||
default_threshold_config: "default"
|
default_threshold_config: "default"
|
||||||
|
|
||||||
# Define multiple named threshold configurations
|
|
||||||
threshold_configs:
|
threshold_configs:
|
||||||
# Configuration name 1
|
|
||||||
default:
|
default:
|
||||||
thresholds:
|
thresholds:
|
||||||
# Standard threshold definitions
|
|
||||||
cpu_monitor:
|
cpu_monitor:
|
||||||
cpu_percent:
|
cpu_percent:
|
||||||
warning: 80.0
|
warning: 80.0
|
||||||
critical: 90.0
|
critical: 90.0
|
||||||
|
|
||||||
# Configuration name 2
|
|
||||||
high_sensitivity:
|
high_sensitivity:
|
||||||
thresholds:
|
thresholds:
|
||||||
cpu_monitor:
|
cpu_monitor:
|
||||||
@@ -849,7 +847,6 @@ threshold_configs:
|
|||||||
warning: 60.0
|
warning: 60.0
|
||||||
critical: 75.0
|
critical: 75.0
|
||||||
|
|
||||||
# Configuration name 3
|
|
||||||
low_sensitivity:
|
low_sensitivity:
|
||||||
thresholds:
|
thresholds:
|
||||||
cpu_monitor:
|
cpu_monitor:
|
||||||
@@ -857,14 +854,77 @@ threshold_configs:
|
|||||||
warning: 90.0
|
warning: 90.0
|
||||||
critical: 95.0
|
critical: 95.0
|
||||||
|
|
||||||
# Map specific hosts to specific configurations
|
hosts:
|
||||||
host_threshold_mapping:
|
prod-web-01:
|
||||||
prod-web-01: high_sensitivity
|
threshold_config: high_sensitivity # single config
|
||||||
prod-web-02: high_sensitivity
|
|
||||||
dev-server-01: low_sensitivity
|
dev-server-01:
|
||||||
# Unmapped hosts use default_threshold_config
|
threshold_config: low_sensitivity
|
||||||
|
|
||||||
|
# Hosts with no threshold_config use default_threshold_config
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Composable Configurations (list form)
|
||||||
|
|
||||||
|
`threshold_config` can be a list. Configs are applied **left to right**: the defaults are the base, then each named config's overrides are layered on top. Later entries in the list win on any metric they define.
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
threshold_configs:
|
||||||
|
default:
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 80, critical: 90}
|
||||||
|
memory_monitor:
|
||||||
|
memory_percent: {warning: 85, critical: 95}
|
||||||
|
disk_monitor:
|
||||||
|
partitions:
|
||||||
|
/:
|
||||||
|
percent: {warning: 80, critical: 90}
|
||||||
|
|
||||||
|
# Tighter CPU limits for busy servers
|
||||||
|
high_cpu_load:
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 60, critical: 75}
|
||||||
|
|
||||||
|
# Tighter disk limits for data-heavy servers
|
||||||
|
busy_disk:
|
||||||
|
thresholds:
|
||||||
|
disk_monitor:
|
||||||
|
partitions:
|
||||||
|
/:
|
||||||
|
percent: {warning: 70, critical: 85}
|
||||||
|
|
||||||
|
hosts:
|
||||||
|
# Gets default thresholds only
|
||||||
|
web-01:
|
||||||
|
threshold_config: default
|
||||||
|
|
||||||
|
# Gets tighter CPU limits, default memory and disk
|
||||||
|
build-server:
|
||||||
|
threshold_config: high_cpu_load
|
||||||
|
|
||||||
|
# Layers both: tighter CPU AND tighter disk, default memory
|
||||||
|
db-01:
|
||||||
|
threshold_config: [high_cpu_load, busy_disk]
|
||||||
|
|
||||||
|
# Three layers: busy_disk overrides high_cpu_load if they conflict
|
||||||
|
storage-01:
|
||||||
|
threshold_config: [default, high_cpu_load, busy_disk]
|
||||||
|
```
|
||||||
|
|
||||||
|
**How layering works:**
|
||||||
|
|
||||||
|
Starting from the `default` thresholds:
|
||||||
|
|
||||||
|
| Layer | Applied config | Effect |
|
||||||
|
|-------|---------------|--------|
|
||||||
|
| Base | `default` | all default thresholds |
|
||||||
|
| +1 | `high_cpu_load` | cpu_percent overridden to 60/75 |
|
||||||
|
| +2 | `busy_disk` | disk percent overridden to 70/85; cpu_percent stays at 60/75 |
|
||||||
|
|
||||||
|
Each named config only overrides the metrics it explicitly defines. Metrics not mentioned in a config inherit from the layers beneath.
|
||||||
|
|
||||||
### Use Cases
|
### Use Cases
|
||||||
|
|
||||||
#### 1. Environment-Based Thresholds
|
#### 1. Environment-Based Thresholds
|
||||||
@@ -887,11 +947,15 @@ threshold_configs:
|
|||||||
warning: 90.0 # More relaxed for dev
|
warning: 90.0 # More relaxed for dev
|
||||||
critical: 98.0
|
critical: 98.0
|
||||||
|
|
||||||
host_threshold_mapping:
|
hosts:
|
||||||
prod-web-01: production
|
prod-web-01:
|
||||||
prod-web-02: production
|
threshold_config: production
|
||||||
dev-web-01: development
|
prod-web-02:
|
||||||
dev-web-02: development
|
threshold_config: production
|
||||||
|
dev-web-01:
|
||||||
|
threshold_config: development
|
||||||
|
dev-web-02:
|
||||||
|
threshold_config: development
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 2. Server Role-Based Thresholds
|
#### 2. Server Role-Based Thresholds
|
||||||
@@ -914,7 +978,7 @@ threshold_configs:
|
|||||||
warning: 70.0
|
warning: 70.0
|
||||||
critical: 85.0
|
critical: 85.0
|
||||||
memory_monitor:
|
memory_monitor:
|
||||||
percent:
|
memory_percent:
|
||||||
warning: 90.0 # Databases can use high memory
|
warning: 90.0 # Databases can use high memory
|
||||||
critical: 97.0
|
critical: 97.0
|
||||||
disk_monitor:
|
disk_monitor:
|
||||||
@@ -927,17 +991,23 @@ threshold_configs:
|
|||||||
cache:
|
cache:
|
||||||
thresholds:
|
thresholds:
|
||||||
memory_monitor:
|
memory_monitor:
|
||||||
percent:
|
memory_percent:
|
||||||
warning: 95.0 # Redis/Memcached can use very high memory
|
warning: 95.0 # Redis/Memcached can use very high memory
|
||||||
critical: 99.0
|
critical: 99.0
|
||||||
|
|
||||||
host_threshold_mapping:
|
hosts:
|
||||||
web-01: webserver
|
web-01:
|
||||||
web-02: webserver
|
threshold_config: webserver
|
||||||
db-01: database
|
web-02:
|
||||||
db-02: database
|
threshold_config: webserver
|
||||||
redis-01: cache
|
db-01:
|
||||||
memcached-01: cache
|
threshold_config: database
|
||||||
|
db-02:
|
||||||
|
threshold_config: database
|
||||||
|
redis-01:
|
||||||
|
threshold_config: cache
|
||||||
|
memcached-01:
|
||||||
|
threshold_config: cache
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 3. Sensitivity Levels
|
#### 3. Sensitivity Levels
|
||||||
@@ -952,7 +1022,7 @@ threshold_configs:
|
|||||||
partitions:
|
partitions:
|
||||||
/:
|
/:
|
||||||
percent:
|
percent:
|
||||||
warning: 70.0 # Very sensitive
|
warning: 70.0
|
||||||
critical: 80.0
|
critical: 80.0
|
||||||
hysteresis: 0.15
|
hysteresis: 0.15
|
||||||
|
|
||||||
@@ -976,12 +1046,69 @@ threshold_configs:
|
|||||||
critical: 98.0
|
critical: 98.0
|
||||||
hysteresis: 0.05
|
hysteresis: 0.05
|
||||||
|
|
||||||
host_threshold_mapping:
|
hosts:
|
||||||
payment-gateway: critical
|
payment-gateway:
|
||||||
auth-server: critical
|
threshold_config: critical
|
||||||
web-01: standard
|
auth-server:
|
||||||
web-02: standard
|
threshold_config: critical
|
||||||
test-server: relaxed
|
web-01:
|
||||||
|
threshold_config: standard
|
||||||
|
web-02:
|
||||||
|
threshold_config: standard
|
||||||
|
test-server:
|
||||||
|
threshold_config: relaxed
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 4. Composable Profiles
|
||||||
|
|
||||||
|
Build host-specific thresholds by combining small, focused configs:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
threshold_configs:
|
||||||
|
# Baseline — everything at default levels
|
||||||
|
default:
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 80, critical: 90}
|
||||||
|
memory_monitor:
|
||||||
|
memory_percent: {warning: 85, critical: 95}
|
||||||
|
|
||||||
|
# Overlay: tighter CPU only
|
||||||
|
tight_cpu:
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 60, critical: 75}
|
||||||
|
|
||||||
|
# Overlay: tighter memory only
|
||||||
|
tight_memory:
|
||||||
|
thresholds:
|
||||||
|
memory_monitor:
|
||||||
|
memory_percent: {warning: 70, critical: 85}
|
||||||
|
|
||||||
|
# Overlay: extra disk partition for database servers
|
||||||
|
db_disk:
|
||||||
|
thresholds:
|
||||||
|
disk_monitor:
|
||||||
|
partitions:
|
||||||
|
/var/lib/postgresql:
|
||||||
|
percent: {warning: 75, critical: 88}
|
||||||
|
|
||||||
|
hosts:
|
||||||
|
# Plain web server
|
||||||
|
web-01:
|
||||||
|
threshold_config: default
|
||||||
|
|
||||||
|
# Build server: tight CPU, default memory and disk
|
||||||
|
build-01:
|
||||||
|
threshold_config: tight_cpu
|
||||||
|
|
||||||
|
# Database: tight CPU + tight memory + extra disk partition
|
||||||
|
db-01:
|
||||||
|
threshold_config: [tight_cpu, tight_memory, db_disk]
|
||||||
|
|
||||||
|
# Replica database: tight memory + extra disk, normal CPU
|
||||||
|
db-02:
|
||||||
|
threshold_config: [tight_memory, db_disk]
|
||||||
```
|
```
|
||||||
|
|
||||||
### Backward Compatibility
|
### Backward Compatibility
|
||||||
@@ -1012,16 +1139,25 @@ threshold_configs:
|
|||||||
|
|
||||||
### Configuration Priority
|
### Configuration Priority
|
||||||
|
|
||||||
1. **Host-specific mapping**: If host is in `host_threshold_mapping`, use that config
|
1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults
|
||||||
2. **Default config**: Use `default_threshold_config`
|
2. **Host `threshold_config` (string)**: Use that single named config directly
|
||||||
3. **First alphabetically**: If default not found, use first config alphabetically
|
3. **`host_threshold_mapping`** (legacy): Same as above, string only
|
||||||
4. **Legacy fallback**: If `threshold_configs` not present, use `thresholds`
|
4. **`default_threshold_config`**: Used for hosts with no mapping
|
||||||
|
5. **First alphabetically**: If the default config is not found, use the first config alphabetically
|
||||||
|
6. **Legacy `thresholds` section**: Used when `threshold_configs` is absent entirely
|
||||||
|
|
||||||
### Example: Complete Multi-Threshold Setup
|
### Backward Compatibility
|
||||||
|
|
||||||
See `hbd/config_multi_threshold_example.yaml` for a complete example with:
|
The legacy `host_threshold_mapping` top-level key and the flat `thresholds` section are still fully supported:
|
||||||
- 4 named configurations (default, high_sensitivity, low_sensitivity, database)
|
|
||||||
- Host-to-config mappings for production, development, and test systems
|
```yaml
|
||||||
- Specialized database server thresholds
|
# Still works — equivalent to hosts: {prod-web-01: {threshold_config: high_sensitivity}}
|
||||||
- Custom display messages with plugin data
|
host_threshold_mapping:
|
||||||
|
prod-web-01: high_sensitivity
|
||||||
|
|
||||||
|
# Still works — equivalent to threshold_configs: {default: {thresholds: ...}}
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 80, critical: 90}
|
||||||
|
```
|
||||||
|
|
||||||
|
|||||||
+1
-1
@@ -14,4 +14,4 @@ Install options:
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
__all__ = ["__version__"]
|
__all__ = ["__version__"]
|
||||||
__version__ = "5.1.12"
|
__version__ = "5.1.16"
|
||||||
|
|||||||
@@ -525,6 +525,13 @@ async def async_main(args, config):
|
|||||||
for sig in (signal.SIGTERM, signal.SIGINT):
|
for sig in (signal.SIGTERM, signal.SIGINT):
|
||||||
loop.add_signal_handler(sig, stop)
|
loop.add_signal_handler(sig, stop)
|
||||||
|
|
||||||
|
def _sighup():
|
||||||
|
global dorestart
|
||||||
|
dorestart = True
|
||||||
|
stop()
|
||||||
|
|
||||||
|
loop.add_signal_handler(signal.SIGHUP, _sighup)
|
||||||
|
|
||||||
# Start async tasks
|
# Start async tasks
|
||||||
# Heartbeat senders (one per connection)
|
# Heartbeat senders (one per connection)
|
||||||
for conn in connections:
|
for conn in connections:
|
||||||
|
|||||||
@@ -0,0 +1,130 @@
|
|||||||
|
"""
|
||||||
|
ZFS pool monitoring plugin for Heartbeat.
|
||||||
|
|
||||||
|
Collects per-pool health, capacity, and cumulative I/O statistics via zpool(8).
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import logging
|
||||||
|
import shutil
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
from hbd.client.plugin import MonitorPlugin
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def _int(s: str) -> Optional[int]:
|
||||||
|
try:
|
||||||
|
return int(s.strip().rstrip("KMGTkBkmgt%x"))
|
||||||
|
except (ValueError, AttributeError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _float(s: str) -> Optional[float]:
|
||||||
|
try:
|
||||||
|
return float(s.strip().rstrip("%x"))
|
||||||
|
except (ValueError, AttributeError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
class ZFSMonitorPlugin(MonitorPlugin):
|
||||||
|
"""Monitor ZFS pool health, capacity, and I/O statistics.
|
||||||
|
|
||||||
|
Collects per pool:
|
||||||
|
- health: ONLINE, DEGRADED, FAULTED, etc.
|
||||||
|
- size / alloc / free: total, allocated and free bytes
|
||||||
|
- capacity: percentage used (0-100)
|
||||||
|
- frag: fragmentation percentage
|
||||||
|
- dedup: deduplication ratio
|
||||||
|
- read_ops / write_ops: cumulative I/O operations since last boot/clear
|
||||||
|
- read_bw / write_bw: cumulative bytes transferred since last boot/clear
|
||||||
|
|
||||||
|
Configuration:
|
||||||
|
interval: collection interval in seconds (default: 300)
|
||||||
|
pools: list of pool names to monitor (default: all)
|
||||||
|
"""
|
||||||
|
|
||||||
|
name = "zfs_monitor"
|
||||||
|
description = "ZFS pool health, capacity, and I/O statistics"
|
||||||
|
interval = 300
|
||||||
|
|
||||||
|
def __init__(self, config: Optional[Dict[str, Any]] = None):
|
||||||
|
super().__init__(config)
|
||||||
|
self.interval = self.config.get("interval", 300)
|
||||||
|
self._pools_filter: Optional[List[str]] = self.config.get("pools", None)
|
||||||
|
|
||||||
|
async def initialize(self) -> bool:
|
||||||
|
if not shutil.which("zpool"):
|
||||||
|
self.skip_reason = "zpool not found"
|
||||||
|
return False
|
||||||
|
logger.info("ZFS monitor initialized (interval: %ds)", self.interval)
|
||||||
|
return True
|
||||||
|
|
||||||
|
async def _run(self, *args: str) -> List[str]:
|
||||||
|
"""Run a command and return its stdout lines, or [] on error."""
|
||||||
|
try:
|
||||||
|
proc = await asyncio.create_subprocess_exec(
|
||||||
|
*args,
|
||||||
|
stdout=asyncio.subprocess.PIPE,
|
||||||
|
stderr=asyncio.subprocess.DEVNULL,
|
||||||
|
)
|
||||||
|
stdout, _ = await asyncio.wait_for(proc.communicate(), timeout=15)
|
||||||
|
return stdout.decode(errors="replace").splitlines()
|
||||||
|
except (FileNotFoundError, asyncio.TimeoutError) as exc:
|
||||||
|
logger.warning("zfs_monitor: %s: %s", args[0], exc)
|
||||||
|
return []
|
||||||
|
|
||||||
|
async def _zpool_list(self) -> Dict[str, Dict]:
|
||||||
|
"""Return per-pool health and capacity from `zpool list`."""
|
||||||
|
lines = await self._run(
|
||||||
|
"zpool", "list", "-H", "-p",
|
||||||
|
"-o", "name,health,size,alloc,free,cap,frag,dedup",
|
||||||
|
)
|
||||||
|
pools: Dict[str, Dict] = {}
|
||||||
|
for line in lines:
|
||||||
|
parts = line.split("\t")
|
||||||
|
if len(parts) < 8:
|
||||||
|
continue
|
||||||
|
name = parts[0].strip()
|
||||||
|
if self._pools_filter and name not in self._pools_filter:
|
||||||
|
continue
|
||||||
|
pools[name] = {
|
||||||
|
"health": parts[1].strip(),
|
||||||
|
"size": _int(parts[2]),
|
||||||
|
"alloc": _int(parts[3]),
|
||||||
|
"free": _int(parts[4]),
|
||||||
|
"capacity": _float(parts[5]),
|
||||||
|
"frag": _float(parts[6]),
|
||||||
|
"dedup": _float(parts[7]),
|
||||||
|
}
|
||||||
|
return pools
|
||||||
|
|
||||||
|
async def _zpool_iostat(self) -> Dict[str, Dict]:
|
||||||
|
"""Return per-pool cumulative I/O counters from `zpool iostat`."""
|
||||||
|
lines = await self._run("zpool", "iostat", "-H", "-p")
|
||||||
|
io: Dict[str, Dict] = {}
|
||||||
|
for line in lines:
|
||||||
|
parts = line.split("\t")
|
||||||
|
if len(parts) < 7:
|
||||||
|
continue
|
||||||
|
name = parts[0].strip()
|
||||||
|
if not name or name.startswith(" "):
|
||||||
|
continue
|
||||||
|
io[name] = {
|
||||||
|
"read_ops": _int(parts[3]),
|
||||||
|
"write_ops": _int(parts[4]),
|
||||||
|
"read_bw": _int(parts[5]),
|
||||||
|
"write_bw": _int(parts[6]),
|
||||||
|
}
|
||||||
|
return io
|
||||||
|
|
||||||
|
async def _collect_metrics(self) -> Dict[str, Any]:
|
||||||
|
pools, io = await asyncio.gather(self._zpool_list(), self._zpool_iostat())
|
||||||
|
for name, stats in io.items():
|
||||||
|
if name in pools:
|
||||||
|
pools[name].update(stats)
|
||||||
|
return {"pools": pools}
|
||||||
|
|
||||||
|
|
||||||
|
plugin = ZFSMonitorPlugin
|
||||||
@@ -225,7 +225,7 @@ def get_watchhosts(config):
|
|||||||
hosts_config = config.get("hosts", {})
|
hosts_config = config.get("hosts", {})
|
||||||
if isinstance(hosts_config, dict):
|
if isinstance(hosts_config, dict):
|
||||||
for host_name, host_attrs in hosts_config.items():
|
for host_name, host_attrs in hosts_config.items():
|
||||||
if isinstance(host_attrs, dict) and host_attrs.get("watch", False):
|
if isinstance(host_attrs, dict) and host_attrs.get("watch", True):
|
||||||
watchhosts.append(host_name)
|
watchhosts.append(host_name)
|
||||||
return watchhosts
|
return watchhosts
|
||||||
|
|
||||||
|
|||||||
@@ -95,7 +95,7 @@ class Connection:
|
|||||||
if not Null:
|
if not Null:
|
||||||
d["addr"] = self.addr
|
d["addr"] = self.addr
|
||||||
if self.rtts[-1]:
|
if self.rtts[-1]:
|
||||||
d["rtt"] = "%0.1f" % self.rtts[-1]
|
d["rtt"] = "%d" % round(self.rtts[-1])
|
||||||
elif self.state == Connection.UNKNOWN:
|
elif self.state == Connection.UNKNOWN:
|
||||||
d["rtt"] = ""
|
d["rtt"] = ""
|
||||||
else:
|
else:
|
||||||
@@ -286,7 +286,7 @@ class Host:
|
|||||||
Host.hosts[name] = self
|
Host.hosts[name] = self
|
||||||
self.num = num
|
self.num = num
|
||||||
self.dyn = False
|
self.dyn = False
|
||||||
self.watched = False
|
self.watched = True
|
||||||
self.upcount = 0
|
self.upcount = 0
|
||||||
self.interval = 0
|
self.interval = 0
|
||||||
self.doesack = -1
|
self.doesack = -1
|
||||||
@@ -304,6 +304,7 @@ class Host:
|
|||||||
|
|
||||||
def statedict(self):
|
def statedict(self):
|
||||||
d = {}
|
d = {}
|
||||||
|
d["raw_name"] = self.name
|
||||||
d["name"] = self.name
|
d["name"] = self.name
|
||||||
if self.dyn:
|
if self.dyn:
|
||||||
d["name"] += "*"
|
d["name"] += "*"
|
||||||
|
|||||||
+4
-2
@@ -258,7 +258,9 @@ async def start(
|
|||||||
extra_scripts=extra_scripts,
|
extra_scripts=extra_scripts,
|
||||||
hbd_version=hbd_version,
|
hbd_version=hbd_version,
|
||||||
hosts=[
|
hosts=[
|
||||||
hbdclass.Host.hosts[h].stateinfo() for h in sorted(hbdclass.Host.hosts)
|
hbdclass.Host.hosts[h].stateinfo()
|
||||||
|
for h in sorted(hbdclass.Host.hosts)
|
||||||
|
if _can_operate_host(current_user, hbdclass.Host.hosts[h])
|
||||||
],
|
],
|
||||||
messages=data.msgs[-30:],
|
messages=data.msgs[-30:],
|
||||||
current_user=current_user.to_dict() if current_user else None,
|
current_user=current_user.to_dict() if current_user else None,
|
||||||
@@ -510,7 +512,7 @@ async def start(
|
|||||||
hosts_with_plugins = []
|
hosts_with_plugins = []
|
||||||
for hostname in sorted(hbdclass.Host.hosts.keys()):
|
for hostname in sorted(hbdclass.Host.hosts.keys()):
|
||||||
host = hbdclass.Host.hosts[hostname]
|
host = hbdclass.Host.hosts[hostname]
|
||||||
if not _can_view_host(current_user, host):
|
if not _can_operate_host(current_user, host):
|
||||||
continue
|
continue
|
||||||
if host.plugin_data:
|
if host.plugin_data:
|
||||||
hosts_with_plugins.append({
|
hosts_with_plugins.append({
|
||||||
|
|||||||
+54
-2
@@ -24,7 +24,7 @@ sensitive bool True when the raw value must never be shown
|
|||||||
# Credential field names that should always be masked.
|
# Credential field names that should always be masked.
|
||||||
_SECRET_KEYS = frozenset({
|
_SECRET_KEYS = frozenset({
|
||||||
"password", "token", "user_key", "api_key", "secret",
|
"password", "token", "user_key", "api_key", "secret",
|
||||||
"smtp_password", "smtp_user",
|
"smtp_password", "smtp_user", "api_password", "access_token",
|
||||||
})
|
})
|
||||||
|
|
||||||
_CHANNEL_TYPE_LABELS = {
|
_CHANNEL_TYPE_LABELS = {
|
||||||
@@ -181,6 +181,48 @@ def get_settings_sections(config: dict) -> list:
|
|||||||
"notification_channels": attrs.get("notification_channels", []),
|
"notification_channels": attrs.get("notification_channels", []),
|
||||||
})
|
})
|
||||||
|
|
||||||
|
# ---- Threshold configurations -----------------------------------------
|
||||||
|
def _parse_metric_row(metric_path, metric_cfg):
|
||||||
|
if not isinstance(metric_cfg, dict):
|
||||||
|
return None
|
||||||
|
return {
|
||||||
|
"metric": metric_path,
|
||||||
|
"operator": metric_cfg.get("operator", ">"),
|
||||||
|
"warning": metric_cfg.get("warning"),
|
||||||
|
"critical": metric_cfg.get("critical"),
|
||||||
|
"hysteresis": metric_cfg.get("hysteresis"),
|
||||||
|
"count": metric_cfg.get("count", 1),
|
||||||
|
"enabled": metric_cfg.get("enabled", True),
|
||||||
|
}
|
||||||
|
|
||||||
|
threshold_config_list = []
|
||||||
|
raw_tconfigs = config.get("threshold_configs") or {}
|
||||||
|
if raw_tconfigs:
|
||||||
|
for cfg_name, cfg_data in sorted(raw_tconfigs.items()):
|
||||||
|
if not isinstance(cfg_data, dict):
|
||||||
|
continue
|
||||||
|
metrics = [
|
||||||
|
r for r in (
|
||||||
|
_parse_metric_row(mp, mc)
|
||||||
|
for mp, mc in (cfg_data.get("thresholds") or {}).items()
|
||||||
|
) if r
|
||||||
|
]
|
||||||
|
threshold_config_list.append({
|
||||||
|
"name": cfg_name,
|
||||||
|
"metrics": sorted(metrics, key=lambda m: m["metric"]),
|
||||||
|
})
|
||||||
|
elif config.get("thresholds"):
|
||||||
|
metrics = [
|
||||||
|
r for r in (
|
||||||
|
_parse_metric_row(mp, mc)
|
||||||
|
for mp, mc in config["thresholds"].items()
|
||||||
|
) if r
|
||||||
|
]
|
||||||
|
threshold_config_list.append({
|
||||||
|
"name": "default",
|
||||||
|
"metrics": sorted(metrics, key=lambda m: m["metric"]),
|
||||||
|
})
|
||||||
|
|
||||||
# ---- Hosts summary ----------------------------------------------------
|
# ---- Hosts summary ----------------------------------------------------
|
||||||
hosts_list = []
|
hosts_list = []
|
||||||
for hname, hcfg in (config.get("hosts") or {}).items():
|
for hname, hcfg in (config.get("hosts") or {}).items():
|
||||||
@@ -188,7 +230,7 @@ def get_settings_sections(config: dict) -> list:
|
|||||||
continue
|
continue
|
||||||
hosts_list.append({
|
hosts_list.append({
|
||||||
"name": hname,
|
"name": hname,
|
||||||
"watch": bool(hcfg.get("watch", False)),
|
"watch": bool(hcfg.get("watch", True)),
|
||||||
"dyndns": bool(hcfg.get("dyndns", False)),
|
"dyndns": bool(hcfg.get("dyndns", False)),
|
||||||
"owner": hcfg.get("owner", ""),
|
"owner": hcfg.get("owner", ""),
|
||||||
"managers": hcfg.get("managers", []),
|
"managers": hcfg.get("managers", []),
|
||||||
@@ -312,6 +354,16 @@ def get_settings_sections(config: dict) -> list:
|
|||||||
"hosts": hosts_list,
|
"hosts": hosts_list,
|
||||||
"fields": [],
|
"fields": [],
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"id": "thresholds",
|
||||||
|
"title": "Threshold Configurations",
|
||||||
|
"description": "Named alert threshold sets. Each defines warning/critical levels per metric.",
|
||||||
|
"threshold_configs": threshold_config_list,
|
||||||
|
"fields": [
|
||||||
|
field("default_threshold_config", "Default config", "text",
|
||||||
|
"Threshold config used for hosts with no explicit mapping."),
|
||||||
|
],
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"id": "runtime",
|
"id": "runtime",
|
||||||
"title": "Runtime",
|
"title": "Runtime",
|
||||||
|
|||||||
@@ -236,6 +236,8 @@
|
|||||||
color: #ff9800;
|
color: #ff9800;
|
||||||
font-weight: 700;
|
font-weight: 700;
|
||||||
}
|
}
|
||||||
|
#ntable a.host-link { color: inherit; text-decoration: none; }
|
||||||
|
#ntable a.host-link:hover { text-decoration: underline; }
|
||||||
</style>
|
</style>
|
||||||
<script type="text/javascript">
|
<script type="text/javascript">
|
||||||
var cnt = 0;
|
var cnt = 0;
|
||||||
@@ -245,11 +247,13 @@
|
|||||||
var HBD_VERSION = "{{ hbd_version }}";
|
var HBD_VERSION = "{{ hbd_version }}";
|
||||||
|
|
||||||
function hostNameHtml(data) {
|
function hostNameHtml(data) {
|
||||||
|
var rawName = data.raw_name || data.name.replace(/<[^>]+>/g, '').replace('*', '').trim();
|
||||||
var nameHtml = data.name;
|
var nameHtml = data.name;
|
||||||
if (!data.hbc_version || data.hbc_version !== HBD_VERSION) {
|
if (!data.hbc_version || data.hbc_version !== HBD_VERSION) {
|
||||||
nameHtml += ' 🥀';
|
nameHtml += ' 🥀';
|
||||||
}
|
}
|
||||||
return data.dyn ? '<b>' + nameHtml + '</b>' : nameHtml;
|
var display = data.dyn ? '<b>' + nameHtml + '</b>' : nameHtml;
|
||||||
|
return '<a class="host-link" href="/plugins#' + encodeURIComponent(rawName) + '">' + display + '</a>';
|
||||||
}
|
}
|
||||||
|
|
||||||
function setup() {
|
function setup() {
|
||||||
@@ -404,7 +408,7 @@
|
|||||||
);
|
);
|
||||||
if (data.connections[i].state == "up") {
|
if (data.connections[i].state == "up") {
|
||||||
state = '<span class="state-up">up</span>';
|
state = '<span class="state-up">up</span>';
|
||||||
latency = Number.parseFloat(data.connections[i].rtts[0]).toFixed(2);
|
latency = String(Math.round(Number.parseFloat(data.connections[i].rtts[0])));
|
||||||
} else {
|
} else {
|
||||||
if (data.connections[i].state == "unknown") {
|
if (data.connections[i].state == "unknown") {
|
||||||
state = "";
|
state = "";
|
||||||
@@ -511,7 +515,7 @@
|
|||||||
<tbody id="ntablebody">
|
<tbody id="ntablebody">
|
||||||
{% for host in hosts %}
|
{% for host in hosts %}
|
||||||
<tr class="{% if host.alert_critical_unacked > 0 or host.alert_critical_acked > 0 %}row-critical{% elif host.alert_warning_unacked > 0 or host.alert_warning_acked > 0 %}row-warning{% endif %}">
|
<tr class="{% if host.alert_critical_unacked > 0 or host.alert_critical_acked > 0 %}row-critical{% elif host.alert_warning_unacked > 0 or host.alert_warning_acked > 0 %}row-warning{% endif %}">
|
||||||
<td data-name="{{ host.name }}">{{ host.name }}{% if not host.hbc_version or host.hbc_version != hbd_version %} 🥀{% endif %}</td>
|
<td data-name="{{ host.name }}"><a class="host-link" href="/plugins#{{ host.raw_name | urlencode }}">{{ host.name }}{% if not host.hbc_version or host.hbc_version != hbd_version %} 🥀{% endif %}</a></td>
|
||||||
<td style="text-align: center; color: #ff9800; font-weight: bold;">
|
<td style="text-align: center; color: #ff9800; font-weight: bold;">
|
||||||
{%- set warning_unacked = host.alert_warning_unacked -%}
|
{%- set warning_unacked = host.alert_warning_unacked -%}
|
||||||
{%- set warning_acked = host.alert_warning_acked -%}
|
{%- set warning_acked = host.alert_warning_acked -%}
|
||||||
|
|||||||
@@ -383,7 +383,7 @@
|
|||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div class="host-body">
|
<div class="host-body">
|
||||||
{% set plugin_order = ['os_info','cpu_monitor','memory_monitor','disk_monitor','network_monitor','nagios_runner','filesystem_info'] %}
|
{% set plugin_order = ['os_info','cpu_monitor','memory_monitor','disk_monitor','network_monitor','zfs_monitor','nagios_runner','filesystem_info'] %}
|
||||||
{% for plugin in plugin_order if plugin in host.plugins %}
|
{% for plugin in plugin_order if plugin in host.plugins %}
|
||||||
<div class="plugin-accordion collapsed"
|
<div class="plugin-accordion collapsed"
|
||||||
data-hostname="{{ host.name }}"
|
data-hostname="{{ host.name }}"
|
||||||
@@ -673,6 +673,19 @@
|
|||||||
text = `${count} filesystem${count !== 1 ? 's' : ''}`;
|
text = `${count} filesystem${count !== 1 ? 's' : ''}`;
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
case 'zfs_monitor': {
|
||||||
|
const pools = d.pools || {};
|
||||||
|
const names = Object.keys(pools);
|
||||||
|
if (names.length === 0) { text = 'No pools'; break; }
|
||||||
|
const degraded = names.filter(n => pools[n].health && pools[n].health !== 'ONLINE');
|
||||||
|
text = names.map(n => {
|
||||||
|
const p = pools[n];
|
||||||
|
const cap = p.capacity != null ? ` ${p.capacity.toFixed(0)}%` : '';
|
||||||
|
return `${n}${cap}`;
|
||||||
|
}).join(' · ');
|
||||||
|
if (degraded.length) text += ` ⚠ ${degraded.map(n => pools[n].health).join(',')}`;
|
||||||
|
break;
|
||||||
|
}
|
||||||
default:
|
default:
|
||||||
text = 'Loaded';
|
text = 'Loaded';
|
||||||
}
|
}
|
||||||
@@ -694,6 +707,7 @@
|
|||||||
case 'memory_monitor': html = renderMemoryTable(cached.data); break;
|
case 'memory_monitor': html = renderMemoryTable(cached.data); break;
|
||||||
case 'disk_monitor': html = renderDiskTables(cached.data); break;
|
case 'disk_monitor': html = renderDiskTables(cached.data); break;
|
||||||
case 'network_monitor':html = renderNetworkTables(cached.data); break;
|
case 'network_monitor':html = renderNetworkTables(cached.data); break;
|
||||||
|
case 'zfs_monitor': html = renderZfsTables(cached.data); break;
|
||||||
case 'nagios_runner': html = renderNagiosTable(cached.data); break;
|
case 'nagios_runner': html = renderNagiosTable(cached.data); break;
|
||||||
case 'filesystem_info':html = renderFilesystemTable(cached.data); break;
|
case 'filesystem_info':html = renderFilesystemTable(cached.data); break;
|
||||||
default: html = renderGenericTable(cached.data); break;
|
default: html = renderGenericTable(cached.data); break;
|
||||||
@@ -1024,6 +1038,66 @@
|
|||||||
return html;
|
return html;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
function renderZfsTables(d) {
|
||||||
|
const pools = d.pools || {};
|
||||||
|
const names = Object.keys(pools);
|
||||||
|
if (names.length === 0) return '<div class="no-data">No ZFS pools found</div>';
|
||||||
|
|
||||||
|
const healthCls = h => {
|
||||||
|
if (!h || h === 'ONLINE') return 'pct-ok';
|
||||||
|
if (h === 'DEGRADED') return 'pct-warn';
|
||||||
|
return 'pct-crit';
|
||||||
|
};
|
||||||
|
|
||||||
|
let pt = '<table class="data-table"><thead><tr>'
|
||||||
|
+ '<th>Pool</th><th>Health</th>'
|
||||||
|
+ '<th class="num">Size</th><th class="num">Used</th>'
|
||||||
|
+ '<th class="num">Free</th><th class="num">Cap %</th>'
|
||||||
|
+ '<th class="num">Frag %</th><th class="num">Dedup</th>'
|
||||||
|
+ '</tr></thead><tbody>';
|
||||||
|
for (const name of names) {
|
||||||
|
const p = pools[name];
|
||||||
|
const cap = p.capacity != null ? p.capacity : 0;
|
||||||
|
const capCls = cap > 90 ? 'pct-crit' : cap > 75 ? 'pct-warn' : 'pct-ok';
|
||||||
|
pt += `<tr>
|
||||||
|
<td class="iface-name">${escHtml(name)}</td>
|
||||||
|
<td class="${healthCls(p.health)}">${escHtml(p.health || '—')}</td>
|
||||||
|
<td class="num">${formatBytes(p.size || 0)}</td>
|
||||||
|
<td class="num">${formatBytes(p.alloc || 0)}</td>
|
||||||
|
<td class="num">${formatBytes(p.free || 0)}</td>
|
||||||
|
<td class="num ${capCls}">${cap.toFixed(1)}%</td>
|
||||||
|
<td class="num">${p.frag != null ? p.frag.toFixed(1) + '%' : '—'}</td>
|
||||||
|
<td class="num">${p.dedup != null ? p.dedup.toFixed(2) + 'x' : '—'}</td>
|
||||||
|
</tr>`;
|
||||||
|
}
|
||||||
|
pt += '</tbody></table>';
|
||||||
|
|
||||||
|
const hasIo = names.some(n => pools[n].read_ops != null);
|
||||||
|
if (!hasIo) return pt;
|
||||||
|
|
||||||
|
let iot = '<table class="data-table"><thead><tr>'
|
||||||
|
+ '<th>Pool</th>'
|
||||||
|
+ '<th class="num">Read ops</th><th class="num">Write ops</th>'
|
||||||
|
+ '<th class="num">Read BW</th><th class="num">Write BW</th>'
|
||||||
|
+ '</tr></thead><tbody>';
|
||||||
|
for (const name of names) {
|
||||||
|
const p = pools[name];
|
||||||
|
iot += `<tr>
|
||||||
|
<td class="iface-name">${escHtml(name)}</td>
|
||||||
|
<td class="num">${p.read_ops != null ? p.read_ops.toLocaleString() : '—'}</td>
|
||||||
|
<td class="num">${p.write_ops != null ? p.write_ops.toLocaleString() : '—'}</td>
|
||||||
|
<td class="num">${p.read_bw != null ? formatBytes(p.read_bw) : '—'}</td>
|
||||||
|
<td class="num">${p.write_bw != null ? formatBytes(p.write_bw) : '—'}</td>
|
||||||
|
</tr>`;
|
||||||
|
}
|
||||||
|
iot += '</tbody></table>';
|
||||||
|
|
||||||
|
return `<div class="flex-tables">
|
||||||
|
<div><div class="table-section-label">Pools</div>${pt}</div>
|
||||||
|
<div><div class="table-section-label">I/O (cumulative)</div>${iot}</div>
|
||||||
|
</div>`;
|
||||||
|
}
|
||||||
|
|
||||||
function renderGenericTable(d) {
|
function renderGenericTable(d) {
|
||||||
let html = '<table class="data-table"><thead><tr><th>Field</th><th>Value</th></tr></thead><tbody>';
|
let html = '<table class="data-table"><thead><tr><th>Field</th><th>Value</th></tr></thead><tbody>';
|
||||||
for (const [k, v] of Object.entries(d)) {
|
for (const [k, v] of Object.entries(d)) {
|
||||||
@@ -1082,6 +1156,19 @@
|
|||||||
// ── Init ────────────────────────────────────────────────────────────────
|
// ── Init ────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
document.addEventListener('DOMContentLoaded', () => {
|
document.addEventListener('DOMContentLoaded', () => {
|
||||||
|
// If a host fragment is in the URL, expand and scroll to that host;
|
||||||
|
// otherwise expand the first host as before.
|
||||||
|
const hash = window.location.hash;
|
||||||
|
if (hash) {
|
||||||
|
const hostname = decodeURIComponent(hash.slice(1));
|
||||||
|
const card = document.querySelector(`.host-card[data-hostname="${hostname}"]`);
|
||||||
|
if (card) {
|
||||||
|
card.classList.remove('collapsed');
|
||||||
|
fetchHostGlance(hostname);
|
||||||
|
setTimeout(() => card.scrollIntoView({ behavior: 'smooth', block: 'start' }), 150);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
}
|
||||||
const first = document.querySelector('.host-card');
|
const first = document.querySelector('.host-card');
|
||||||
if (first) {
|
if (first) {
|
||||||
first.classList.remove('collapsed');
|
first.classList.remove('collapsed');
|
||||||
|
|||||||
@@ -254,6 +254,17 @@
|
|||||||
.host-bool { text-align: center; }
|
.host-bool { text-align: center; }
|
||||||
.dot-yes { color: #2e7d32; font-size: 1.1em; }
|
.dot-yes { color: #2e7d32; font-size: 1.1em; }
|
||||||
.dot-no { color: #ddd; font-size: 1.1em; }
|
.dot-no { color: #ddd; font-size: 1.1em; }
|
||||||
|
|
||||||
|
/* ---- Threshold configurations ---- */
|
||||||
|
.thresh-config { margin: 12px 20px 20px; }
|
||||||
|
.thresh-config-name {
|
||||||
|
font-weight: 600; font-size: 0.9em; color: #1a237e;
|
||||||
|
margin-bottom: 6px;
|
||||||
|
}
|
||||||
|
.mini-table .warn { color: #e65100; font-weight: 600; }
|
||||||
|
.mini-table .crit { color: #b71c1c; font-weight: 600; }
|
||||||
|
.mini-table .dim { color: #aaa; }
|
||||||
|
.mini-table .metric-path { font-family: monospace; font-size: 0.88em; }
|
||||||
</style>
|
</style>
|
||||||
|
|
||||||
<body>
|
<body>
|
||||||
@@ -394,6 +405,49 @@
|
|||||||
{% endif %}
|
{% endif %}
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
|
{# ---- Threshold configurations section ---- #}
|
||||||
|
{% if section.id == "thresholds" %}
|
||||||
|
{% if section.threshold_configs %}
|
||||||
|
{% for tc in section.threshold_configs %}
|
||||||
|
<div class="thresh-config">
|
||||||
|
<div class="thresh-config-name">{{ tc.name }}</div>
|
||||||
|
{% if tc.metrics %}
|
||||||
|
<div style="overflow-x: auto;">
|
||||||
|
<table class="mini-table">
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>Metric</th>
|
||||||
|
<th>Op</th>
|
||||||
|
<th>Warning</th>
|
||||||
|
<th>Critical</th>
|
||||||
|
<th>Hysteresis</th>
|
||||||
|
<th>Count</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
{% for m in tc.metrics %}
|
||||||
|
<tr {% if not m.enabled %} style="opacity:0.45"{% endif %}>
|
||||||
|
<td class="metric-path">{{ m.metric }}</td>
|
||||||
|
<td>{{ m.operator or '>' }}</td>
|
||||||
|
<td class="warn">{{ m.warning if m.warning is not none else '—' }}</td>
|
||||||
|
<td class="crit">{{ m.critical if m.critical is not none else '—' }}</td>
|
||||||
|
<td class="dim">{{ '%.0f%%' % (m.hysteresis * 100) if m.hysteresis else '—' }}</td>
|
||||||
|
<td class="dim">{{ m.count }}</td>
|
||||||
|
</tr>
|
||||||
|
{% endfor %}
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
{% else %}
|
||||||
|
<span class="val-empty">No thresholds defined.</span>
|
||||||
|
{% endif %}
|
||||||
|
</div>
|
||||||
|
{% endfor %}
|
||||||
|
{% else %}
|
||||||
|
<div class="field-row"><span class="val-empty">No threshold configurations defined.</span></div>
|
||||||
|
{% endif %}
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
{# ---- Hosts section ---- #}
|
{# ---- Hosts section ---- #}
|
||||||
{% if section.id == "hosts" %}
|
{% if section.id == "hosts" %}
|
||||||
{% if section.hosts %}
|
{% if section.hosts %}
|
||||||
|
|||||||
+121
-48
@@ -9,10 +9,11 @@ This module provides a flexible threshold checking system that:
|
|||||||
- Supports multiple comparison operators
|
- Supports multiple comparison operators
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
import logging
|
import logging
|
||||||
import time
|
import time
|
||||||
from enum import Enum
|
from enum import Enum
|
||||||
from typing import Dict, Any, Optional, Tuple, Callable
|
from typing import Dict, List, Any, Optional, Tuple, Callable
|
||||||
from . import notify as notify_mod
|
from . import notify as notify_mod
|
||||||
from .config import THRESHOLD_DEFAULTS
|
from .config import THRESHOLD_DEFAULTS
|
||||||
|
|
||||||
@@ -328,14 +329,17 @@ class ThresholdChecker:
|
|||||||
renotify_interval: Seconds between repeat notifications (default: 1 hour)
|
renotify_interval: Seconds between repeat notifications (default: 1 hour)
|
||||||
journal: Optional MessageJournal instance for logging threshold events
|
journal: Optional MessageJournal instance for logging threshold events
|
||||||
"""
|
"""
|
||||||
# Named threshold configurations: {config_name: {metric_path: ThresholdConfig}}
|
# Named threshold configurations (pre-merged: defaults + overrides): {config_name: {metric_path: ThresholdConfig}}
|
||||||
self.threshold_configs = {}
|
self.threshold_configs = {}
|
||||||
|
|
||||||
|
# Raw overrides only for each named config (no defaults baked in): {config_name: {metric_path: ThresholdConfig}}
|
||||||
|
self.threshold_raw_configs: Dict[str, Dict[str, ThresholdConfig]] = {}
|
||||||
|
|
||||||
# Single threshold set for backward compatibility: {metric_path: ThresholdConfig}
|
# Single threshold set for backward compatibility: {metric_path: ThresholdConfig}
|
||||||
self.thresholds = {}
|
self.thresholds = {}
|
||||||
|
|
||||||
# Host to config name mapping: {host_name: config_name}
|
# Host to ordered list of config names: {host_name: [config_name, ...]}
|
||||||
self.host_config_mapping = {}
|
self.host_config_mapping: Dict[str, List[str]] = {}
|
||||||
|
|
||||||
# Default config name to use when no mapping exists
|
# Default config name to use when no mapping exists
|
||||||
self.default_config = "default"
|
self.default_config = "default"
|
||||||
@@ -372,6 +376,7 @@ class ThresholdChecker:
|
|||||||
|
|
||||||
# Clear old configuration
|
# Clear old configuration
|
||||||
self.threshold_configs.clear()
|
self.threshold_configs.clear()
|
||||||
|
self.threshold_raw_configs.clear()
|
||||||
self.thresholds.clear()
|
self.thresholds.clear()
|
||||||
self.host_config_mapping.clear()
|
self.host_config_mapping.clear()
|
||||||
self.grace_seconds = float(config.get("grace", 2))
|
self.grace_seconds = float(config.get("grace", 2))
|
||||||
@@ -424,9 +429,10 @@ class ThresholdChecker:
|
|||||||
self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=effective_defaults)
|
self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=effective_defaults)
|
||||||
|
|
||||||
self.threshold_configs["default"] = dict(effective_defaults)
|
self.threshold_configs["default"] = dict(effective_defaults)
|
||||||
|
self.threshold_raw_configs["default"] = {}
|
||||||
logger.info("Registered 'default' threshold config with %d metrics", len(effective_defaults))
|
logger.info("Registered 'default' threshold config with %d metrics", len(effective_defaults))
|
||||||
|
|
||||||
# Parse each named configuration, seeding it with effective_defaults first
|
# Parse each named configuration
|
||||||
for config_name, config_data in threshold_configs.items():
|
for config_name, config_data in threshold_configs.items():
|
||||||
if config_name == "default":
|
if config_name == "default":
|
||||||
continue # already handled above
|
continue # already handled above
|
||||||
@@ -440,33 +446,41 @@ class ThresholdChecker:
|
|||||||
continue
|
continue
|
||||||
|
|
||||||
logger.info("Parsing threshold configuration: %s", config_name)
|
logger.info("Parsing threshold configuration: %s", config_name)
|
||||||
self.threshold_configs[config_name] = dict(effective_defaults)
|
|
||||||
|
|
||||||
|
# Raw overrides only (used for multi-config layering)
|
||||||
|
raw_overrides: Dict[str, ThresholdConfig] = {}
|
||||||
thresholds_config = config_data["thresholds"]
|
thresholds_config = config_data["thresholds"]
|
||||||
for plugin_name, plugin_thresholds in thresholds_config.items():
|
for plugin_name, plugin_thresholds in thresholds_config.items():
|
||||||
if not isinstance(plugin_thresholds, dict):
|
if isinstance(plugin_thresholds, dict):
|
||||||
continue
|
self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=raw_overrides)
|
||||||
|
self.threshold_raw_configs[config_name] = raw_overrides
|
||||||
|
|
||||||
self._parse_plugin_thresholds(
|
# Pre-merged version (defaults + overrides) for single-config fast path
|
||||||
plugin_name,
|
self.threshold_configs[config_name] = dict(effective_defaults)
|
||||||
plugin_thresholds,
|
self.threshold_configs[config_name].update(raw_overrides)
|
||||||
target_dict=self.threshold_configs[config_name]
|
|
||||||
)
|
|
||||||
|
|
||||||
# Parse host to config mapping from two possible sources
|
# Parse host → config list mapping from two possible sources
|
||||||
# 1. New format: hosts section with threshold_config attribute
|
|
||||||
|
def _normalise(value) -> List[str]:
|
||||||
|
"""Accept a string or list; always return a list."""
|
||||||
|
if isinstance(value, list):
|
||||||
|
return [str(v) for v in value]
|
||||||
|
return [str(value)]
|
||||||
|
|
||||||
|
# 1. hosts section with threshold_config attribute (string or list)
|
||||||
if "hosts" in config:
|
if "hosts" in config:
|
||||||
hosts_config = config["hosts"]
|
hosts_config = config["hosts"]
|
||||||
if isinstance(hosts_config, dict):
|
if isinstance(hosts_config, dict):
|
||||||
for host_name, host_attrs in hosts_config.items():
|
for host_name, host_attrs in hosts_config.items():
|
||||||
if isinstance(host_attrs, dict) and "threshold_config" in host_attrs:
|
if isinstance(host_attrs, dict) and "threshold_config" in host_attrs:
|
||||||
self.host_config_mapping[host_name] = host_attrs["threshold_config"]
|
self.host_config_mapping[host_name] = _normalise(host_attrs["threshold_config"])
|
||||||
|
|
||||||
# 2. Legacy format: host_threshold_mapping section (for backward compatibility)
|
# 2. Legacy host_threshold_mapping section (string values only)
|
||||||
if "host_threshold_mapping" in config:
|
if "host_threshold_mapping" in config:
|
||||||
legacy_mapping = config.get("host_threshold_mapping", {})
|
legacy_mapping = config.get("host_threshold_mapping", {})
|
||||||
if isinstance(legacy_mapping, dict):
|
if isinstance(legacy_mapping, dict):
|
||||||
self.host_config_mapping.update(legacy_mapping)
|
for host_name, value in legacy_mapping.items():
|
||||||
|
self.host_config_mapping[host_name] = _normalise(value)
|
||||||
|
|
||||||
# Set default config (first one alphabetically or explicitly set)
|
# Set default config (first one alphabetically or explicitly set)
|
||||||
self.default_config = config.get("default_threshold_config", "default")
|
self.default_config = config.get("default_threshold_config", "default")
|
||||||
@@ -664,7 +678,10 @@ class ThresholdChecker:
|
|||||||
)
|
)
|
||||||
|
|
||||||
def get_thresholds_for_host(self, host_name: str) -> Dict[str, ThresholdConfig]:
|
def get_thresholds_for_host(self, host_name: str) -> Dict[str, ThresholdConfig]:
|
||||||
"""Get the appropriate threshold configuration for a host.
|
"""Get the effective threshold configuration for a host.
|
||||||
|
|
||||||
|
When threshold_config is a list, configs are applied left-to-right on top
|
||||||
|
of the default thresholds so earlier entries can be overridden by later ones.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
host_name: Name of the host
|
host_name: Name of the host
|
||||||
@@ -676,23 +693,40 @@ class ThresholdChecker:
|
|||||||
if self.thresholds and not self.threshold_configs:
|
if self.thresholds and not self.threshold_configs:
|
||||||
return self.thresholds
|
return self.thresholds
|
||||||
|
|
||||||
# Multi-config mode: look up host-specific configuration
|
if not self.threshold_configs:
|
||||||
if self.threshold_configs:
|
return {}
|
||||||
config_name = self.host_config_mapping.get(host_name, self.default_config)
|
|
||||||
|
|
||||||
if config_name in self.threshold_configs:
|
config_names = self.host_config_mapping.get(host_name)
|
||||||
return self.threshold_configs[config_name]
|
|
||||||
else:
|
# No host-specific mapping → return pre-merged default
|
||||||
|
if not config_names:
|
||||||
|
return self.threshold_configs.get(self.default_config, {})
|
||||||
|
|
||||||
|
# Single config → fast path using pre-merged copy
|
||||||
|
if len(config_names) == 1:
|
||||||
|
name = config_names[0]
|
||||||
|
if name in self.threshold_configs:
|
||||||
|
return self.threshold_configs[name]
|
||||||
|
logger.warning(
|
||||||
|
"Threshold config '%s' not found for host '%s', using default '%s'",
|
||||||
|
name, host_name, self.default_config,
|
||||||
|
)
|
||||||
|
return self.threshold_configs.get(self.default_config, {})
|
||||||
|
|
||||||
|
# Multiple configs → start from defaults, layer raw overrides in order
|
||||||
|
result = dict(self.threshold_configs.get(self.default_config, {}))
|
||||||
|
for name in config_names:
|
||||||
|
if name == self.default_config:
|
||||||
|
continue # defaults already the base
|
||||||
|
raw = self.threshold_raw_configs.get(name)
|
||||||
|
if raw is None:
|
||||||
logger.warning(
|
logger.warning(
|
||||||
"Threshold config '%s' not found for host '%s', using default '%s'",
|
"Threshold config '%s' not found for host '%s', skipping",
|
||||||
config_name,
|
name, host_name,
|
||||||
host_name,
|
|
||||||
self.default_config
|
|
||||||
)
|
)
|
||||||
return self.threshold_configs.get(self.default_config, {})
|
else:
|
||||||
|
result.update(raw)
|
||||||
# No thresholds configured
|
return result
|
||||||
return {}
|
|
||||||
|
|
||||||
def check_value(
|
def check_value(
|
||||||
self,
|
self,
|
||||||
@@ -769,6 +803,29 @@ class ThresholdChecker:
|
|||||||
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, None)
|
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, None)
|
||||||
|
|
||||||
return None
|
return None
|
||||||
|
def _find_threshold(
|
||||||
|
self, thresholds: Dict[str, "ThresholdConfig"], metric_path: str
|
||||||
|
) -> Optional["ThresholdConfig"]:
|
||||||
|
"""Return the threshold for *metric_path*, falling back to suffix matches.
|
||||||
|
|
||||||
|
Allows generic thresholds like ``ping_monitor.rtt_avg`` to match
|
||||||
|
fully-qualified paths like ``ping_monitor.8_8_8_8_rtt_avg``.
|
||||||
|
The exact match is always tried first; then successive leading
|
||||||
|
underscore-delimited segments are stripped from the field name until
|
||||||
|
a match is found or no segments remain.
|
||||||
|
"""
|
||||||
|
if metric_path in thresholds:
|
||||||
|
return thresholds[metric_path]
|
||||||
|
plugin, sep, field = metric_path.partition(".")
|
||||||
|
if not sep:
|
||||||
|
return None
|
||||||
|
parts = field.split("_")
|
||||||
|
for i in range(1, len(parts)):
|
||||||
|
candidate = plugin + "." + "_".join(parts[i:])
|
||||||
|
if candidate in thresholds:
|
||||||
|
return thresholds[candidate]
|
||||||
|
return None
|
||||||
|
|
||||||
def check_plugin_data(
|
def check_plugin_data(
|
||||||
self,
|
self,
|
||||||
host_name: str,
|
host_name: str,
|
||||||
@@ -797,11 +854,10 @@ class ThresholdChecker:
|
|||||||
for metric_name, value in data.items():
|
for metric_name, value in data.items():
|
||||||
metric_path = f"{plugin_name}.{metric_name}"
|
metric_path = f"{plugin_name}.{metric_name}"
|
||||||
|
|
||||||
if metric_path not in thresholds:
|
threshold = self._find_threshold(thresholds, metric_path)
|
||||||
|
if threshold is None:
|
||||||
continue
|
continue
|
||||||
|
|
||||||
threshold = thresholds[metric_path]
|
|
||||||
|
|
||||||
# Get or create alert state
|
# Get or create alert state
|
||||||
if metric_path not in alert_states:
|
if metric_path not in alert_states:
|
||||||
alert_states[metric_path] = AlertState(metric_path)
|
alert_states[metric_path] = AlertState(metric_path)
|
||||||
@@ -987,6 +1043,11 @@ class ThresholdChecker:
|
|||||||
value: Any,
|
value: Any,
|
||||||
):
|
):
|
||||||
"""Send notification and log to journal/eventlog."""
|
"""Send notification and log to journal/eventlog."""
|
||||||
|
from . import hbdclass
|
||||||
|
host = hbdclass.Host.hosts.get(host_name)
|
||||||
|
if host is not None and not host.watched:
|
||||||
|
eventlog(host_name, lvl, message, service="threshold")
|
||||||
|
return
|
||||||
asyncio.get_event_loop().create_task(notify_mod.send_notification(
|
asyncio.get_event_loop().create_task(notify_mod.send_notification(
|
||||||
host_name,
|
host_name,
|
||||||
notify_mod.Notification(
|
notify_mod.Notification(
|
||||||
@@ -999,7 +1060,6 @@ class ThresholdChecker:
|
|||||||
# Log to journal
|
# Log to journal
|
||||||
if self.journal is not None:
|
if self.journal is not None:
|
||||||
try:
|
try:
|
||||||
import asyncio
|
|
||||||
loop = asyncio.get_event_loop()
|
loop = asyncio.get_event_loop()
|
||||||
loop.create_task(self.journal.log_threshold_event(
|
loop.create_task(self.journal.log_threshold_event(
|
||||||
host_name=host_name,
|
host_name=host_name,
|
||||||
@@ -1076,7 +1136,9 @@ class ThresholdChecker:
|
|||||||
) -> None:
|
) -> None:
|
||||||
"""Handle a state-change transition with grace-period logic.
|
"""Handle a state-change transition with grace-period logic.
|
||||||
|
|
||||||
Transitioning INTO alert: defers the notification for grace_seconds.
|
Transitioning INTO alert (worsening): defers the notification for grace_seconds.
|
||||||
|
De-escalation within alert states (e.g. CRITICAL→WARNING): no new notification;
|
||||||
|
the metric is still alerting so no RECOVER was sent.
|
||||||
Transitioning TO OK:
|
Transitioning TO OK:
|
||||||
- Still in grace window (pending_since set): suppresses both the alert
|
- Still in grace window (pending_since set): suppresses both the alert
|
||||||
and the recovery — the spike never warranted a page.
|
and the recovery — the spike never warranted a page.
|
||||||
@@ -1096,12 +1158,20 @@ class ThresholdChecker:
|
|||||||
alert_state.pending_since = None
|
alert_state.pending_since = None
|
||||||
else:
|
else:
|
||||||
self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
|
self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
|
||||||
else:
|
elif new_level.value > old_level.value:
|
||||||
|
# Worsening (OK→WARNING, OK→CRITICAL, WARNING→CRITICAL): schedule notification.
|
||||||
alert_state.pending_since = time.time()
|
alert_state.pending_since = time.time()
|
||||||
logger.debug(
|
logger.debug(
|
||||||
"Alert deferred (%.0fs grace): %s on %s = %s",
|
"Alert deferred (%.0fs grace): %s on %s = %s",
|
||||||
self.grace_seconds, metric_path, host_name, value,
|
self.grace_seconds, metric_path, host_name, value,
|
||||||
)
|
)
|
||||||
|
else:
|
||||||
|
# De-escalation within alert states (e.g. CRITICAL→WARNING): metric is still
|
||||||
|
# alerting but did not recover, so no new notification.
|
||||||
|
logger.debug(
|
||||||
|
"De-escalation %s→%s for %s on %s, no notification",
|
||||||
|
old_level.name, new_level.name, metric_path, host_name,
|
||||||
|
)
|
||||||
|
|
||||||
def _check_pending_or_renotify(
|
def _check_pending_or_renotify(
|
||||||
self,
|
self,
|
||||||
@@ -1191,17 +1261,20 @@ class ThresholdChecker:
|
|||||||
else:
|
else:
|
||||||
message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} (ongoing for {int(now - alert_state.since)}s)"
|
message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} (ongoing for {int(now - alert_state.since)}s)"
|
||||||
|
|
||||||
asyncio.get_event_loop().create_task(notify_mod.send_notification(
|
from . import hbdclass
|
||||||
host_name,
|
host = hbdclass.Host.hosts.get(host_name)
|
||||||
notify_mod.Notification(
|
if host is None or host.watched:
|
||||||
title=f"[REMINDER/{alert_state.level.name}] {host_name}",
|
asyncio.get_event_loop().create_task(notify_mod.send_notification(
|
||||||
body=message,
|
host_name,
|
||||||
level=alert_state.level.name,
|
notify_mod.Notification(
|
||||||
),
|
title=f"[REMINDER/{alert_state.level.name}] {host_name}",
|
||||||
))
|
body=message,
|
||||||
|
level=alert_state.level.name,
|
||||||
|
),
|
||||||
|
))
|
||||||
|
logger.info("Re-notification sent: %s", message)
|
||||||
alert_state.last_notification = now
|
alert_state.last_notification = now
|
||||||
alert_state.notification_count += 1
|
alert_state.notification_count += 1
|
||||||
logger.info("Re-notification sent: %s", message)
|
|
||||||
|
|
||||||
def get_active_alerts(self, alert_states: Dict[str, AlertState]) -> list:
|
def get_active_alerts(self, alert_states: Dict[str, AlertState]) -> list:
|
||||||
"""
|
"""
|
||||||
|
|||||||
+30
-21
@@ -211,10 +211,11 @@ def _make_timer_callbacks(uname, host, ctx):
|
|||||||
connection.newstate(connection.__class__.OVERDUE, now, cfg.get("grace", 2))
|
connection.newstate(connection.__class__.OVERDUE, now, cfg.get("grace", 2))
|
||||||
msg = f"{connection.afam} overdue"
|
msg = f"{connection.afam} overdue"
|
||||||
eventlog(uname, "CRITICAL", msg)
|
eventlog(uname, "CRITICAL", msg)
|
||||||
asyncio.create_task(notify_mod.send_notification(
|
if host.watched:
|
||||||
uname,
|
asyncio.create_task(notify_mod.send_notification(
|
||||||
notify_mod.Notification(title=f"[CRITICAL] {uname}", body=msg, level="CRITICAL"),
|
uname,
|
||||||
))
|
notify_mod.Notification(title=f"[CRITICAL] {uname}", body=msg, level="CRITICAL"),
|
||||||
|
))
|
||||||
# Track in alert_states so the Alerts Dashboard shows this
|
# Track in alert_states so the Alerts Dashboard shows this
|
||||||
_set_connectivity_alert(host, connection.afam, "CRITICAL")
|
_set_connectivity_alert(host, connection.afam, "CRITICAL")
|
||||||
if threshold_checker:
|
if threshold_checker:
|
||||||
@@ -407,10 +408,11 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
|
|||||||
|
|
||||||
if res:
|
if res:
|
||||||
eventlog(uname, "WARNING", res)
|
eventlog(uname, "WARNING", res)
|
||||||
asyncio.create_task(notify_mod.send_notification(
|
if host.watched:
|
||||||
uname,
|
asyncio.create_task(notify_mod.send_notification(
|
||||||
notify_mod.Notification(title=f"[WARNING] {uname}", body=res, level="WARNING"),
|
uname,
|
||||||
))
|
notify_mod.Notification(title=f"[WARNING] {uname}", body=res, level="WARNING"),
|
||||||
|
))
|
||||||
|
|
||||||
interval = int(msg.get("interval", 0) or 0)
|
interval = int(msg.get("interval", 0) or 0)
|
||||||
shutdown = msg.get("shutdown", 0)
|
shutdown = msg.get("shutdown", 0)
|
||||||
@@ -420,10 +422,11 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
|
|||||||
|
|
||||||
if boot:
|
if boot:
|
||||||
eventlog(uname, "INFO", "booted")
|
eventlog(uname, "INFO", "booted")
|
||||||
asyncio.create_task(notify_mod.send_notification(
|
if host.watched:
|
||||||
uname,
|
asyncio.create_task(notify_mod.send_notification(
|
||||||
notify_mod.Notification(title=f"[INFO] {uname}", body=f"{host.name} booted", level="INFO"),
|
uname,
|
||||||
))
|
notify_mod.Notification(title=f"[INFO] {uname}", body=f"{host.name} booted", level="INFO"),
|
||||||
|
))
|
||||||
if message:
|
if message:
|
||||||
eventlog(uname, "INFO", "msg: %s" % message, service=service)
|
eventlog(uname, "INFO", "msg: %s" % message, service=service)
|
||||||
|
|
||||||
@@ -437,13 +440,18 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
|
|||||||
if not newh:
|
if not newh:
|
||||||
if d == 0 or lasts == "unknown":
|
if d == 0 or lasts == "unknown":
|
||||||
m = "%s is up" % (conn.afam)
|
m = "%s is up" % (conn.afam)
|
||||||
|
elif d < 4:
|
||||||
|
# Transient blip (likely client restart) — skip log and notification
|
||||||
|
m = None
|
||||||
else:
|
else:
|
||||||
m = "%s back after being %s for %s" % (conn.afam, lasts, dur(d))
|
m = "%s back after being %s for %s" % (conn.afam, lasts, dur(d))
|
||||||
eventlog(uname, "RECOVER", m)
|
if m:
|
||||||
asyncio.create_task(notify_mod.send_notification(
|
eventlog(uname, "RECOVER", m)
|
||||||
uname,
|
if host.watched:
|
||||||
notify_mod.Notification(title=f"[RECOVER] {uname}", body=m, level="RECOVER"),
|
asyncio.create_task(notify_mod.send_notification(
|
||||||
))
|
uname,
|
||||||
|
notify_mod.Notification(title=f"[RECOVER] {uname}", body=m, level="RECOVER"),
|
||||||
|
))
|
||||||
|
|
||||||
if boot or newh:
|
if boot or newh:
|
||||||
host.upcount = host.doesack
|
host.upcount = host.doesack
|
||||||
@@ -453,10 +461,11 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
|
|||||||
if shutdown:
|
if shutdown:
|
||||||
m = "%s shutdown" % conn.afam
|
m = "%s shutdown" % conn.afam
|
||||||
eventlog(uname, "INFO", m)
|
eventlog(uname, "INFO", m)
|
||||||
asyncio.create_task(notify_mod.send_notification(
|
if host.watched:
|
||||||
uname,
|
asyncio.create_task(notify_mod.send_notification(
|
||||||
notify_mod.Notification(title=f"[INFO] {uname}", body=m, level="INFO"),
|
uname,
|
||||||
))
|
notify_mod.Notification(title=f"[INFO] {uname}", body=m, level="INFO"),
|
||||||
|
))
|
||||||
conn.newstate(hbdcls.Connection.DOWN, now)
|
conn.newstate(hbdcls.Connection.DOWN, now)
|
||||||
_set_connectivity_alert(host, conn.afam, "CRITICAL")
|
_set_connectivity_alert(host, conn.afam, "CRITICAL")
|
||||||
|
|
||||||
|
|||||||
+53
-10
@@ -13,7 +13,8 @@ from . import data
|
|||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
_connections: set = set()
|
# Map of WebSocket → User object (or None when auth is disabled)
|
||||||
|
_connections: dict = {}
|
||||||
_loop: Optional[asyncio.AbstractEventLoop] = None
|
_loop: Optional[asyncio.AbstractEventLoop] = None
|
||||||
_get_hosts: Optional[Callable[[], Iterable]] = None
|
_get_hosts: Optional[Callable[[], Iterable]] = None
|
||||||
_verbose: bool = False
|
_verbose: bool = False
|
||||||
@@ -34,23 +35,53 @@ def setup(
|
|||||||
_verbose = verbose
|
_verbose = verbose
|
||||||
|
|
||||||
|
|
||||||
|
def _user_can_see_host(user, host_name: str) -> bool:
|
||||||
|
"""Return True if *user* may see updates for *host_name* (manager or higher)."""
|
||||||
|
from . import hbdclass, users as users_mod
|
||||||
|
if user is None or not users_mod.users_enabled():
|
||||||
|
return True
|
||||||
|
if user.admin:
|
||||||
|
return True
|
||||||
|
host = hbdclass.Host.hosts.get(host_name)
|
||||||
|
if host is None:
|
||||||
|
return False
|
||||||
|
return host.is_manager(user.username)
|
||||||
|
|
||||||
|
|
||||||
|
def _get_token(request) -> str:
|
||||||
|
"""Extract session token from request (mirrors logic in http.py)."""
|
||||||
|
auth = request.headers.get("Authorization", "")
|
||||||
|
if auth.startswith("Bearer "):
|
||||||
|
return auth[7:].strip()
|
||||||
|
token = request.headers.get("X-Auth-Token", "")
|
||||||
|
if token:
|
||||||
|
return token
|
||||||
|
return request.cookies.get("hbd_session", "")
|
||||||
|
|
||||||
|
|
||||||
async def handler(request):
|
async def handler(request):
|
||||||
"""aiohttp WebSocket upgrade handler — register as GET /ws."""
|
"""aiohttp WebSocket upgrade handler — register as GET /ws."""
|
||||||
from aiohttp import web
|
from aiohttp import web
|
||||||
|
from . import users as users_mod
|
||||||
|
|
||||||
ws = web.WebSocketResponse()
|
ws = web.WebSocketResponse()
|
||||||
await ws.prepare(request)
|
await ws.prepare(request)
|
||||||
|
|
||||||
_connections.add(ws)
|
token = _get_token(request)
|
||||||
|
user = users_mod.get_session_user(token) if token else None
|
||||||
|
|
||||||
|
_connections[ws] = user
|
||||||
remote = request.remote
|
remote = request.remote
|
||||||
logger.info("WebSocket connected from %s", remote)
|
logger.info("WebSocket connected from %s", remote)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# Send current host state to the new client
|
# Send current host state, filtered to hosts this user may see
|
||||||
if _get_hosts:
|
if _get_hosts:
|
||||||
try:
|
try:
|
||||||
for h in list(_get_hosts()):
|
for h in list(_get_hosts()):
|
||||||
await ws.send_str(json.dumps({"type": "host", "data": h}))
|
host_name = h.get("raw_name") or h.get("name", "")
|
||||||
|
if _user_can_see_host(user, host_name):
|
||||||
|
await ws.send_str(json.dumps({"type": "host", "data": h}))
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error("Error sending initial hosts: %s", e)
|
logger.error("Error sending initial hosts: %s", e)
|
||||||
|
|
||||||
@@ -74,7 +105,7 @@ async def handler(request):
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.exception("WebSocket handler error from %s: %s", remote, e)
|
logger.exception("WebSocket handler error from %s: %s", remote, e)
|
||||||
finally:
|
finally:
|
||||||
_connections.discard(ws)
|
_connections.pop(ws, None)
|
||||||
logger.info("WebSocket disconnected from %s", remote)
|
logger.info("WebSocket disconnected from %s", remote)
|
||||||
|
|
||||||
return ws
|
return ws
|
||||||
@@ -83,25 +114,37 @@ async def handler(request):
|
|||||||
def broadcast(typ: str, payload) -> bool:
|
def broadcast(typ: str, payload) -> bool:
|
||||||
"""Thread-safe broadcast to all connected WebSocket clients.
|
"""Thread-safe broadcast to all connected WebSocket clients.
|
||||||
|
|
||||||
|
For host and plugin updates, only sends to clients whose user has
|
||||||
|
manager-or-higher access to that host. Other message types are
|
||||||
|
broadcast to all clients.
|
||||||
|
|
||||||
Can be called from any thread; schedules sends on the event loop.
|
Can be called from any thread; schedules sends on the event loop.
|
||||||
Returns False if the loop is not running yet.
|
Returns False if the loop is not running yet.
|
||||||
"""
|
"""
|
||||||
if not _loop:
|
if not _loop:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
# Determine the host name for access-filtered message types
|
||||||
|
host_name: Optional[str] = None
|
||||||
|
if typ in ("host", "plugin"):
|
||||||
|
host_name = payload.get("raw_name") or payload.get("host") or payload.get("name")
|
||||||
|
|
||||||
jmsg = json.dumps({"type": typ, "data": payload})
|
jmsg = json.dumps({"type": typ, "data": payload})
|
||||||
|
|
||||||
async def _send_all():
|
async def _send_all():
|
||||||
dead = set()
|
dead = set()
|
||||||
for ws in list(_connections):
|
for ws, user in list(_connections.items()):
|
||||||
try:
|
try:
|
||||||
if not ws.closed:
|
if ws.closed:
|
||||||
await ws.send_str(jmsg)
|
|
||||||
else:
|
|
||||||
dead.add(ws)
|
dead.add(ws)
|
||||||
|
continue
|
||||||
|
if host_name is not None and not _user_can_see_host(user, host_name):
|
||||||
|
continue
|
||||||
|
await ws.send_str(jmsg)
|
||||||
except Exception:
|
except Exception:
|
||||||
dead.add(ws)
|
dead.add(ws)
|
||||||
for ws in dead:
|
for ws in dead:
|
||||||
_connections.discard(ws)
|
_connections.pop(ws, None)
|
||||||
|
|
||||||
asyncio.run_coroutine_threadsafe(_send_all(), _loop)
|
asyncio.run_coroutine_threadsafe(_send_all(), _loop)
|
||||||
return True
|
return True
|
||||||
|
|||||||
+1
-1
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|||||||
|
|
||||||
[project]
|
[project]
|
||||||
name = "hbd"
|
name = "hbd"
|
||||||
version = "5.1.12"
|
version = "5.1.16"
|
||||||
description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
|
description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
requires-python = ">=3.11"
|
requires-python = ">=3.11"
|
||||||
|
|||||||
+8
-1
@@ -41,7 +41,7 @@ from pathlib import Path
|
|||||||
from typing import Any, Dict, List, Optional, Tuple
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
|
||||||
# updated by scripts/bumpminor.sh
|
# updated by scripts/bumpminor.sh
|
||||||
__version__ = "5.1.12"
|
__version__ = "5.1.16"
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Protocol (mirrors hbd/common/proto.py)
|
# Protocol (mirrors hbd/common/proto.py)
|
||||||
@@ -1066,6 +1066,13 @@ async def _async_main(args, cfg: Dict[str, Any]) -> int:
|
|||||||
for sig in (signal.SIGTERM, signal.SIGINT):
|
for sig in (signal.SIGTERM, signal.SIGINT):
|
||||||
loop.add_signal_handler(sig, _stop)
|
loop.add_signal_handler(sig, _stop)
|
||||||
|
|
||||||
|
def _sighup():
|
||||||
|
global _dorestart
|
||||||
|
_dorestart = True
|
||||||
|
_stop()
|
||||||
|
|
||||||
|
loop.add_signal_handler(signal.SIGHUP, _sighup)
|
||||||
|
|
||||||
for conn in connections:
|
for conn in connections:
|
||||||
_active_tasks.append(asyncio.create_task(_heartbeat_sender(conn, interval)))
|
_active_tasks.append(asyncio.create_task(_heartbeat_sender(conn, interval)))
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user