Compare commits
67 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| f50acca509 | |||
| 72fc82b91f | |||
| 46f8c32c0b | |||
| 691f62aa69 | |||
| cffc9805f9 | |||
| 917d6a401b | |||
| 2bd3a9beb6 | |||
| 5523c60866 | |||
| ab37ac7194 | |||
| f811a19d80 | |||
| 6239825f43 | |||
| b56245bb23 | |||
| 331c4e804d | |||
| 9fd945a481 | |||
| 26df08eeff | |||
| 5819dd6b25 | |||
| 6fb67f8615 | |||
| e70ae6f176 | |||
| a77f6d380c | |||
| 6aae2a1dab | |||
| 85ee0e1040 | |||
| c4f09e9ced | |||
| 64710fd4cd | |||
| 1f5e7465a3 | |||
| b290b21e23 | |||
| 65c4267847 | |||
| 462a445235 | |||
| 368e178f93 | |||
| 6905bf266a | |||
| b6dcce4f35 | |||
| e6436fc236 | |||
| c5ce41762e | |||
| 26ca0c095f | |||
| 1eecd67594 | |||
| caf3c2c0ac | |||
| 9af4006097 | |||
| ddf7067d13 | |||
| 505353a8a8 | |||
| 0402d33c71 | |||
| 7d8ca5d8db | |||
| 56037a036d | |||
| 65ceb31d8d | |||
| 1c9b6c1ca9 | |||
| d7e6b478e1 | |||
| 535dbda47d | |||
| c9567dddae | |||
| b5963badd6 | |||
| a76a39b4a0 | |||
| 94e1597978 | |||
| c9c2ed772f | |||
| aeb78dcb8e | |||
| 77b337e4dd | |||
| 293461f3f6 | |||
| c70a4807dc | |||
| 1a470e7cfa | |||
| 990c658e65 | |||
| b78d6ac0fe | |||
| afd5060f59 | |||
| f61f7aebc2 | |||
| 5c382d2b8d | |||
| 35bba451f5 | |||
| 80edfba0c0 | |||
| 6bc8de192e | |||
| 2d8166d04a | |||
| ab33d81b30 | |||
| 2c0328f36d | |||
| fb8e27825d |
@@ -24,11 +24,11 @@ jobs:
|
|||||||
|
|
||||||
- name: Install build tools
|
- name: Install build tools
|
||||||
run: |
|
run: |
|
||||||
python -m pip install --upgrade pip
|
python3 -m pip install --upgrade pip
|
||||||
pip install build twine
|
python3 -m pip install build twine
|
||||||
|
|
||||||
- name: Build package
|
- name: Build package
|
||||||
run: python -m build
|
run: python3 -m build
|
||||||
|
|
||||||
- name: Extract version from tag
|
- name: Extract version from tag
|
||||||
id: get_version
|
id: get_version
|
||||||
@@ -39,7 +39,7 @@ jobs:
|
|||||||
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
|
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
|
||||||
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
|
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
|
||||||
run: |
|
run: |
|
||||||
python -m twine upload --repository-url https://git.wrede.ca/api/packages/andreas/pypi dist/*
|
python3 -m twine upload --repository-url https://git.wrede.ca/api/packages/andreas/pypi dist/*
|
||||||
|
|
||||||
- name: Create release
|
- name: Create release
|
||||||
uses: actions/gitea-release-action@v1
|
uses: actions/gitea-release-action@v1
|
||||||
|
|||||||
@@ -0,0 +1,4 @@
|
|||||||
|
1. Don't assume. Don't hide confusion. Surface tradeoffs.
|
||||||
|
2. Minimum code that solves the problem. Nothing speculative.
|
||||||
|
3. Touch only what you must. Clean up only your own mess.
|
||||||
|
4. Define success criteria. Loop until verified.
|
||||||
@@ -267,6 +267,41 @@ All plugin metrics can be thresholded:
|
|||||||
- **Network**: errors_total, dropped packets, connection counts
|
- **Network**: errors_total, dropped packets, connection counts
|
||||||
- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
|
- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
|
||||||
|
|
||||||
|
### Per-Host Threshold Profiles
|
||||||
|
|
||||||
|
Named threshold configurations let different hosts use different limits. A host's `threshold_config` can be a single name or a **list** — lists are applied left-to-right so profiles compose without duplication:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
threshold_configs:
|
||||||
|
default:
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 80, critical: 90}
|
||||||
|
memory_monitor:
|
||||||
|
memory_percent: {warning: 85, critical: 95}
|
||||||
|
|
||||||
|
tight_cpu: # override CPU limits only
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 60, critical: 75}
|
||||||
|
|
||||||
|
db_disk: # add a database partition check
|
||||||
|
thresholds:
|
||||||
|
disk_monitor:
|
||||||
|
partitions:
|
||||||
|
/var/lib/postgresql:
|
||||||
|
percent: {warning: 75, critical: 88}
|
||||||
|
|
||||||
|
hosts:
|
||||||
|
web-01:
|
||||||
|
threshold_config: default # single profile
|
||||||
|
|
||||||
|
db-01:
|
||||||
|
threshold_config: [tight_cpu, db_disk] # layered: CPU override + extra disk check
|
||||||
|
```
|
||||||
|
|
||||||
|
Each named config's overrides are applied in order on top of the defaults. Metrics not mentioned in a profile are inherited unchanged.
|
||||||
|
|
||||||
See [docs/THRESHOLD_ALERTING.md](docs/THRESHOLD_ALERTING.md) for comprehensive documentation including best practices, troubleshooting, and advanced configuration.
|
See [docs/THRESHOLD_ALERTING.md](docs/THRESHOLD_ALERTING.md) for comprehensive documentation including best practices, troubleshooting, and advanced configuration.
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -377,7 +412,7 @@ This project now declares its dependencies in `pyproject.toml`. Instead
|
|||||||
of the old `requirements.txt` flow, install the package into a virtualenv
|
of the old `requirements.txt` flow, install the package into a virtualenv
|
||||||
using `pip`:
|
using `pip`:
|
||||||
|
|
||||||
See `scripts/install.sh` for a way to install.
|
See `scripts/hb_install.sh` for a way to install.
|
||||||
|
|
||||||
Run the daemon (example):
|
Run the daemon (example):
|
||||||
|
|
||||||
@@ -441,6 +476,68 @@ plugins:
|
|||||||
|
|
||||||
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
|
All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.
|
||||||
|
|
||||||
|
### hbc_mini — single-file client (no external dependencies)
|
||||||
|
|
||||||
|
`scripts/hbc_mini.py` is a self-contained version of the heartbeat client that requires only Python 3.8+ and no external packages. Copy it to any host and run it directly — no virtualenv, no `pip install`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Basic usage
|
||||||
|
python3 hbc_mini.py your-server.example.com
|
||||||
|
|
||||||
|
# Run as daemon
|
||||||
|
python3 hbc_mini.py -d your-server.example.com
|
||||||
|
|
||||||
|
# Send a boot message
|
||||||
|
python3 hbc_mini.py -b your-server.example.com
|
||||||
|
|
||||||
|
# Send a one-off message
|
||||||
|
python3 hbc_mini.py -m "maintenance starting" your-server.example.com
|
||||||
|
```
|
||||||
|
|
||||||
|
**Config:** `~/.hbc.json` (same keys as `~/.hbc.yaml`, JSON format). Example:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"hb_port": 50003,
|
||||||
|
"interval": 30,
|
||||||
|
"plugins": {
|
||||||
|
"ping_monitor": {
|
||||||
|
"interval": 60,
|
||||||
|
"hosts": ["8.8.8.8", "192.168.1.1"]
|
||||||
|
},
|
||||||
|
"nagios_runner": {
|
||||||
|
"interval": 300,
|
||||||
|
"commands": [
|
||||||
|
{"name": "check_load", "command": "/usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6"}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Plugin availability:**
|
||||||
|
|
||||||
|
| Plugin | Platform | Data source |
|
||||||
|
|---|---|---|
|
||||||
|
| `os_info` | all | `platform` stdlib |
|
||||||
|
| `ping_monitor` | all | `ping` subprocess |
|
||||||
|
| `nagios_runner` | all (not Windows) | subprocess |
|
||||||
|
| `cpu_monitor` | Linux | `/proc/stat` |
|
||||||
|
| `memory_monitor` | Linux | `/proc/meminfo` |
|
||||||
|
| `disk_monitor` | Linux, macOS, BSD | `df -P` subprocess |
|
||||||
|
| `network_monitor` | Linux | `/proc/net/dev` |
|
||||||
|
|
||||||
|
**What is not available compared to the full `hbc`:**
|
||||||
|
|
||||||
|
- No YAML config (use JSON instead)
|
||||||
|
- No `filesystem_info` plugin
|
||||||
|
- `cpu_monitor` does not report per-core usage or CPU frequency (no psutil)
|
||||||
|
- Plugins cannot be loaded from external `.py` files — all plugins are compiled in
|
||||||
|
|
||||||
|
Everything else — heartbeat protocol, ACK/CMD/UPD handling, `hb_install.sh`-based self-update, daemonize, syslog — is identical to the full client.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## 🐞 Debugging in VS Code
|
## 🐞 Debugging in VS Code
|
||||||
|
|
||||||
This repository includes a ready-to-use `.vscode/launch.json` with configurations to run or attach the VS Code debugger to `hbd`.
|
This repository includes a ready-to-use `.vscode/launch.json` with configurations to run or attach the VS Code debugger to `hbd`.
|
||||||
|
|||||||
-234
@@ -1,234 +0,0 @@
|
|||||||
# HBD/HBC Separation Refactoring
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
The heartbeat monitoring system has been refactored into a modular package structure with separate client and server components. This allows users to install only what they need and provides clear separation of concerns.
|
|
||||||
|
|
||||||
## New Package Structure
|
|
||||||
|
|
||||||
```
|
|
||||||
hbd/
|
|
||||||
├── __init__.py # Main package (minimal)
|
|
||||||
├── client/ # HBC - System monitoring client
|
|
||||||
│ ├── __init__.py
|
|
||||||
│ ├── main.py # Entry point (was hbc.py)
|
|
||||||
│ ├── config.py # Client-specific configuration
|
|
||||||
│ ├── plugin.py # Plugin framework
|
|
||||||
│ ├── threshold.py # Threshold checking
|
|
||||||
│ └── plugins/ # Monitoring plugins
|
|
||||||
│ ├── cpu_monitor.py
|
|
||||||
│ ├── disk_monitor.py
|
|
||||||
│ ├── memory_monitor.py
|
|
||||||
│ ├── network_monitor.py
|
|
||||||
│ ├── filesystem_info.py
|
|
||||||
│ ├── os_info.py
|
|
||||||
│ └── nagios_runner.py
|
|
||||||
├── server/ # HBD - Heartbeat daemon/server
|
|
||||||
│ ├── __init__.py
|
|
||||||
│ ├── main.py # Server runtime (was server.py)
|
|
||||||
│ ├── cli.py # Command-line interface
|
|
||||||
│ ├── config.py # Server-specific configuration
|
|
||||||
│ ├── http.py # HTTP/REST API
|
|
||||||
│ ├── ws.py # WebSocket server
|
|
||||||
│ ├── udp.py # UDP heartbeat listener
|
|
||||||
│ ├── dns.py # DNS update functionality
|
|
||||||
│ ├── notify.py # Notification handlers
|
|
||||||
│ ├── monitor.py # Host monitoring
|
|
||||||
│ ├── hbdclass.py # Host class definitions
|
|
||||||
│ ├── journal.py # Message journaling
|
|
||||||
│ ├── templates/ # Jinja2 web templates
|
|
||||||
│ └── static/ # Web UI assets
|
|
||||||
└── common/ # Shared utilities
|
|
||||||
├── __init__.py
|
|
||||||
├── proto.py # Protocol encoding/decoding
|
|
||||||
└── utils.py # Common utilities
|
|
||||||
|
|
||||||
## Configuration Files
|
|
||||||
|
|
||||||
### Client Configuration (hbd/client/config.py)
|
|
||||||
|
|
||||||
Client-specific defaults:
|
|
||||||
- `hb_port`: Port where hbd servers listen (default: 50003)
|
|
||||||
- `interval`: Heartbeat interval in seconds (default: 10)
|
|
||||||
- `plugins`: Per-plugin configuration
|
|
||||||
- `thresholds`: Threshold configuration for monitoring
|
|
||||||
|
|
||||||
### Server Configuration (hbd/server/config.py)
|
|
||||||
|
|
||||||
Server-specific defaults:
|
|
||||||
- `hb_port`: Port to listen for heartbeats (default: 50003)
|
|
||||||
- `hbd_port`: HTTP API port (default: 50004)
|
|
||||||
- `ws_port`: WebSocket port (default: 50005)
|
|
||||||
- `logfile`: Log file path
|
|
||||||
- `pushsrv`, `pushover_token`, etc.: Notification settings
|
|
||||||
- `watchhosts`, `dyndnshosts`: Host monitoring
|
|
||||||
- `smtpserver`, etc.: Email settings
|
|
||||||
- `journal_*`: Message journaling settings
|
|
||||||
|
|
||||||
## Installation Options
|
|
||||||
|
|
||||||
### Install Core Only (minimal, PyYAML only)
|
|
||||||
```bash
|
|
||||||
pip install hbd
|
|
||||||
```
|
|
||||||
|
|
||||||
### Install Client Only (for monitoring)
|
|
||||||
```bash
|
|
||||||
pip install hbd[client]
|
|
||||||
# Installs: PyYAML, psutil
|
|
||||||
```
|
|
||||||
|
|
||||||
### Install Server Only (for daemon)
|
|
||||||
```bash
|
|
||||||
pip install hbd[server]
|
|
||||||
# Installs: PyYAML, websockets, mattermostdriver, aiohttp, Jinja2
|
|
||||||
```
|
|
||||||
|
|
||||||
### Install Everything
|
|
||||||
```bash
|
|
||||||
pip install hbd[all]
|
|
||||||
# Installs all dependencies for both client and server
|
|
||||||
```
|
|
||||||
|
|
||||||
### Development Installation
|
|
||||||
```bash
|
|
||||||
pip install -e ".[dev]"
|
|
||||||
# Includes all dependencies plus testing/linting tools
|
|
||||||
```
|
|
||||||
|
|
||||||
## Command-Line Interfaces
|
|
||||||
|
|
||||||
### HBC (Client)
|
|
||||||
```bash
|
|
||||||
hbc [options] host1 [host2 ...]
|
|
||||||
|
|
||||||
# Entry point: hbd.client.main:main
|
|
||||||
# Location: hbd/client/main.py
|
|
||||||
```
|
|
||||||
|
|
||||||
### HBD (Server)
|
|
||||||
```bash
|
|
||||||
hbd [options]
|
|
||||||
|
|
||||||
# Entry point: hbd.server.cli:main
|
|
||||||
# Location: hbd/server/cli.py → hbd/server/main.py
|
|
||||||
```
|
|
||||||
|
|
||||||
## Import Changes
|
|
||||||
|
|
||||||
### Client Code
|
|
||||||
```python
|
|
||||||
# Old imports
|
|
||||||
from .config import load_config
|
|
||||||
from .proto import dicttos, stodict
|
|
||||||
from .plugin import PluginRegistry
|
|
||||||
|
|
||||||
# New imports
|
|
||||||
from .config import load_config # Still in client/
|
|
||||||
from ..common.proto import dicttos # Moved to common/
|
|
||||||
from .plugin import PluginRegistry # Still in client/
|
|
||||||
```
|
|
||||||
|
|
||||||
### Server Code
|
|
||||||
```python
|
|
||||||
# Old imports
|
|
||||||
from .config import load_config
|
|
||||||
from .proto import stodict
|
|
||||||
from .threshold import AlertLevel
|
|
||||||
|
|
||||||
# New imports
|
|
||||||
from .config import load_config # Server-specific config
|
|
||||||
from ..common.proto import stodict # Moved to common/
|
|
||||||
from ..client.threshold import AlertLevel # Client module
|
|
||||||
```
|
|
||||||
|
|
||||||
### Plugin Code
|
|
||||||
```python
|
|
||||||
# Old import
|
|
||||||
from hbd.plugin import MonitorPlugin
|
|
||||||
|
|
||||||
# New import
|
|
||||||
from hbd.client.plugin import MonitorPlugin
|
|
||||||
```
|
|
||||||
|
|
||||||
## Benefits
|
|
||||||
|
|
||||||
1. **Modular Installation**: Install only what you need
|
|
||||||
- Client-only systems don't need web server dependencies
|
|
||||||
- Server-only systems don't need psutil
|
|
||||||
|
|
||||||
2. **Clearer Architecture**: Explicit separation of concerns
|
|
||||||
- Client: System monitoring and data collection
|
|
||||||
- Server: Heartbeat reception, web UI, notifications
|
|
||||||
- Common: Shared protocol and utilities
|
|
||||||
|
|
||||||
3. **Independent Evolution**: Client and server can evolve separately
|
|
||||||
- Different release cycles possible
|
|
||||||
- Clear API boundaries via common/
|
|
||||||
|
|
||||||
4. **Smaller Footprint**: Reduced dependency installation
|
|
||||||
- Client: ~1 dependency (psutil)
|
|
||||||
- Server: ~4 dependencies (websockets, aiohttp, Jinja2, mattermostdriver)
|
|
||||||
|
|
||||||
## Migration Guide
|
|
||||||
|
|
||||||
### For Existing Installations
|
|
||||||
|
|
||||||
1. **Reinstall the package**:
|
|
||||||
```bash
|
|
||||||
pip install -e ".[all]" # For development
|
|
||||||
# or
|
|
||||||
pip install hbd[all] # For production
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Configuration files remain unchanged**:
|
|
||||||
- Both client and server read from `~/.hb.yaml`
|
|
||||||
- All existing config keys are supported in both configs
|
|
||||||
- Server has additional keys (journal, websocket, email, etc.)
|
|
||||||
- Client has minimal keys (interval, plugins, thresholds)
|
|
||||||
|
|
||||||
3. **Commands remain the same**:
|
|
||||||
- `hbc` command works identically
|
|
||||||
- `hbd` command works identically
|
|
||||||
|
|
||||||
### For New Deployments
|
|
||||||
|
|
||||||
1. **Client-only system** (monitoring host):
|
|
||||||
```bash
|
|
||||||
pip install hbd[client]
|
|
||||||
hbc server1.example.com server2.example.com
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Server-only system** (monitoring daemon):
|
|
||||||
```bash
|
|
||||||
pip install hbd[server]
|
|
||||||
hbd -c /etc/hbd.yaml -f
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Combined system** (dev/test):
|
|
||||||
```bash
|
|
||||||
pip install hbd[all]
|
|
||||||
```
|
|
||||||
|
|
||||||
## Testing
|
|
||||||
|
|
||||||
All imports and entry points have been tested and validated:
|
|
||||||
- ✅ Package imports work correctly
|
|
||||||
- ✅ `hbc` command entry point functional
|
|
||||||
- ✅ `hbd` command entry point functional
|
|
||||||
- ✅ Optional dependencies properly configured
|
|
||||||
- ✅ All internal imports updated
|
|
||||||
|
|
||||||
## Files Archived
|
|
||||||
|
|
||||||
The following files were renamed to avoid conflicts:
|
|
||||||
- `hbd/config.py` → `hbd/config.py.old` (split into client/server configs)
|
|
||||||
- `hbd/hbc_old.py` → `hbd/hbc_old.py.bak` (backup file)
|
|
||||||
|
|
||||||
## Next Steps
|
|
||||||
|
|
||||||
1. Test client functionality with a monitoring host
|
|
||||||
2. Test server functionality with web UI and notifications
|
|
||||||
3. Update documentation (README.md) with new structure
|
|
||||||
4. Consider publishing to PyPI with new structure
|
|
||||||
5. Update any deployment scripts/Dockerfiles to use optional dependencies
|
|
||||||
+183
-47
@@ -814,34 +814,32 @@ Planned features:
|
|||||||
|
|
||||||
## Multi-Threshold Configuration
|
## Multi-Threshold Configuration
|
||||||
|
|
||||||
**New in version 2.0**: Support for multiple named threshold configurations with per-host mapping.
|
Support for multiple named threshold configurations with per-host mapping and composable layering.
|
||||||
|
|
||||||
### Overview
|
### Overview
|
||||||
|
|
||||||
The multi-threshold feature allows you to:
|
The multi-threshold feature allows you to:
|
||||||
- Define multiple sets of threshold configurations
|
- Define multiple named threshold configurations
|
||||||
- Map different hosts to different threshold sets
|
- Assign one or more configurations to each host
|
||||||
|
- Compose configurations by layering — each named config's overrides are applied in order on top of the defaults
|
||||||
- Use different sensitivity levels for different environments
|
- Use different sensitivity levels for different environments
|
||||||
- Maintain a default configuration for unmapped hosts
|
|
||||||
|
|
||||||
### Configuration Structure
|
### Configuration Structure
|
||||||
|
|
||||||
|
Named configurations are defined under `threshold_configs`. Each host selects which ones to use via `threshold_config` in the `hosts` section (a string for a single config, or a list to layer multiple):
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
# Optional: Set the default configuration name (defaults to "default")
|
# Optional: set the default configuration name (defaults to "default")
|
||||||
default_threshold_config: "default"
|
default_threshold_config: "default"
|
||||||
|
|
||||||
# Define multiple named threshold configurations
|
|
||||||
threshold_configs:
|
threshold_configs:
|
||||||
# Configuration name 1
|
|
||||||
default:
|
default:
|
||||||
thresholds:
|
thresholds:
|
||||||
# Standard threshold definitions
|
|
||||||
cpu_monitor:
|
cpu_monitor:
|
||||||
cpu_percent:
|
cpu_percent:
|
||||||
warning: 80.0
|
warning: 80.0
|
||||||
critical: 90.0
|
critical: 90.0
|
||||||
|
|
||||||
# Configuration name 2
|
|
||||||
high_sensitivity:
|
high_sensitivity:
|
||||||
thresholds:
|
thresholds:
|
||||||
cpu_monitor:
|
cpu_monitor:
|
||||||
@@ -849,7 +847,6 @@ threshold_configs:
|
|||||||
warning: 60.0
|
warning: 60.0
|
||||||
critical: 75.0
|
critical: 75.0
|
||||||
|
|
||||||
# Configuration name 3
|
|
||||||
low_sensitivity:
|
low_sensitivity:
|
||||||
thresholds:
|
thresholds:
|
||||||
cpu_monitor:
|
cpu_monitor:
|
||||||
@@ -857,14 +854,77 @@ threshold_configs:
|
|||||||
warning: 90.0
|
warning: 90.0
|
||||||
critical: 95.0
|
critical: 95.0
|
||||||
|
|
||||||
# Map specific hosts to specific configurations
|
hosts:
|
||||||
host_threshold_mapping:
|
prod-web-01:
|
||||||
prod-web-01: high_sensitivity
|
threshold_config: high_sensitivity # single config
|
||||||
prod-web-02: high_sensitivity
|
|
||||||
dev-server-01: low_sensitivity
|
dev-server-01:
|
||||||
# Unmapped hosts use default_threshold_config
|
threshold_config: low_sensitivity
|
||||||
|
|
||||||
|
# Hosts with no threshold_config use default_threshold_config
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Composable Configurations (list form)
|
||||||
|
|
||||||
|
`threshold_config` can be a list. Configs are applied **left to right**: the defaults are the base, then each named config's overrides are layered on top. Later entries in the list win on any metric they define.
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
threshold_configs:
|
||||||
|
default:
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 80, critical: 90}
|
||||||
|
memory_monitor:
|
||||||
|
memory_percent: {warning: 85, critical: 95}
|
||||||
|
disk_monitor:
|
||||||
|
partitions:
|
||||||
|
/:
|
||||||
|
percent: {warning: 80, critical: 90}
|
||||||
|
|
||||||
|
# Tighter CPU limits for busy servers
|
||||||
|
high_cpu_load:
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 60, critical: 75}
|
||||||
|
|
||||||
|
# Tighter disk limits for data-heavy servers
|
||||||
|
busy_disk:
|
||||||
|
thresholds:
|
||||||
|
disk_monitor:
|
||||||
|
partitions:
|
||||||
|
/:
|
||||||
|
percent: {warning: 70, critical: 85}
|
||||||
|
|
||||||
|
hosts:
|
||||||
|
# Gets default thresholds only
|
||||||
|
web-01:
|
||||||
|
threshold_config: default
|
||||||
|
|
||||||
|
# Gets tighter CPU limits, default memory and disk
|
||||||
|
build-server:
|
||||||
|
threshold_config: high_cpu_load
|
||||||
|
|
||||||
|
# Layers both: tighter CPU AND tighter disk, default memory
|
||||||
|
db-01:
|
||||||
|
threshold_config: [high_cpu_load, busy_disk]
|
||||||
|
|
||||||
|
# Three layers: busy_disk overrides high_cpu_load if they conflict
|
||||||
|
storage-01:
|
||||||
|
threshold_config: [default, high_cpu_load, busy_disk]
|
||||||
|
```
|
||||||
|
|
||||||
|
**How layering works:**
|
||||||
|
|
||||||
|
Starting from the `default` thresholds:
|
||||||
|
|
||||||
|
| Layer | Applied config | Effect |
|
||||||
|
|-------|---------------|--------|
|
||||||
|
| Base | `default` | all default thresholds |
|
||||||
|
| +1 | `high_cpu_load` | cpu_percent overridden to 60/75 |
|
||||||
|
| +2 | `busy_disk` | disk percent overridden to 70/85; cpu_percent stays at 60/75 |
|
||||||
|
|
||||||
|
Each named config only overrides the metrics it explicitly defines. Metrics not mentioned in a config inherit from the layers beneath.
|
||||||
|
|
||||||
### Use Cases
|
### Use Cases
|
||||||
|
|
||||||
#### 1. Environment-Based Thresholds
|
#### 1. Environment-Based Thresholds
|
||||||
@@ -887,11 +947,15 @@ threshold_configs:
|
|||||||
warning: 90.0 # More relaxed for dev
|
warning: 90.0 # More relaxed for dev
|
||||||
critical: 98.0
|
critical: 98.0
|
||||||
|
|
||||||
host_threshold_mapping:
|
hosts:
|
||||||
prod-web-01: production
|
prod-web-01:
|
||||||
prod-web-02: production
|
threshold_config: production
|
||||||
dev-web-01: development
|
prod-web-02:
|
||||||
dev-web-02: development
|
threshold_config: production
|
||||||
|
dev-web-01:
|
||||||
|
threshold_config: development
|
||||||
|
dev-web-02:
|
||||||
|
threshold_config: development
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 2. Server Role-Based Thresholds
|
#### 2. Server Role-Based Thresholds
|
||||||
@@ -914,7 +978,7 @@ threshold_configs:
|
|||||||
warning: 70.0
|
warning: 70.0
|
||||||
critical: 85.0
|
critical: 85.0
|
||||||
memory_monitor:
|
memory_monitor:
|
||||||
percent:
|
memory_percent:
|
||||||
warning: 90.0 # Databases can use high memory
|
warning: 90.0 # Databases can use high memory
|
||||||
critical: 97.0
|
critical: 97.0
|
||||||
disk_monitor:
|
disk_monitor:
|
||||||
@@ -927,17 +991,23 @@ threshold_configs:
|
|||||||
cache:
|
cache:
|
||||||
thresholds:
|
thresholds:
|
||||||
memory_monitor:
|
memory_monitor:
|
||||||
percent:
|
memory_percent:
|
||||||
warning: 95.0 # Redis/Memcached can use very high memory
|
warning: 95.0 # Redis/Memcached can use very high memory
|
||||||
critical: 99.0
|
critical: 99.0
|
||||||
|
|
||||||
host_threshold_mapping:
|
hosts:
|
||||||
web-01: webserver
|
web-01:
|
||||||
web-02: webserver
|
threshold_config: webserver
|
||||||
db-01: database
|
web-02:
|
||||||
db-02: database
|
threshold_config: webserver
|
||||||
redis-01: cache
|
db-01:
|
||||||
memcached-01: cache
|
threshold_config: database
|
||||||
|
db-02:
|
||||||
|
threshold_config: database
|
||||||
|
redis-01:
|
||||||
|
threshold_config: cache
|
||||||
|
memcached-01:
|
||||||
|
threshold_config: cache
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 3. Sensitivity Levels
|
#### 3. Sensitivity Levels
|
||||||
@@ -952,7 +1022,7 @@ threshold_configs:
|
|||||||
partitions:
|
partitions:
|
||||||
/:
|
/:
|
||||||
percent:
|
percent:
|
||||||
warning: 70.0 # Very sensitive
|
warning: 70.0
|
||||||
critical: 80.0
|
critical: 80.0
|
||||||
hysteresis: 0.15
|
hysteresis: 0.15
|
||||||
|
|
||||||
@@ -976,12 +1046,69 @@ threshold_configs:
|
|||||||
critical: 98.0
|
critical: 98.0
|
||||||
hysteresis: 0.05
|
hysteresis: 0.05
|
||||||
|
|
||||||
host_threshold_mapping:
|
hosts:
|
||||||
payment-gateway: critical
|
payment-gateway:
|
||||||
auth-server: critical
|
threshold_config: critical
|
||||||
web-01: standard
|
auth-server:
|
||||||
web-02: standard
|
threshold_config: critical
|
||||||
test-server: relaxed
|
web-01:
|
||||||
|
threshold_config: standard
|
||||||
|
web-02:
|
||||||
|
threshold_config: standard
|
||||||
|
test-server:
|
||||||
|
threshold_config: relaxed
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 4. Composable Profiles
|
||||||
|
|
||||||
|
Build host-specific thresholds by combining small, focused configs:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
threshold_configs:
|
||||||
|
# Baseline — everything at default levels
|
||||||
|
default:
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 80, critical: 90}
|
||||||
|
memory_monitor:
|
||||||
|
memory_percent: {warning: 85, critical: 95}
|
||||||
|
|
||||||
|
# Overlay: tighter CPU only
|
||||||
|
tight_cpu:
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 60, critical: 75}
|
||||||
|
|
||||||
|
# Overlay: tighter memory only
|
||||||
|
tight_memory:
|
||||||
|
thresholds:
|
||||||
|
memory_monitor:
|
||||||
|
memory_percent: {warning: 70, critical: 85}
|
||||||
|
|
||||||
|
# Overlay: extra disk partition for database servers
|
||||||
|
db_disk:
|
||||||
|
thresholds:
|
||||||
|
disk_monitor:
|
||||||
|
partitions:
|
||||||
|
/var/lib/postgresql:
|
||||||
|
percent: {warning: 75, critical: 88}
|
||||||
|
|
||||||
|
hosts:
|
||||||
|
# Plain web server
|
||||||
|
web-01:
|
||||||
|
threshold_config: default
|
||||||
|
|
||||||
|
# Build server: tight CPU, default memory and disk
|
||||||
|
build-01:
|
||||||
|
threshold_config: tight_cpu
|
||||||
|
|
||||||
|
# Database: tight CPU + tight memory + extra disk partition
|
||||||
|
db-01:
|
||||||
|
threshold_config: [tight_cpu, tight_memory, db_disk]
|
||||||
|
|
||||||
|
# Replica database: tight memory + extra disk, normal CPU
|
||||||
|
db-02:
|
||||||
|
threshold_config: [tight_memory, db_disk]
|
||||||
```
|
```
|
||||||
|
|
||||||
### Backward Compatibility
|
### Backward Compatibility
|
||||||
@@ -1012,16 +1139,25 @@ threshold_configs:
|
|||||||
|
|
||||||
### Configuration Priority
|
### Configuration Priority
|
||||||
|
|
||||||
1. **Host-specific mapping**: If host is in `host_threshold_mapping`, use that config
|
1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults
|
||||||
2. **Default config**: Use `default_threshold_config`
|
2. **Host `threshold_config` (string)**: Use that single named config directly
|
||||||
3. **First alphabetically**: If default not found, use first config alphabetically
|
3. **`host_threshold_mapping`** (legacy): Same as above, string only
|
||||||
4. **Legacy fallback**: If `threshold_configs` not present, use `thresholds`
|
4. **`default_threshold_config`**: Used for hosts with no mapping
|
||||||
|
5. **First alphabetically**: If the default config is not found, use the first config alphabetically
|
||||||
|
6. **Legacy `thresholds` section**: Used when `threshold_configs` is absent entirely
|
||||||
|
|
||||||
### Example: Complete Multi-Threshold Setup
|
### Backward Compatibility
|
||||||
|
|
||||||
See `hbd/config_multi_threshold_example.yaml` for a complete example with:
|
The legacy `host_threshold_mapping` top-level key and the flat `thresholds` section are still fully supported:
|
||||||
- 4 named configurations (default, high_sensitivity, low_sensitivity, database)
|
|
||||||
- Host-to-config mappings for production, development, and test systems
|
```yaml
|
||||||
- Specialized database server thresholds
|
# Still works — equivalent to hosts: {prod-web-01: {threshold_config: high_sensitivity}}
|
||||||
- Custom display messages with plugin data
|
host_threshold_mapping:
|
||||||
|
prod-web-01: high_sensitivity
|
||||||
|
|
||||||
|
# Still works — equivalent to threshold_configs: {default: {thresholds: ...}}
|
||||||
|
thresholds:
|
||||||
|
cpu_monitor:
|
||||||
|
cpu_percent: {warning: 80, critical: 90}
|
||||||
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,602 @@
|
|||||||
|
# Plugin Error Checking Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Improve plugin error checking in hbc, especially for nagios_runner, and fix logger messages silently discarded in daemon mode.
|
||||||
|
|
||||||
|
**Architecture:** Three focused changes across three files: (1) `hbd/client/plugin.py` gains a `skip_reason` attribute on Plugin and updated PluginLoader messaging; (2) `hbd/client/plugins/nagios_runner.py` gains async subprocess execution, stderr capture, signal-killed process handling, and init-time command path validation; (3) `hbd/client/main.py` gains proper post-fork logging reconfiguration to syslog.
|
||||||
|
|
||||||
|
**Tech Stack:** Python 3.11+, asyncio, `logging.handlers.SysLogHandler`, pytest
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Map
|
||||||
|
|
||||||
|
| Action | Path | What changes |
|
||||||
|
|---|---|---|
|
||||||
|
| Modify | `hbd/client/plugin.py` | `Plugin.__init__` gains `skip_reason`; `PluginLoader` checks it |
|
||||||
|
| Modify | `hbd/client/plugins/nagios_runner.py` | async subprocess, stderr, signal codes, init validation, `skip_reason` |
|
||||||
|
| Modify | `hbd/client/main.py` | `_reconfigure_logging_for_daemon()` helper; remove redundant syslog calls |
|
||||||
|
| Create | `tests/test_plugin.py` | PluginLoader messaging tests |
|
||||||
|
| Create | `tests/test_nagios_runner.py` | NagiosRunnerPlugin behaviour tests |
|
||||||
|
|
||||||
|
Run tests throughout with:
|
||||||
|
```bash
|
||||||
|
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 1: Plugin.skip_reason + PluginLoader messaging
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `hbd/client/plugin.py:40-48` (Plugin.__init__)
|
||||||
|
- Modify: `hbd/client/plugin.py:369-381` (PluginLoader.load_from_directory)
|
||||||
|
- Create: `tests/test_plugin.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write failing tests**
|
||||||
|
|
||||||
|
Create `tests/test_plugin.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import asyncio
|
||||||
|
import logging
|
||||||
|
import textwrap
|
||||||
|
|
||||||
|
from hbd.client.plugin import Plugin, PluginLoader, PluginRegistry
|
||||||
|
|
||||||
|
|
||||||
|
def test_plugin_skip_reason_defaults_none(tmp_path):
|
||||||
|
plugin_code = textwrap.dedent("""
|
||||||
|
from hbd.client.plugin import MonitorPlugin
|
||||||
|
|
||||||
|
class MinimalPlugin(MonitorPlugin):
|
||||||
|
name = "minimal"
|
||||||
|
version = "1.0.0"
|
||||||
|
interval = 60
|
||||||
|
|
||||||
|
async def initialize(self):
|
||||||
|
return True
|
||||||
|
|
||||||
|
async def _collect_metrics(self):
|
||||||
|
return {}
|
||||||
|
""")
|
||||||
|
(tmp_path / "minimal.py").write_text(plugin_code)
|
||||||
|
registry = PluginRegistry()
|
||||||
|
loader = PluginLoader(registry)
|
||||||
|
asyncio.run(loader.load_from_directory(tmp_path))
|
||||||
|
plugin = registry.get("minimal")
|
||||||
|
assert plugin is not None
|
||||||
|
assert plugin.skip_reason is None
|
||||||
|
|
||||||
|
|
||||||
|
def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog):
|
||||||
|
plugin_code = textwrap.dedent("""
|
||||||
|
from hbd.client.plugin import MonitorPlugin
|
||||||
|
|
||||||
|
class SkippablePlugin(MonitorPlugin):
|
||||||
|
name = "skippable"
|
||||||
|
version = "1.0.0"
|
||||||
|
interval = 60
|
||||||
|
|
||||||
|
async def initialize(self):
|
||||||
|
self.skip_reason = "not configured in yaml"
|
||||||
|
return False
|
||||||
|
|
||||||
|
async def _collect_metrics(self):
|
||||||
|
return {}
|
||||||
|
""")
|
||||||
|
(tmp_path / "skippable.py").write_text(plugin_code)
|
||||||
|
registry = PluginRegistry()
|
||||||
|
loader = PluginLoader(registry)
|
||||||
|
|
||||||
|
with caplog.at_level(logging.INFO, logger="plugin.loader"):
|
||||||
|
count = asyncio.run(loader.load_from_directory(tmp_path))
|
||||||
|
|
||||||
|
assert count == 0
|
||||||
|
assert any("skipped: not configured in yaml" in r.message for r in caplog.records)
|
||||||
|
assert not any("failed initialization" in r.message for r in caplog.records)
|
||||||
|
|
||||||
|
|
||||||
|
def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog):
|
||||||
|
plugin_code = textwrap.dedent("""
|
||||||
|
from hbd.client.plugin import MonitorPlugin
|
||||||
|
|
||||||
|
class FailPlugin(MonitorPlugin):
|
||||||
|
name = "fail"
|
||||||
|
version = "1.0.0"
|
||||||
|
interval = 60
|
||||||
|
|
||||||
|
async def initialize(self):
|
||||||
|
return False
|
||||||
|
|
||||||
|
async def _collect_metrics(self):
|
||||||
|
return {}
|
||||||
|
""")
|
||||||
|
(tmp_path / "fail_plugin.py").write_text(plugin_code)
|
||||||
|
registry = PluginRegistry()
|
||||||
|
loader = PluginLoader(registry)
|
||||||
|
|
||||||
|
with caplog.at_level(logging.WARNING, logger="plugin.loader"):
|
||||||
|
count = asyncio.run(loader.load_from_directory(tmp_path))
|
||||||
|
|
||||||
|
assert count == 0
|
||||||
|
assert any("failed initialization" in r.message for r in caplog.records)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run tests to verify they fail**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m pytest tests/test_plugin.py -v
|
||||||
|
```
|
||||||
|
Expected: `test_plugin_skip_reason_defaults_none` FAILS (attribute missing), others may error.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add `skip_reason` to `Plugin.__init__`**
|
||||||
|
|
||||||
|
In `hbd/client/plugin.py`, in `Plugin.__init__` (around line 46), add one line:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def __init__(self, config: Optional[Dict[str, Any]] = None):
|
||||||
|
self.config = config or {}
|
||||||
|
self.logger = logging.getLogger(f"plugin.{self.name}")
|
||||||
|
self._initialized = False
|
||||||
|
self.skip_reason: Optional[str] = None
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Update PluginLoader messaging**
|
||||||
|
|
||||||
|
In `hbd/client/plugin.py`, replace the `if not initialized:` block (around line 372):
|
||||||
|
|
||||||
|
```python
|
||||||
|
if not initialized:
|
||||||
|
if plugin.skip_reason:
|
||||||
|
self.logger.info(
|
||||||
|
f"Plugin {plugin.name} skipped: {plugin.skip_reason}"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.logger.warning(
|
||||||
|
f"Plugin {plugin.name} failed initialization, skipping"
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Run tests to verify they pass**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m pytest tests/test_plugin.py -v
|
||||||
|
```
|
||||||
|
Expected: all 3 tests PASS.
|
||||||
|
|
||||||
|
- [ ] **Step 6: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add hbd/client/plugin.py tests/test_plugin.py
|
||||||
|
git commit -m "feat: add skip_reason to Plugin; improve PluginLoader init messaging"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 2: NagiosRunnerPlugin — skip_reason when no commands
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `hbd/client/plugins/nagios_runner.py:88-105` (initialize)
|
||||||
|
- Modify: `tests/test_nagios_runner.py` (create)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write failing test**
|
||||||
|
|
||||||
|
Create `tests/test_nagios_runner.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import asyncio
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import stat
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from hbd.client.plugins.nagios_runner import (
|
||||||
|
NagiosRunnerPlugin,
|
||||||
|
NAGIOS_OK,
|
||||||
|
NAGIOS_WARNING,
|
||||||
|
NAGIOS_CRITICAL,
|
||||||
|
NAGIOS_UNKNOWN,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_no_commands_sets_skip_reason():
|
||||||
|
plugin = NagiosRunnerPlugin(config={"commands": []})
|
||||||
|
result = asyncio.run(plugin.initialize())
|
||||||
|
assert result is False
|
||||||
|
assert plugin.skip_reason is not None
|
||||||
|
assert "nagios_runner.commands" in plugin.skip_reason
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run test to verify it fails**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
|
||||||
|
```
|
||||||
|
Expected: FAIL — `plugin.skip_reason` is `None`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Set skip_reason in NagiosRunnerPlugin.initialize()**
|
||||||
|
|
||||||
|
In `hbd/client/plugins/nagios_runner.py`, replace the early-return block in `initialize()` (around line 96):
|
||||||
|
|
||||||
|
```python
|
||||||
|
if not self.commands:
|
||||||
|
self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
|
||||||
|
self.logger.info("No Nagios commands configured")
|
||||||
|
return False
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run test to verify it passes**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
|
||||||
|
```
|
||||||
|
Expected: PASS.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
|
||||||
|
git commit -m "feat: set skip_reason on nagios_runner when no commands configured"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 3: NagiosRunnerPlugin — async subprocess, stderr capture, negative return codes
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `hbd/client/plugins/nagios_runner.py` (imports + `_run_nagios_plugin`)
|
||||||
|
- Modify: `tests/test_nagios_runner.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write failing tests**
|
||||||
|
|
||||||
|
Append to `tests/test_nagios_runner.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_stderr_used_when_stdout_empty(tmp_path):
|
||||||
|
script = tmp_path / "check_err.sh"
|
||||||
|
script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n")
|
||||||
|
script.chmod(script.stat().st_mode | stat.S_IEXEC)
|
||||||
|
|
||||||
|
config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
|
||||||
|
plugin = NagiosRunnerPlugin(config=config)
|
||||||
|
asyncio.run(plugin.initialize())
|
||||||
|
data = asyncio.run(plugin._collect_metrics())
|
||||||
|
|
||||||
|
assert "error from stderr" in data["t_output"]
|
||||||
|
assert data["t_status_code"] == NAGIOS_CRITICAL
|
||||||
|
|
||||||
|
|
||||||
|
def test_stderr_appended_when_both_present(tmp_path):
|
||||||
|
script = tmp_path / "check_both.sh"
|
||||||
|
script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n")
|
||||||
|
script.chmod(script.stat().st_mode | stat.S_IEXEC)
|
||||||
|
|
||||||
|
config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
|
||||||
|
plugin = NagiosRunnerPlugin(config=config)
|
||||||
|
asyncio.run(plugin.initialize())
|
||||||
|
data = asyncio.run(plugin._collect_metrics())
|
||||||
|
|
||||||
|
assert "OK - all good" in data["t_output"]
|
||||||
|
assert "extra detail" in data["t_output"]
|
||||||
|
assert data["t_status_code"] == NAGIOS_OK
|
||||||
|
|
||||||
|
|
||||||
|
def test_negative_returncode_maps_to_unknown():
|
||||||
|
# kill -9 $$ kills the shell itself; asyncio sees returncode -9
|
||||||
|
config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5}
|
||||||
|
plugin = NagiosRunnerPlugin(config=config)
|
||||||
|
asyncio.run(plugin.initialize())
|
||||||
|
data = asyncio.run(plugin._collect_metrics())
|
||||||
|
|
||||||
|
assert data["t_status_code"] == NAGIOS_UNKNOWN
|
||||||
|
assert "signal" in data["t_output"].lower()
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run tests to verify they fail**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m pytest tests/test_nagios_runner.py::test_stderr_used_when_stdout_empty \
|
||||||
|
tests/test_nagios_runner.py::test_stderr_appended_when_both_present \
|
||||||
|
tests/test_nagios_runner.py::test_negative_returncode_maps_to_unknown -v
|
||||||
|
```
|
||||||
|
Expected: all FAIL — current implementation ignores stderr and doesn't handle negative codes.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Update imports in nagios_runner.py**
|
||||||
|
|
||||||
|
Replace the import block at the top of `hbd/client/plugins/nagios_runner.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import asyncio
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
|
||||||
|
from hbd.client.plugin import MonitorPlugin
|
||||||
|
```
|
||||||
|
|
||||||
|
(Remove `import subprocess`; add `import asyncio` and `import os`.)
|
||||||
|
|
||||||
|
- [ ] **Step 4: Upgrade collection log level from DEBUG to INFO**
|
||||||
|
|
||||||
|
In `hbd/client/plugins/nagios_runner.py`, in `_collect_metrics()`, change the debug log (around line 144) so results are visible at INFO level:
|
||||||
|
|
||||||
|
```python
|
||||||
|
self.logger.info(
|
||||||
|
f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Replace `_run_nagios_plugin` with async implementation**
|
||||||
|
|
||||||
|
Replace the entire `_run_nagios_plugin` method in `hbd/client/plugins/nagios_runner.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def _run_nagios_plugin(
|
||||||
|
self,
|
||||||
|
command: str
|
||||||
|
) -> Tuple[int, str, Dict[str, Any]]:
|
||||||
|
"""Execute a Nagios plugin and parse its output."""
|
||||||
|
try:
|
||||||
|
proc = await asyncio.create_subprocess_shell(
|
||||||
|
command,
|
||||||
|
stdout=asyncio.subprocess.PIPE,
|
||||||
|
stderr=asyncio.subprocess.PIPE,
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
stdout_bytes, stderr_bytes = await asyncio.wait_for(
|
||||||
|
proc.communicate(), timeout=self.timeout
|
||||||
|
)
|
||||||
|
except asyncio.TimeoutError:
|
||||||
|
proc.kill()
|
||||||
|
await proc.communicate()
|
||||||
|
self.logger.error(f"Command timed out: {command}")
|
||||||
|
return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
|
||||||
|
|
||||||
|
status_code = proc.returncode
|
||||||
|
|
||||||
|
if status_code < 0:
|
||||||
|
return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {}
|
||||||
|
|
||||||
|
if status_code > 3:
|
||||||
|
status_code = NAGIOS_UNKNOWN
|
||||||
|
|
||||||
|
stdout = stdout_bytes.decode(errors="replace").strip()
|
||||||
|
stderr = stderr_bytes.decode(errors="replace").strip()
|
||||||
|
|
||||||
|
# Parse perfdata from stdout before mixing in stderr
|
||||||
|
perfdata = self._parse_perfdata(stdout)
|
||||||
|
|
||||||
|
# Build status message
|
||||||
|
status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout
|
||||||
|
|
||||||
|
if not stdout and stderr:
|
||||||
|
output_msg = stderr
|
||||||
|
elif stdout and stderr:
|
||||||
|
output_msg = f"{status_part} [stderr: {stderr}]"
|
||||||
|
else:
|
||||||
|
output_msg = status_part
|
||||||
|
|
||||||
|
return status_code, output_msg, perfdata
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(f"Error executing command: {e}")
|
||||||
|
return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}
|
||||||
|
```
|
||||||
|
|
||||||
|
Also remove the now-unused `self.shell` line from `__init__` (the `shell` config key is no longer used since `create_subprocess_shell` always uses a shell):
|
||||||
|
|
||||||
|
In `NagiosRunnerPlugin.__init__`, remove:
|
||||||
|
```python
|
||||||
|
self.shell: bool = config.get("shell", True) if config else True
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 6: Run tests to verify they pass**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m pytest tests/test_nagios_runner.py -v
|
||||||
|
```
|
||||||
|
Expected: all tests PASS including the 3 new ones.
|
||||||
|
|
||||||
|
- [ ] **Step 7: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
|
||||||
|
git commit -m "feat: async subprocess in nagios_runner with stderr capture and signal handling"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 4: NagiosRunnerPlugin — command path validation at init
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `hbd/client/plugins/nagios_runner.py` (initialize)
|
||||||
|
- Modify: `tests/test_nagios_runner.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write failing tests**
|
||||||
|
|
||||||
|
Append to `tests/test_nagios_runner.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_absolute_path_not_found_warns(caplog):
|
||||||
|
fake_cmd = "/nonexistent_hbc_test_path/check_something"
|
||||||
|
config = {"commands": [{"name": "t", "command": fake_cmd}]}
|
||||||
|
plugin = NagiosRunnerPlugin(config=config)
|
||||||
|
|
||||||
|
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
|
||||||
|
asyncio.run(plugin.initialize())
|
||||||
|
|
||||||
|
assert any("not found" in r.message for r in caplog.records)
|
||||||
|
|
||||||
|
|
||||||
|
def test_absolute_path_not_executable_warns(caplog, tmp_path):
|
||||||
|
non_exec = tmp_path / "check_test"
|
||||||
|
non_exec.write_text("#!/bin/sh\necho OK\n")
|
||||||
|
non_exec.chmod(0o644) # readable but not executable
|
||||||
|
|
||||||
|
config = {"commands": [{"name": "t", "command": str(non_exec)}]}
|
||||||
|
plugin = NagiosRunnerPlugin(config=config)
|
||||||
|
|
||||||
|
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
|
||||||
|
asyncio.run(plugin.initialize())
|
||||||
|
|
||||||
|
assert any("not executable" in r.message for r in caplog.records)
|
||||||
|
|
||||||
|
|
||||||
|
def test_relative_path_not_checked(caplog):
|
||||||
|
# Relative paths (resolved via PATH) must not generate warnings
|
||||||
|
config = {"commands": [{"name": "t", "command": "echo OK"}]}
|
||||||
|
plugin = NagiosRunnerPlugin(config=config)
|
||||||
|
|
||||||
|
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
|
||||||
|
asyncio.run(plugin.initialize())
|
||||||
|
|
||||||
|
assert not any(
|
||||||
|
"not found" in r.message or "not executable" in r.message
|
||||||
|
for r in caplog.records
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run tests to verify they fail**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m pytest tests/test_nagios_runner.py::test_absolute_path_not_found_warns \
|
||||||
|
tests/test_nagios_runner.py::test_absolute_path_not_executable_warns \
|
||||||
|
tests/test_nagios_runner.py::test_relative_path_not_checked -v
|
||||||
|
```
|
||||||
|
Expected: `test_absolute_path_not_found_warns` and `test_absolute_path_not_executable_warns` FAIL (no warnings logged); `test_relative_path_not_checked` may pass.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add command path validation to `initialize()`**
|
||||||
|
|
||||||
|
In `hbd/client/plugins/nagios_runner.py`, extend `initialize()` by adding validation after the existing "log each command" loop (after line 103, before `return True`):
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Validate absolute command paths early
|
||||||
|
for cmd_config in self.commands:
|
||||||
|
name = cmd_config.get("name", "unnamed")
|
||||||
|
command = cmd_config.get("command", "")
|
||||||
|
if not command:
|
||||||
|
continue
|
||||||
|
exe = command.split()[0]
|
||||||
|
if os.path.isabs(exe):
|
||||||
|
if not os.path.isfile(exe):
|
||||||
|
self.logger.warning(
|
||||||
|
f"Command '{name}': executable not found: {exe}"
|
||||||
|
)
|
||||||
|
elif not os.access(exe, os.X_OK):
|
||||||
|
self.logger.warning(
|
||||||
|
f"Command '{name}': executable not executable: {exe}"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run full test suite to verify all pass**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
|
||||||
|
```
|
||||||
|
Expected: all tests PASS.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
|
||||||
|
git commit -m "feat: validate absolute command paths at nagios_runner init"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 5: Daemon mode logging — route to syslog after fork
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `hbd/client/main.py` (new helper + updated daemon block)
|
||||||
|
|
||||||
|
No automated test for daemonization itself (fork behaviour is hard to unit-test). Manual verification steps are provided below.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add `_reconfigure_logging_for_daemon` helper**
|
||||||
|
|
||||||
|
In `hbd/client/main.py`, add this function just before `def build_parser()` (around line 589):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _reconfigure_logging_for_daemon(log_level: int) -> None:
|
||||||
|
"""Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler."""
|
||||||
|
from logging.handlers import SysLogHandler
|
||||||
|
|
||||||
|
root = logging.getLogger()
|
||||||
|
for handler in root.handlers[:]:
|
||||||
|
root.removeHandler(handler)
|
||||||
|
handler.close()
|
||||||
|
|
||||||
|
try:
|
||||||
|
syslog_handler = SysLogHandler(
|
||||||
|
address="/dev/log",
|
||||||
|
facility=SysLogHandler.LOG_DAEMON,
|
||||||
|
)
|
||||||
|
except OSError:
|
||||||
|
syslog_handler = SysLogHandler(
|
||||||
|
address=("localhost", 514),
|
||||||
|
facility=SysLogHandler.LOG_DAEMON,
|
||||||
|
)
|
||||||
|
# Attach the fallback first so the warning reaches syslog
|
||||||
|
syslog_handler.setFormatter(
|
||||||
|
logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
|
||||||
|
)
|
||||||
|
root.addHandler(syslog_handler)
|
||||||
|
root.setLevel(log_level)
|
||||||
|
logging.warning("/dev/log not found, using syslog UDP localhost:514")
|
||||||
|
return
|
||||||
|
|
||||||
|
syslog_handler.setFormatter(
|
||||||
|
logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
|
||||||
|
)
|
||||||
|
root.addHandler(syslog_handler)
|
||||||
|
root.setLevel(log_level)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Update the daemon block in `main()`**
|
||||||
|
|
||||||
|
In `hbd/client/main.py`, replace the entire `if args.daemon:` block (lines 664–675):
|
||||||
|
|
||||||
|
```python
|
||||||
|
if args.daemon:
|
||||||
|
print("Daemonizing...")
|
||||||
|
daemonize()
|
||||||
|
_reconfigure_logging_for_daemon(log_level)
|
||||||
|
logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
|
||||||
|
```
|
||||||
|
|
||||||
|
This removes the `import syslog`, `syslog.openlog()`, and `syslog.syslog()` calls (now handled by the logging system) and removes the no-op second `logging.basicConfig()` call.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Run existing test suite to confirm no regressions**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
|
||||||
|
```
|
||||||
|
Expected: all tests still PASS.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Manual smoke test — verify syslog output in daemon mode**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# In one terminal, tail syslog
|
||||||
|
sudo journalctl -f -t hbc
|
||||||
|
|
||||||
|
# In another terminal, start hbc in daemon mode (replace HOST with a real or dummy host)
|
||||||
|
python -m hbd.client.main -d -v localhost
|
||||||
|
|
||||||
|
# Expected in journalctl output:
|
||||||
|
# hbc[<pid>]: hbc.main INFO: Starting hbc for <hostname> -> ['localhost']
|
||||||
|
# hbc[<pid>]: hbc.main INFO: hbc starting, sending heartbeat to localhost
|
||||||
|
# hbc[<pid>]: plugin.loader INFO: ...
|
||||||
|
|
||||||
|
# Stop the daemon
|
||||||
|
pkill -f "hbd.client.main"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add hbd/client/main.py
|
||||||
|
git commit -m "fix: reconfigure logging to syslog after daemonize() instead of no-op basicConfig"
|
||||||
|
```
|
||||||
@@ -0,0 +1,92 @@
|
|||||||
|
# Plugin Error Checking & Daemon Logging — Design Spec
|
||||||
|
|
||||||
|
**Date:** 2026-04-25
|
||||||
|
**Scope:** hbc client — daemon mode logging, nagios_runner plugin robustness, PluginLoader messaging
|
||||||
|
**Files affected:** `hbd/client/main.py`, `hbd/client/plugins/nagios_runner.py`, `hbd/client/plugin.py`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Daemon Mode Logging
|
||||||
|
|
||||||
|
### Problem
|
||||||
|
In `main()`, `logging.basicConfig()` is called before `daemonize()` (establishing a StreamHandler to stderr), then called again after `daemonize()`. The second call is a no-op — Python ignores `basicConfig()` when handlers are already configured. After daemonization, stderr is redirected to `/dev/null`, so all subsequent log output is silently discarded.
|
||||||
|
|
||||||
|
The existing `syslog.openlog()` / `syslog.syslog()` calls (lines 666–668) write a single startup message but do not integrate with the `logging` system, so plugin and connection log messages never reach syslog.
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
After `daemonize()`, explicitly reconfigure the root logger:
|
||||||
|
|
||||||
|
1. Remove all existing handlers (they now write to `/dev/null`).
|
||||||
|
2. Add `logging.handlers.SysLogHandler(address='/dev/log', facility=LOG_DAEMON)`.
|
||||||
|
3. Set formatter: `hbc[%(process)d]: %(name)s %(levelname)s: %(message)s`
|
||||||
|
4. Preserve the `log_level` already determined from `-v`/`-x` CLI flags.
|
||||||
|
|
||||||
|
Remove the redundant `syslog.openlog()` / `syslog.syslog()` calls — the logging system handles routing.
|
||||||
|
|
||||||
|
**Fallback:** If `/dev/log` does not exist (containers, some BSDs), fall back to `SysLogHandler(address=('localhost', 514))`. Log one warning (to stderr, before handlers are replaced) so the operator knows.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Nagios Runner Improvements
|
||||||
|
|
||||||
|
### 2a — Async Subprocess
|
||||||
|
`_run_nagios_plugin()` is declared `async def` but calls `subprocess.run()` synchronously, blocking the event loop for the full command duration.
|
||||||
|
|
||||||
|
**Fix:** Replace with `asyncio.create_subprocess_shell()` + `await proc.communicate()`. Enforce timeout with `asyncio.wait_for(..., timeout=self.timeout)` and catch `asyncio.TimeoutError`.
|
||||||
|
|
||||||
|
### 2b — Stderr Capture
|
||||||
|
Subprocess stderr is currently discarded (`capture_output=True` only captures stdout in the sync call; stderr content is lost).
|
||||||
|
|
||||||
|
**Fix:** Pass `stderr=asyncio.subprocess.PIPE` to `create_subprocess_shell`. After `communicate()`, if stdout is empty but stderr has content, use stderr as the output message. If both have content, append stderr to the output for visibility.
|
||||||
|
|
||||||
|
### 2c — Negative Return Codes
|
||||||
|
A negative `returncode` means the process was killed by a signal (SIGKILL, OOM, etc.). The current code treats these as-is, which may produce unexpected status values.
|
||||||
|
|
||||||
|
**Fix:** If `returncode < 0`, map to `NAGIOS_UNKNOWN` with message `"Process killed by signal {-returncode}"`.
|
||||||
|
|
||||||
|
### 2d — Command Path Validation at Init
|
||||||
|
`initialize()` currently only checks that the commands list is non-empty.
|
||||||
|
|
||||||
|
**Fix:** For each command entry during `initialize()`:
|
||||||
|
- Warn and skip the entry if `name` or `command` is missing.
|
||||||
|
- Extract the executable (first whitespace-delimited token of the command string).
|
||||||
|
- If the executable is an absolute path, check `os.path.isfile()` and `os.access(..., os.X_OK)`. Log a `WARNING` if either check fails.
|
||||||
|
- Commands with relative paths or shell builtins are not checked (they may be on PATH) — just noted.
|
||||||
|
- Validation warns only; all original entries in `self.commands` are retained and still attempted at collection time (where the existing missing-name/command guard already skips them). The plugin initializes successfully as long as the commands list is non-empty.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. PluginLoader Messaging
|
||||||
|
|
||||||
|
### Problem
|
||||||
|
When `initialize()` returns `False`, the loader always logs:
|
||||||
|
> `WARNING: Plugin X failed initialization, skipping`
|
||||||
|
|
||||||
|
This is alarming when the real reason is simply "no commands configured". There is no API to distinguish "not configured" from "genuinely broken".
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
Add an optional `skip_reason` attribute to `Plugin.__init__()` (defaults to `None`).
|
||||||
|
|
||||||
|
In `PluginLoader.load_from_directory()`, after `initialize()` returns `False`:
|
||||||
|
- If `plugin.skip_reason` is set → `logger.info(f"Plugin {plugin.name} skipped: {plugin.skip_reason}")`
|
||||||
|
- If `plugin.skip_reason` is `None` → `logger.warning(f"Plugin {plugin.name} failed initialization, skipping")` (existing behaviour)
|
||||||
|
|
||||||
|
In `NagiosRunnerPlugin.initialize()`, when no commands are configured:
|
||||||
|
```python
|
||||||
|
self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
|
||||||
|
return False
|
||||||
|
```
|
||||||
|
|
||||||
|
Genuine failures (exceptions) continue to go through the existing `except` block in the loader, logging at `ERROR` with traceback — unchanged.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decisions
|
||||||
|
|
||||||
|
| Topic | Decision |
|
||||||
|
|---|---|
|
||||||
|
| Daemon log destination | syslog only (LOG_DAEMON facility) |
|
||||||
|
| Syslog fallback | localhost:514 UDP if `/dev/log` absent |
|
||||||
|
| Nagios result log level | INFO for all statuses (OK/WARNING/CRITICAL/UNKNOWN) |
|
||||||
|
| Invalid command handling at init | Warn and continue; still attempt at collection time |
|
||||||
|
| PluginLoader API change | `skip_reason` attribute on Plugin base class, checked by loader |
|
||||||
-21
@@ -1,21 +0,0 @@
|
|||||||
Plan the following changes, ask questions to clarify before implementing
|
|
||||||
|
|
||||||
Re-factor the notification system:
|
|
||||||
- use available libraries for pushover, matrix, email and sms notifications.
|
|
||||||
- notifications have a title/subject: alert_type (recover/warning/critical), a body (info from threshold check) and a link to the host plugin metrix page
|
|
||||||
- define a list of notification channels for each user
|
|
||||||
- notifications are dispatched to users that are listed as managers for the host
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
1 - correct
|
|
||||||
2 - for now channels are defined globaly
|
|
||||||
3 - matrix-nio)sounds good, homeserver URL, access token, room ID per channel?
|
|
||||||
4 - use the REST api provided by https://voip.ms/api/v1/rest.php
|
|
||||||
5 - The page does not exist yet, point at the host tab in the /plugins
|
|
||||||
6 - per-channel minimum severity is a good idea, go fo it
|
|
||||||
7 - yes
|
|
||||||
|
|
||||||
1 - use base_url, there might not have been any incoming requests yet
|
|
||||||
2 - use same asyncio loop for matrix-nio
|
|
||||||
3 - for now, just silently do nothing
|
|
||||||
+1
-1
@@ -14,4 +14,4 @@ Install options:
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
__all__ = ["__version__"]
|
__all__ = ["__version__"]
|
||||||
__version__ = "5.1.1"
|
__version__ = "5.1.14"
|
||||||
|
|||||||
+74
-41
@@ -14,7 +14,7 @@ import signal
|
|||||||
import socket
|
import socket
|
||||||
import sys
|
import sys
|
||||||
import time
|
import time
|
||||||
from hashlib import md5
|
from logging.handlers import SysLogHandler
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Dict, List, Optional
|
from typing import Dict, List, Optional
|
||||||
|
|
||||||
@@ -55,6 +55,7 @@ class AsyncConnection:
|
|||||||
|
|
||||||
self.transport: Optional[asyncio.DatagramTransport] = None
|
self.transport: Optional[asyncio.DatagramTransport] = None
|
||||||
self.protocol: Optional[asyncio.DatagramProtocol] = None
|
self.protocol: Optional[asyncio.DatagramProtocol] = None
|
||||||
|
self._dead = False
|
||||||
|
|
||||||
self.logger = logging.getLogger(f"hbc.conn.{addr}")
|
self.logger = logging.getLogger(f"hbc.conn.{addr}")
|
||||||
|
|
||||||
@@ -92,6 +93,9 @@ class AsyncConnection:
|
|||||||
msg: Message dictionary
|
msg: Message dictionary
|
||||||
msg_id: Message ID (HTB, PLG, etc.)
|
msg_id: Message ID (HTB, PLG, etc.)
|
||||||
"""
|
"""
|
||||||
|
if self._dead:
|
||||||
|
return
|
||||||
|
|
||||||
if not self.transport:
|
if not self.transport:
|
||||||
await self.open()
|
await self.open()
|
||||||
|
|
||||||
@@ -166,7 +170,9 @@ class HeartbeatProtocol(asyncio.DatagramProtocol):
|
|||||||
|
|
||||||
def error_received(self, exc):
|
def error_received(self, exc):
|
||||||
"""Handle protocol errors."""
|
"""Handle protocol errors."""
|
||||||
self.logger.error(f"Protocol error: {exc}")
|
self.logger.warning(f"Protocol error on {self.connection.addr}: {exc} — dropping connection")
|
||||||
|
self.connection._dead = True
|
||||||
|
self.connection.close()
|
||||||
|
|
||||||
|
|
||||||
async def handle_command(conn: AsyncConnection, msg: dict):
|
async def handle_command(conn: AsyncConnection, msg: dict):
|
||||||
@@ -203,48 +209,45 @@ async def handle_command(conn: AsyncConnection, msg: dict):
|
|||||||
await conn.sendto(response)
|
await conn.sendto(response)
|
||||||
|
|
||||||
|
|
||||||
async def handle_update(conn: AsyncConnection, msg: dict):
|
async def handle_update(conn: AsyncConnection, _msg: dict): # pyright: ignore[reportUnusedParameter]
|
||||||
"""Handle self-update from server."""
|
"""Handle self-update by running hb_install.sh."""
|
||||||
import codecs
|
|
||||||
import shutil
|
import shutil
|
||||||
|
|
||||||
logger = logging.getLogger("hbc.update")
|
logger = logging.getLogger("hbc.update")
|
||||||
|
|
||||||
try:
|
installer = shutil.which("hb_install.sh")
|
||||||
code = codecs.decode(msg["code"], "base64").decode()
|
if installer is None:
|
||||||
csum = msg["csum"]
|
candidate = Path(sys.argv[0]).parent / "hb_install.sh"
|
||||||
except Exception as e:
|
if candidate.exists():
|
||||||
error = f"Missing code/csum: {e}"
|
installer = str(candidate)
|
||||||
|
|
||||||
|
if installer is None:
|
||||||
|
error = "hb_install.sh not found in PATH or alongside hbc"
|
||||||
logger.error(error)
|
logger.error(error)
|
||||||
await conn.sendto({"service": "update", "msg": error})
|
await conn.sendto({"service": "update", "msg": error})
|
||||||
return
|
return
|
||||||
|
|
||||||
# Verify checksum
|
logger.info(f"Running installer: {installer}")
|
||||||
m = md5()
|
try:
|
||||||
m.update(code.encode())
|
proc = await asyncio.create_subprocess_exec(
|
||||||
if m.hexdigest() != csum:
|
installer, "client",
|
||||||
error = "Checksum mismatch"
|
stdout=asyncio.subprocess.PIPE,
|
||||||
|
stderr=asyncio.subprocess.STDOUT,
|
||||||
|
)
|
||||||
|
out, _ = await asyncio.wait_for(proc.communicate(), timeout=120)
|
||||||
|
except asyncio.TimeoutError:
|
||||||
|
error = "Installer timed out"
|
||||||
|
logger.error(error)
|
||||||
|
await conn.sendto({"service": "update", "msg": error})
|
||||||
|
return
|
||||||
|
except Exception as e:
|
||||||
|
error = f"Installer failed: {e}"
|
||||||
logger.error(error)
|
logger.error(error)
|
||||||
await conn.sendto({"service": "update", "msg": error})
|
await conn.sendto({"service": "update", "msg": error})
|
||||||
return
|
return
|
||||||
|
|
||||||
# Backup current file
|
if proc.returncode != 0:
|
||||||
fn = sys.argv[0]
|
error = f"Installer exited {proc.returncode}: {out.decode().strip()}"
|
||||||
ofn = f"{fn}.sav"
|
|
||||||
try:
|
|
||||||
shutil.copy2(fn, ofn)
|
|
||||||
except Exception as e:
|
|
||||||
error = f"Backup failed: {e}"
|
|
||||||
logger.error(error)
|
|
||||||
await conn.sendto({"service": "update", "msg": error})
|
|
||||||
return
|
|
||||||
|
|
||||||
# Write new code
|
|
||||||
try:
|
|
||||||
with open(fn, "w") as fh:
|
|
||||||
fh.write(code)
|
|
||||||
except Exception as e:
|
|
||||||
error = f"Write failed: {e}"
|
|
||||||
logger.error(error)
|
logger.error(error)
|
||||||
await conn.sendto({"service": "update", "msg": error})
|
await conn.sendto({"service": "update", "msg": error})
|
||||||
return
|
return
|
||||||
@@ -522,6 +525,13 @@ async def async_main(args, config):
|
|||||||
for sig in (signal.SIGTERM, signal.SIGINT):
|
for sig in (signal.SIGTERM, signal.SIGINT):
|
||||||
loop.add_signal_handler(sig, stop)
|
loop.add_signal_handler(sig, stop)
|
||||||
|
|
||||||
|
def _sighup():
|
||||||
|
global dorestart
|
||||||
|
dorestart = True
|
||||||
|
stop()
|
||||||
|
|
||||||
|
loop.add_signal_handler(signal.SIGHUP, _sighup)
|
||||||
|
|
||||||
# Start async tasks
|
# Start async tasks
|
||||||
# Heartbeat senders (one per connection)
|
# Heartbeat senders (one per connection)
|
||||||
for conn in connections:
|
for conn in connections:
|
||||||
@@ -586,6 +596,36 @@ def daemonize(
|
|||||||
os.dup2(se.fileno(), sys.stderr.fileno())
|
os.dup2(se.fileno(), sys.stderr.fileno())
|
||||||
|
|
||||||
|
|
||||||
|
def _reconfigure_logging_for_daemon(log_level: int) -> None:
|
||||||
|
"""Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler."""
|
||||||
|
root = logging.getLogger()
|
||||||
|
for handler in root.handlers[:]:
|
||||||
|
root.removeHandler(handler)
|
||||||
|
handler.close()
|
||||||
|
|
||||||
|
use_udp_fallback = not os.path.exists("/dev/log")
|
||||||
|
|
||||||
|
if use_udp_fallback:
|
||||||
|
syslog_handler = SysLogHandler(
|
||||||
|
address=("localhost", 514),
|
||||||
|
facility=SysLogHandler.LOG_DAEMON,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
syslog_handler = SysLogHandler(
|
||||||
|
address="/dev/log",
|
||||||
|
facility=SysLogHandler.LOG_DAEMON,
|
||||||
|
)
|
||||||
|
|
||||||
|
syslog_handler.setFormatter(
|
||||||
|
logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
|
||||||
|
)
|
||||||
|
root.addHandler(syslog_handler)
|
||||||
|
root.setLevel(log_level)
|
||||||
|
|
||||||
|
if use_udp_fallback:
|
||||||
|
logging.warning("/dev/log not found, using syslog UDP localhost:514")
|
||||||
|
|
||||||
|
|
||||||
def build_parser():
|
def build_parser():
|
||||||
"""Build argument parser."""
|
"""Build argument parser."""
|
||||||
parser = argparse.ArgumentParser(
|
parser = argparse.ArgumentParser(
|
||||||
@@ -663,16 +703,9 @@ def main(argv=None):
|
|||||||
# Daemonize if requested
|
# Daemonize if requested
|
||||||
if args.daemon:
|
if args.daemon:
|
||||||
print("Daemonizing...")
|
print("Daemonizing...")
|
||||||
import syslog
|
|
||||||
syslog.openlog("hbc", syslog.LOG_PID, syslog.LOG_DAEMON)
|
|
||||||
syslog.syslog(syslog.LOG_INFO, f"Starting heartbeat to {', '.join(args.hosts)}")
|
|
||||||
daemonize()
|
daemonize()
|
||||||
|
_reconfigure_logging_for_daemon(log_level)
|
||||||
# Reconfigure logging for syslog
|
logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
|
||||||
logging.basicConfig(
|
|
||||||
level=log_level,
|
|
||||||
format="hbc[%(process)d]: %(name)s %(levelname)s: %(message)s"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Run async main
|
# Run async main
|
||||||
try:
|
try:
|
||||||
|
|||||||
+14
-5
@@ -29,6 +29,7 @@ class Plugin(ABC):
|
|||||||
description: Human-readable description
|
description: Human-readable description
|
||||||
interval: Collection interval in seconds (0 for InfoPlugin = collect once)
|
interval: Collection interval in seconds (0 for InfoPlugin = collect once)
|
||||||
enabled: Whether plugin is active (can be disabled via config)
|
enabled: Whether plugin is active (can be disabled via config)
|
||||||
|
skip_reason: Set by plugin before returning False from initialize(); causes loader to log INFO instead of WARNING.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
name: str = ""
|
name: str = ""
|
||||||
@@ -46,6 +47,7 @@ class Plugin(ABC):
|
|||||||
self.config = config or {}
|
self.config = config or {}
|
||||||
self.logger = logging.getLogger(f"plugin.{self.name}")
|
self.logger = logging.getLogger(f"plugin.{self.name}")
|
||||||
self._initialized = False
|
self._initialized = False
|
||||||
|
self.skip_reason: Optional[str] = None
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
async def initialize(self) -> bool:
|
async def initialize(self) -> bool:
|
||||||
@@ -312,9 +314,10 @@ class PluginLoader:
|
|||||||
|
|
||||||
loaded_count = 0
|
loaded_count = 0
|
||||||
raw_config = config or {}
|
raw_config = config or {}
|
||||||
# Per-plugin config lives under the 'plugins' key; fall back to top-level
|
# Per-plugin config lives under the 'plugins' key or at top-level.
|
||||||
# for backwards compatibility.
|
# CLIENT_DEFAULTS seeds "plugins": {} so the key always exists; check
|
||||||
plugin_config = raw_config.get("plugins", raw_config)
|
# both the subdict and top-level so that either layout in .hbc.yaml works.
|
||||||
|
plugins_subconfig = raw_config.get("plugins", {})
|
||||||
|
|
||||||
# Scan for Python files
|
# Scan for Python files
|
||||||
for plugin_file in directory.glob("*.py"):
|
for plugin_file in directory.glob("*.py"):
|
||||||
@@ -359,14 +362,20 @@ class PluginLoader:
|
|||||||
|
|
||||||
self.logger.debug(f"Found plugin class: {name}")
|
self.logger.debug(f"Found plugin class: {name}")
|
||||||
|
|
||||||
# Instantiate plugin with config
|
# Instantiate plugin with config — check plugins subdict first,
|
||||||
plugin_instance_config = plugin_config.get(obj.name, {})
|
# then top-level keys (e.g. nagios_runner: ... at root of config).
|
||||||
|
plugin_instance_config = plugins_subconfig.get(obj.name) or raw_config.get(obj.name, {})
|
||||||
plugin = obj(config=plugin_instance_config)
|
plugin = obj(config=plugin_instance_config)
|
||||||
|
|
||||||
# Initialize plugin
|
# Initialize plugin
|
||||||
try:
|
try:
|
||||||
initialized = await plugin.initialize()
|
initialized = await plugin.initialize()
|
||||||
if not initialized:
|
if not initialized:
|
||||||
|
if plugin.skip_reason:
|
||||||
|
self.logger.info(
|
||||||
|
f"Plugin {plugin.name} skipped: {plugin.skip_reason}"
|
||||||
|
)
|
||||||
|
else:
|
||||||
self.logger.warning(
|
self.logger.warning(
|
||||||
f"Plugin {plugin.name} failed initialization, skipping"
|
f"Plugin {plugin.name} failed initialization, skipping"
|
||||||
)
|
)
|
||||||
|
|||||||
@@ -21,8 +21,10 @@ nagios_runner:
|
|||||||
```
|
```
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import os
|
||||||
import re
|
import re
|
||||||
import subprocess
|
import shlex
|
||||||
from typing import Any, Dict, List, Optional, Tuple
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
|
||||||
from hbd.client.plugin import MonitorPlugin
|
from hbd.client.plugin import MonitorPlugin
|
||||||
@@ -52,7 +54,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
|||||||
interval: Collection interval in seconds (default: 300)
|
interval: Collection interval in seconds (default: 300)
|
||||||
commands: List of command definitions with 'name' and 'command' keys
|
commands: List of command definitions with 'name' and 'command' keys
|
||||||
timeout: Command execution timeout in seconds (default: 30)
|
timeout: Command execution timeout in seconds (default: 30)
|
||||||
shell: Whether to execute commands via shell (default: True)
|
|
||||||
|
|
||||||
Example:
|
Example:
|
||||||
nagios_runner:
|
nagios_runner:
|
||||||
@@ -76,15 +77,8 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
|||||||
# Extract configuration
|
# Extract configuration
|
||||||
self.commands: List[Dict[str, str]] = config.get("commands", []) if config else []
|
self.commands: List[Dict[str, str]] = config.get("commands", []) if config else []
|
||||||
self.timeout: int = config.get("timeout", 30) if config else 30
|
self.timeout: int = config.get("timeout", 30) if config else 30
|
||||||
self.shell: bool = config.get("shell", True) if config else True
|
|
||||||
self.interval = config.get("interval", 300) if config else 300
|
self.interval = config.get("interval", 300) if config else 300
|
||||||
|
|
||||||
# Validate commands
|
|
||||||
if not self.commands:
|
|
||||||
self.logger.info(
|
|
||||||
"No Nagios commands configured. Add 'nagios_runner.commands' to config."
|
|
||||||
)
|
|
||||||
|
|
||||||
async def initialize(self) -> bool:
|
async def initialize(self) -> bool:
|
||||||
"""Initialize the Nagios runner plugin.
|
"""Initialize the Nagios runner plugin.
|
||||||
|
|
||||||
@@ -94,7 +88,7 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
|||||||
self.logger.info(f"Initializing {self.name} plugin")
|
self.logger.info(f"Initializing {self.name} plugin")
|
||||||
|
|
||||||
if not self.commands:
|
if not self.commands:
|
||||||
self.logger.info("No Nagios commands configured")
|
self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
|
||||||
return False
|
return False
|
||||||
|
|
||||||
self.logger.info(f"Configured to run {len(self.commands)} Nagios plugin(s)")
|
self.logger.info(f"Configured to run {len(self.commands)} Nagios plugin(s)")
|
||||||
@@ -102,6 +96,29 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
|||||||
name = cmd_config.get("name", "unnamed")
|
name = cmd_config.get("name", "unnamed")
|
||||||
self.logger.info(f" - {name}: {cmd_config.get('command', 'N/A')}")
|
self.logger.info(f" - {name}: {cmd_config.get('command', 'N/A')}")
|
||||||
|
|
||||||
|
# Validate absolute command paths early
|
||||||
|
for cmd_config in self.commands:
|
||||||
|
name = cmd_config.get("name", "unnamed")
|
||||||
|
command = cmd_config.get("command", "")
|
||||||
|
if not command:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
tokens = shlex.split(command)
|
||||||
|
except ValueError:
|
||||||
|
continue # malformed command string; skip validation
|
||||||
|
if not tokens:
|
||||||
|
continue
|
||||||
|
exe = tokens[0]
|
||||||
|
if os.path.isabs(exe):
|
||||||
|
if not os.path.isfile(exe):
|
||||||
|
self.logger.warning(
|
||||||
|
f"Command '{name}': executable not found: {exe}"
|
||||||
|
)
|
||||||
|
elif not os.access(exe, os.X_OK):
|
||||||
|
self.logger.warning(
|
||||||
|
f"Command '{name}': executable not executable: {exe}"
|
||||||
|
)
|
||||||
|
|
||||||
return True
|
return True
|
||||||
|
|
||||||
async def _collect_metrics(self) -> Dict[str, Any]:
|
async def _collect_metrics(self) -> Dict[str, Any]:
|
||||||
@@ -141,7 +158,7 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
|||||||
for metric_name, metric_value in perfdata.items():
|
for metric_name, metric_value in perfdata.items():
|
||||||
results[f"{name}_{metric_name}"] = metric_value
|
results[f"{name}_{metric_name}"] = metric_value
|
||||||
|
|
||||||
self.logger.debug(
|
self.logger.info(
|
||||||
f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
|
f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -163,46 +180,49 @@ class NagiosRunnerPlugin(MonitorPlugin):
|
|||||||
self,
|
self,
|
||||||
command: str
|
command: str
|
||||||
) -> Tuple[int, str, Dict[str, Any]]:
|
) -> Tuple[int, str, Dict[str, Any]]:
|
||||||
"""Execute a Nagios plugin and parse its output.
|
"""Execute a Nagios plugin and parse its output."""
|
||||||
|
|
||||||
Args:
|
|
||||||
command: Command string to execute
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Tuple of (status_code, output_message, performance_data_dict)
|
|
||||||
"""
|
|
||||||
try:
|
try:
|
||||||
# Run command
|
proc = await asyncio.create_subprocess_shell(
|
||||||
result = subprocess.run(
|
|
||||||
command,
|
command,
|
||||||
shell=self.shell,
|
stdout=asyncio.subprocess.PIPE,
|
||||||
capture_output=True,
|
stderr=asyncio.subprocess.PIPE,
|
||||||
timeout=self.timeout,
|
|
||||||
text=True
|
|
||||||
)
|
)
|
||||||
|
try:
|
||||||
|
stdout_bytes, stderr_bytes = await asyncio.wait_for(
|
||||||
|
proc.communicate(), timeout=self.timeout
|
||||||
|
)
|
||||||
|
except asyncio.TimeoutError:
|
||||||
|
proc.kill()
|
||||||
|
await proc.communicate()
|
||||||
|
self.logger.error(f"Command timed out: {command}")
|
||||||
|
return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
|
||||||
|
|
||||||
status_code = result.returncode
|
status_code = proc.returncode
|
||||||
output = result.stdout.strip()
|
|
||||||
|
if status_code < 0:
|
||||||
|
return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {}
|
||||||
|
|
||||||
# Nagios plugins can return codes > 3, treat as UNKNOWN
|
|
||||||
if status_code > 3:
|
if status_code > 3:
|
||||||
status_code = NAGIOS_UNKNOWN
|
status_code = NAGIOS_UNKNOWN
|
||||||
|
|
||||||
# Parse performance data
|
stdout = stdout_bytes.decode(errors="replace").strip()
|
||||||
perfdata = self._parse_perfdata(output)
|
stderr = stderr_bytes.decode(errors="replace").strip()
|
||||||
|
|
||||||
# Extract just the status message (before the pipe if present)
|
# Parse perfdata from stdout before mixing in stderr
|
||||||
if '|' in output:
|
perfdata = self._parse_perfdata(stdout)
|
||||||
output_msg = output.split('|')[0].strip()
|
|
||||||
|
# Build status message
|
||||||
|
status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout
|
||||||
|
|
||||||
|
if not stdout and stderr:
|
||||||
|
output_msg = stderr
|
||||||
|
elif stdout and stderr:
|
||||||
|
output_msg = f"{status_part} [stderr: {stderr}]"
|
||||||
else:
|
else:
|
||||||
output_msg = output
|
output_msg = status_part
|
||||||
|
|
||||||
return status_code, output_msg, perfdata
|
return status_code, output_msg, perfdata
|
||||||
|
|
||||||
except subprocess.TimeoutExpired:
|
|
||||||
self.logger.error(f"Command timed out: {command}")
|
|
||||||
return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
|
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
self.logger.error(f"Error executing command: {e}")
|
self.logger.error(f"Error executing command: {e}")
|
||||||
return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}
|
return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}
|
||||||
|
|||||||
@@ -60,6 +60,7 @@ class OSInfoPlugin(InfoPlugin):
|
|||||||
"python_version": platform.python_version(),
|
"python_version": platform.python_version(),
|
||||||
"python_implementation": platform.python_implementation(),
|
"python_implementation": platform.python_implementation(),
|
||||||
"hbc_version": hbc_version,
|
"hbc_version": hbc_version,
|
||||||
|
"hbc_type": "full",
|
||||||
}
|
}
|
||||||
|
|
||||||
# Add Linux-specific distribution info
|
# Add Linux-specific distribution info
|
||||||
|
|||||||
@@ -0,0 +1,130 @@
|
|||||||
|
"""
|
||||||
|
ZFS pool monitoring plugin for Heartbeat.
|
||||||
|
|
||||||
|
Collects per-pool health, capacity, and cumulative I/O statistics via zpool(8).
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import logging
|
||||||
|
import shutil
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
from hbd.client.plugin import MonitorPlugin
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def _int(s: str) -> Optional[int]:
|
||||||
|
try:
|
||||||
|
return int(s.strip().rstrip("KMGTkBkmgt%x"))
|
||||||
|
except (ValueError, AttributeError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _float(s: str) -> Optional[float]:
|
||||||
|
try:
|
||||||
|
return float(s.strip().rstrip("%x"))
|
||||||
|
except (ValueError, AttributeError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
class ZFSMonitorPlugin(MonitorPlugin):
|
||||||
|
"""Monitor ZFS pool health, capacity, and I/O statistics.
|
||||||
|
|
||||||
|
Collects per pool:
|
||||||
|
- health: ONLINE, DEGRADED, FAULTED, etc.
|
||||||
|
- size / alloc / free: total, allocated and free bytes
|
||||||
|
- capacity: percentage used (0-100)
|
||||||
|
- frag: fragmentation percentage
|
||||||
|
- dedup: deduplication ratio
|
||||||
|
- read_ops / write_ops: cumulative I/O operations since last boot/clear
|
||||||
|
- read_bw / write_bw: cumulative bytes transferred since last boot/clear
|
||||||
|
|
||||||
|
Configuration:
|
||||||
|
interval: collection interval in seconds (default: 300)
|
||||||
|
pools: list of pool names to monitor (default: all)
|
||||||
|
"""
|
||||||
|
|
||||||
|
name = "zfs_monitor"
|
||||||
|
description = "ZFS pool health, capacity, and I/O statistics"
|
||||||
|
interval = 300
|
||||||
|
|
||||||
|
def __init__(self, config: Optional[Dict[str, Any]] = None):
|
||||||
|
super().__init__(config)
|
||||||
|
self.interval = self.config.get("interval", 300)
|
||||||
|
self._pools_filter: Optional[List[str]] = self.config.get("pools", None)
|
||||||
|
|
||||||
|
async def initialize(self) -> bool:
|
||||||
|
if not shutil.which("zpool"):
|
||||||
|
self.skip_reason = "zpool not found"
|
||||||
|
return False
|
||||||
|
logger.info("ZFS monitor initialized (interval: %ds)", self.interval)
|
||||||
|
return True
|
||||||
|
|
||||||
|
async def _run(self, *args: str) -> List[str]:
|
||||||
|
"""Run a command and return its stdout lines, or [] on error."""
|
||||||
|
try:
|
||||||
|
proc = await asyncio.create_subprocess_exec(
|
||||||
|
*args,
|
||||||
|
stdout=asyncio.subprocess.PIPE,
|
||||||
|
stderr=asyncio.subprocess.DEVNULL,
|
||||||
|
)
|
||||||
|
stdout, _ = await asyncio.wait_for(proc.communicate(), timeout=15)
|
||||||
|
return stdout.decode(errors="replace").splitlines()
|
||||||
|
except (FileNotFoundError, asyncio.TimeoutError) as exc:
|
||||||
|
logger.warning("zfs_monitor: %s: %s", args[0], exc)
|
||||||
|
return []
|
||||||
|
|
||||||
|
async def _zpool_list(self) -> Dict[str, Dict]:
|
||||||
|
"""Return per-pool health and capacity from `zpool list`."""
|
||||||
|
lines = await self._run(
|
||||||
|
"zpool", "list", "-H", "-p",
|
||||||
|
"-o", "name,health,size,alloc,free,cap,frag,dedup",
|
||||||
|
)
|
||||||
|
pools: Dict[str, Dict] = {}
|
||||||
|
for line in lines:
|
||||||
|
parts = line.split("\t")
|
||||||
|
if len(parts) < 8:
|
||||||
|
continue
|
||||||
|
name = parts[0].strip()
|
||||||
|
if self._pools_filter and name not in self._pools_filter:
|
||||||
|
continue
|
||||||
|
pools[name] = {
|
||||||
|
"health": parts[1].strip(),
|
||||||
|
"size": _int(parts[2]),
|
||||||
|
"alloc": _int(parts[3]),
|
||||||
|
"free": _int(parts[4]),
|
||||||
|
"capacity": _float(parts[5]),
|
||||||
|
"frag": _float(parts[6]),
|
||||||
|
"dedup": _float(parts[7]),
|
||||||
|
}
|
||||||
|
return pools
|
||||||
|
|
||||||
|
async def _zpool_iostat(self) -> Dict[str, Dict]:
|
||||||
|
"""Return per-pool cumulative I/O counters from `zpool iostat`."""
|
||||||
|
lines = await self._run("zpool", "iostat", "-H", "-p")
|
||||||
|
io: Dict[str, Dict] = {}
|
||||||
|
for line in lines:
|
||||||
|
parts = line.split("\t")
|
||||||
|
if len(parts) < 7:
|
||||||
|
continue
|
||||||
|
name = parts[0].strip()
|
||||||
|
if not name or name.startswith(" "):
|
||||||
|
continue
|
||||||
|
io[name] = {
|
||||||
|
"read_ops": _int(parts[3]),
|
||||||
|
"write_ops": _int(parts[4]),
|
||||||
|
"read_bw": _int(parts[5]),
|
||||||
|
"write_bw": _int(parts[6]),
|
||||||
|
}
|
||||||
|
return io
|
||||||
|
|
||||||
|
async def _collect_metrics(self) -> Dict[str, Any]:
|
||||||
|
pools, io = await asyncio.gather(self._zpool_list(), self._zpool_iostat())
|
||||||
|
for name, stats in io.items():
|
||||||
|
if name in pools:
|
||||||
|
pools[name].update(stats)
|
||||||
|
return {"pools": pools}
|
||||||
|
|
||||||
|
|
||||||
|
plugin = ZFSMonitorPlugin
|
||||||
+8
-3
@@ -52,11 +52,16 @@ def decode_value(val: str) -> Any:
|
|||||||
except Exception:
|
except Exception:
|
||||||
return val[1:] # Return as string without @
|
return val[1:] # Return as string without @
|
||||||
|
|
||||||
# Try numeric evaluation (original behavior)
|
# Try numeric conversion (avoid eval to prevent SyntaxWarnings on version strings)
|
||||||
if val[0].isdigit() or (val[0] == '-' and len(val) > 1 and val[1].isdigit()):
|
if val[0].isdigit() or (val[0] == '-' and len(val) > 1 and val[1].isdigit()):
|
||||||
try:
|
try:
|
||||||
return eval(val)
|
return int(val)
|
||||||
except Exception:
|
except ValueError:
|
||||||
|
pass
|
||||||
|
try:
|
||||||
|
return float(val)
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
return val
|
return val
|
||||||
|
|
||||||
return val
|
return val
|
||||||
|
|||||||
+5
-6
@@ -144,17 +144,16 @@ def cmd_notify(args):
|
|||||||
url=f"{base_url}/plugins" if base_url else "",
|
url=f"{base_url}/plugins" if base_url else "",
|
||||||
)
|
)
|
||||||
|
|
||||||
# Bypass min_level for explicit test sends; run async channels directly
|
|
||||||
import asyncio
|
import asyncio
|
||||||
|
from .notify import _send_matrix_async, _send_sms_voipms_async, _DRIVERS
|
||||||
ch_type = channel_cfg.get("type", "")
|
ch_type = channel_cfg.get("type", "")
|
||||||
print(f"Sending via {args.channel} ({ch_type}): {title} — {args.message}")
|
print(f"Sending via {args.channel} ({ch_type}): {title} — {args.message}")
|
||||||
|
|
||||||
if ch_type in ("matrix", "sms_voipms"):
|
if ch_type == "matrix":
|
||||||
from .notify import _send_matrix_async, _send_sms_voipms_async
|
ok = asyncio.run(_send_matrix_async(channel_cfg, notif))
|
||||||
driver_async = _send_matrix_async if ch_type == "matrix" else _send_sms_voipms_async
|
elif ch_type == "sms_voipms":
|
||||||
ok = asyncio.run(driver_async(channel_cfg, notif))
|
ok = asyncio.run(_send_sms_voipms_async(channel_cfg, notif))
|
||||||
else:
|
else:
|
||||||
from .notify import _DRIVERS
|
|
||||||
driver = _DRIVERS.get(ch_type)
|
driver = _DRIVERS.get(ch_type)
|
||||||
if driver is None:
|
if driver is None:
|
||||||
print(f"Error: unknown channel type '{ch_type}'", file=sys.stderr)
|
print(f"Error: unknown channel type '{ch_type}'", file=sys.stderr)
|
||||||
|
|||||||
@@ -225,7 +225,7 @@ def get_watchhosts(config):
|
|||||||
hosts_config = config.get("hosts", {})
|
hosts_config = config.get("hosts", {})
|
||||||
if isinstance(hosts_config, dict):
|
if isinstance(hosts_config, dict):
|
||||||
for host_name, host_attrs in hosts_config.items():
|
for host_name, host_attrs in hosts_config.items():
|
||||||
if isinstance(host_attrs, dict) and host_attrs.get("watch", False):
|
if isinstance(host_attrs, dict) and host_attrs.get("watch", True):
|
||||||
watchhosts.append(host_name)
|
watchhosts.append(host_name)
|
||||||
return watchhosts
|
return watchhosts
|
||||||
|
|
||||||
|
|||||||
@@ -286,7 +286,7 @@ class Host:
|
|||||||
Host.hosts[name] = self
|
Host.hosts[name] = self
|
||||||
self.num = num
|
self.num = num
|
||||||
self.dyn = False
|
self.dyn = False
|
||||||
self.watched = False
|
self.watched = True
|
||||||
self.upcount = 0
|
self.upcount = 0
|
||||||
self.interval = 0
|
self.interval = 0
|
||||||
self.doesack = -1
|
self.doesack = -1
|
||||||
@@ -304,6 +304,7 @@ class Host:
|
|||||||
|
|
||||||
def statedict(self):
|
def statedict(self):
|
||||||
d = {}
|
d = {}
|
||||||
|
d["raw_name"] = self.name
|
||||||
d["name"] = self.name
|
d["name"] = self.name
|
||||||
if self.dyn:
|
if self.dyn:
|
||||||
d["name"] += "*"
|
d["name"] += "*"
|
||||||
|
|||||||
+58
-13
@@ -1,7 +1,11 @@
|
|||||||
"""HTTP server implementation using aiohttp and jinja2."""
|
"""HTTP server implementation using aiohttp and jinja2."""
|
||||||
|
|
||||||
import asyncio
|
import asyncio
|
||||||
|
import datetime
|
||||||
import json
|
import json
|
||||||
|
import platform
|
||||||
|
import socket
|
||||||
|
import sys
|
||||||
import time
|
import time
|
||||||
import urllib.parse
|
import urllib.parse
|
||||||
import os
|
import os
|
||||||
@@ -111,6 +115,7 @@ async def start(
|
|||||||
This function is intended to be awaited inside the main asyncio event loop.
|
This function is intended to be awaited inside the main asyncio event loop.
|
||||||
"""
|
"""
|
||||||
get_now = get_now or (lambda: time.time())
|
get_now = get_now or (lambda: time.time())
|
||||||
|
_start_epoch = time.time()
|
||||||
|
|
||||||
async def old_index(request):
|
async def old_index(request):
|
||||||
_require_auth_redirect(request)
|
_require_auth_redirect(request)
|
||||||
@@ -210,15 +215,11 @@ async def start(
|
|||||||
return err
|
return err
|
||||||
qa = request.rel_url.query
|
qa = request.rel_url.query
|
||||||
uname = urllib.parse.unquote(qa.get("h", ""))
|
uname = urllib.parse.unquote(qa.get("h", ""))
|
||||||
ucode = qa.get("c")
|
if not uname:
|
||||||
if not ucode or not uname:
|
return web.Response(status=400, text="need h= argument")
|
||||||
return web.Response(status=400, text="need h= and c= arguments")
|
|
||||||
if uname != "All" and uname not in hbdclass.Host.hosts:
|
if uname != "All" and uname not in hbdclass.Host.hosts:
|
||||||
return web.Response(status=400, text=f"h={uname} not found")
|
return web.Response(status=400, text=f"h={uname} not found")
|
||||||
if uname != "All":
|
names = [uname] if uname != "All" else list(hbdclass.Host.hosts)
|
||||||
names = [uname]
|
|
||||||
else:
|
|
||||||
names = [n for n in hbdclass.Host.hosts]
|
|
||||||
out = []
|
out = []
|
||||||
for n in names:
|
for n in names:
|
||||||
host = hbdclass.Host.hosts[n]
|
host = hbdclass.Host.hosts[n]
|
||||||
@@ -227,8 +228,7 @@ async def start(
|
|||||||
continue
|
continue
|
||||||
op_err = None
|
op_err = None
|
||||||
try:
|
try:
|
||||||
r = {"csum": None, "code": ucode}
|
host.cmds.append(("UPD", {}))
|
||||||
host.cmds.append(("UPD", r))
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
op_err = str(e)
|
op_err = str(e)
|
||||||
out.append(f"update started for {n}: {op_err if op_err else 'OK'}")
|
out.append(f"update started for {n}: {op_err if op_err else 'OK'}")
|
||||||
@@ -258,7 +258,9 @@ async def start(
|
|||||||
extra_scripts=extra_scripts,
|
extra_scripts=extra_scripts,
|
||||||
hbd_version=hbd_version,
|
hbd_version=hbd_version,
|
||||||
hosts=[
|
hosts=[
|
||||||
hbdclass.Host.hosts[h].stateinfo() for h in sorted(hbdclass.Host.hosts)
|
hbdclass.Host.hosts[h].stateinfo()
|
||||||
|
for h in sorted(hbdclass.Host.hosts)
|
||||||
|
if _can_operate_host(current_user, hbdclass.Host.hosts[h])
|
||||||
],
|
],
|
||||||
messages=data.msgs[-30:],
|
messages=data.msgs[-30:],
|
||||||
current_user=current_user.to_dict() if current_user else None,
|
current_user=current_user.to_dict() if current_user else None,
|
||||||
@@ -510,7 +512,7 @@ async def start(
|
|||||||
hosts_with_plugins = []
|
hosts_with_plugins = []
|
||||||
for hostname in sorted(hbdclass.Host.hosts.keys()):
|
for hostname in sorted(hbdclass.Host.hosts.keys()):
|
||||||
host = hbdclass.Host.hosts[hostname]
|
host = hbdclass.Host.hosts[hostname]
|
||||||
if not _can_view_host(current_user, host):
|
if not _can_operate_host(current_user, host):
|
||||||
continue
|
continue
|
||||||
if host.plugin_data:
|
if host.plugin_data:
|
||||||
hosts_with_plugins.append({
|
hosts_with_plugins.append({
|
||||||
@@ -520,8 +522,8 @@ async def start(
|
|||||||
|
|
||||||
tmpl = env.get_template("plugins.html")
|
tmpl = env.get_template("plugins.html")
|
||||||
body = tmpl.render(
|
body = tmpl.render(
|
||||||
title="Plugin Metrics - Heartbeat",
|
title="Host Overview - Heartbeat",
|
||||||
header="Plugin Metrics",
|
header="Host Overview",
|
||||||
hosts=hosts_with_plugins,
|
hosts=hosts_with_plugins,
|
||||||
current_user=current_user.to_dict() if current_user else None,
|
current_user=current_user.to_dict() if current_user else None,
|
||||||
active_page="plugins",
|
active_page="plugins",
|
||||||
@@ -811,6 +813,48 @@ async def start(
|
|||||||
)
|
)
|
||||||
return web.Response(text=body, content_type="text/html")
|
return web.Response(text=body, content_type="text/html")
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# About page
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
async def about_page(request):
|
||||||
|
"""GET /about — version, runtime, and project information."""
|
||||||
|
current_user, _ = _require_auth_redirect(request)
|
||||||
|
pkg_dir = os.path.dirname(__file__)
|
||||||
|
templates_dir = config.get("templates_dir", os.path.join(pkg_dir, "templates"))
|
||||||
|
env = jinja2.Environment(loader=jinja2.FileSystemLoader(templates_dir))
|
||||||
|
from hbd import __version__ as hbd_version
|
||||||
|
|
||||||
|
uptime_secs = int(time.time() - _start_epoch)
|
||||||
|
days, rem = divmod(uptime_secs, 86400)
|
||||||
|
hours, rem = divmod(rem, 3600)
|
||||||
|
mins, secs = divmod(rem, 60)
|
||||||
|
if days:
|
||||||
|
uptime_str = f"{days}d {hours}h {mins}m"
|
||||||
|
elif hours:
|
||||||
|
uptime_str = f"{hours}h {mins}m {secs}s"
|
||||||
|
else:
|
||||||
|
uptime_str = f"{mins}m {secs}s"
|
||||||
|
|
||||||
|
start_dt = datetime.datetime.fromtimestamp(_start_epoch)
|
||||||
|
start_time_str = start_dt.strftime("%Y-%m-%d %H:%M:%S")
|
||||||
|
|
||||||
|
tmpl = env.get_template("about.html")
|
||||||
|
body = tmpl.render(
|
||||||
|
title="About - Heartbeat",
|
||||||
|
header="About",
|
||||||
|
hbd_version=hbd_version,
|
||||||
|
python_version=f"{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro} ({platform.python_implementation()})",
|
||||||
|
server_hostname=socket.gethostname(),
|
||||||
|
start_epoch=int(_start_epoch),
|
||||||
|
start_time_str=start_time_str,
|
||||||
|
uptime_str=uptime_str,
|
||||||
|
host_count=len(hbdclass.Host.hosts),
|
||||||
|
current_user=current_user.to_dict() if current_user else None,
|
||||||
|
active_page="about",
|
||||||
|
)
|
||||||
|
return web.Response(text=body, content_type="text/html")
|
||||||
|
|
||||||
# -------------------------------------------------------------------------
|
# -------------------------------------------------------------------------
|
||||||
# Settings page (admin only)
|
# Settings page (admin only)
|
||||||
# -------------------------------------------------------------------------
|
# -------------------------------------------------------------------------
|
||||||
@@ -864,6 +908,7 @@ async def start(
|
|||||||
web.get("/live", live),
|
web.get("/live", live),
|
||||||
web.get("/plugins", plugins_page),
|
web.get("/plugins", plugins_page),
|
||||||
web.get("/alerts", alerts_page),
|
web.get("/alerts", alerts_page),
|
||||||
|
web.get("/about", about_page),
|
||||||
web.get("/profile", profile_page),
|
web.get("/profile", profile_page),
|
||||||
web.get("/settings", settings_page),
|
web.get("/settings", settings_page),
|
||||||
web.get("/static/{path:.*}", static),
|
web.get("/static/{path:.*}", static),
|
||||||
|
|||||||
@@ -210,7 +210,6 @@ async def _run_async(config, config_path=None):
|
|||||||
ctx = dict(
|
ctx = dict(
|
||||||
config=config,
|
config=config,
|
||||||
hbdclass=hbdclass,
|
hbdclass=hbdclass,
|
||||||
log=eventlog,
|
|
||||||
msg_to_websockets=msg_to_websockets,
|
msg_to_websockets=msg_to_websockets,
|
||||||
msg_journal=msg_journal,
|
msg_journal=msg_journal,
|
||||||
threshold_checker=threshold_checker,
|
threshold_checker=threshold_checker,
|
||||||
@@ -237,7 +236,6 @@ async def _run_async(config, config_path=None):
|
|||||||
restore_ctx = dict(
|
restore_ctx = dict(
|
||||||
config=config,
|
config=config,
|
||||||
hbdclass=hbdclass,
|
hbdclass=hbdclass,
|
||||||
log=eventlog,
|
|
||||||
msg_to_websockets=msg_to_websockets,
|
msg_to_websockets=msg_to_websockets,
|
||||||
threshold_checker=threshold_checker,
|
threshold_checker=threshold_checker,
|
||||||
)
|
)
|
||||||
|
|||||||
+29
-50
@@ -15,7 +15,6 @@ their own ``notification_channels`` list. When no users are configured the
|
|||||||
server runs silently (no notifications sent).
|
server runs silently (no notifications sent).
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import asyncio
|
|
||||||
import asyncio
|
import asyncio
|
||||||
import logging
|
import logging
|
||||||
import smtplib
|
import smtplib
|
||||||
@@ -30,13 +29,10 @@ from . import ws as ws_mod
|
|||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
msg_to_websockets = ws_mod.broadcast
|
msg_to_websockets = ws_mod.broadcast
|
||||||
|
|
||||||
# Module-level state set via setup()
|
# Module-level state set via setup()
|
||||||
_config: dict = {}
|
_config: dict = {}
|
||||||
_loop: Optional[asyncio.AbstractEventLoop] = None
|
|
||||||
|
|
||||||
# Tracks which channels fired a WARNING/CRITICAL per host.
|
# Tracks which channels fired a WARNING/CRITICAL per host.
|
||||||
# {host_name: set of channel_names} — used to route RECOVER to the same channels.
|
# {host_name: set of channel_names} — used to route RECOVER to the same channels.
|
||||||
@@ -73,11 +69,9 @@ class Notification:
|
|||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
def setup(cfg: dict, loop: Optional[asyncio.AbstractEventLoop] = None):
|
def setup(cfg: dict, loop: Optional[asyncio.AbstractEventLoop] = None):
|
||||||
"""Initialize notifier from configuration dict and event loop."""
|
"""Initialize notifier from configuration dict."""
|
||||||
global _config, _loop
|
global _config
|
||||||
_config = dict(cfg)
|
_config = dict(cfg)
|
||||||
if loop is not None:
|
|
||||||
_loop = loop
|
|
||||||
|
|
||||||
|
|
||||||
def reload_config(cfg: dict):
|
def reload_config(cfg: dict):
|
||||||
@@ -299,17 +293,6 @@ async def _send_sms_voipms_async(channel_cfg: dict, notif: Notification) -> bool
|
|||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
def _send_sms_voipms(channel_cfg: dict, notif: Notification) -> bool:
|
|
||||||
"""Dispatch voip.ms SMS send onto the shared event loop."""
|
|
||||||
if _loop is None:
|
|
||||||
logger.warning("sms_voipms: event loop not available")
|
|
||||||
return False
|
|
||||||
future = asyncio.run_coroutine_threadsafe(_send_sms_voipms_async(channel_cfg, notif), _loop)
|
|
||||||
try:
|
|
||||||
return future.result(timeout=15)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error("sms_voipms send timed out or failed: %s", e)
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
async def _send_matrix_async(channel_cfg: dict, notif: Notification) -> bool:
|
async def _send_matrix_async(channel_cfg: dict, notif: Notification) -> bool:
|
||||||
@@ -357,48 +340,48 @@ async def _send_matrix_async(channel_cfg: dict, notif: Notification) -> bool:
|
|||||||
await client.close()
|
await client.close()
|
||||||
|
|
||||||
|
|
||||||
def _send_matrix(channel_cfg: dict, notif: Notification) -> bool:
|
|
||||||
"""Dispatch matrix send onto the shared event loop."""
|
|
||||||
if _loop is None:
|
|
||||||
logger.warning("matrix: event loop not available")
|
|
||||||
return False
|
|
||||||
future = asyncio.run_coroutine_threadsafe(_send_matrix_async(channel_cfg, notif), _loop)
|
|
||||||
try:
|
|
||||||
return future.result(timeout=15)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error("matrix send timed out or failed: %s", e)
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Channel dispatcher
|
# Channel dispatcher (all async — sync drivers run in a thread executor)
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
# Sync drivers kept for `hbd notify` CLI usage (asyncio.run wraps them there).
|
||||||
_DRIVERS = {
|
_DRIVERS = {
|
||||||
"pushover": _send_pushover,
|
"pushover": _send_pushover,
|
||||||
"email": _send_email,
|
"email": _send_email,
|
||||||
"mattermost": _send_mattermost,
|
"mattermost": _send_mattermost,
|
||||||
"signal": _send_signal,
|
"signal": _send_signal,
|
||||||
"sms_voipms": _send_sms_voipms,
|
|
||||||
"matrix": _send_matrix,
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
_TIMEOUT = 15 # seconds per channel send
|
||||||
|
|
||||||
def _dispatch_to_channel(channel_name: str, channel_cfg: dict, notif: Notification) -> bool:
|
|
||||||
|
async def _dispatch_to_channel(channel_name: str, channel_cfg: dict, notif: Notification) -> bool:
|
||||||
"""Send *notif* to a single named channel, honouring min_level."""
|
"""Send *notif* to a single named channel, honouring min_level."""
|
||||||
|
level = notif.level.upper()
|
||||||
|
if level != "RECOVER":
|
||||||
min_level = channel_cfg.get("min_level", "WARNING").upper()
|
min_level = channel_cfg.get("min_level", "WARNING").upper()
|
||||||
if _level_value(notif.level) < _level_value(min_level):
|
if _level_value(level) < _level_value(min_level):
|
||||||
logger.debug(
|
logger.debug(
|
||||||
"channel '%s': skipping level %s (min_level=%s)", channel_name, notif.level, min_level
|
"channel '%s': skipping level %s (min_level=%s)", channel_name, level, min_level
|
||||||
)
|
)
|
||||||
return True # not an error — filtered intentionally
|
return True # filtered intentionally
|
||||||
|
|
||||||
ch_type = channel_cfg.get("type", "")
|
ch_type = channel_cfg.get("type", "")
|
||||||
driver = _DRIVERS.get(ch_type)
|
try:
|
||||||
if driver is None:
|
if ch_type == "matrix":
|
||||||
|
return await asyncio.wait_for(_send_matrix_async(channel_cfg, notif), timeout=_TIMEOUT)
|
||||||
|
if ch_type == "sms_voipms":
|
||||||
|
return await asyncio.wait_for(_send_sms_voipms_async(channel_cfg, notif), timeout=_TIMEOUT)
|
||||||
|
sync_driver = _DRIVERS.get(ch_type)
|
||||||
|
if sync_driver is None:
|
||||||
logger.warning("unknown channel type '%s' for channel '%s'", ch_type, channel_name)
|
logger.warning("unknown channel type '%s' for channel '%s'", ch_type, channel_name)
|
||||||
return False
|
return False
|
||||||
return driver(channel_cfg, notif)
|
return await asyncio.wait_for(
|
||||||
|
asyncio.to_thread(sync_driver, channel_cfg, notif), timeout=_TIMEOUT
|
||||||
|
)
|
||||||
|
except asyncio.TimeoutError:
|
||||||
|
logger.error("channel '%s' timed out after %ds", channel_name, _TIMEOUT)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
@@ -412,7 +395,7 @@ def _build_url(host_name: str) -> str:
|
|||||||
return f"{base_url}/plugins#{host_name}"
|
return f"{base_url}/plugins#{host_name}"
|
||||||
|
|
||||||
|
|
||||||
def send_notification(host_name: str, notif: Notification) -> dict:
|
async def send_notification(host_name: str, notif: Notification) -> dict:
|
||||||
"""Dispatch *notif* to all managers/owner of *host_name*.
|
"""Dispatch *notif* to all managers/owner of *host_name*.
|
||||||
|
|
||||||
Looks up the host's owner + managers, resolves each user's
|
Looks up the host's owner + managers, resolves each user's
|
||||||
@@ -462,16 +445,12 @@ def send_notification(host_name: str, notif: Notification) -> dict:
|
|||||||
if not channel_cfg:
|
if not channel_cfg:
|
||||||
continue
|
continue
|
||||||
try:
|
try:
|
||||||
ch_type = channel_cfg.get("type", "")
|
ok = await _dispatch_to_channel(channel_name, channel_cfg, notif)
|
||||||
driver = _DRIVERS.get(ch_type)
|
|
||||||
if driver:
|
|
||||||
ok = driver(channel_cfg, notif)
|
|
||||||
results[channel_name] = ok
|
results[channel_name] = ok
|
||||||
if ok:
|
if ok:
|
||||||
logger.info("recover sent to channel '%s': %s", channel_name, notif.title)
|
logger.info("recover sent to channel '%s': %s", channel_name, notif.title)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error("error sending recover to channel '%s': %s", channel_name, e)
|
logger.error("error sending recover to channel '%s': %s", channel_name, e)
|
||||||
# Clear the alerted set once recovery is delivered
|
|
||||||
del _alerted_channels[host_name]
|
del _alerted_channels[host_name]
|
||||||
return results
|
return results
|
||||||
|
|
||||||
@@ -482,14 +461,14 @@ def send_notification(host_name: str, notif: Notification) -> dict:
|
|||||||
continue
|
continue
|
||||||
for channel_name in user.notification_channels:
|
for channel_name in user.notification_channels:
|
||||||
if channel_name in results:
|
if channel_name in results:
|
||||||
continue # already dispatched to this channel this notification
|
continue
|
||||||
channel_cfg = global_channels.get(channel_name)
|
channel_cfg = global_channels.get(channel_name)
|
||||||
if not channel_cfg:
|
if not channel_cfg:
|
||||||
logger.warning("channel '%s' not defined in notification_channels", channel_name)
|
logger.warning("channel '%s' not defined in notification_channels", channel_name)
|
||||||
results[channel_name] = False
|
results[channel_name] = False
|
||||||
continue
|
continue
|
||||||
try:
|
try:
|
||||||
ok = _dispatch_to_channel(channel_name, channel_cfg, notif)
|
ok = await _dispatch_to_channel(channel_name, channel_cfg, notif)
|
||||||
results[channel_name] = ok
|
results[channel_name] = ok
|
||||||
if ok:
|
if ok:
|
||||||
logger.info("notification sent to channel '%s': %s", channel_name, notif.title)
|
logger.info("notification sent to channel '%s': %s", channel_name, notif.title)
|
||||||
|
|||||||
@@ -24,7 +24,7 @@ sensitive bool True when the raw value must never be shown
|
|||||||
# Credential field names that should always be masked.
|
# Credential field names that should always be masked.
|
||||||
_SECRET_KEYS = frozenset({
|
_SECRET_KEYS = frozenset({
|
||||||
"password", "token", "user_key", "api_key", "secret",
|
"password", "token", "user_key", "api_key", "secret",
|
||||||
"smtp_password", "smtp_user",
|
"smtp_password", "smtp_user", "api_password", "access_token",
|
||||||
})
|
})
|
||||||
|
|
||||||
_CHANNEL_TYPE_LABELS = {
|
_CHANNEL_TYPE_LABELS = {
|
||||||
@@ -188,7 +188,7 @@ def get_settings_sections(config: dict) -> list:
|
|||||||
continue
|
continue
|
||||||
hosts_list.append({
|
hosts_list.append({
|
||||||
"name": hname,
|
"name": hname,
|
||||||
"watch": bool(hcfg.get("watch", False)),
|
"watch": bool(hcfg.get("watch", True)),
|
||||||
"dyndns": bool(hcfg.get("dyndns", False)),
|
"dyndns": bool(hcfg.get("dyndns", False)),
|
||||||
"owner": hcfg.get("owner", ""),
|
"owner": hcfg.get("owner", ""),
|
||||||
"managers": hcfg.get("managers", []),
|
"managers": hcfg.get("managers", []),
|
||||||
|
|||||||
@@ -0,0 +1,199 @@
|
|||||||
|
<!DOCTYPE html>
|
||||||
|
<html>
|
||||||
|
{% include 'head.html' %}
|
||||||
|
|
||||||
|
<style>
|
||||||
|
html, body { overflow: visible; }
|
||||||
|
|
||||||
|
.container {
|
||||||
|
max-width: 700px;
|
||||||
|
margin: 0 auto;
|
||||||
|
}
|
||||||
|
|
||||||
|
h1 {
|
||||||
|
color: #333;
|
||||||
|
margin-bottom: 4px;
|
||||||
|
font-size: 1.5em;
|
||||||
|
}
|
||||||
|
|
||||||
|
.subtitle {
|
||||||
|
color: #666;
|
||||||
|
margin-bottom: 24px;
|
||||||
|
font-size: 0.9em;
|
||||||
|
}
|
||||||
|
|
||||||
|
.section {
|
||||||
|
background: #fff;
|
||||||
|
border-radius: 8px;
|
||||||
|
box-shadow: 0 1px 6px rgba(0,0,0,0.1);
|
||||||
|
padding: 20px 24px;
|
||||||
|
margin-bottom: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.section h2 {
|
||||||
|
font-size: 1em;
|
||||||
|
font-weight: 700;
|
||||||
|
color: #333;
|
||||||
|
margin: 0 0 16px;
|
||||||
|
padding-bottom: 10px;
|
||||||
|
border-bottom: 1px solid #eee;
|
||||||
|
text-transform: uppercase;
|
||||||
|
letter-spacing: 0.5px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.info-row {
|
||||||
|
display: flex;
|
||||||
|
align-items: baseline;
|
||||||
|
padding: 8px 0;
|
||||||
|
border-bottom: 1px solid #f5f5f5;
|
||||||
|
font-size: 0.9em;
|
||||||
|
}
|
||||||
|
.info-row:last-child { border-bottom: none; }
|
||||||
|
|
||||||
|
.info-label {
|
||||||
|
width: 160px;
|
||||||
|
flex-shrink: 0;
|
||||||
|
color: #666;
|
||||||
|
font-size: 0.88em;
|
||||||
|
}
|
||||||
|
|
||||||
|
.info-value {
|
||||||
|
color: #222;
|
||||||
|
word-break: break-all;
|
||||||
|
}
|
||||||
|
|
||||||
|
.info-value a {
|
||||||
|
color: #0066cc;
|
||||||
|
text-decoration: none;
|
||||||
|
}
|
||||||
|
.info-value a:hover { text-decoration: underline; }
|
||||||
|
|
||||||
|
.version-badge {
|
||||||
|
display: inline-block;
|
||||||
|
padding: 3px 12px;
|
||||||
|
background: #e8f0fe;
|
||||||
|
color: #1a73e8;
|
||||||
|
border-radius: 12px;
|
||||||
|
font-size: 0.85em;
|
||||||
|
font-weight: 600;
|
||||||
|
font-family: monospace;
|
||||||
|
}
|
||||||
|
|
||||||
|
.hb-logo {
|
||||||
|
font-size: 2.5em;
|
||||||
|
font-weight: 700;
|
||||||
|
color: #0066cc;
|
||||||
|
letter-spacing: -1px;
|
||||||
|
margin-bottom: 6px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.hb-tagline {
|
||||||
|
color: #555;
|
||||||
|
font-size: 0.95em;
|
||||||
|
}
|
||||||
|
|
||||||
|
.logo-section {
|
||||||
|
display: flex;
|
||||||
|
align-items: center;
|
||||||
|
gap: 20px;
|
||||||
|
padding: 8px 0 4px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.logo-text { flex: 1; }
|
||||||
|
</style>
|
||||||
|
|
||||||
|
<body>
|
||||||
|
{% include 'nav.html' %}
|
||||||
|
|
||||||
|
<div class="container">
|
||||||
|
<h1>{{ header }}</h1>
|
||||||
|
<p class="subtitle">Heartbeat monitoring system</p>
|
||||||
|
|
||||||
|
<div class="section">
|
||||||
|
<div class="logo-section">
|
||||||
|
<div class="logo-text">
|
||||||
|
<div class="hb-logo">Heartbeat</div>
|
||||||
|
<div class="hb-tagline">Lightweight host monitoring over UDP</div>
|
||||||
|
</div>
|
||||||
|
<span class="version-badge">v{{ hbd_version }}</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="section">
|
||||||
|
<h2>Version</h2>
|
||||||
|
<div class="info-row">
|
||||||
|
<span class="info-label">Server version</span>
|
||||||
|
<span class="info-value">{{ hbd_version }}</span>
|
||||||
|
</div>
|
||||||
|
<div class="info-row">
|
||||||
|
<span class="info-label">Python</span>
|
||||||
|
<span class="info-value">{{ python_version }}</span>
|
||||||
|
</div>
|
||||||
|
<div class="info-row">
|
||||||
|
<span class="info-label">License</span>
|
||||||
|
<span class="info-value">MIT</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="section">
|
||||||
|
<h2>Runtime</h2>
|
||||||
|
<div class="info-row">
|
||||||
|
<span class="info-label">Host</span>
|
||||||
|
<span class="info-value">{{ server_hostname }}</span>
|
||||||
|
</div>
|
||||||
|
<div class="info-row">
|
||||||
|
<span class="info-label">Started</span>
|
||||||
|
<span class="info-value">{{ start_time_str }}</span>
|
||||||
|
</div>
|
||||||
|
<div class="info-row">
|
||||||
|
<span class="info-label">Uptime</span>
|
||||||
|
<span class="info-value" id="uptime-value">{{ uptime_str }}</span>
|
||||||
|
</div>
|
||||||
|
<div class="info-row">
|
||||||
|
<span class="info-label">Hosts monitored</span>
|
||||||
|
<span class="info-value">{{ host_count }}</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="section">
|
||||||
|
<h2>Contact & Source</h2>
|
||||||
|
<div class="info-row">
|
||||||
|
<span class="info-label">Author</span>
|
||||||
|
<span class="info-value">Andreas Wrede</span>
|
||||||
|
</div>
|
||||||
|
<div class="info-row">
|
||||||
|
<span class="info-label">Email</span>
|
||||||
|
<span class="info-value"><a href="mailto:aew@wrede.ca">aew@wrede.ca</a></span>
|
||||||
|
</div>
|
||||||
|
<div class="info-row">
|
||||||
|
<span class="info-label">Repository</span>
|
||||||
|
<span class="info-value"><a href="https://git.wrede.ca/andreas/heartbeat" target="_blank" rel="noopener">git.wrede.ca/andreas/heartbeat</a></span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<script>
|
||||||
|
(function() {
|
||||||
|
var startEpoch = {{ start_epoch }};
|
||||||
|
var el = document.getElementById('uptime-value');
|
||||||
|
if (!el) return;
|
||||||
|
function fmt(s) {
|
||||||
|
var d = Math.floor(s / 86400);
|
||||||
|
var h = Math.floor((s % 86400) / 3600);
|
||||||
|
var m = Math.floor((s % 3600) / 60);
|
||||||
|
var sec = s % 60;
|
||||||
|
if (d > 0) return d + 'd ' + h + 'h ' + m + 'm';
|
||||||
|
if (h > 0) return h + 'h ' + m + 'm ' + sec + 's';
|
||||||
|
return m + 'm ' + sec + 's';
|
||||||
|
}
|
||||||
|
function tick() {
|
||||||
|
var up = Math.floor(Date.now() / 1000 - startEpoch);
|
||||||
|
el.textContent = fmt(up);
|
||||||
|
}
|
||||||
|
tick();
|
||||||
|
setInterval(tick, 1000);
|
||||||
|
})();
|
||||||
|
</script>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
@@ -3,20 +3,13 @@
|
|||||||
{% include 'head.html' %}
|
{% include 'head.html' %}
|
||||||
|
|
||||||
<style>
|
<style>
|
||||||
body {
|
|
||||||
margin: 20px;
|
|
||||||
background: #f5f5f5;
|
|
||||||
}
|
|
||||||
|
|
||||||
.container {
|
.container {
|
||||||
max-width: 1400px;
|
max-width: 1400px;
|
||||||
margin: 0 auto;
|
margin: 0 auto;
|
||||||
}
|
}
|
||||||
|
|
||||||
h1 {
|
h1 { color: #333; margin-bottom: 5px; margin-top: 15px; font-size: 1.5em; }
|
||||||
color: #333;
|
|
||||||
margin-bottom: 10px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.subtitle {
|
.subtitle {
|
||||||
color: #666;
|
color: #666;
|
||||||
@@ -41,7 +34,7 @@
|
|||||||
border-left: 4px solid #ddd;
|
border-left: 4px solid #ddd;
|
||||||
}
|
}
|
||||||
|
|
||||||
.summary-card.critical { border-left-color: #f44336; }
|
.summary-card.critical { border-left-color: #ea1e0f; }
|
||||||
.summary-card.warning { border-left-color: #ff9800; }
|
.summary-card.warning { border-left-color: #ff9800; }
|
||||||
.summary-card.ok { border-left-color: #4caf50; }
|
.summary-card.ok { border-left-color: #4caf50; }
|
||||||
|
|
||||||
@@ -51,7 +44,7 @@
|
|||||||
line-height: 1;
|
line-height: 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
.summary-number.critical { color: #f44336; }
|
.summary-number.critical { color: #ea1e0f; }
|
||||||
.summary-number.warning { color: #ff9800; }
|
.summary-number.warning { color: #ff9800; }
|
||||||
.summary-number.ok { color: #4caf50; }
|
.summary-number.ok { color: #4caf50; }
|
||||||
|
|
||||||
@@ -116,7 +109,7 @@
|
|||||||
}
|
}
|
||||||
|
|
||||||
.alert-item.acknowledged {
|
.alert-item.acknowledged {
|
||||||
opacity: 0.6;
|
opacity: 0.8;
|
||||||
background: #f0f0f0;
|
background: #f0f0f0;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -6,13 +6,32 @@
|
|||||||
<title>{{ title }}</title>
|
<title>{{ title }}</title>
|
||||||
{% if extra_scripts %}<script src="{{ extra_scripts }}"></script>{% endif %}
|
{% if extra_scripts %}<script src="{{ extra_scripts }}"></script>{% endif %}
|
||||||
<style>
|
<style>
|
||||||
|
/* ── Reset / shared baseline ── */
|
||||||
|
*, *::before, *::after { box-sizing: border-box; }
|
||||||
|
html {
|
||||||
|
font-family: 'Segoe UI', system-ui, -apple-system, sans-serif;
|
||||||
|
font-size: 14px;
|
||||||
|
}
|
||||||
|
body {
|
||||||
|
margin: 0;
|
||||||
|
padding: 10px;
|
||||||
|
padding-top: 60px;
|
||||||
|
background: #f5f5f5;
|
||||||
|
}
|
||||||
|
h1 { font-size: 1.5em; color: #333; margin: 0 0 5px; }
|
||||||
|
h2 { font-size: 1.1em; color: #333; margin: 0 0 8px; }
|
||||||
|
p { margin: 0; }
|
||||||
|
|
||||||
/* Navigation bar — shared across all pages */
|
/* Navigation bar — shared across all pages */
|
||||||
.nav {
|
.nav {
|
||||||
|
position: fixed;
|
||||||
|
top: 0;
|
||||||
|
left: 0;
|
||||||
|
right: 0;
|
||||||
|
z-index: 200;
|
||||||
background: #fff;
|
background: #fff;
|
||||||
padding: 10px 15px;
|
padding: 6px 12px;
|
||||||
margin-bottom: 10px;
|
|
||||||
box-shadow: 0 2px 4px rgba(0,0,0,.1);
|
box-shadow: 0 2px 4px rgba(0,0,0,.1);
|
||||||
border-radius: 4px;
|
|
||||||
display: flex;
|
display: flex;
|
||||||
align-items: center;
|
align-items: center;
|
||||||
justify-content: space-between;
|
justify-content: space-between;
|
||||||
@@ -42,6 +61,17 @@
|
|||||||
transition: background 0.15s;
|
transition: background 0.15s;
|
||||||
}
|
}
|
||||||
.nav-user:hover { background: #f0f4ff; text-decoration: none; }
|
.nav-user:hover { background: #f0f4ff; text-decoration: none; }
|
||||||
|
.nav-username {
|
||||||
|
max-width: 0;
|
||||||
|
overflow: hidden;
|
||||||
|
white-space: nowrap;
|
||||||
|
opacity: 0;
|
||||||
|
transition: max-width 0.2s ease, opacity 0.2s ease;
|
||||||
|
}
|
||||||
|
.nav-user:hover .nav-username {
|
||||||
|
max-width: 160px;
|
||||||
|
opacity: 1;
|
||||||
|
}
|
||||||
.nav-avatar {
|
.nav-avatar {
|
||||||
width: 28px; height: 28px;
|
width: 28px; height: 28px;
|
||||||
border-radius: 50%;
|
border-radius: 50%;
|
||||||
@@ -94,6 +124,158 @@
|
|||||||
.nav-links.nav-open { display: flex; }
|
.nav-links.nav-open { display: flex; }
|
||||||
.nav-links a { margin-right: 0; padding: 6px 0; font-size: 1em; }
|
.nav-links a { margin-right: 0; padding: 6px 0; font-size: 1em; }
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/* Swiss railway clock — nav */
|
||||||
|
.nav-clock {
|
||||||
|
flex-shrink: 0;
|
||||||
|
line-height: 0;
|
||||||
|
margin-left: auto;
|
||||||
|
padding: 4px 4px 4px 0;
|
||||||
|
cursor: pointer;
|
||||||
|
}
|
||||||
|
#swiss-clock { display: block; }
|
||||||
|
|
||||||
|
/* Swiss railway clock — full-page overlay */
|
||||||
|
#clock-overlay {
|
||||||
|
display: none;
|
||||||
|
position: fixed;
|
||||||
|
inset: 0;
|
||||||
|
z-index: 9999;
|
||||||
|
background: #1a1a1a;
|
||||||
|
align-items: center;
|
||||||
|
justify-content: center;
|
||||||
|
cursor: pointer;
|
||||||
|
}
|
||||||
|
#clock-overlay.visible { display: flex; }
|
||||||
|
#swiss-clock-overlay { display: block; }
|
||||||
</style>
|
</style>
|
||||||
|
<script>
|
||||||
|
/* ── Swiss Federal Railway (SBB) clock ── */
|
||||||
|
|
||||||
|
/* Draw one frame of the clock onto any canvas element. */
|
||||||
|
function drawSwissClock(canvas) {
|
||||||
|
var SIZE = canvas.width;
|
||||||
|
var R = SIZE / 2;
|
||||||
|
var ctx = canvas.getContext('2d');
|
||||||
|
var now = new Date();
|
||||||
|
var h = now.getHours() % 12;
|
||||||
|
var m = now.getMinutes();
|
||||||
|
var s = now.getSeconds();
|
||||||
|
var ms = now.getMilliseconds();
|
||||||
|
|
||||||
|
/* Seconds hand idles ~1.5 s at 12 before advancing (SBB behaviour) */
|
||||||
|
var sFrac = s + ms / 1000;
|
||||||
|
var sAngle = sFrac >= 58.5 ? 0 : (sFrac / 58.5) * Math.PI * 2;
|
||||||
|
|
||||||
|
ctx.clearRect(0, 0, SIZE, SIZE);
|
||||||
|
|
||||||
|
/* face */
|
||||||
|
ctx.beginPath();
|
||||||
|
ctx.arc(R, R, R - 1, 0, Math.PI * 2);
|
||||||
|
ctx.fillStyle = '#fff';
|
||||||
|
ctx.fill();
|
||||||
|
ctx.strokeStyle = '#333';
|
||||||
|
ctx.lineWidth = SIZE * 0.018;
|
||||||
|
ctx.stroke();
|
||||||
|
|
||||||
|
/* tick marks */
|
||||||
|
for (var i = 0; i < 60; i++) {
|
||||||
|
var a = (i / 60) * Math.PI * 2 - Math.PI / 2;
|
||||||
|
var isHour = (i % 5 === 0);
|
||||||
|
ctx.beginPath();
|
||||||
|
ctx.moveTo(R + Math.cos(a) * (isHour ? R * 0.72 : R * 0.88),
|
||||||
|
R + Math.sin(a) * (isHour ? R * 0.72 : R * 0.88));
|
||||||
|
ctx.lineTo(R + Math.cos(a) * R * 0.94,
|
||||||
|
R + Math.sin(a) * R * 0.94);
|
||||||
|
ctx.strokeStyle = '#222';
|
||||||
|
ctx.lineWidth = isHour ? SIZE * 0.027 : SIZE * 0.011;
|
||||||
|
ctx.lineCap = 'butt';
|
||||||
|
ctx.stroke();
|
||||||
|
}
|
||||||
|
|
||||||
|
/* hands */
|
||||||
|
function hand(angle, tip, tail, width, color) {
|
||||||
|
ctx.save();
|
||||||
|
ctx.translate(R, R);
|
||||||
|
ctx.rotate(angle);
|
||||||
|
ctx.beginPath();
|
||||||
|
ctx.moveTo(tail, 0);
|
||||||
|
ctx.lineTo(tip, 0);
|
||||||
|
ctx.strokeStyle = color;
|
||||||
|
ctx.lineWidth = width;
|
||||||
|
ctx.lineCap = 'square';
|
||||||
|
ctx.stroke();
|
||||||
|
ctx.restore();
|
||||||
|
}
|
||||||
|
|
||||||
|
hand((m + s / 60) / 60 * Math.PI * 2 - Math.PI / 2,
|
||||||
|
R * 0.88, -R * 0.12, SIZE * 0.027, '#222'); /* minute */
|
||||||
|
hand((h + m / 60) / 12 * Math.PI * 2 - Math.PI / 2,
|
||||||
|
R * 0.58, -R * 0.12, SIZE * 0.039, '#222'); /* hour */
|
||||||
|
hand(sAngle - Math.PI / 2, R * 0.78, -R * 0.22,
|
||||||
|
SIZE * 0.013, '#e00'); /* second tail+tip */
|
||||||
|
|
||||||
|
/* round dot at tip of second hand */
|
||||||
|
var dotR = SIZE * 0.028;
|
||||||
|
ctx.save();
|
||||||
|
ctx.translate(R, R);
|
||||||
|
ctx.rotate(sAngle - Math.PI / 2);
|
||||||
|
ctx.beginPath();
|
||||||
|
ctx.arc(R * 0.78, 0, dotR, 0, Math.PI * 2);
|
||||||
|
ctx.fillStyle = '#e00';
|
||||||
|
ctx.fill();
|
||||||
|
ctx.restore();
|
||||||
|
|
||||||
|
/* centre cap */
|
||||||
|
ctx.beginPath();
|
||||||
|
ctx.arc(R, R, R * 0.04, 0, Math.PI * 2);
|
||||||
|
ctx.fillStyle = '#222';
|
||||||
|
ctx.fill();
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Resize the overlay canvas to fit the viewport, keeping it square. */
|
||||||
|
function resizeOverlayClock() {
|
||||||
|
var oc = document.getElementById('swiss-clock-overlay');
|
||||||
|
if (!oc) return;
|
||||||
|
var size = Math.min(window.innerWidth, window.innerHeight) * 0.88;
|
||||||
|
size = Math.floor(size);
|
||||||
|
oc.width = size;
|
||||||
|
oc.height = size;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Main tick — redraws both nav clock and (if visible) overlay clock. */
|
||||||
|
function clockTick() {
|
||||||
|
var nav = document.getElementById('swiss-clock');
|
||||||
|
if (nav) drawSwissClock(nav);
|
||||||
|
var overlay = document.getElementById('clock-overlay');
|
||||||
|
if (overlay && overlay.classList.contains('visible')) {
|
||||||
|
var oc = document.getElementById('swiss-clock-overlay');
|
||||||
|
if (oc) drawSwissClock(oc);
|
||||||
|
}
|
||||||
|
var delay = 100 - (Date.now() % 100);
|
||||||
|
setTimeout(clockTick, delay);
|
||||||
|
}
|
||||||
|
|
||||||
|
document.addEventListener('DOMContentLoaded', function() {
|
||||||
|
/* Start the shared tick loop */
|
||||||
|
clockTick();
|
||||||
|
|
||||||
|
/* Overlay toggle — clicking the nav clock opens it */
|
||||||
|
var navClock = document.querySelector('.nav-clock');
|
||||||
|
var overlay = document.getElementById('clock-overlay');
|
||||||
|
if (navClock && overlay) {
|
||||||
|
navClock.addEventListener('click', function() {
|
||||||
|
resizeOverlayClock();
|
||||||
|
overlay.classList.add('visible');
|
||||||
|
});
|
||||||
|
overlay.addEventListener('click', function() {
|
||||||
|
overlay.classList.remove('visible');
|
||||||
|
});
|
||||||
|
window.addEventListener('resize', function() {
|
||||||
|
if (overlay.classList.contains('visible')) resizeOverlayClock();
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
</script>
|
||||||
<script src="static/sorttable.js"></script>
|
<script src="static/sorttable.js"></script>
|
||||||
</head>
|
</head>
|
||||||
@@ -7,10 +7,6 @@
|
|||||||
display: flex;
|
display: flex;
|
||||||
flex-direction: column;
|
flex-direction: column;
|
||||||
height: 100vh;
|
height: 100vh;
|
||||||
box-sizing: border-box;
|
|
||||||
padding: 10px;
|
|
||||||
margin: 0;
|
|
||||||
background: #f5f5f5;
|
|
||||||
overflow: hidden;
|
overflow: hidden;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -49,6 +45,7 @@
|
|||||||
h1 {
|
h1 {
|
||||||
color: #333;
|
color: #333;
|
||||||
margin-bottom: 5px;
|
margin-bottom: 5px;
|
||||||
|
margin-top: 15px;
|
||||||
font-size: 1.5em;
|
font-size: 1.5em;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -489,8 +486,10 @@
|
|||||||
{% include 'menu.html' %}
|
{% include 'menu.html' %}
|
||||||
|
|
||||||
<div class="container">
|
<div class="container">
|
||||||
|
<div>
|
||||||
<h1>{{ header }}</h1>
|
<h1>{{ header }}</h1>
|
||||||
<p class="subtitle">Real-time host monitoring and event log</p>
|
<p class="subtitle">Real-time host monitoring and event log</p>
|
||||||
|
</div>
|
||||||
|
|
||||||
<div class="table-section">
|
<div class="table-section">
|
||||||
<table id="ntable" class="sortable">
|
<table id="ntable" class="sortable">
|
||||||
|
|||||||
@@ -4,11 +4,15 @@
|
|||||||
</button>
|
</button>
|
||||||
<div class="nav-links" id="nav-links">
|
<div class="nav-links" id="nav-links">
|
||||||
<a href="/live"{% if active_page == "live" %} class="active"{% endif %}>Live Dashboard</a>
|
<a href="/live"{% if active_page == "live" %} class="active"{% endif %}>Live Dashboard</a>
|
||||||
<a href="/plugins"{% if active_page == "plugins" %} class="active"{% endif %}>Plugin Metrics</a>
|
<a href="/plugins"{% if active_page == "plugins" %} class="active"{% endif %}>Host Overview</a>
|
||||||
<a href="/alerts"{% if active_page == "alerts" %} class="active"{% endif %}>Alerts</a>
|
<a href="/alerts"{% if active_page == "alerts" %} class="active"{% endif %}>Alerts</a>
|
||||||
{% if current_user and current_user.admin %}
|
{% if current_user and current_user.admin %}
|
||||||
<a href="/settings"{% if active_page == "settings" %} class="active"{% endif %}>Settings</a>
|
<a href="/settings"{% if active_page == "settings" %} class="active"{% endif %}>Settings</a>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
<a href="/about"{% if active_page == "about" %} class="active"{% endif %}>About</a>
|
||||||
|
</div>
|
||||||
|
<div class="nav-clock" title="Click for full-screen clock">
|
||||||
|
<canvas id="swiss-clock" width="44" height="44"></canvas>
|
||||||
</div>
|
</div>
|
||||||
{% if current_user %}
|
{% if current_user %}
|
||||||
<a href="/profile" class="nav-user{% if active_page == 'profile' %} active{% endif %}" title="{{ current_user.full_name or current_user.username }}">
|
<a href="/profile" class="nav-user{% if active_page == 'profile' %} active{% endif %}" title="{{ current_user.full_name or current_user.username }}">
|
||||||
@@ -21,6 +25,12 @@
|
|||||||
</a>
|
</a>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
<!-- Full-page clock overlay (click anywhere to dismiss) -->
|
||||||
|
<div id="clock-overlay">
|
||||||
|
<canvas id="swiss-clock-overlay" width="400" height="400"></canvas>
|
||||||
|
</div>
|
||||||
|
|
||||||
<script>
|
<script>
|
||||||
(function() {
|
(function() {
|
||||||
var btn = document.getElementById('nav-hamburger-btn');
|
var btn = document.getElementById('nav-hamburger-btn');
|
||||||
|
|||||||
+962
-863
File diff suppressed because it is too large
Load Diff
@@ -3,15 +3,7 @@
|
|||||||
{% include 'head.html' %}
|
{% include 'head.html' %}
|
||||||
|
|
||||||
<style>
|
<style>
|
||||||
html, body {
|
html, body { overflow: visible; }
|
||||||
overflow: visible;
|
|
||||||
}
|
|
||||||
|
|
||||||
body {
|
|
||||||
margin: 20px;
|
|
||||||
background: #f5f5f5;
|
|
||||||
font-family: 'Segoe UI', system-ui, sans-serif;
|
|
||||||
}
|
|
||||||
|
|
||||||
.container {
|
.container {
|
||||||
max-width: 900px;
|
max-width: 900px;
|
||||||
|
|||||||
@@ -3,22 +3,13 @@
|
|||||||
{% include 'head.html' %}
|
{% include 'head.html' %}
|
||||||
|
|
||||||
<style>
|
<style>
|
||||||
html, body {
|
html, body { overflow: visible; }
|
||||||
overflow: visible;
|
|
||||||
}
|
|
||||||
|
|
||||||
body {
|
|
||||||
margin: 20px;
|
|
||||||
background: #f5f5f5;
|
|
||||||
font-family: 'Segoe UI', system-ui, sans-serif;
|
|
||||||
}
|
|
||||||
|
|
||||||
.container {
|
.container {
|
||||||
max-width: 960px;
|
max-width: 960px;
|
||||||
margin: 0 auto;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
h1 { color: #333; margin-bottom: 4px; font-size: 1.5em; }
|
h1 { color: #333; margin-bottom: 5px; margin-top: 15px; font-size: 1.5em; }
|
||||||
.subtitle { color: #666; margin-bottom: 24px; font-size: 0.9em; }
|
.subtitle { color: #666; margin-bottom: 24px; font-size: 0.9em; }
|
||||||
|
|
||||||
/* ---- Sidebar + content layout ---- */
|
/* ---- Sidebar + content layout ---- */
|
||||||
@@ -32,7 +23,7 @@
|
|||||||
width: 180px;
|
width: 180px;
|
||||||
flex-shrink: 0;
|
flex-shrink: 0;
|
||||||
position: sticky;
|
position: sticky;
|
||||||
top: 20px;
|
top: 60px;
|
||||||
}
|
}
|
||||||
|
|
||||||
.sidebar-nav a {
|
.sidebar-nav a {
|
||||||
|
|||||||
+156
-70
@@ -9,10 +9,11 @@ This module provides a flexible threshold checking system that:
|
|||||||
- Supports multiple comparison operators
|
- Supports multiple comparison operators
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
import logging
|
import logging
|
||||||
import time
|
import time
|
||||||
from enum import Enum
|
from enum import Enum
|
||||||
from typing import Dict, Any, Optional, Tuple, Callable
|
from typing import Dict, List, Any, Optional, Tuple, Callable
|
||||||
from . import notify as notify_mod
|
from . import notify as notify_mod
|
||||||
from .config import THRESHOLD_DEFAULTS
|
from .config import THRESHOLD_DEFAULTS
|
||||||
|
|
||||||
@@ -60,6 +61,7 @@ class AlertState:
|
|||||||
self.acknowledged = False # Whether alert has been acknowledged
|
self.acknowledged = False # Whether alert has been acknowledged
|
||||||
self.acknowledged_at = None # Timestamp when acknowledged
|
self.acknowledged_at = None # Timestamp when acknowledged
|
||||||
self.consecutive_count = 0 # Consecutive exceedances while still OK (for count gating)
|
self.consecutive_count = 0 # Consecutive exceedances while still OK (for count gating)
|
||||||
|
self.pending_since: Optional[float] = None # non-None while waiting out grace period before notifying
|
||||||
|
|
||||||
def update(
|
def update(
|
||||||
self,
|
self,
|
||||||
@@ -105,6 +107,7 @@ class AlertState:
|
|||||||
self.level = level
|
self.level = level
|
||||||
self.since = now
|
self.since = now
|
||||||
self.notification_count = 0
|
self.notification_count = 0
|
||||||
|
self.last_notification = None # restart reminder interval on level change
|
||||||
# Reset acknowledgment on state change
|
# Reset acknowledgment on state change
|
||||||
if level != AlertLevel.OK:
|
if level != AlertLevel.OK:
|
||||||
# Only reset if changing to a different alert level
|
# Only reset if changing to a different alert level
|
||||||
@@ -326,19 +329,23 @@ class ThresholdChecker:
|
|||||||
renotify_interval: Seconds between repeat notifications (default: 1 hour)
|
renotify_interval: Seconds between repeat notifications (default: 1 hour)
|
||||||
journal: Optional MessageJournal instance for logging threshold events
|
journal: Optional MessageJournal instance for logging threshold events
|
||||||
"""
|
"""
|
||||||
# Named threshold configurations: {config_name: {metric_path: ThresholdConfig}}
|
# Named threshold configurations (pre-merged: defaults + overrides): {config_name: {metric_path: ThresholdConfig}}
|
||||||
self.threshold_configs = {}
|
self.threshold_configs = {}
|
||||||
|
|
||||||
|
# Raw overrides only for each named config (no defaults baked in): {config_name: {metric_path: ThresholdConfig}}
|
||||||
|
self.threshold_raw_configs: Dict[str, Dict[str, ThresholdConfig]] = {}
|
||||||
|
|
||||||
# Single threshold set for backward compatibility: {metric_path: ThresholdConfig}
|
# Single threshold set for backward compatibility: {metric_path: ThresholdConfig}
|
||||||
self.thresholds = {}
|
self.thresholds = {}
|
||||||
|
|
||||||
# Host to config name mapping: {host_name: config_name}
|
# Host to ordered list of config names: {host_name: [config_name, ...]}
|
||||||
self.host_config_mapping = {}
|
self.host_config_mapping: Dict[str, List[str]] = {}
|
||||||
|
|
||||||
# Default config name to use when no mapping exists
|
# Default config name to use when no mapping exists
|
||||||
self.default_config = "default"
|
self.default_config = "default"
|
||||||
|
|
||||||
self.renotify_interval = renotify_interval
|
self.renotify_interval = renotify_interval
|
||||||
|
self.grace_seconds: float = float(config.get("grace", 2))
|
||||||
self.journal = journal
|
self.journal = journal
|
||||||
|
|
||||||
# Parse configuration
|
# Parse configuration
|
||||||
@@ -369,8 +376,10 @@ class ThresholdChecker:
|
|||||||
|
|
||||||
# Clear old configuration
|
# Clear old configuration
|
||||||
self.threshold_configs.clear()
|
self.threshold_configs.clear()
|
||||||
|
self.threshold_raw_configs.clear()
|
||||||
self.thresholds.clear()
|
self.thresholds.clear()
|
||||||
self.host_config_mapping.clear()
|
self.host_config_mapping.clear()
|
||||||
|
self.grace_seconds = float(config.get("grace", 2))
|
||||||
|
|
||||||
# Parse new configuration
|
# Parse new configuration
|
||||||
self._parse_config(config)
|
self._parse_config(config)
|
||||||
@@ -420,9 +429,10 @@ class ThresholdChecker:
|
|||||||
self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=effective_defaults)
|
self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=effective_defaults)
|
||||||
|
|
||||||
self.threshold_configs["default"] = dict(effective_defaults)
|
self.threshold_configs["default"] = dict(effective_defaults)
|
||||||
|
self.threshold_raw_configs["default"] = {}
|
||||||
logger.info("Registered 'default' threshold config with %d metrics", len(effective_defaults))
|
logger.info("Registered 'default' threshold config with %d metrics", len(effective_defaults))
|
||||||
|
|
||||||
# Parse each named configuration, seeding it with effective_defaults first
|
# Parse each named configuration
|
||||||
for config_name, config_data in threshold_configs.items():
|
for config_name, config_data in threshold_configs.items():
|
||||||
if config_name == "default":
|
if config_name == "default":
|
||||||
continue # already handled above
|
continue # already handled above
|
||||||
@@ -436,33 +446,41 @@ class ThresholdChecker:
|
|||||||
continue
|
continue
|
||||||
|
|
||||||
logger.info("Parsing threshold configuration: %s", config_name)
|
logger.info("Parsing threshold configuration: %s", config_name)
|
||||||
self.threshold_configs[config_name] = dict(effective_defaults)
|
|
||||||
|
|
||||||
|
# Raw overrides only (used for multi-config layering)
|
||||||
|
raw_overrides: Dict[str, ThresholdConfig] = {}
|
||||||
thresholds_config = config_data["thresholds"]
|
thresholds_config = config_data["thresholds"]
|
||||||
for plugin_name, plugin_thresholds in thresholds_config.items():
|
for plugin_name, plugin_thresholds in thresholds_config.items():
|
||||||
if not isinstance(plugin_thresholds, dict):
|
if isinstance(plugin_thresholds, dict):
|
||||||
continue
|
self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=raw_overrides)
|
||||||
|
self.threshold_raw_configs[config_name] = raw_overrides
|
||||||
|
|
||||||
self._parse_plugin_thresholds(
|
# Pre-merged version (defaults + overrides) for single-config fast path
|
||||||
plugin_name,
|
self.threshold_configs[config_name] = dict(effective_defaults)
|
||||||
plugin_thresholds,
|
self.threshold_configs[config_name].update(raw_overrides)
|
||||||
target_dict=self.threshold_configs[config_name]
|
|
||||||
)
|
|
||||||
|
|
||||||
# Parse host to config mapping from two possible sources
|
# Parse host → config list mapping from two possible sources
|
||||||
# 1. New format: hosts section with threshold_config attribute
|
|
||||||
|
def _normalise(value) -> List[str]:
|
||||||
|
"""Accept a string or list; always return a list."""
|
||||||
|
if isinstance(value, list):
|
||||||
|
return [str(v) for v in value]
|
||||||
|
return [str(value)]
|
||||||
|
|
||||||
|
# 1. hosts section with threshold_config attribute (string or list)
|
||||||
if "hosts" in config:
|
if "hosts" in config:
|
||||||
hosts_config = config["hosts"]
|
hosts_config = config["hosts"]
|
||||||
if isinstance(hosts_config, dict):
|
if isinstance(hosts_config, dict):
|
||||||
for host_name, host_attrs in hosts_config.items():
|
for host_name, host_attrs in hosts_config.items():
|
||||||
if isinstance(host_attrs, dict) and "threshold_config" in host_attrs:
|
if isinstance(host_attrs, dict) and "threshold_config" in host_attrs:
|
||||||
self.host_config_mapping[host_name] = host_attrs["threshold_config"]
|
self.host_config_mapping[host_name] = _normalise(host_attrs["threshold_config"])
|
||||||
|
|
||||||
# 2. Legacy format: host_threshold_mapping section (for backward compatibility)
|
# 2. Legacy host_threshold_mapping section (string values only)
|
||||||
if "host_threshold_mapping" in config:
|
if "host_threshold_mapping" in config:
|
||||||
legacy_mapping = config.get("host_threshold_mapping", {})
|
legacy_mapping = config.get("host_threshold_mapping", {})
|
||||||
if isinstance(legacy_mapping, dict):
|
if isinstance(legacy_mapping, dict):
|
||||||
self.host_config_mapping.update(legacy_mapping)
|
for host_name, value in legacy_mapping.items():
|
||||||
|
self.host_config_mapping[host_name] = _normalise(value)
|
||||||
|
|
||||||
# Set default config (first one alphabetically or explicitly set)
|
# Set default config (first one alphabetically or explicitly set)
|
||||||
self.default_config = config.get("default_threshold_config", "default")
|
self.default_config = config.get("default_threshold_config", "default")
|
||||||
@@ -660,7 +678,10 @@ class ThresholdChecker:
|
|||||||
)
|
)
|
||||||
|
|
||||||
def get_thresholds_for_host(self, host_name: str) -> Dict[str, ThresholdConfig]:
|
def get_thresholds_for_host(self, host_name: str) -> Dict[str, ThresholdConfig]:
|
||||||
"""Get the appropriate threshold configuration for a host.
|
"""Get the effective threshold configuration for a host.
|
||||||
|
|
||||||
|
When threshold_config is a list, configs are applied left-to-right on top
|
||||||
|
of the default thresholds so earlier entries can be overridden by later ones.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
host_name: Name of the host
|
host_name: Name of the host
|
||||||
@@ -672,23 +693,40 @@ class ThresholdChecker:
|
|||||||
if self.thresholds and not self.threshold_configs:
|
if self.thresholds and not self.threshold_configs:
|
||||||
return self.thresholds
|
return self.thresholds
|
||||||
|
|
||||||
# Multi-config mode: look up host-specific configuration
|
if not self.threshold_configs:
|
||||||
if self.threshold_configs:
|
return {}
|
||||||
config_name = self.host_config_mapping.get(host_name, self.default_config)
|
|
||||||
|
|
||||||
if config_name in self.threshold_configs:
|
config_names = self.host_config_mapping.get(host_name)
|
||||||
return self.threshold_configs[config_name]
|
|
||||||
else:
|
# No host-specific mapping → return pre-merged default
|
||||||
|
if not config_names:
|
||||||
|
return self.threshold_configs.get(self.default_config, {})
|
||||||
|
|
||||||
|
# Single config → fast path using pre-merged copy
|
||||||
|
if len(config_names) == 1:
|
||||||
|
name = config_names[0]
|
||||||
|
if name in self.threshold_configs:
|
||||||
|
return self.threshold_configs[name]
|
||||||
logger.warning(
|
logger.warning(
|
||||||
"Threshold config '%s' not found for host '%s', using default '%s'",
|
"Threshold config '%s' not found for host '%s', using default '%s'",
|
||||||
config_name,
|
name, host_name, self.default_config,
|
||||||
host_name,
|
|
||||||
self.default_config
|
|
||||||
)
|
)
|
||||||
return self.threshold_configs.get(self.default_config, {})
|
return self.threshold_configs.get(self.default_config, {})
|
||||||
|
|
||||||
# No thresholds configured
|
# Multiple configs → start from defaults, layer raw overrides in order
|
||||||
return {}
|
result = dict(self.threshold_configs.get(self.default_config, {}))
|
||||||
|
for name in config_names:
|
||||||
|
if name == self.default_config:
|
||||||
|
continue # defaults already the base
|
||||||
|
raw = self.threshold_raw_configs.get(name)
|
||||||
|
if raw is None:
|
||||||
|
logger.warning(
|
||||||
|
"Threshold config '%s' not found for host '%s', skipping",
|
||||||
|
name, host_name,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
result.update(raw)
|
||||||
|
return result
|
||||||
|
|
||||||
def check_value(
|
def check_value(
|
||||||
self,
|
self,
|
||||||
@@ -759,15 +797,10 @@ class ThresholdChecker:
|
|||||||
# Update state and check for changes
|
# Update state and check for changes
|
||||||
old_level = alert_state.level
|
old_level = alert_state.level
|
||||||
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
|
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
|
||||||
# For check_value, we don't have full plugin data, pass None
|
self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, None)
|
||||||
lvl, message, formatted_msg = self._trigger_notification(host_name, metric_path, old_level, new_level, value, threshold, None)
|
|
||||||
# Update alert state with formatted message
|
|
||||||
alert_state.formatted_message = formatted_msg
|
|
||||||
self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
|
|
||||||
return (old_level, new_level)
|
return (old_level, new_level)
|
||||||
elif new_level != AlertLevel.OK:
|
elif new_level != AlertLevel.OK:
|
||||||
# Check if we should re-notify
|
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, None)
|
||||||
self._check_renotify(host_name, alert_state, metric_path, value, threshold, None)
|
|
||||||
|
|
||||||
return None
|
return None
|
||||||
def check_plugin_data(
|
def check_plugin_data(
|
||||||
@@ -826,13 +859,9 @@ class ThresholdChecker:
|
|||||||
old_level = alert_state.level
|
old_level = alert_state.level
|
||||||
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
|
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
|
||||||
state_changes.append((metric_path, old_level, new_level, value))
|
state_changes.append((metric_path, old_level, new_level, value))
|
||||||
lvl, message, formatted_msg = self._trigger_notification(host_name, metric_path, old_level, new_level, value, threshold, data)
|
self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data)
|
||||||
# Update alert state with formatted message
|
|
||||||
alert_state.formatted_message = formatted_msg
|
|
||||||
self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
|
|
||||||
elif new_level != AlertLevel.OK:
|
elif new_level != AlertLevel.OK:
|
||||||
# Check if we should re-notify
|
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data)
|
||||||
self._check_renotify(host_name, alert_state, metric_path, value, threshold, data)
|
|
||||||
|
|
||||||
# Check nested metrics (e.g., partition data in disk_monitor)
|
# Check nested metrics (e.g., partition data in disk_monitor)
|
||||||
self._check_nested_metrics(
|
self._check_nested_metrics(
|
||||||
@@ -895,20 +924,9 @@ class ThresholdChecker:
|
|||||||
old_level = alert_state.level
|
old_level = alert_state.level
|
||||||
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
|
if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
|
||||||
state_changes.append((metric_path, old_level, new_level, value))
|
state_changes.append((metric_path, old_level, new_level, value))
|
||||||
lvl, message, formatted_msg = self._trigger_notification(
|
self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data)
|
||||||
host_name,
|
|
||||||
metric_path,
|
|
||||||
old_level,
|
|
||||||
new_level,
|
|
||||||
value,
|
|
||||||
threshold,
|
|
||||||
data # Pass full plugin data for format string
|
|
||||||
)
|
|
||||||
# Update alert state with formatted message
|
|
||||||
alert_state.formatted_message = formatted_msg
|
|
||||||
self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
|
|
||||||
elif new_level != AlertLevel.OK:
|
elif new_level != AlertLevel.OK:
|
||||||
self._check_renotify(host_name, alert_state, metric_path, value, threshold, data)
|
self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data)
|
||||||
|
|
||||||
def _trigger_notification(
|
def _trigger_notification(
|
||||||
self,
|
self,
|
||||||
@@ -947,7 +965,7 @@ class ThresholdChecker:
|
|||||||
|
|
||||||
# Format message
|
# Format message
|
||||||
if new_level == AlertLevel.OK:
|
if new_level == AlertLevel.OK:
|
||||||
lvl = "RECOVERED"
|
lvl = "RECOVER"
|
||||||
message = f"{metric_path} = {display_value} ({old_level.name} -> OK)"
|
message = f"{metric_path} = {display_value} ({old_level.name} -> OK)"
|
||||||
elif new_level == AlertLevel.WARNING:
|
elif new_level == AlertLevel.WARNING:
|
||||||
lvl = "WARNING"
|
lvl = "WARNING"
|
||||||
@@ -1003,23 +1021,23 @@ class ThresholdChecker:
|
|||||||
value: Any,
|
value: Any,
|
||||||
):
|
):
|
||||||
"""Send notification and log to journal/eventlog."""
|
"""Send notification and log to journal/eventlog."""
|
||||||
try:
|
from . import hbdclass
|
||||||
notify_mod.send_notification(
|
host = hbdclass.Host.hosts.get(host_name)
|
||||||
|
if host is not None and not host.watched:
|
||||||
|
eventlog(host_name, lvl, message, service="threshold")
|
||||||
|
return
|
||||||
|
asyncio.get_event_loop().create_task(notify_mod.send_notification(
|
||||||
host_name,
|
host_name,
|
||||||
notify_mod.Notification(
|
notify_mod.Notification(
|
||||||
title=f"[{lvl}] {host_name}",
|
title=f"[{lvl}] {host_name}",
|
||||||
body=message,
|
body=message,
|
||||||
level=lvl,
|
level=lvl,
|
||||||
),
|
),
|
||||||
)
|
))
|
||||||
logger.info("Notification sent: %s", message)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error("Failed to send notification: %s", e)
|
|
||||||
|
|
||||||
# Log to journal
|
# Log to journal
|
||||||
if self.journal is not None:
|
if self.journal is not None:
|
||||||
try:
|
try:
|
||||||
import asyncio
|
|
||||||
loop = asyncio.get_event_loop()
|
loop = asyncio.get_event_loop()
|
||||||
loop.create_task(self.journal.log_threshold_event(
|
loop.create_task(self.journal.log_threshold_event(
|
||||||
host_name=host_name,
|
host_name=host_name,
|
||||||
@@ -1083,6 +1101,74 @@ class ThresholdChecker:
|
|||||||
)
|
)
|
||||||
return f"(threshold: {op_symbol} {threshold_value})"
|
return f"(threshold: {op_symbol} {threshold_value})"
|
||||||
|
|
||||||
|
def _apply_grace(
|
||||||
|
self,
|
||||||
|
host_name: str,
|
||||||
|
alert_state: AlertState,
|
||||||
|
metric_path: str,
|
||||||
|
old_level: AlertLevel,
|
||||||
|
new_level: AlertLevel,
|
||||||
|
value: Any,
|
||||||
|
threshold: ThresholdConfig,
|
||||||
|
plugin_data: Optional[Dict[str, Any]],
|
||||||
|
) -> None:
|
||||||
|
"""Handle a state-change transition with grace-period logic.
|
||||||
|
|
||||||
|
Transitioning INTO alert: defers the notification for grace_seconds.
|
||||||
|
Transitioning TO OK:
|
||||||
|
- Still in grace window (pending_since set): suppresses both the alert
|
||||||
|
and the recovery — the spike never warranted a page.
|
||||||
|
- Past grace: fires the RECOVER notification normally.
|
||||||
|
"""
|
||||||
|
lvl, message, formatted_msg = self._trigger_notification(
|
||||||
|
host_name, metric_path, old_level, new_level, value, threshold, plugin_data
|
||||||
|
)
|
||||||
|
alert_state.formatted_message = formatted_msg
|
||||||
|
|
||||||
|
if new_level == AlertLevel.OK:
|
||||||
|
if alert_state.pending_since is not None:
|
||||||
|
logger.info(
|
||||||
|
"Alert suppressed (recovered within %.0fs grace): %s on %s",
|
||||||
|
self.grace_seconds, metric_path, host_name,
|
||||||
|
)
|
||||||
|
alert_state.pending_since = None
|
||||||
|
else:
|
||||||
|
self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
|
||||||
|
else:
|
||||||
|
alert_state.pending_since = time.time()
|
||||||
|
logger.debug(
|
||||||
|
"Alert deferred (%.0fs grace): %s on %s = %s",
|
||||||
|
self.grace_seconds, metric_path, host_name, value,
|
||||||
|
)
|
||||||
|
|
||||||
|
def _check_pending_or_renotify(
|
||||||
|
self,
|
||||||
|
host_name: str,
|
||||||
|
alert_state: AlertState,
|
||||||
|
metric_path: str,
|
||||||
|
value: Any,
|
||||||
|
threshold: ThresholdConfig,
|
||||||
|
plugin_data: Optional[Dict[str, Any]],
|
||||||
|
) -> None:
|
||||||
|
"""Called when alert level is unchanged and non-OK.
|
||||||
|
|
||||||
|
If a deferred notification is pending and grace_seconds have elapsed,
|
||||||
|
fires it now. Otherwise falls through to normal reminder logic.
|
||||||
|
"""
|
||||||
|
if alert_state.pending_since is not None:
|
||||||
|
if time.time() - alert_state.pending_since >= self.grace_seconds:
|
||||||
|
lvl, message, formatted_msg = self._trigger_notification(
|
||||||
|
host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data
|
||||||
|
)
|
||||||
|
alert_state.formatted_message = formatted_msg
|
||||||
|
self._send_notification(
|
||||||
|
host_name, lvl, message, metric_path, AlertLevel.OK, alert_state.level, value
|
||||||
|
)
|
||||||
|
alert_state.pending_since = None
|
||||||
|
# else: still within grace window, do nothing
|
||||||
|
else:
|
||||||
|
self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data)
|
||||||
|
|
||||||
def _check_renotify(
|
def _check_renotify(
|
||||||
self,
|
self,
|
||||||
host_name: str,
|
host_name: str,
|
||||||
@@ -1143,20 +1229,20 @@ class ThresholdChecker:
|
|||||||
else:
|
else:
|
||||||
message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} (ongoing for {int(now - alert_state.since)}s)"
|
message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} (ongoing for {int(now - alert_state.since)}s)"
|
||||||
|
|
||||||
try:
|
from . import hbdclass
|
||||||
notify_mod.send_notification(
|
host = hbdclass.Host.hosts.get(host_name)
|
||||||
|
if host is None or host.watched:
|
||||||
|
asyncio.get_event_loop().create_task(notify_mod.send_notification(
|
||||||
host_name,
|
host_name,
|
||||||
notify_mod.Notification(
|
notify_mod.Notification(
|
||||||
title=f"[REMINDER/{alert_state.level.name}] {host_name}",
|
title=f"[REMINDER/{alert_state.level.name}] {host_name}",
|
||||||
body=message,
|
body=message,
|
||||||
level=alert_state.level.name,
|
level=alert_state.level.name,
|
||||||
),
|
),
|
||||||
)
|
))
|
||||||
|
logger.info("Re-notification sent: %s", message)
|
||||||
alert_state.last_notification = now
|
alert_state.last_notification = now
|
||||||
alert_state.notification_count += 1
|
alert_state.notification_count += 1
|
||||||
logger.info("Re-notification sent: %s", message)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error("Failed to send re-notification: %s", e)
|
|
||||||
|
|
||||||
def get_active_alerts(self, alert_states: Dict[str, AlertState]) -> list:
|
def get_active_alerts(self, alert_states: Dict[str, AlertState]) -> list:
|
||||||
"""
|
"""
|
||||||
|
|||||||
+41
-15
@@ -171,6 +171,24 @@ def dicttos(ID, d):
|
|||||||
DROPOVERDUE = 7 * 24 * 3600 # seconds before an overdue host becomes UNKNOWN
|
DROPOVERDUE = 7 * 24 * 3600 # seconds before an overdue host becomes UNKNOWN
|
||||||
|
|
||||||
|
|
||||||
|
def _set_connectivity_alert(host, afam, level_name):
|
||||||
|
"""Update (or clear) a connectivity alert_state entry for a host/address-family.
|
||||||
|
|
||||||
|
level_name is "CRITICAL", "WARNING", or "OK". "OK" removes the entry so
|
||||||
|
that recovered hosts don't clutter the Alerts Dashboard.
|
||||||
|
"""
|
||||||
|
from .threshold import AlertState, AlertLevel
|
||||||
|
metric_path = f"connectivity.{afam}"
|
||||||
|
level = getattr(AlertLevel, level_name, AlertLevel.OK)
|
||||||
|
if level == AlertLevel.OK:
|
||||||
|
host.alert_states.pop(metric_path, None)
|
||||||
|
return
|
||||||
|
if metric_path not in host.alert_states:
|
||||||
|
host.alert_states[metric_path] = AlertState(metric_path)
|
||||||
|
state = host.alert_states[metric_path]
|
||||||
|
state.update(level, level_name)
|
||||||
|
|
||||||
|
|
||||||
def _make_timer_callbacks(uname, host, ctx):
|
def _make_timer_callbacks(uname, host, ctx):
|
||||||
"""Return (on_overdue, on_unknown) async callbacks for connection timer logic.
|
"""Return (on_overdue, on_unknown) async callbacks for connection timer logic.
|
||||||
|
|
||||||
@@ -182,6 +200,7 @@ def _make_timer_callbacks(uname, host, ctx):
|
|||||||
|
|
||||||
async def on_unknown(connection):
|
async def on_unknown(connection):
|
||||||
connection.newstate(connection.__class__.UNKNOWN, connection.lastbeat)
|
connection.newstate(connection.__class__.UNKNOWN, connection.lastbeat)
|
||||||
|
# Keep connectivity alert active when host transitions to unknown
|
||||||
if msg_to_websockets:
|
if msg_to_websockets:
|
||||||
msg_to_websockets("host", host.stateinfo())
|
msg_to_websockets("host", host.stateinfo())
|
||||||
|
|
||||||
@@ -192,10 +211,13 @@ def _make_timer_callbacks(uname, host, ctx):
|
|||||||
connection.newstate(connection.__class__.OVERDUE, now, cfg.get("grace", 2))
|
connection.newstate(connection.__class__.OVERDUE, now, cfg.get("grace", 2))
|
||||||
msg = f"{connection.afam} overdue"
|
msg = f"{connection.afam} overdue"
|
||||||
eventlog(uname, "CRITICAL", msg)
|
eventlog(uname, "CRITICAL", msg)
|
||||||
notify_mod.send_notification(
|
if host.watched:
|
||||||
|
asyncio.create_task(notify_mod.send_notification(
|
||||||
uname,
|
uname,
|
||||||
notify_mod.Notification(title=f"[CRITICAL] {uname}", body=msg, level="CRITICAL"),
|
notify_mod.Notification(title=f"[CRITICAL] {uname}", body=msg, level="CRITICAL"),
|
||||||
)
|
))
|
||||||
|
# Track in alert_states so the Alerts Dashboard shows this
|
||||||
|
_set_connectivity_alert(host, connection.afam, "CRITICAL")
|
||||||
if threshold_checker:
|
if threshold_checker:
|
||||||
threshold_checker.check_value(
|
threshold_checker.check_value(
|
||||||
host_name=uname,
|
host_name=uname,
|
||||||
@@ -294,7 +316,6 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
|
|||||||
|
|
||||||
cfg = ctx.get("config", {})
|
cfg = ctx.get("config", {})
|
||||||
hbdcls = ctx.get("hbdclass")
|
hbdcls = ctx.get("hbdclass")
|
||||||
log = ctx.get("log")
|
|
||||||
msg_to_websockets = ctx.get("msg_to_websockets")
|
msg_to_websockets = ctx.get("msg_to_websockets")
|
||||||
DEBUG = ctx.get("DEBUG", 0)
|
DEBUG = ctx.get("DEBUG", 0)
|
||||||
verbose = ctx.get("verbose", False)
|
verbose = ctx.get("verbose", False)
|
||||||
@@ -387,10 +408,11 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
|
|||||||
|
|
||||||
if res:
|
if res:
|
||||||
eventlog(uname, "WARNING", res)
|
eventlog(uname, "WARNING", res)
|
||||||
notify_mod.send_notification(
|
if host.watched:
|
||||||
|
asyncio.create_task(notify_mod.send_notification(
|
||||||
uname,
|
uname,
|
||||||
notify_mod.Notification(title=f"[WARNING] {uname}", body=res, level="WARNING"),
|
notify_mod.Notification(title=f"[WARNING] {uname}", body=res, level="WARNING"),
|
||||||
)
|
))
|
||||||
|
|
||||||
interval = int(msg.get("interval", 0) or 0)
|
interval = int(msg.get("interval", 0) or 0)
|
||||||
shutdown = msg.get("shutdown", 0)
|
shutdown = msg.get("shutdown", 0)
|
||||||
@@ -400,16 +422,19 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
|
|||||||
|
|
||||||
if boot:
|
if boot:
|
||||||
eventlog(uname, "INFO", "booted")
|
eventlog(uname, "INFO", "booted")
|
||||||
notify_mod.send_notification(
|
if host.watched:
|
||||||
|
asyncio.create_task(notify_mod.send_notification(
|
||||||
uname,
|
uname,
|
||||||
notify_mod.Notification(title=f"[INFO] {uname}", body=f"{host.name} booted", level="INFO"),
|
notify_mod.Notification(title=f"[INFO] {uname}", body=f"{host.name} booted", level="INFO"),
|
||||||
)
|
))
|
||||||
if message:
|
if message:
|
||||||
eventlog(uname, "INFO", "msg: %s" % message, service=service)
|
eventlog(uname, "INFO", "msg: %s" % message, service=service)
|
||||||
|
|
||||||
if conn.getstate() != hbdcls.Connection.UP:
|
if conn.getstate() != hbdcls.Connection.UP:
|
||||||
lasts = conn.state
|
lasts = conn.state
|
||||||
d = conn.newstate(hbdcls.Connection.UP, now)
|
d = conn.newstate(hbdcls.Connection.UP, now)
|
||||||
|
# Clear connectivity alert now that the host is back up
|
||||||
|
_set_connectivity_alert(host, conn.afam, "OK")
|
||||||
# Don't log/notify RECOVER for a brand-new host seen for the first time —
|
# Don't log/notify RECOVER for a brand-new host seen for the first time —
|
||||||
# it was never down, it just hasn't been seen before.
|
# it was never down, it just hasn't been seen before.
|
||||||
if not newh:
|
if not newh:
|
||||||
@@ -418,10 +443,11 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
|
|||||||
else:
|
else:
|
||||||
m = "%s back after being %s for %s" % (conn.afam, lasts, dur(d))
|
m = "%s back after being %s for %s" % (conn.afam, lasts, dur(d))
|
||||||
eventlog(uname, "RECOVER", m)
|
eventlog(uname, "RECOVER", m)
|
||||||
notify_mod.send_notification(
|
if host.watched:
|
||||||
|
asyncio.create_task(notify_mod.send_notification(
|
||||||
uname,
|
uname,
|
||||||
notify_mod.Notification(title=f"[RECOVER] {uname}", body=m, level="RECOVER"),
|
notify_mod.Notification(title=f"[RECOVER] {uname}", body=m, level="RECOVER"),
|
||||||
)
|
))
|
||||||
|
|
||||||
if boot or newh:
|
if boot or newh:
|
||||||
host.upcount = host.doesack
|
host.upcount = host.doesack
|
||||||
@@ -431,11 +457,13 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
|
|||||||
if shutdown:
|
if shutdown:
|
||||||
m = "%s shutdown" % conn.afam
|
m = "%s shutdown" % conn.afam
|
||||||
eventlog(uname, "INFO", m)
|
eventlog(uname, "INFO", m)
|
||||||
notify_mod.send_notification(
|
if host.watched:
|
||||||
|
asyncio.create_task(notify_mod.send_notification(
|
||||||
uname,
|
uname,
|
||||||
notify_mod.Notification(title=f"[INFO] {uname}", body=m, level="INFO"),
|
notify_mod.Notification(title=f"[INFO] {uname}", body=m, level="INFO"),
|
||||||
)
|
))
|
||||||
conn.newstate(hbdcls.Connection.DOWN, now)
|
conn.newstate(hbdcls.Connection.DOWN, now)
|
||||||
|
_set_connectivity_alert(host, conn.afam, "CRITICAL")
|
||||||
|
|
||||||
if interval > 0:
|
if interval > 0:
|
||||||
host.interval = interval
|
host.interval = interval
|
||||||
@@ -467,12 +495,10 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
|
|||||||
op, rmsg = host.cmds[0]
|
op, rmsg = host.cmds[0]
|
||||||
if op == "CMD":
|
if op == "CMD":
|
||||||
del host.cmds[0]
|
del host.cmds[0]
|
||||||
if log:
|
eventlog(uname, "INFO", "command sent")
|
||||||
log(uname, "command sent")
|
|
||||||
elif op == "UPD":
|
elif op == "UPD":
|
||||||
del host.cmds[0]
|
del host.cmds[0]
|
||||||
if log:
|
eventlog(uname, "INFO", "update initiated")
|
||||||
log(uname, "update initiated")
|
|
||||||
opkt = dicttos(op, rmsg)
|
opkt = dicttos(op, rmsg)
|
||||||
try:
|
try:
|
||||||
transport.sendto(opkt, addr)
|
transport.sendto(opkt, addr)
|
||||||
|
|||||||
+52
-9
@@ -13,7 +13,8 @@ from . import data
|
|||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
_connections: set = set()
|
# Map of WebSocket → User object (or None when auth is disabled)
|
||||||
|
_connections: dict = {}
|
||||||
_loop: Optional[asyncio.AbstractEventLoop] = None
|
_loop: Optional[asyncio.AbstractEventLoop] = None
|
||||||
_get_hosts: Optional[Callable[[], Iterable]] = None
|
_get_hosts: Optional[Callable[[], Iterable]] = None
|
||||||
_verbose: bool = False
|
_verbose: bool = False
|
||||||
@@ -34,22 +35,52 @@ def setup(
|
|||||||
_verbose = verbose
|
_verbose = verbose
|
||||||
|
|
||||||
|
|
||||||
|
def _user_can_see_host(user, host_name: str) -> bool:
|
||||||
|
"""Return True if *user* may see updates for *host_name* (manager or higher)."""
|
||||||
|
from . import hbdclass, users as users_mod
|
||||||
|
if user is None or not users_mod.users_enabled():
|
||||||
|
return True
|
||||||
|
if user.admin:
|
||||||
|
return True
|
||||||
|
host = hbdclass.Host.hosts.get(host_name)
|
||||||
|
if host is None:
|
||||||
|
return False
|
||||||
|
return host.is_manager(user.username)
|
||||||
|
|
||||||
|
|
||||||
|
def _get_token(request) -> str:
|
||||||
|
"""Extract session token from request (mirrors logic in http.py)."""
|
||||||
|
auth = request.headers.get("Authorization", "")
|
||||||
|
if auth.startswith("Bearer "):
|
||||||
|
return auth[7:].strip()
|
||||||
|
token = request.headers.get("X-Auth-Token", "")
|
||||||
|
if token:
|
||||||
|
return token
|
||||||
|
return request.cookies.get("hbd_session", "")
|
||||||
|
|
||||||
|
|
||||||
async def handler(request):
|
async def handler(request):
|
||||||
"""aiohttp WebSocket upgrade handler — register as GET /ws."""
|
"""aiohttp WebSocket upgrade handler — register as GET /ws."""
|
||||||
from aiohttp import web
|
from aiohttp import web
|
||||||
|
from . import users as users_mod
|
||||||
|
|
||||||
ws = web.WebSocketResponse()
|
ws = web.WebSocketResponse()
|
||||||
await ws.prepare(request)
|
await ws.prepare(request)
|
||||||
|
|
||||||
_connections.add(ws)
|
token = _get_token(request)
|
||||||
|
user = users_mod.get_session_user(token) if token else None
|
||||||
|
|
||||||
|
_connections[ws] = user
|
||||||
remote = request.remote
|
remote = request.remote
|
||||||
logger.info("WebSocket connected from %s", remote)
|
logger.info("WebSocket connected from %s", remote)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# Send current host state to the new client
|
# Send current host state, filtered to hosts this user may see
|
||||||
if _get_hosts:
|
if _get_hosts:
|
||||||
try:
|
try:
|
||||||
for h in list(_get_hosts()):
|
for h in list(_get_hosts()):
|
||||||
|
host_name = h.get("raw_name") or h.get("name", "")
|
||||||
|
if _user_can_see_host(user, host_name):
|
||||||
await ws.send_str(json.dumps({"type": "host", "data": h}))
|
await ws.send_str(json.dumps({"type": "host", "data": h}))
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error("Error sending initial hosts: %s", e)
|
logger.error("Error sending initial hosts: %s", e)
|
||||||
@@ -74,7 +105,7 @@ async def handler(request):
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.exception("WebSocket handler error from %s: %s", remote, e)
|
logger.exception("WebSocket handler error from %s: %s", remote, e)
|
||||||
finally:
|
finally:
|
||||||
_connections.discard(ws)
|
_connections.pop(ws, None)
|
||||||
logger.info("WebSocket disconnected from %s", remote)
|
logger.info("WebSocket disconnected from %s", remote)
|
||||||
|
|
||||||
return ws
|
return ws
|
||||||
@@ -83,25 +114,37 @@ async def handler(request):
|
|||||||
def broadcast(typ: str, payload) -> bool:
|
def broadcast(typ: str, payload) -> bool:
|
||||||
"""Thread-safe broadcast to all connected WebSocket clients.
|
"""Thread-safe broadcast to all connected WebSocket clients.
|
||||||
|
|
||||||
|
For host and plugin updates, only sends to clients whose user has
|
||||||
|
manager-or-higher access to that host. Other message types are
|
||||||
|
broadcast to all clients.
|
||||||
|
|
||||||
Can be called from any thread; schedules sends on the event loop.
|
Can be called from any thread; schedules sends on the event loop.
|
||||||
Returns False if the loop is not running yet.
|
Returns False if the loop is not running yet.
|
||||||
"""
|
"""
|
||||||
if not _loop:
|
if not _loop:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
# Determine the host name for access-filtered message types
|
||||||
|
host_name: Optional[str] = None
|
||||||
|
if typ in ("host", "plugin"):
|
||||||
|
host_name = payload.get("raw_name") or payload.get("host") or payload.get("name")
|
||||||
|
|
||||||
jmsg = json.dumps({"type": typ, "data": payload})
|
jmsg = json.dumps({"type": typ, "data": payload})
|
||||||
|
|
||||||
async def _send_all():
|
async def _send_all():
|
||||||
dead = set()
|
dead = set()
|
||||||
for ws in list(_connections):
|
for ws, user in list(_connections.items()):
|
||||||
try:
|
try:
|
||||||
if not ws.closed:
|
if ws.closed:
|
||||||
await ws.send_str(jmsg)
|
|
||||||
else:
|
|
||||||
dead.add(ws)
|
dead.add(ws)
|
||||||
|
continue
|
||||||
|
if host_name is not None and not _user_can_see_host(user, host_name):
|
||||||
|
continue
|
||||||
|
await ws.send_str(jmsg)
|
||||||
except Exception:
|
except Exception:
|
||||||
dead.add(ws)
|
dead.add(ws)
|
||||||
for ws in dead:
|
for ws in dead:
|
||||||
_connections.discard(ws)
|
_connections.pop(ws, None)
|
||||||
|
|
||||||
asyncio.run_coroutine_threadsafe(_send_all(), _loop)
|
asyncio.run_coroutine_threadsafe(_send_all(), _loop)
|
||||||
return True
|
return True
|
||||||
|
|||||||
+7
-1
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|||||||
|
|
||||||
[project]
|
[project]
|
||||||
name = "hbd"
|
name = "hbd"
|
||||||
version = "5.1.1"
|
version = "5.1.14"
|
||||||
description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
|
description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
requires-python = ">=3.11"
|
requires-python = ">=3.11"
|
||||||
@@ -34,6 +34,9 @@ server = [
|
|||||||
"matrix-nio>=0.24",
|
"matrix-nio>=0.24",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
# Minimal client — hbc_mini only, no external dependencies
|
||||||
|
mini = []
|
||||||
|
|
||||||
# Install both client and server
|
# Install both client and server
|
||||||
all = [
|
all = [
|
||||||
"hbd[client,server]",
|
"hbd[client,server]",
|
||||||
@@ -54,6 +57,9 @@ dev = [
|
|||||||
hbd = "hbd.server.cli:main"
|
hbd = "hbd.server.cli:main"
|
||||||
hbc = "hbd.client.main:main"
|
hbc = "hbd.client.main:main"
|
||||||
|
|
||||||
|
[tool.setuptools]
|
||||||
|
script-files = ["scripts/hb_install.sh", "scripts/hbc_mini.py"]
|
||||||
|
|
||||||
[tool.setuptools.packages.find]
|
[tool.setuptools.packages.find]
|
||||||
where = ["."]
|
where = ["."]
|
||||||
include = ["hbd*"]
|
include = ["hbd*"]
|
||||||
|
|||||||
@@ -4,12 +4,14 @@ set -e
|
|||||||
uv version --bump patch
|
uv version --bump patch
|
||||||
VER=$(uv version --short)
|
VER=$(uv version --short)
|
||||||
sed -i".bak" "s/__version__ = \"[0-9.]*\"\(.*\)$/__version__ = \"$VER\"\1/" hbd/__init__.py
|
sed -i".bak" "s/__version__ = \"[0-9.]*\"\(.*\)$/__version__ = \"$VER\"\1/" hbd/__init__.py
|
||||||
|
sed -i".bak" "s/__version__ = \"[0-9.]*\"\(.*\)$/__version__ = \"$VER\"\1/" scripts/hbc_mini.py
|
||||||
|
|
||||||
# commit pyproject.toml
|
# commit pyproject.toml
|
||||||
git commit -m "version $VER" pyproject.toml hbd/__init__.py
|
git commit -m "version $VER" pyproject.toml hbd/__init__.py scripts/hbc_mini.py
|
||||||
git push
|
git push
|
||||||
# tag version
|
# tag version
|
||||||
git tag -a v$VER -m "Version $VER"
|
git tag -a v$VER -m "Version $VER"
|
||||||
git push --tags
|
git push --tags
|
||||||
|
|
||||||
rm hbd/__init__.py.bak
|
rm hbd/__init__.py.bak
|
||||||
|
rm scripts/hbc_mini.py.bak
|
||||||
|
|||||||
Executable
+115
@@ -0,0 +1,115 @@
|
|||||||
|
#!/bin/sh
|
||||||
|
|
||||||
|
# Helper script to install the heartbeat tools. By default, it will only
|
||||||
|
# install the heartbeat client, hbc. The server is installed when the arg 'server' is passed
|
||||||
|
# to the script. The script will install the heartbeat tools in a python
|
||||||
|
# virtual environment in ~/venvs/hbd. The hbd and hbc commands will be
|
||||||
|
# installed from the wheel and symlinked to ~/bin/hbd and ~/bin/hbc,
|
||||||
|
# respectively. If the virtual environment already exists, it will be
|
||||||
|
# reused. The script will also remove any existing symlinks for hbd and hbc
|
||||||
|
# in ~/bin before creating new ones.
|
||||||
|
|
||||||
|
set -e
|
||||||
|
what=$1
|
||||||
|
on_ha=0
|
||||||
|
where=""
|
||||||
|
venv=""
|
||||||
|
[ "$2" = "HA" ] && on_ha=1
|
||||||
|
[ -z "$what" ] && what="client"
|
||||||
|
|
||||||
|
if [ -d /homeassistant ]; then # if running from HA command line
|
||||||
|
echo "HA, running \"docker exec homeassistant /config/bin/hb_install.sh $@\""
|
||||||
|
docker exec homeassistant /config/bin/hb_install.sh $@ HA
|
||||||
|
rc=$?
|
||||||
|
if [ $rc -ne 0 ]; then
|
||||||
|
echo "Failed to install heartbeat in HA, please check the logs for more details"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ $on_ha -eq 1 ] || [ -r /.dockerenv ] && [ -d /config/bin ]; then
|
||||||
|
# Installing under docker on Home Assistant OS, using /config/bin for executables and /config/venvs for virtual environments
|
||||||
|
echo "Home Assistant OS detected, installing under docker"
|
||||||
|
where="/config/bin"
|
||||||
|
venv="/config/venvs"
|
||||||
|
else
|
||||||
|
if [ ! -d $HOME/.local/bin ] && [ ! -d $HOME/bin ]; then
|
||||||
|
echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
for where in $HOME/bin $HOME/.local/bin notset ; do
|
||||||
|
if echo ":$PATH:" | grep -q ":$where:" ; then
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if [ "$where" = "notset" ]; then
|
||||||
|
echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
if [ "$what" = "mini" ]; then
|
||||||
|
venv=""
|
||||||
|
else
|
||||||
|
venv="$HOME/venvs"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
echo "Installing $what to $where"
|
||||||
|
if [ ! -z "$venv" ]; then
|
||||||
|
echo "Using virtual environment at $venv/hbd"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ "$venv" != "" ] && [ ! -d $venv/hbd ]; then
|
||||||
|
arg=""
|
||||||
|
have_pip=$(python3 -c "import pip" 2>/dev/null &> /dev/null && echo "Installed" || echo "Not Installed")
|
||||||
|
if [ "$have_pip" = "Not Installed" ]; then
|
||||||
|
# some systems do not have pip installed by default, so we need to fetch get-pip.py and install pip
|
||||||
|
echo "pip is not installed, fetching get-pip.py and installing pip"
|
||||||
|
arg="--without-pip"
|
||||||
|
fi
|
||||||
|
mkdir -p $venv
|
||||||
|
have_venv=$(python3 -c "import venv" 2>/dev/null &> /dev/null && echo "Installed" || echo "Not Installed")
|
||||||
|
if [ "$have_venv" = "Not Installed" ]; then
|
||||||
|
if [ "$have_pip" = "Not Installed" ]; then
|
||||||
|
echo "python has no venv, and no pip to install virtualenv, cannot continue"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
echo "python venv module not found, installing virtualenv"
|
||||||
|
python3 -m pip install --user virtualenv
|
||||||
|
python3 -m virtualenv $venv/hbd --system-site-packages $arg
|
||||||
|
else
|
||||||
|
python3 -m venv $venv/hbd --system-site-packages $arg
|
||||||
|
fi
|
||||||
|
. $venv/hbd/bin/activate
|
||||||
|
if [ -n "$arg" ]; then
|
||||||
|
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py
|
||||||
|
fi
|
||||||
|
deactivate
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ! -z "$venv" ]; then
|
||||||
|
. $venv/hbd/bin/activate
|
||||||
|
fi
|
||||||
|
if [ "$what" = "mini" ]; then
|
||||||
|
curl -s -o $where/hbc_mini https://git.wrede.ca/andreas/heartbeat/raw/branch/master/scripts/hbc_mini.py
|
||||||
|
chmod +x $where/hbc_mini
|
||||||
|
else
|
||||||
|
python3 -mpip install --upgrade --index-url https://git.wrede.ca/api/packages/andreas/pypi/simple/ --extra-index-url https://pypi.org/simple hbd[$what]
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ! -z "$venv" ]; then
|
||||||
|
echo "linking executables to $where"
|
||||||
|
if [ "$what" = "server" ]; then
|
||||||
|
rm -f $where/hbd
|
||||||
|
ln -sf $(which hbd) $where/hbd
|
||||||
|
elif [ "$what" = "client" ]; then
|
||||||
|
rm -f $where/hbc
|
||||||
|
ln -sf $(which hbc) $where/hbc
|
||||||
|
fi
|
||||||
|
rm -f $where/hb_install.sh
|
||||||
|
ln -sf $(which hb_install.sh) $where/hb_install.sh
|
||||||
|
fi
|
||||||
|
echo "Installation complete. To upgrade, run the following:"
|
||||||
|
echo " $where/hb_install.sh $what"
|
||||||
|
echo "To install on another machine, run the following obtain the install script and run it:"
|
||||||
|
echo "from https://git.wrede.ca/andreas/heartbeat/raw/branch/master/scripts/hb_install.sh"
|
||||||
|
echo "and then run sh hb_install.sh [mini|client]"
|
||||||
Executable
+1154
File diff suppressed because it is too large
Load Diff
@@ -1,58 +0,0 @@
|
|||||||
#!/bin/sh
|
|
||||||
|
|
||||||
# install the heartbeat client, hbc. The server is installed when the arg 'server' is passed
|
|
||||||
# install the heartbeat client, hbc. The server is installed when the arg 'server' is passed
|
|
||||||
# to the script. The script will install the heartbeat tools in a python
|
|
||||||
# virtual environment in ~/venvs/hbd. The hbd and hbc commands will be
|
|
||||||
# installed from the wheel and symlinked to ~/bin/hbd and ~/bin/hbc,
|
|
||||||
# respectively. If the virtual environment already exists, it will be
|
|
||||||
# reused. The script will also remove any existing symlinks for hbd and hbc
|
|
||||||
# in ~/bin before creating new ones.
|
|
||||||
|
|
||||||
|
|
||||||
# hbd/hbc from wheel and create symlinks for hbd and hbc in ~/bin
|
|
||||||
|
|
||||||
set -e
|
|
||||||
what=$1
|
|
||||||
|
|
||||||
if [ -d /homeassistant ]; then
|
|
||||||
echo "cannot install in HA, run \"docker exec -it homeassistant $0 $@\""
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
if [ -d /config ]; then
|
|
||||||
echo "Installing on HA"
|
|
||||||
where="/config/bin"
|
|
||||||
venv="/config/venvs"
|
|
||||||
else
|
|
||||||
if [ ! -d ~/.local/bin ] && [ ! -d ~/bin ]; then
|
|
||||||
echo "No suitable bin directory found in PATH, please add either ~/.local/bin or ~/bin to your PATH"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
for where in ~/bin ~/.local/bin; do
|
|
||||||
if echo ":$PATH:" | grep -q ":$where:" ; then
|
|
||||||
break
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
venv="~/venvs"
|
|
||||||
fi
|
|
||||||
python3 -m pip --version > /dev/null 2>&1 || { echo "pip is not installed, please install pip for python3"; exit 1; }
|
|
||||||
|
|
||||||
if [ "$what" = "server" ]; then
|
|
||||||
echo "Installing heartbeat server (hbd)"
|
|
||||||
else
|
|
||||||
what="client"
|
|
||||||
echo "Installing heartbeat client (hbc)"
|
|
||||||
fi
|
|
||||||
if [ ! -d $venv/hbd ]; then
|
|
||||||
mkdir -p $venv
|
|
||||||
python3 -m venv $venv/hbd --system-site-packages
|
|
||||||
fi
|
|
||||||
. $venv/hbd/bin/activate
|
|
||||||
pip install --index-url https://git.wrede.ca/api/packages/andreas/pypi/simple/ --extra-index-url https://pypi.org/simple hbd[$what]
|
|
||||||
if [ "$what" = "server" ]; then
|
|
||||||
rm -f ~$where/hbd
|
|
||||||
ln -sf $(which hbd) $where/hbd
|
|
||||||
else
|
|
||||||
rm -f $where/hbc
|
|
||||||
ln -sf $(which hbc) $where/hbc
|
|
||||||
fi
|
|
||||||
@@ -0,0 +1,99 @@
|
|||||||
|
import asyncio
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import stat
|
||||||
|
|
||||||
|
from hbd.client.plugins.nagios_runner import (
|
||||||
|
NagiosRunnerPlugin,
|
||||||
|
NAGIOS_OK,
|
||||||
|
NAGIOS_WARNING,
|
||||||
|
NAGIOS_CRITICAL,
|
||||||
|
NAGIOS_UNKNOWN,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_no_commands_sets_skip_reason():
|
||||||
|
plugin = NagiosRunnerPlugin(config={"commands": []})
|
||||||
|
result = asyncio.run(plugin.initialize())
|
||||||
|
assert result is False
|
||||||
|
assert plugin.skip_reason is not None
|
||||||
|
assert "nagios_runner.commands" in plugin.skip_reason
|
||||||
|
|
||||||
|
|
||||||
|
def test_stderr_used_when_stdout_empty(tmp_path):
|
||||||
|
script = tmp_path / "check_err.sh"
|
||||||
|
script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n")
|
||||||
|
script.chmod(script.stat().st_mode | stat.S_IEXEC)
|
||||||
|
|
||||||
|
config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
|
||||||
|
plugin = NagiosRunnerPlugin(config=config)
|
||||||
|
asyncio.run(plugin.initialize())
|
||||||
|
data = asyncio.run(plugin._collect_metrics())
|
||||||
|
|
||||||
|
assert "error from stderr" in data["t_output"]
|
||||||
|
assert data["t_status_code"] == NAGIOS_CRITICAL
|
||||||
|
|
||||||
|
|
||||||
|
def test_stderr_appended_when_both_present(tmp_path):
|
||||||
|
script = tmp_path / "check_both.sh"
|
||||||
|
script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n")
|
||||||
|
script.chmod(script.stat().st_mode | stat.S_IEXEC)
|
||||||
|
|
||||||
|
config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
|
||||||
|
plugin = NagiosRunnerPlugin(config=config)
|
||||||
|
asyncio.run(plugin.initialize())
|
||||||
|
data = asyncio.run(plugin._collect_metrics())
|
||||||
|
|
||||||
|
assert "OK - all good" in data["t_output"]
|
||||||
|
assert "extra detail" in data["t_output"]
|
||||||
|
assert data["t_status_code"] == NAGIOS_OK
|
||||||
|
|
||||||
|
|
||||||
|
def test_negative_returncode_maps_to_unknown():
|
||||||
|
# kill -9 $$ kills the shell itself; asyncio sees returncode -9
|
||||||
|
config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5}
|
||||||
|
plugin = NagiosRunnerPlugin(config=config)
|
||||||
|
asyncio.run(plugin.initialize())
|
||||||
|
data = asyncio.run(plugin._collect_metrics())
|
||||||
|
|
||||||
|
assert data["t_status_code"] == NAGIOS_UNKNOWN
|
||||||
|
assert "signal" in data["t_output"].lower()
|
||||||
|
|
||||||
|
|
||||||
|
def test_absolute_path_not_found_warns(caplog):
|
||||||
|
fake_cmd = "/nonexistent_hbc_test_path/check_something"
|
||||||
|
config = {"commands": [{"name": "t", "command": fake_cmd}]}
|
||||||
|
plugin = NagiosRunnerPlugin(config=config)
|
||||||
|
|
||||||
|
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
|
||||||
|
asyncio.run(plugin.initialize())
|
||||||
|
|
||||||
|
assert any("not found" in r.message for r in caplog.records)
|
||||||
|
|
||||||
|
|
||||||
|
def test_absolute_path_not_executable_warns(caplog, tmp_path):
|
||||||
|
non_exec = tmp_path / "check_test"
|
||||||
|
non_exec.write_text("#!/bin/sh\necho OK\n")
|
||||||
|
non_exec.chmod(0o644) # readable but not executable
|
||||||
|
|
||||||
|
config = {"commands": [{"name": "t", "command": str(non_exec)}]}
|
||||||
|
plugin = NagiosRunnerPlugin(config=config)
|
||||||
|
|
||||||
|
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
|
||||||
|
asyncio.run(plugin.initialize())
|
||||||
|
|
||||||
|
assert any("not executable" in r.message for r in caplog.records)
|
||||||
|
|
||||||
|
|
||||||
|
def test_relative_path_not_checked(caplog):
|
||||||
|
# Relative paths (resolved via PATH) must not generate warnings
|
||||||
|
config = {"commands": [{"name": "t", "command": "echo OK"}]}
|
||||||
|
plugin = NagiosRunnerPlugin(config=config)
|
||||||
|
|
||||||
|
with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
|
||||||
|
asyncio.run(plugin.initialize())
|
||||||
|
|
||||||
|
assert not any(
|
||||||
|
"not found" in r.message or "not executable" in r.message
|
||||||
|
for r in caplog.records
|
||||||
|
)
|
||||||
@@ -0,0 +1,83 @@
|
|||||||
|
import asyncio
|
||||||
|
import logging
|
||||||
|
import textwrap
|
||||||
|
|
||||||
|
from hbd.client.plugin import PluginLoader, PluginRegistry
|
||||||
|
|
||||||
|
|
||||||
|
def test_plugin_skip_reason_defaults_none(tmp_path):
|
||||||
|
plugin_code = textwrap.dedent("""
|
||||||
|
from hbd.client.plugin import MonitorPlugin
|
||||||
|
|
||||||
|
class MinimalPlugin(MonitorPlugin):
|
||||||
|
name = "minimal"
|
||||||
|
version = "1.0.0"
|
||||||
|
interval = 60
|
||||||
|
|
||||||
|
async def initialize(self):
|
||||||
|
return True
|
||||||
|
|
||||||
|
async def _collect_metrics(self):
|
||||||
|
return {}
|
||||||
|
""")
|
||||||
|
(tmp_path / "minimal.py").write_text(plugin_code)
|
||||||
|
registry = PluginRegistry()
|
||||||
|
loader = PluginLoader(registry)
|
||||||
|
asyncio.run(loader.load_from_directory(tmp_path))
|
||||||
|
plugin = registry.get("minimal")
|
||||||
|
assert plugin is not None
|
||||||
|
assert plugin.skip_reason is None
|
||||||
|
|
||||||
|
|
||||||
|
def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog):
|
||||||
|
plugin_code = textwrap.dedent("""
|
||||||
|
from hbd.client.plugin import MonitorPlugin
|
||||||
|
|
||||||
|
class SkippablePlugin(MonitorPlugin):
|
||||||
|
name = "skippable"
|
||||||
|
version = "1.0.0"
|
||||||
|
interval = 60
|
||||||
|
|
||||||
|
async def initialize(self):
|
||||||
|
self.skip_reason = "not configured in yaml"
|
||||||
|
return False
|
||||||
|
|
||||||
|
async def _collect_metrics(self):
|
||||||
|
return {}
|
||||||
|
""")
|
||||||
|
(tmp_path / "skippable.py").write_text(plugin_code)
|
||||||
|
registry = PluginRegistry()
|
||||||
|
loader = PluginLoader(registry)
|
||||||
|
|
||||||
|
with caplog.at_level(logging.INFO, logger="plugin.loader"):
|
||||||
|
count = asyncio.run(loader.load_from_directory(tmp_path))
|
||||||
|
|
||||||
|
assert count == 0
|
||||||
|
assert any("skipped: not configured in yaml" in r.message for r in caplog.records)
|
||||||
|
assert not any("failed initialization" in r.message for r in caplog.records)
|
||||||
|
|
||||||
|
|
||||||
|
def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog):
|
||||||
|
plugin_code = textwrap.dedent("""
|
||||||
|
from hbd.client.plugin import MonitorPlugin
|
||||||
|
|
||||||
|
class FailPlugin(MonitorPlugin):
|
||||||
|
name = "fail"
|
||||||
|
version = "1.0.0"
|
||||||
|
interval = 60
|
||||||
|
|
||||||
|
async def initialize(self):
|
||||||
|
return False
|
||||||
|
|
||||||
|
async def _collect_metrics(self):
|
||||||
|
return {}
|
||||||
|
""")
|
||||||
|
(tmp_path / "fail_plugin.py").write_text(plugin_code)
|
||||||
|
registry = PluginRegistry()
|
||||||
|
loader = PluginLoader(registry)
|
||||||
|
|
||||||
|
with caplog.at_level(logging.WARNING, logger="plugin.loader"):
|
||||||
|
count = asyncio.run(loader.load_from_directory(tmp_path))
|
||||||
|
|
||||||
|
assert count == 0
|
||||||
|
assert any("failed initialization" in r.message for r in caplog.records)
|
||||||
Reference in New Issue
Block a user