version 5.1.21

feat: nagios_runner improvements and alerts page fixes
- nagios_runner: remove overall_status/overall_status_code/plugin_count fields; each command still reports its own <name>_status and <name>_status_code - threshold: expose {output} and {status} aliases in display templates for nagios_runner generic matches (mapped from <check_name>_output/status) - alerts.html: fix scrolling by overriding html,body height/overflow (style.css sets both); make hostname a link to /plugins/<hostname> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 11:05:48 -04:00 · 2026-05-05 11:05:45 -04:00 · 2026-05-05 10:48:24 -04:00 · 2026-05-05 10:48:17 -04:00 · 2026-05-04 14:47:50 -04:00 · 2026-05-04 12:10:01 -04:00
42 changed files with 5233 additions and 1734 deletions
@@ -24,11 +24,11 @@ jobs:
          
      - name: Install build tools
        run: |
-          python -m pip install --upgrade pip
-          pip install build twine
+          python3 -m pip install --upgrade pip
+          python3 -m pip install build twine
          
      - name: Build package
-        run: python -m build
+        run: python3 -m build
        
      - name: Extract version from tag
        id: get_version
@@ -39,7 +39,7 @@ jobs:
          TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
          TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
        run: |
-          python -m twine upload --repository-url https://git.wrede.ca/api/packages/andreas/pypi dist/*
+          python3 -m twine upload --repository-url https://git.wrede.ca/api/packages/andreas/pypi dist/*

      - name: Create release
        uses: actions/gitea-release-action@v1
@@ -0,0 +1,4 @@
+1. Don't assume. Don't hide confusion. Surface tradeoffs.
+2. Minimum code that solves the problem. Nothing speculative.
+3. Touch only what you must. Clean up only your own mess.
+4. Define success criteria. Loop until verified.
@@ -27,6 +27,7 @@ A lightweight daemon that listens for UDP heartbeat messages and acts on them: k
  - Configurable retention and backup management
 - **Plugin system for extensible monitoring** ✅
  - Collect system metrics (CPU, memory, disk, network)
+  - Monitor ZFS pool health, capacity, and I/O via `zpool(8)`
  - Execute existing Nagios monitoring plugins
  - Create custom plugins with simple Python classes
 - **Threshold alerting system** ✅
@@ -34,6 +35,8 @@ A lightweight daemon that listens for UDP heartbeat messages and acts on them: k
  - Hysteresis to prevent alert flapping
  - Automatic notifications on state changes
  - Re-notification for ongoing alerts
+- **Per-host watch flag** — set `watch: false` on any host to silence all notifications for that host without removing its configuration ✅
+- **Role-filtered dashboards** — Live Dashboard and Host Overview show only hosts where the logged-in user is owner or manager (admins see all) ✅
 - Modular codebase suitable for unit testing and CI ✅

 ---
@@ -61,12 +64,16 @@ Heartbeat includes a comprehensive plugin architecture that extends monitoring b
 - `network_monitor`: Monitors network interface statistics, bandwidth, and connections
 - `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
 - `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
+- `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`

 ### Nagios Integration

 The `nagios_runner` plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored:

- Executes plugins via subprocess with timeout protection
+- Executes plugins asynchronously (non-blocking) with timeout protection
+- Captures both stdout and stderr; if stdout is empty, stderr is used as the status message
+- Handles signal-killed processes (negative exit code → UNKNOWN status)
+- Validates absolute command paths at startup and warns on missing or non-executable files
 - Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
 - Extracts performance data with thresholds
 - Reports aggregated status across all configured checks
@@ -147,9 +154,11 @@ Heartbeat includes a sophisticated threshold alerting system that monitors plugi
 - **Multi-level alerts**: WARNING and CRITICAL severity levels
 - **Flexible operators**: Support for >, >=, <, <=, ==, != comparisons
 - **Hysteresis**: Prevents alert flapping with configurable recovery thresholds
- **Smart notifications**: Alerts only on state changes, not every check
+- **Smart notifications**: Alerts only on state changes, not every check; de-escalations (e.g. CRITICAL → WARNING) do not generate a notification
 - **Re-notifications**: Periodic reminders for ongoing alerts
+- **Short-duration suppression**: Recovery notifications are suppressed for down events under 4 seconds (avoids noise from transient blips)
 - **Journal integration**: All threshold events logged for audit trail
+- **`ping_monitor` thresholds**: Latency and packet-loss thresholds use the same format as all other plugin metrics

 ### Configuration

@@ -172,7 +181,8 @@ thresholds:
      warning: 80.0      # Warn when CPU > 80%
      critical: 90.0     # Critical when CPU > 90%
      operator: ">"
-      hysteresis: 0.1    # 10% hysteresis to prevent flapping
+      hysteresis: 0.02   # 2% hysteresis to prevent flapping
+      display: "(threshold: {op_symbol} {threshold_value}%)"  # optional
  
  memory_monitor:
    percent:
@@ -265,7 +275,96 @@ All plugin metrics can be thresholded:
 - **Memory**: percent, available_mb, swap_percent
 - **Disk**: Per-partition percent, free_gb, free_mb
 - **Network**: errors_total, dropped packets, connection counts
- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
+- **Nagios**: Any field emitted by `nagios_runner` (status_code, exit_code, performance data, …)
+
+### Display Format Templates
+
+Each threshold entry accepts an optional `display` field — a Python format string shown in notifications and on the Alerts dashboard:
+
+```yaml
+nagios_runner:
+  status_code:
+    warning: 1
+    critical: 2
+    operator: ">="
+    display: "{check_name}: exit {value} (expected < {threshold_value})"
+```
+
+Available variables:
+
+| Variable | Description |
+|---|---|
+| `{value}` | Current metric value |
+| `{threshold_value}` | Threshold that was crossed |
+| `{op_symbol}` | Comparison operator (`>`, `<`, `>=`, …) |
+| `{check_name}` | Prefix stripped by generic matching (see below) |
+| `{metric_name}` | Full field name within the plugin data |
+| `{output}` | For `nagios_runner` generic matches: the matched check's status text (alias for `{check_name}_output`) |
+| `{status}` | For `nagios_runner` generic matches: the matched check's status name — OK/WARNING/CRITICAL/UNKNOWN (alias for `{check_name}_status`) |
+| any plugin field | Any other field present in the plugin's data |
+
+### Generic Threshold Matching
+
+When a metric name has no exact threshold entry, the server progressively strips leading underscore-separated segments and re-tries the lookup. This lets a single generic entry cover an entire family of metrics.
+
+The classic use case is `nagios_runner`, which names each metric after the command that produced it:
+
+```
+nagios_runner.check_disk_root_status_code    → no exact match
+nagios_runner.disk_root_status_code          → no match
+nagios_runner.root_status_code               → no match
+nagios_runner.status_code                    → matched ✓
+```
+
+Configure the generic threshold once:
+
+```yaml
+nagios_runner:
+  status_code:
+    warning: 1
+    critical: 2
+    operator: ">="
+    display: "{check_name}: exit {value}"
+```
+
+The stripped prefix (`check_disk_root` in the example above) is available as `{check_name}` in the display template, so you can identify which check triggered the alert without writing a separate threshold entry per command.
+
+Exact matches always take priority. A generic entry only applies when no specific one is defined.
+
+### Per-Host Threshold Profiles
+
+Named threshold configurations let different hosts use different limits. A host's `threshold_config` can be a single name or a **list** — lists are applied left-to-right so profiles compose without duplication:
+
+```yaml
+threshold_configs:
+  default:
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 80, critical: 90}
+      memory_monitor:
+        memory_percent: {warning: 85, critical: 95}
+
+  tight_cpu:           # override CPU limits only
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 60, critical: 75}
+
+  db_disk:             # add a database partition check
+    thresholds:
+      disk_monitor:
+        partitions:
+          /var/lib/postgresql:
+            percent: {warning: 75, critical: 88}
+
+hosts:
+  web-01:
+    threshold_config: default          # single profile
+
+  db-01:
+    threshold_config: [tight_cpu, db_disk]   # layered: CPU override + extra disk check
+```
+
+Each named config's overrides are applied in order on top of the defaults. Metrics not mentioned in a profile are inherited unchanged.

 See [docs/THRESHOLD_ALERTING.md](docs/THRESHOLD_ALERTING.md) for comprehensive documentation including best practices, troubleshooting, and advanced configuration.

@@ -328,9 +427,10 @@ Heartbeat includes a built-in HTTP/WebSocket server that provides both a REST AP
 ### Web Dashboards

 - **Login** (`/login`): Browser login form (shown automatically when auth is configured)
- **Live View** (`/live`): Real-time host connectivity, latency, and messages
- **Plugin Metrics** (`/plugins`): Browse and visualize metrics from all plugins
- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering
+- **Live View** (`/live`): Real-time host connectivity, latency, and messages; hostnames link directly to the Host Overview page
+- **Host Overview** (`/plugins/<host>`): Per-host plugin metrics with ZFS pool visualization; filtered to hosts where the logged-in user is owner or manager (admins see all)
+- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering; alert count pie chart shown in the navigation bar
+- **Settings** (`/settings`): Server configuration, user management, and threshold configuration viewer

 ### API Endpoints

@@ -377,7 +477,7 @@ This project now declares its dependencies in `pyproject.toml`. Instead
 of the old `requirements.txt` flow, install the package into a virtualenv
 using `pip`:

-See `scripts/install.sh` for a way to install.
+See `scripts/hb_install.sh` for a way to install.

 Run the daemon (example):

@@ -441,6 +541,74 @@ plugins:

 All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.

+**Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
+
+**Daemon logging:** When running with `-d`, `hbc` routes all log output to syslog (`LOG_DAEMON` facility) after daemonizing. Without `-d`, logs go to stderr as usual.
+
+### hbc_mini — single-file client (no external dependencies)
+
+`scripts/hbc_mini.py` is a self-contained version of the heartbeat client that requires only Python 3.8+ and no external packages. Copy it to any host and run it directly — no virtualenv, no `pip install`.
+
+```bash
+# Basic usage
+python3 hbc_mini.py your-server.example.com
+
+# Run as daemon
+python3 hbc_mini.py -d your-server.example.com
+
+# Send a boot message
+python3 hbc_mini.py -b your-server.example.com
+
+# Send a one-off message
+python3 hbc_mini.py -m "maintenance starting" your-server.example.com
+```
+
+**Config:** `~/.hbc.json` (same keys as `~/.hbc.yaml`, JSON format). Example:
+
+```json
+{
+  "hb_port": 50003,
+  "interval": 30,
+  "plugins": {
+    "ping_monitor": {
+      "interval": 60,
+      "hosts": ["8.8.8.8", "192.168.1.1"]
+    },
+    "nagios_runner": {
+      "interval": 300,
+      "commands": [
+        {"name": "check_load", "command": "/usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6"}
+      ]
+    }
+  }
+}
+```
+
+**Plugin availability:**
+
+| Plugin | Platform | Data source |
+|---|---|---|
+| `os_info` | all | `platform` stdlib |
+| `ping_monitor` | all | `ping` subprocess |
+| `nagios_runner` | all (not Windows) | subprocess |
+| `cpu_monitor` | Linux | `/proc/stat` |
+| `memory_monitor` | Linux | `/proc/meminfo` |
+| `disk_monitor` | Linux, macOS, BSD | `df -P` subprocess |
+| `network_monitor` | Linux | `/proc/net/dev` |
+
+**What is not available compared to the full `hbc`:**
+
+- No YAML config (use JSON instead)
+- No `filesystem_info` plugin
+- No `zfs_monitor` plugin (requires `zpool(8)` and the full plugin loader)
+- `cpu_monitor` does not report per-core usage or CPU frequency (no psutil)
+- Plugins cannot be loaded from external `.py` files — all plugins are compiled in
+- No IPv6 early-fail protection — connections that fail to open at startup are silently skipped rather than retried
+
+Everything else — heartbeat protocol, ACK/CMD/UPD handling, `hb_install.sh`-based self-update, daemonize, syslog — is identical to the full client.
+
+---
+
 ## 🐞 Debugging in VS Code

 This repository includes a ready-to-use `.vscode/launch.json` with configurations to run or attach the VS Code debugger to `hbd`.
@@ -1,234 +0,0 @@
-# HBD/HBC Separation Refactoring
-
-## Overview
-
-The heartbeat monitoring system has been refactored into a modular package structure with separate client and server components. This allows users to install only what they need and provides clear separation of concerns.
-
-## New Package Structure
-
-```
-hbd/
-├── __init__.py                 # Main package (minimal)
-├── client/                     # HBC - System monitoring client
-│   ├── __init__.py
-│   ├── main.py                # Entry point (was hbc.py)
-│   ├── config.py              # Client-specific configuration
-│   ├── plugin.py              # Plugin framework
-│   ├── threshold.py           # Threshold checking
-│   └── plugins/               # Monitoring plugins
-│       ├── cpu_monitor.py
-│       ├── disk_monitor.py
-│       ├── memory_monitor.py
-│       ├── network_monitor.py
-│       ├── filesystem_info.py
-│       ├── os_info.py
-│       └── nagios_runner.py
-├── server/                     # HBD - Heartbeat daemon/server
-│   ├── __init__.py
-│   ├── main.py                # Server runtime (was server.py)
-│   ├── cli.py                 # Command-line interface
-│   ├── config.py              # Server-specific configuration
-│   ├── http.py                # HTTP/REST API
-│   ├── ws.py                  # WebSocket server
-│   ├── udp.py                 # UDP heartbeat listener
-│   ├── dns.py                 # DNS update functionality
-│   ├── notify.py              # Notification handlers
-│   ├── monitor.py             # Host monitoring
-│   ├── hbdclass.py            # Host class definitions
-│   ├── journal.py             # Message journaling
-│   ├── templates/             # Jinja2 web templates
-│   └── static/                # Web UI assets
-└── common/                     # Shared utilities
-    ├── __init__.py
-    ├── proto.py               # Protocol encoding/decoding
-    └── utils.py               # Common utilities
-
-## Configuration Files
-
-### Client Configuration (hbd/client/config.py)
-
-Client-specific defaults:
- `hb_port`: Port where hbd servers listen (default: 50003)
- `interval`: Heartbeat interval in seconds (default: 10)
- `plugins`: Per-plugin configuration
- `thresholds`: Threshold configuration for monitoring
-
-### Server Configuration (hbd/server/config.py)
-
-Server-specific defaults:
- `hb_port`: Port to listen for heartbeats (default: 50003)
- `hbd_port`: HTTP API port (default: 50004)
- `ws_port`: WebSocket port (default: 50005)
- `logfile`: Log file path
- `pushsrv`, `pushover_token`, etc.: Notification settings
- `watchhosts`, `dyndnshosts`: Host monitoring
- `smtpserver`, etc.: Email settings
- `journal_*`: Message journaling settings
-
-## Installation Options
-
-### Install Core Only (minimal, PyYAML only)
-```bash
-pip install hbd
-```
-
-### Install Client Only (for monitoring)
-```bash
-pip install hbd[client]
-# Installs: PyYAML, psutil
-```
-
-### Install Server Only (for daemon)
-```bash
-pip install hbd[server]
-# Installs: PyYAML, websockets, mattermostdriver, aiohttp, Jinja2
-```
-
-### Install Everything
-```bash
-pip install hbd[all]
-# Installs all dependencies for both client and server
-```
-
-### Development Installation
-```bash
-pip install -e ".[dev]"
-# Includes all dependencies plus testing/linting tools
-```
-
-## Command-Line Interfaces
-
-### HBC (Client)
-```bash
-hbc [options] host1 [host2 ...]
-
-# Entry point: hbd.client.main:main
-# Location: hbd/client/main.py
-```
-
-### HBD (Server)
-```bash
-hbd [options]
-
-# Entry point: hbd.server.cli:main
-# Location: hbd/server/cli.py → hbd/server/main.py
-```
-
-## Import Changes
-
-### Client Code
-```python
-# Old imports
-from .config import load_config
-from .proto import dicttos, stodict
-from .plugin import PluginRegistry
-
-# New imports
-from .config import load_config          # Still in client/
-from ..common.proto import dicttos       # Moved to common/
-from .plugin import PluginRegistry       # Still in client/
-```
-
-### Server Code
-```python
-# Old imports
-from .config import load_config
-from .proto import stodict
-from .threshold import AlertLevel
-
-# New imports
-from .config import load_config          # Server-specific config
-from ..common.proto import stodict       # Moved to common/
-from ..client.threshold import AlertLevel # Client module
-```
-
-### Plugin Code
-```python
-# Old import
-from hbd.plugin import MonitorPlugin
-
-# New import
-from hbd.client.plugin import MonitorPlugin
-```
-
-## Benefits
-
-1. **Modular Installation**: Install only what you need
-   - Client-only systems don't need web server dependencies
-   - Server-only systems don't need psutil
-   
-2. **Clearer Architecture**: Explicit separation of concerns
-   - Client: System monitoring and data collection
-   - Server: Heartbeat reception, web UI, notifications
-   - Common: Shared protocol and utilities
-
-3. **Independent Evolution**: Client and server can evolve separately
-   - Different release cycles possible
-   - Clear API boundaries via common/
-
-4. **Smaller Footprint**: Reduced dependency installation
-   - Client: ~1 dependency (psutil)
-   - Server: ~4 dependencies (websockets, aiohttp, Jinja2, mattermostdriver)
-
-## Migration Guide
-
-### For Existing Installations
-
-1. **Reinstall the package**:
-   ```bash
-   pip install -e ".[all]"  # For development
-   # or
-   pip install hbd[all]     # For production
-   ```
-
-2. **Configuration files remain unchanged**:
-   - Both client and server read from `~/.hb.yaml`
-   - All existing config keys are supported in both configs
-   - Server has additional keys (journal, websocket, email, etc.)
-   - Client has minimal keys (interval, plugins, thresholds)
-
-3. **Commands remain the same**:
-   - `hbc` command works identically
-   - `hbd` command works identically
-
-### For New Deployments
-
-1. **Client-only system** (monitoring host):
-   ```bash
-   pip install hbd[client]
-   hbc server1.example.com server2.example.com
-   ```
-
-2. **Server-only system** (monitoring daemon):
-   ```bash
-   pip install hbd[server]
-   hbd -c /etc/hbd.yaml -f
-   ```
-
-3. **Combined system** (dev/test):
-   ```bash
-   pip install hbd[all]
-   ```
-
-## Testing
-
-All imports and entry points have been tested and validated:
- ✅ Package imports work correctly
- ✅ `hbc` command entry point functional
- ✅ `hbd` command entry point functional
- ✅ Optional dependencies properly configured
- ✅ All internal imports updated
-
-## Files Archived
-
-The following files were renamed to avoid conflicts:
- `hbd/config.py` → `hbd/config.py.old` (split into client/server configs)
- `hbd/hbc_old.py` → `hbd/hbc_old.py.bak` (backup file)
-
-## Next Steps
-
-1. Test client functionality with a monitoring host
-2. Test server functionality with web UI and notifications
-3. Update documentation (README.md) with new structure
-4. Consider publishing to PyPI with new structure
-5. Update any deployment scripts/Dockerfiles to use optional dependencies
@@ -814,34 +814,32 @@ Planned features:

 ## Multi-Threshold Configuration

-**New in version 2.0**: Support for multiple named threshold configurations with per-host mapping.
+Support for multiple named threshold configurations with per-host mapping and composable layering.

 ### Overview

 The multi-threshold feature allows you to:
- Define multiple sets of threshold configurations
- Map different hosts to different threshold sets
+- Define multiple named threshold configurations
+- Assign one or more configurations to each host
+- Compose configurations by layering — each named config's overrides are applied in order on top of the defaults
 - Use different sensitivity levels for different environments
- Maintain a default configuration for unmapped hosts

 ### Configuration Structure

+Named configurations are defined under `threshold_configs`. Each host selects which ones to use via `threshold_config` in the `hosts` section (a string for a single config, or a list to layer multiple):
+
 ```yaml
-# Optional: Set the default configuration name (defaults to "default")
+# Optional: set the default configuration name (defaults to "default")
 default_threshold_config: "default"

-# Define multiple named threshold configurations
 threshold_configs:
-  # Configuration name 1
  default:
    thresholds:
-      # Standard threshold definitions
      cpu_monitor:
        cpu_percent:
          warning: 80.0
          critical: 90.0

-  # Configuration name 2
  high_sensitivity:
    thresholds:
      cpu_monitor:
@@ -849,7 +847,6 @@ threshold_configs:
          warning: 60.0
          critical: 75.0

-  # Configuration name 3
  low_sensitivity:
    thresholds:
      cpu_monitor:
@@ -857,14 +854,77 @@ threshold_configs:
          warning: 90.0
          critical: 95.0

-# Map specific hosts to specific configurations
-host_threshold_mapping:
-  prod-web-01: high_sensitivity
-  prod-web-02: high_sensitivity
-  dev-server-01: low_sensitivity
-  # Unmapped hosts use default_threshold_config
+hosts:
+  prod-web-01:
+    threshold_config: high_sensitivity   # single config
+
+  dev-server-01:
+    threshold_config: low_sensitivity
+
+  # Hosts with no threshold_config use default_threshold_config
 ```

+### Composable Configurations (list form)
+
+`threshold_config` can be a list. Configs are applied **left to right**: the defaults are the base, then each named config's overrides are layered on top. Later entries in the list win on any metric they define.
+
+```yaml
+threshold_configs:
+  default:
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 80, critical: 90}
+      memory_monitor:
+        memory_percent: {warning: 85, critical: 95}
+      disk_monitor:
+        partitions:
+          /:
+            percent: {warning: 80, critical: 90}
+
+  # Tighter CPU limits for busy servers
+  high_cpu_load:
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 60, critical: 75}
+
+  # Tighter disk limits for data-heavy servers
+  busy_disk:
+    thresholds:
+      disk_monitor:
+        partitions:
+          /:
+            percent: {warning: 70, critical: 85}
+
+hosts:
+  # Gets default thresholds only
+  web-01:
+    threshold_config: default
+
+  # Gets tighter CPU limits, default memory and disk
+  build-server:
+    threshold_config: high_cpu_load
+
+  # Layers both: tighter CPU AND tighter disk, default memory
+  db-01:
+    threshold_config: [high_cpu_load, busy_disk]
+
+  # Three layers: busy_disk overrides high_cpu_load if they conflict
+  storage-01:
+    threshold_config: [default, high_cpu_load, busy_disk]
+```
+
+**How layering works:**
+
+Starting from the `default` thresholds:
+
+| Layer | Applied config | Effect |
+|-------|---------------|--------|
+| Base  | `default` | all default thresholds |
+| +1    | `high_cpu_load` | cpu_percent overridden to 60/75 |
+| +2    | `busy_disk` | disk percent overridden to 70/85; cpu_percent stays at 60/75 |
+
+Each named config only overrides the metrics it explicitly defines. Metrics not mentioned in a config inherit from the layers beneath.
+
 ### Use Cases

 #### 1. Environment-Based Thresholds
@@ -887,11 +947,15 @@ threshold_configs:
          warning: 90.0   # More relaxed for dev
          critical: 98.0

-host_threshold_mapping:
-  prod-web-01: production
-  prod-web-02: production
-  dev-web-01: development
-  dev-web-02: development
+hosts:
+  prod-web-01:
+    threshold_config: production
+  prod-web-02:
+    threshold_config: production
+  dev-web-01:
+    threshold_config: development
+  dev-web-02:
+    threshold_config: development
 ```

 #### 2. Server Role-Based Thresholds
@@ -914,7 +978,7 @@ threshold_configs:
          warning: 70.0
          critical: 85.0
      memory_monitor:
-        percent:
+        memory_percent:
          warning: 90.0   # Databases can use high memory
          critical: 97.0
      disk_monitor:
@@ -927,17 +991,23 @@ threshold_configs:
  cache:
    thresholds:
      memory_monitor:
-        percent:
+        memory_percent:
          warning: 95.0   # Redis/Memcached can use very high memory
          critical: 99.0

-host_threshold_mapping:
-  web-01: webserver
-  web-02: webserver
-  db-01: database
-  db-02: database
-  redis-01: cache
-  memcached-01: cache
+hosts:
+  web-01:
+    threshold_config: webserver
+  web-02:
+    threshold_config: webserver
+  db-01:
+    threshold_config: database
+  db-02:
+    threshold_config: database
+  redis-01:
+    threshold_config: cache
+  memcached-01:
+    threshold_config: cache
 ```

 #### 3. Sensitivity Levels
@@ -952,7 +1022,7 @@ threshold_configs:
        partitions:
          /:
            percent:
-              warning: 70.0    # Very sensitive
+              warning: 70.0
              critical: 80.0
              hysteresis: 0.15

@@ -976,12 +1046,69 @@ threshold_configs:
              critical: 98.0
              hysteresis: 0.05

-host_threshold_mapping:
-  payment-gateway: critical
-  auth-server: critical
-  web-01: standard
-  web-02: standard
-  test-server: relaxed
+hosts:
+  payment-gateway:
+    threshold_config: critical
+  auth-server:
+    threshold_config: critical
+  web-01:
+    threshold_config: standard
+  web-02:
+    threshold_config: standard
+  test-server:
+    threshold_config: relaxed
+```
+
+#### 4. Composable Profiles
+
+Build host-specific thresholds by combining small, focused configs:
+
+```yaml
+threshold_configs:
+  # Baseline — everything at default levels
+  default:
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 80, critical: 90}
+      memory_monitor:
+        memory_percent: {warning: 85, critical: 95}
+
+  # Overlay: tighter CPU only
+  tight_cpu:
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 60, critical: 75}
+
+  # Overlay: tighter memory only
+  tight_memory:
+    thresholds:
+      memory_monitor:
+        memory_percent: {warning: 70, critical: 85}
+
+  # Overlay: extra disk partition for database servers
+  db_disk:
+    thresholds:
+      disk_monitor:
+        partitions:
+          /var/lib/postgresql:
+            percent: {warning: 75, critical: 88}
+
+hosts:
+  # Plain web server
+  web-01:
+    threshold_config: default
+
+  # Build server: tight CPU, default memory and disk
+  build-01:
+    threshold_config: tight_cpu
+
+  # Database: tight CPU + tight memory + extra disk partition
+  db-01:
+    threshold_config: [tight_cpu, tight_memory, db_disk]
+
+  # Replica database: tight memory + extra disk, normal CPU
+  db-02:
+    threshold_config: [tight_memory, db_disk]
 ```

 ### Backward Compatibility
@@ -1012,16 +1139,25 @@ threshold_configs:

 ### Configuration Priority

-1. **Host-specific mapping**: If host is in `host_threshold_mapping`, use that config
-2. **Default config**: Use `default_threshold_config` 
-3. **First alphabetically**: If default not found, use first config alphabetically
-4. **Legacy fallback**: If `threshold_configs` not present, use `thresholds`
+1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults
+2. **Host `threshold_config` (string)**: Use that single named config directly
+3. **`host_threshold_mapping`** (legacy): Same as above, string only
+4. **`default_threshold_config`**: Used for hosts with no mapping
+5. **First alphabetically**: If the default config is not found, use the first config alphabetically
+6. **Legacy `thresholds` section**: Used when `threshold_configs` is absent entirely

-### Example: Complete Multi-Threshold Setup
+### Backward Compatibility

-See `hbd/config_multi_threshold_example.yaml` for a complete example with:
- 4 named configurations (default, high_sensitivity, low_sensitivity, database)
- Host-to-config mappings for production, development, and test systems
- Specialized database server thresholds
- Custom display messages with plugin data
+The legacy `host_threshold_mapping` top-level key and the flat `thresholds` section are still fully supported:
+
+```yaml
+# Still works — equivalent to hosts: {prod-web-01: {threshold_config: high_sensitivity}}
+host_threshold_mapping:
+  prod-web-01: high_sensitivity
+
+# Still works — equivalent to threshold_configs: {default: {thresholds: ...}}
+thresholds:
+  cpu_monitor:
+    cpu_percent: {warning: 80, critical: 90}
+```

@@ -0,0 +1,602 @@
+# Plugin Error Checking Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Improve plugin error checking in hbc, especially for nagios_runner, and fix logger messages silently discarded in daemon mode.
+
+**Architecture:** Three focused changes across three files: (1) `hbd/client/plugin.py` gains a `skip_reason` attribute on Plugin and updated PluginLoader messaging; (2) `hbd/client/plugins/nagios_runner.py` gains async subprocess execution, stderr capture, signal-killed process handling, and init-time command path validation; (3) `hbd/client/main.py` gains proper post-fork logging reconfiguration to syslog.
+
+**Tech Stack:** Python 3.11+, asyncio, `logging.handlers.SysLogHandler`, pytest
+
+---
+
+## File Map
+
+| Action | Path | What changes |
+|---|---|---|
+| Modify | `hbd/client/plugin.py` | `Plugin.__init__` gains `skip_reason`; `PluginLoader` checks it |
+| Modify | `hbd/client/plugins/nagios_runner.py` | async subprocess, stderr, signal codes, init validation, `skip_reason` |
+| Modify | `hbd/client/main.py` | `_reconfigure_logging_for_daemon()` helper; remove redundant syslog calls |
+| Create | `tests/test_plugin.py` | PluginLoader messaging tests |
+| Create | `tests/test_nagios_runner.py` | NagiosRunnerPlugin behaviour tests |
+
+Run tests throughout with:
+```bash
+python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
+```
+
+---
+
+## Task 1: Plugin.skip_reason + PluginLoader messaging
+
+**Files:**
+- Modify: `hbd/client/plugin.py:40-48` (Plugin.__init__)
+- Modify: `hbd/client/plugin.py:369-381` (PluginLoader.load_from_directory)
+- Create: `tests/test_plugin.py`
+
+- [ ] **Step 1: Write failing tests**
+
+Create `tests/test_plugin.py`:
+
+```python
+import asyncio
+import logging
+import textwrap
+
+from hbd.client.plugin import Plugin, PluginLoader, PluginRegistry
+
+
+def test_plugin_skip_reason_defaults_none(tmp_path):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class MinimalPlugin(MonitorPlugin):
+            name = "minimal"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                return True
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "minimal.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+    asyncio.run(loader.load_from_directory(tmp_path))
+    plugin = registry.get("minimal")
+    assert plugin is not None
+    assert plugin.skip_reason is None
+
+
+def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class SkippablePlugin(MonitorPlugin):
+            name = "skippable"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                self.skip_reason = "not configured in yaml"
+                return False
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "skippable.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+
+    with caplog.at_level(logging.INFO, logger="plugin.loader"):
+        count = asyncio.run(loader.load_from_directory(tmp_path))
+
+    assert count == 0
+    assert any("skipped: not configured in yaml" in r.message for r in caplog.records)
+    assert not any("failed initialization" in r.message for r in caplog.records)
+
+
+def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class FailPlugin(MonitorPlugin):
+            name = "fail"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                return False
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "fail_plugin.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.loader"):
+        count = asyncio.run(loader.load_from_directory(tmp_path))
+
+    assert count == 0
+    assert any("failed initialization" in r.message for r in caplog.records)
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+python -m pytest tests/test_plugin.py -v
+```
+Expected: `test_plugin_skip_reason_defaults_none` FAILS (attribute missing), others may error.
+
+- [ ] **Step 3: Add `skip_reason` to `Plugin.__init__`**
+
+In `hbd/client/plugin.py`, in `Plugin.__init__` (around line 46), add one line:
+
+```python
+def __init__(self, config: Optional[Dict[str, Any]] = None):
+    self.config = config or {}
+    self.logger = logging.getLogger(f"plugin.{self.name}")
+    self._initialized = False
+    self.skip_reason: Optional[str] = None
+```
+
+- [ ] **Step 4: Update PluginLoader messaging**
+
+In `hbd/client/plugin.py`, replace the `if not initialized:` block (around line 372):
+
+```python
+                    if not initialized:
+                        if plugin.skip_reason:
+                            self.logger.info(
+                                f"Plugin {plugin.name} skipped: {plugin.skip_reason}"
+                            )
+                        else:
+                            self.logger.warning(
+                                f"Plugin {plugin.name} failed initialization, skipping"
+                            )
+                        continue
+```
+
+- [ ] **Step 5: Run tests to verify they pass**
+
+```bash
+python -m pytest tests/test_plugin.py -v
+```
+Expected: all 3 tests PASS.
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add hbd/client/plugin.py tests/test_plugin.py
+git commit -m "feat: add skip_reason to Plugin; improve PluginLoader init messaging"
+```
+
+---
+
+## Task 2: NagiosRunnerPlugin — skip_reason when no commands
+
+**Files:**
+- Modify: `hbd/client/plugins/nagios_runner.py:88-105` (initialize)
+- Modify: `tests/test_nagios_runner.py` (create)
+
+- [ ] **Step 1: Write failing test**
+
+Create `tests/test_nagios_runner.py`:
+
+```python
+import asyncio
+import logging
+import os
+import stat
+
+import pytest
+
+from hbd.client.plugins.nagios_runner import (
+    NagiosRunnerPlugin,
+    NAGIOS_OK,
+    NAGIOS_WARNING,
+    NAGIOS_CRITICAL,
+    NAGIOS_UNKNOWN,
+)
+
+
+def test_no_commands_sets_skip_reason():
+    plugin = NagiosRunnerPlugin(config={"commands": []})
+    result = asyncio.run(plugin.initialize())
+    assert result is False
+    assert plugin.skip_reason is not None
+    assert "nagios_runner.commands" in plugin.skip_reason
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
+```
+Expected: FAIL — `plugin.skip_reason` is `None`.
+
+- [ ] **Step 3: Set skip_reason in NagiosRunnerPlugin.initialize()**
+
+In `hbd/client/plugins/nagios_runner.py`, replace the early-return block in `initialize()` (around line 96):
+
+```python
+        if not self.commands:
+            self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
+            self.logger.info("No Nagios commands configured")
+            return False
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+python -m pytest tests/test_nagios_runner.py::test_no_commands_sets_skip_reason -v
+```
+Expected: PASS.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
+git commit -m "feat: set skip_reason on nagios_runner when no commands configured"
+```
+
+---
+
+## Task 3: NagiosRunnerPlugin — async subprocess, stderr capture, negative return codes
+
+**Files:**
+- Modify: `hbd/client/plugins/nagios_runner.py` (imports + `_run_nagios_plugin`)
+- Modify: `tests/test_nagios_runner.py`
+
+- [ ] **Step 1: Write failing tests**
+
+Append to `tests/test_nagios_runner.py`:
+
+```python
+def test_stderr_used_when_stdout_empty(tmp_path):
+    script = tmp_path / "check_err.sh"
+    script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n")
+    script.chmod(script.stat().st_mode | stat.S_IEXEC)
+
+    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert "error from stderr" in data["t_output"]
+    assert data["t_status_code"] == NAGIOS_CRITICAL
+
+
+def test_stderr_appended_when_both_present(tmp_path):
+    script = tmp_path / "check_both.sh"
+    script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n")
+    script.chmod(script.stat().st_mode | stat.S_IEXEC)
+
+    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert "OK - all good" in data["t_output"]
+    assert "extra detail" in data["t_output"]
+    assert data["t_status_code"] == NAGIOS_OK
+
+
+def test_negative_returncode_maps_to_unknown():
+    # kill -9 $$ kills the shell itself; asyncio sees returncode -9
+    config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert data["t_status_code"] == NAGIOS_UNKNOWN
+    assert "signal" in data["t_output"].lower()
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+python -m pytest tests/test_nagios_runner.py::test_stderr_used_when_stdout_empty \
+    tests/test_nagios_runner.py::test_stderr_appended_when_both_present \
+    tests/test_nagios_runner.py::test_negative_returncode_maps_to_unknown -v
+```
+Expected: all FAIL — current implementation ignores stderr and doesn't handle negative codes.
+
+- [ ] **Step 3: Update imports in nagios_runner.py**
+
+Replace the import block at the top of `hbd/client/plugins/nagios_runner.py`:
+
+```python
+import asyncio
+import os
+import re
+from typing import Any, Dict, List, Optional, Tuple
+
+from hbd.client.plugin import MonitorPlugin
+```
+
+(Remove `import subprocess`; add `import asyncio` and `import os`.)
+
+- [ ] **Step 4: Upgrade collection log level from DEBUG to INFO**
+
+In `hbd/client/plugins/nagios_runner.py`, in `_collect_metrics()`, change the debug log (around line 144) so results are visible at INFO level:
+
+```python
+                self.logger.info(
+                    f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
+                )
+```
+
+- [ ] **Step 5: Replace `_run_nagios_plugin` with async implementation**
+
+Replace the entire `_run_nagios_plugin` method in `hbd/client/plugins/nagios_runner.py`:
+
+```python
+    async def _run_nagios_plugin(
+        self,
+        command: str
+    ) -> Tuple[int, str, Dict[str, Any]]:
+        """Execute a Nagios plugin and parse its output."""
+        try:
+            proc = await asyncio.create_subprocess_shell(
+                command,
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.PIPE,
+            )
+            try:
+                stdout_bytes, stderr_bytes = await asyncio.wait_for(
+                    proc.communicate(), timeout=self.timeout
+                )
+            except asyncio.TimeoutError:
+                proc.kill()
+                await proc.communicate()
+                self.logger.error(f"Command timed out: {command}")
+                return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
+
+            status_code = proc.returncode
+
+            if status_code < 0:
+                return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {}
+
+            if status_code > 3:
+                status_code = NAGIOS_UNKNOWN
+
+            stdout = stdout_bytes.decode(errors="replace").strip()
+            stderr = stderr_bytes.decode(errors="replace").strip()
+
+            # Parse perfdata from stdout before mixing in stderr
+            perfdata = self._parse_perfdata(stdout)
+
+            # Build status message
+            status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout
+
+            if not stdout and stderr:
+                output_msg = stderr
+            elif stdout and stderr:
+                output_msg = f"{status_part} [stderr: {stderr}]"
+            else:
+                output_msg = status_part
+
+            return status_code, output_msg, perfdata
+
+        except Exception as e:
+            self.logger.error(f"Error executing command: {e}")
+            return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}
+```
+
+Also remove the now-unused `self.shell` line from `__init__` (the `shell` config key is no longer used since `create_subprocess_shell` always uses a shell):
+
+In `NagiosRunnerPlugin.__init__`, remove:
+```python
+        self.shell: bool = config.get("shell", True) if config else True
+```
+
+- [ ] **Step 6: Run tests to verify they pass**
+
+```bash
+python -m pytest tests/test_nagios_runner.py -v
+```
+Expected: all tests PASS including the 3 new ones.
+
+- [ ] **Step 7: Commit**
+
+```bash
+git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
+git commit -m "feat: async subprocess in nagios_runner with stderr capture and signal handling"
+```
+
+---
+
+## Task 4: NagiosRunnerPlugin — command path validation at init
+
+**Files:**
+- Modify: `hbd/client/plugins/nagios_runner.py` (initialize)
+- Modify: `tests/test_nagios_runner.py`
+
+- [ ] **Step 1: Write failing tests**
+
+Append to `tests/test_nagios_runner.py`:
+
+```python
+def test_absolute_path_not_found_warns(caplog):
+    fake_cmd = "/nonexistent_hbc_test_path/check_something"
+    config = {"commands": [{"name": "t", "command": fake_cmd}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert any("not found" in r.message for r in caplog.records)
+
+
+def test_absolute_path_not_executable_warns(caplog, tmp_path):
+    non_exec = tmp_path / "check_test"
+    non_exec.write_text("#!/bin/sh\necho OK\n")
+    non_exec.chmod(0o644)  # readable but not executable
+
+    config = {"commands": [{"name": "t", "command": str(non_exec)}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert any("not executable" in r.message for r in caplog.records)
+
+
+def test_relative_path_not_checked(caplog):
+    # Relative paths (resolved via PATH) must not generate warnings
+    config = {"commands": [{"name": "t", "command": "echo OK"}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert not any(
+        "not found" in r.message or "not executable" in r.message
+        for r in caplog.records
+    )
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+python -m pytest tests/test_nagios_runner.py::test_absolute_path_not_found_warns \
+    tests/test_nagios_runner.py::test_absolute_path_not_executable_warns \
+    tests/test_nagios_runner.py::test_relative_path_not_checked -v
+```
+Expected: `test_absolute_path_not_found_warns` and `test_absolute_path_not_executable_warns` FAIL (no warnings logged); `test_relative_path_not_checked` may pass.
+
+- [ ] **Step 3: Add command path validation to `initialize()`**
+
+In `hbd/client/plugins/nagios_runner.py`, extend `initialize()` by adding validation after the existing "log each command" loop (after line 103, before `return True`):
+
+```python
+        # Validate absolute command paths early
+        for cmd_config in self.commands:
+            name = cmd_config.get("name", "unnamed")
+            command = cmd_config.get("command", "")
+            if not command:
+                continue
+            exe = command.split()[0]
+            if os.path.isabs(exe):
+                if not os.path.isfile(exe):
+                    self.logger.warning(
+                        f"Command '{name}': executable not found: {exe}"
+                    )
+                elif not os.access(exe, os.X_OK):
+                    self.logger.warning(
+                        f"Command '{name}': executable not executable: {exe}"
+                    )
+```
+
+- [ ] **Step 4: Run full test suite to verify all pass**
+
+```bash
+python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
+```
+Expected: all tests PASS.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add hbd/client/plugins/nagios_runner.py tests/test_nagios_runner.py
+git commit -m "feat: validate absolute command paths at nagios_runner init"
+```
+
+---
+
+## Task 5: Daemon mode logging — route to syslog after fork
+
+**Files:**
+- Modify: `hbd/client/main.py` (new helper + updated daemon block)
+
+No automated test for daemonization itself (fork behaviour is hard to unit-test). Manual verification steps are provided below.
+
+- [ ] **Step 1: Add `_reconfigure_logging_for_daemon` helper**
+
+In `hbd/client/main.py`, add this function just before `def build_parser()` (around line 589):
+
+```python
+def _reconfigure_logging_for_daemon(log_level: int) -> None:
+    """Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler."""
+    from logging.handlers import SysLogHandler
+
+    root = logging.getLogger()
+    for handler in root.handlers[:]:
+        root.removeHandler(handler)
+        handler.close()
+
+    try:
+        syslog_handler = SysLogHandler(
+            address="/dev/log",
+            facility=SysLogHandler.LOG_DAEMON,
+        )
+    except OSError:
+        syslog_handler = SysLogHandler(
+            address=("localhost", 514),
+            facility=SysLogHandler.LOG_DAEMON,
+        )
+        # Attach the fallback first so the warning reaches syslog
+        syslog_handler.setFormatter(
+            logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
+        )
+        root.addHandler(syslog_handler)
+        root.setLevel(log_level)
+        logging.warning("/dev/log not found, using syslog UDP localhost:514")
+        return
+
+    syslog_handler.setFormatter(
+        logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
+    )
+    root.addHandler(syslog_handler)
+    root.setLevel(log_level)
+```
+
+- [ ] **Step 2: Update the daemon block in `main()`**
+
+In `hbd/client/main.py`, replace the entire `if args.daemon:` block (lines 664–675):
+
+```python
+    if args.daemon:
+        print("Daemonizing...")
+        daemonize()
+        _reconfigure_logging_for_daemon(log_level)
+        logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
+```
+
+This removes the `import syslog`, `syslog.openlog()`, and `syslog.syslog()` calls (now handled by the logging system) and removes the no-op second `logging.basicConfig()` call.
+
+- [ ] **Step 3: Run existing test suite to confirm no regressions**
+
+```bash
+python -m pytest tests/test_plugin.py tests/test_nagios_runner.py -v
+```
+Expected: all tests still PASS.
+
+- [ ] **Step 4: Manual smoke test — verify syslog output in daemon mode**
+
+```bash
+# In one terminal, tail syslog
+sudo journalctl -f -t hbc
+
+# In another terminal, start hbc in daemon mode (replace HOST with a real or dummy host)
+python -m hbd.client.main -d -v localhost
+
+# Expected in journalctl output:
+#   hbc[<pid>]: hbc.main INFO: Starting hbc for <hostname> -> ['localhost']
+#   hbc[<pid>]: hbc.main INFO: hbc starting, sending heartbeat to localhost
+#   hbc[<pid>]: plugin.loader INFO: ...
+
+# Stop the daemon
+pkill -f "hbd.client.main"
+```
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add hbd/client/main.py
+git commit -m "fix: reconfigure logging to syslog after daemonize() instead of no-op basicConfig"
+```
@@ -0,0 +1,92 @@
+# Plugin Error Checking & Daemon Logging — Design Spec
+
+**Date:** 2026-04-25  
+**Scope:** hbc client — daemon mode logging, nagios_runner plugin robustness, PluginLoader messaging  
+**Files affected:** `hbd/client/main.py`, `hbd/client/plugins/nagios_runner.py`, `hbd/client/plugin.py`
+
+---
+
+## 1. Daemon Mode Logging
+
+### Problem
+In `main()`, `logging.basicConfig()` is called before `daemonize()` (establishing a StreamHandler to stderr), then called again after `daemonize()`. The second call is a no-op — Python ignores `basicConfig()` when handlers are already configured. After daemonization, stderr is redirected to `/dev/null`, so all subsequent log output is silently discarded.
+
+The existing `syslog.openlog()` / `syslog.syslog()` calls (lines 666–668) write a single startup message but do not integrate with the `logging` system, so plugin and connection log messages never reach syslog.
+
+### Fix
+After `daemonize()`, explicitly reconfigure the root logger:
+
+1. Remove all existing handlers (they now write to `/dev/null`).
+2. Add `logging.handlers.SysLogHandler(address='/dev/log', facility=LOG_DAEMON)`.
+3. Set formatter: `hbc[%(process)d]: %(name)s %(levelname)s: %(message)s`
+4. Preserve the `log_level` already determined from `-v`/`-x` CLI flags.
+
+Remove the redundant `syslog.openlog()` / `syslog.syslog()` calls — the logging system handles routing.
+
+**Fallback:** If `/dev/log` does not exist (containers, some BSDs), fall back to `SysLogHandler(address=('localhost', 514))`. Log one warning (to stderr, before handlers are replaced) so the operator knows.
+
+---
+
+## 2. Nagios Runner Improvements
+
+### 2a — Async Subprocess
+`_run_nagios_plugin()` is declared `async def` but calls `subprocess.run()` synchronously, blocking the event loop for the full command duration.
+
+**Fix:** Replace with `asyncio.create_subprocess_shell()` + `await proc.communicate()`. Enforce timeout with `asyncio.wait_for(..., timeout=self.timeout)` and catch `asyncio.TimeoutError`.
+
+### 2b — Stderr Capture
+Subprocess stderr is currently discarded (`capture_output=True` only captures stdout in the sync call; stderr content is lost).
+
+**Fix:** Pass `stderr=asyncio.subprocess.PIPE` to `create_subprocess_shell`. After `communicate()`, if stdout is empty but stderr has content, use stderr as the output message. If both have content, append stderr to the output for visibility.
+
+### 2c — Negative Return Codes
+A negative `returncode` means the process was killed by a signal (SIGKILL, OOM, etc.). The current code treats these as-is, which may produce unexpected status values.
+
+**Fix:** If `returncode < 0`, map to `NAGIOS_UNKNOWN` with message `"Process killed by signal {-returncode}"`.
+
+### 2d — Command Path Validation at Init
+`initialize()` currently only checks that the commands list is non-empty.
+
+**Fix:** For each command entry during `initialize()`:
+- Warn and skip the entry if `name` or `command` is missing.
+- Extract the executable (first whitespace-delimited token of the command string).
+- If the executable is an absolute path, check `os.path.isfile()` and `os.access(..., os.X_OK)`. Log a `WARNING` if either check fails.
+- Commands with relative paths or shell builtins are not checked (they may be on PATH) — just noted.
+- Validation warns only; all original entries in `self.commands` are retained and still attempted at collection time (where the existing missing-name/command guard already skips them). The plugin initializes successfully as long as the commands list is non-empty.
+
+---
+
+## 3. PluginLoader Messaging
+
+### Problem
+When `initialize()` returns `False`, the loader always logs:
+> `WARNING: Plugin X failed initialization, skipping`
+
+This is alarming when the real reason is simply "no commands configured". There is no API to distinguish "not configured" from "genuinely broken".
+
+### Fix
+Add an optional `skip_reason` attribute to `Plugin.__init__()` (defaults to `None`).
+
+In `PluginLoader.load_from_directory()`, after `initialize()` returns `False`:
+- If `plugin.skip_reason` is set → `logger.info(f"Plugin {plugin.name} skipped: {plugin.skip_reason}")`
+- If `plugin.skip_reason` is `None` → `logger.warning(f"Plugin {plugin.name} failed initialization, skipping")` (existing behaviour)
+
+In `NagiosRunnerPlugin.initialize()`, when no commands are configured:
+```python
+self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
+return False
+```
+
+Genuine failures (exceptions) continue to go through the existing `except` block in the loader, logging at `ERROR` with traceback — unchanged.
+
+---
+
+## Decisions
+
+| Topic | Decision |
+|---|---|
+| Daemon log destination | syslog only (LOG_DAEMON facility) |
+| Syslog fallback | localhost:514 UDP if `/dev/log` absent |
+| Nagios result log level | INFO for all statuses (OK/WARNING/CRITICAL/UNKNOWN) |
+| Invalid command handling at init | Warn and continue; still attempt at collection time |
+| PluginLoader API change | `skip_reason` attribute on Plugin base class, checked by loader |
@@ -1,21 +0,0 @@
-Plan the following changes, ask questions to clarify before implementing
-
-Re-factor the notification system:
- use available libraries for pushover, matrix, email and sms notifications.
- notifications have a title/subject:  alert_type (recover/warning/critical), a body (info from threshold check) and a link to the host plugin metrix page
- define a list of notification channels for each user
- notifications are dispatched to users that are listed as managers for the host
-
-
-
-1 - correct
-2 - for now channels are defined globaly 
-3 - matrix-nio)sounds good, homeserver URL, access token, room ID per channel?
-4 - use the REST api provided by https://voip.ms/api/v1/rest.php
-5 - The page does not exist yet, point at the host tab in the /plugins
-6 - per-channel minimum severity is a good idea, go fo it
-7 - yes
-
-1 - use base_url, there might not have been any incoming requests yet
-2 - use same asyncio loop for matrix-nio
-3 - for now, just silently do nothing
@@ -14,4 +14,4 @@ Install options:
 """

 __all__ = ["__version__"]
-__version__ = "5.1.1"
+__version__ = "5.1.21"
@@ -14,7 +14,7 @@ import signal
 import socket
 import sys
 import time
-from hashlib import md5
+from logging.handlers import SysLogHandler
 from pathlib import Path
 from typing import Dict, List, Optional

@@ -55,6 +55,9 @@ class AsyncConnection:
        
        self.transport: Optional[asyncio.DatagramTransport] = None
        self.protocol: Optional[asyncio.DatagramProtocol] = None
+        self._dead = False
+        self._ever_opened = False
+        self._open_fail_count = 0   # consecutive failures before first success

        self.logger = logging.getLogger(f"hbc.conn.{addr}")

@@ -72,6 +75,7 @@ class AsyncConnection:
                lambda: HeartbeatProtocol(self),
                family=self.af
            )
+            self._ever_opened = True
            self.logger.debug(f"Opened connection to {self.addr}:{self.port}")
            return True
        except Exception as e:
@@ -92,6 +96,9 @@ class AsyncConnection:
            msg: Message dictionary
            msg_id: Message ID (HTB, PLG, etc.)
        """
+        if self._dead:
+            return
+
        if not self.transport:
            await self.open()

@@ -166,7 +173,9 @@ class HeartbeatProtocol(asyncio.DatagramProtocol):
    
    def error_received(self, exc):
        """Handle protocol errors."""
-        self.logger.error(f"Protocol error: {exc}")
+        self.logger.warning(f"Protocol error on {self.connection.addr}: {exc} — dropping connection")
+        self.connection._dead = True
+        self.connection.close()


 async def handle_command(conn: AsyncConnection, msg: dict):
@@ -203,48 +212,45 @@ async def handle_command(conn: AsyncConnection, msg: dict):
    await conn.sendto(response)


-async def handle_update(conn: AsyncConnection, msg: dict):
-    """Handle self-update from server."""
-    import codecs
+async def handle_update(conn: AsyncConnection, _msg: dict):  # pyright: ignore[reportUnusedParameter]
+    """Handle self-update by running hb_install.sh."""
    import shutil

    logger = logging.getLogger("hbc.update")

-    try:
-        code = codecs.decode(msg["code"], "base64").decode()
-        csum = msg["csum"]
-    except Exception as e:
-        error = f"Missing code/csum: {e}"
+    installer = shutil.which("hb_install.sh")
+    if installer is None:
+        candidate = Path(sys.argv[0]).parent / "hb_install.sh"
+        if candidate.exists():
+            installer = str(candidate)
+
+    if installer is None:
+        error = "hb_install.sh not found in PATH or alongside hbc"
        logger.error(error)
        await conn.sendto({"service": "update", "msg": error})
        return

-    # Verify checksum
-    m = md5()
-    m.update(code.encode())
-    if m.hexdigest() != csum:
-        error = "Checksum mismatch"
+    logger.info(f"Running installer: {installer}")
+    try:
+        proc = await asyncio.create_subprocess_exec(
+            installer, "client",
+            stdout=asyncio.subprocess.PIPE,
+            stderr=asyncio.subprocess.STDOUT,
+        )
+        out, _ = await asyncio.wait_for(proc.communicate(), timeout=120)
+    except asyncio.TimeoutError:
+        error = "Installer timed out"
+        logger.error(error)
+        await conn.sendto({"service": "update", "msg": error})
+        return
+    except Exception as e:
+        error = f"Installer failed: {e}"
        logger.error(error)
        await conn.sendto({"service": "update", "msg": error})
        return

-    # Backup current file
-    fn = sys.argv[0]
-    ofn = f"{fn}.sav"
-    try:
-        shutil.copy2(fn, ofn)
-    except Exception as e:
-        error = f"Backup failed: {e}"
-        logger.error(error)
-        await conn.sendto({"service": "update", "msg": error})
-        return
-    
-    # Write new code
-    try:
-        with open(fn, "w") as fh:
-            fh.write(code)
-    except Exception as e:
-        error = f"Write failed: {e}"
+    if proc.returncode != 0:
+        error = f"Installer exited {proc.returncode}: {out.decode().strip()}"
        logger.error(error)
        await conn.sendto({"service": "update", "msg": error})
        return
@@ -259,15 +265,51 @@ async def handle_update(conn: AsyncConnection, msg: dict):


 async def heartbeat_sender(conn: AsyncConnection, interval: int):
-    """Send periodic heartbeats.
+    """Send periodic heartbeats, retrying the connection if it is not open.
+
+    IPv6 connections that fail to open before their first successful send are
+    dropped after IPV6_EARLY_FAIL_LIMIT attempts so that a network without IPv6
+    does not keep a dead sender alive.  IPv4 connections are retried indefinitely.

    Args:
        conn: Connection to send on
        interval: Heartbeat interval in seconds
    """
    logger = logging.getLogger("hbc.heartbeat")
+    IPV6_EARLY_FAIL_LIMIT = 3
+
+    while running and not conn._dead:
+        # Ensure transport is open before attempting to send.
+        if not conn.transport:
+            opened = await conn.open()
+            if opened:
+                conn._open_fail_count = 0
+            else:
+                conn._open_fail_count += 1
+                # Drop an IPv6 connection that has never come up within the
+                # first few attempts — it is likely unavailable on this network.
+                if (not conn._ever_opened
+                        and conn.af == socket.AF_INET6
+                        and conn._open_fail_count >= IPV6_EARLY_FAIL_LIMIT):
+                    logger.warning(
+                        f"IPv6 connection to {conn.addr} unreachable after "
+                        f"{conn._open_fail_count} attempts, disabling"
+                    )
+                    conn._dead = True
+                    break
+                # Retry after the normal interval; IPv4 retries forever.
+                try:
+                    if shutdown_event:
+                        await asyncio.wait_for(shutdown_event.wait(), timeout=interval)
+                        break
+                    else:
+                        await asyncio.sleep(interval)
+                except asyncio.TimeoutError:
+                    pass
+                except asyncio.CancelledError:
+                    raise
+                continue

-    while running:
        try:
            msg = {
                "acks": conn.ackcount,
@@ -276,19 +318,16 @@ async def heartbeat_sender(conn: AsyncConnection, interval: int):
            }
            await conn.sendto(msg, "HTB")

-        except Exception as e:
-            logger.error(f"Error sending heartbeat: {e}", exc_info=True)
        except asyncio.CancelledError:
            logger.debug("Heartbeat sender cancelled")
            raise
+        except Exception as e:
+            logger.error(f"Error sending heartbeat: {e}", exc_info=True)

        # Wait for next interval or shutdown event
        try:
            if shutdown_event:
-                await asyncio.wait_for(
-                    shutdown_event.wait(), 
-                    timeout=interval
-                )
+                await asyncio.wait_for(shutdown_event.wait(), timeout=interval)
                break
            else:
                await asyncio.sleep(interval)
@@ -424,16 +463,13 @@ async def cleanup(connections: List[AsyncConnection]):
    logger = logging.getLogger("hbc.cleanup")
    logger.info("Cleaning up connections")
    
-    for conn in connections:
+    target = next((c for c in connections if c.transport), connections[0] if connections else None)
+    if target:
        try:
-            msg = {
-                "shutdown": 1,
-                "acks": conn.ackcount
-            }
-            await conn.sendto(msg)
+            await target.sendto({"shutdown": 1, "acks": target.ackcount})
        except Exception as e:
            logger.error(f"Error sending shutdown: {e}")
-        
+    for conn in connections:
        conn.close()
    
    # Give messages time to send
@@ -478,12 +514,13 @@ async def async_main(args, config):
            addr = addr_info[4][0]

            conn = AsyncConnection(conn_id, addr, hb_port, af, iam)
-            if await conn.open():
+            if not await conn.open():
+                logger.warning(f"Initial open to {addr} failed, heartbeat sender will retry")
            connections.append(conn)
            conn_id += 1

    if not connections:
-        logger.error("No connections established")
+        logger.error("No connections established (DNS resolution failed for all hosts)")
        return 1
    
    logger.info(f"Created {len(connections)} connections")
@@ -498,8 +535,8 @@ async def async_main(args, config):
            boot_msg["msg"] = args.message
        
        boot_msg["acks"] = 0
-        for conn in connections:
-            await conn.sendto(boot_msg)
+        target = next((c for c in connections if c.transport), connections[0])
+        await target.sendto(boot_msg)
        
        if args.message and not args.daemon:
            # Message-only mode
@@ -522,6 +559,13 @@ async def async_main(args, config):
    for sig in (signal.SIGTERM, signal.SIGINT):
        loop.add_signal_handler(sig, stop)

+    def _sighup():
+        global dorestart
+        dorestart = True
+        stop()
+
+    loop.add_signal_handler(signal.SIGHUP, _sighup)
+    
    # Start async tasks
    # Heartbeat senders (one per connection)
    for conn in connections:
@@ -586,6 +630,36 @@ def daemonize(
    os.dup2(se.fileno(), sys.stderr.fileno())


+def _reconfigure_logging_for_daemon(log_level: int) -> None:
+    """Replace StreamHandlers (now writing to /dev/null) with a SysLogHandler."""
+    root = logging.getLogger()
+    for handler in root.handlers[:]:
+        root.removeHandler(handler)
+        handler.close()
+
+    use_udp_fallback = not os.path.exists("/dev/log")
+
+    if use_udp_fallback:
+        syslog_handler = SysLogHandler(
+            address=("localhost", 514),
+            facility=SysLogHandler.LOG_DAEMON,
+        )
+    else:
+        syslog_handler = SysLogHandler(
+            address="/dev/log",
+            facility=SysLogHandler.LOG_DAEMON,
+        )
+
+    syslog_handler.setFormatter(
+        logging.Formatter("hbc[%(process)d]: %(name)s %(levelname)s: %(message)s")
+    )
+    root.addHandler(syslog_handler)
+    root.setLevel(log_level)
+
+    if use_udp_fallback:
+        logging.warning("/dev/log not found, using syslog UDP localhost:514")
+
+
 def build_parser():
    """Build argument parser."""
    parser = argparse.ArgumentParser(
@@ -662,17 +736,10 @@ def main(argv=None):
    
    # Daemonize if requested
    if args.daemon:
-        print("Daemonizing...")
-        import syslog
-        syslog.openlog("hbc", syslog.LOG_PID, syslog.LOG_DAEMON)
-        syslog.syslog(syslog.LOG_INFO, f"Starting heartbeat to {', '.join(args.hosts)}")
+        logging.info("Daemonizing...")
        daemonize()
-        
-        # Reconfigure logging for syslog
-        logging.basicConfig(
-            level=log_level,
-            format="hbc[%(process)d]: %(name)s %(levelname)s: %(message)s"
-        )
+        _reconfigure_logging_for_daemon(log_level)
+        logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
    
    # Run async main
    try:
@@ -29,6 +29,7 @@ class Plugin(ABC):
        description: Human-readable description
        interval: Collection interval in seconds (0 for InfoPlugin = collect once)
        enabled: Whether plugin is active (can be disabled via config)
+        skip_reason: Set by plugin before returning False from initialize(); causes loader to log INFO instead of WARNING.
    """
    
    name: str = ""
@@ -46,6 +47,7 @@ class Plugin(ABC):
        self.config = config or {}
        self.logger = logging.getLogger(f"plugin.{self.name}")
        self._initialized = False
+        self.skip_reason: Optional[str] = None
        
    @abstractmethod
    async def initialize(self) -> bool:
@@ -312,9 +314,10 @@ class PluginLoader:
        
        loaded_count = 0
        raw_config = config or {}
-        # Per-plugin config lives under the 'plugins' key; fall back to top-level
-        # for backwards compatibility.
-        plugin_config = raw_config.get("plugins", raw_config)
+        # Per-plugin config lives under the 'plugins' key or at top-level.
+        # CLIENT_DEFAULTS seeds "plugins": {} so the key always exists; check
+        # both the subdict and top-level so that either layout in .hbc.yaml works.
+        plugins_subconfig = raw_config.get("plugins", {})
        
        # Scan for Python files
        for plugin_file in directory.glob("*.py"):
@@ -359,14 +362,20 @@ class PluginLoader:
                    
                    self.logger.debug(f"Found plugin class: {name}")
                    
-                    # Instantiate plugin with config
-                    plugin_instance_config = plugin_config.get(obj.name, {})
+                    # Instantiate plugin with config — check plugins subdict first,
+                    # then top-level keys (e.g. nagios_runner: ... at root of config).
+                    plugin_instance_config = plugins_subconfig.get(obj.name) or raw_config.get(obj.name, {})
                    plugin = obj(config=plugin_instance_config)
                    
                    # Initialize plugin
                    try:
                        initialized = await plugin.initialize()
                        if not initialized:
+                            if plugin.skip_reason:
+                                self.logger.info(
+                                    f"Plugin {plugin.name} skipped: {plugin.skip_reason}"
+                                )
+                            else:
                                self.logger.warning(
                                    f"Plugin {plugin.name} failed initialization, skipping"
                                )
@@ -119,6 +119,13 @@ class CPUMonitorPlugin(MonitorPlugin):
            except Exception as e:
                self.logger.debug(f"Could not get CPU times: {e}")

+            # Uptime in seconds
+            try:
+                import time
+                data["uptime_seconds"] = int(time.time() - self.psutil.boot_time())
+            except Exception as e:
+                self.logger.debug(f"Could not get uptime: {e}")
+            
            self.logger.debug(
                f"Collected CPU metrics: {data.get('cpu_percent', 'N/A')}% usage"
            )
@@ -14,6 +14,24 @@ except ImportError:

 from hbd.client.plugin import MonitorPlugin

+
+def _zfs_arc_bytes() -> int:
+    """Return current ZFS ARC size in bytes, or 0 if ZFS is not present.
+
+    ZFS ARC is reclaimable but is not included in MemAvailable by the Linux
+    kernel (it is not in SReclaimable), so it would otherwise be counted as
+    used memory.
+    """
+    try:
+        with open("/proc/spl/kstat/zfs/arcstats") as fh:
+            for line in fh:
+                parts = line.split()
+                if len(parts) >= 3 and parts[0] == "size":
+                    return int(parts[2])
+    except (OSError, ValueError):
+        pass
+    return 0
+
 logger = logging.getLogger(__name__)


@@ -101,11 +119,21 @@ class MemoryMonitorPlugin(MonitorPlugin):
        
        # Virtual (physical) memory statistics
        vmem = psutil.virtual_memory()
+
+        # psutil's available already excludes page cache / file buffers
+        # (uses MemAvailable on Linux). Add ZFS ARC on top because the kernel
+        # does not include it in SReclaimable / MemAvailable even though it is
+        # reclaimable.
+        arc_bytes = _zfs_arc_bytes()
+        available = min(vmem.available + arc_bytes, vmem.total)
+        used = vmem.total - available
+        percent = round(used / vmem.total * 100, 1) if vmem.total else 0.0
+
        metrics['memory_total'] = vmem.total
-        metrics['memory_available'] = vmem.available
-        metrics['memory_used'] = vmem.used
+        metrics['memory_available'] = available
+        metrics['memory_used'] = used
        metrics['memory_free'] = vmem.free
-        metrics['memory_percent'] = vmem.percent
+        metrics['memory_percent'] = percent
        
        # Platform-specific memory details
        if hasattr(vmem, 'active'):
@@ -21,24 +21,23 @@ nagios_runner:
 ```
 """

+import asyncio
+import os
 import re
-import subprocess
+import shlex
 from typing import Any, Dict, List, Optional, Tuple

 from hbd.client.plugin import MonitorPlugin


 # Nagios exit codes
-NAGIOS_OK = 0
-NAGIOS_WARNING = 1
-NAGIOS_CRITICAL = 2
 NAGIOS_UNKNOWN = 3

 STATUS_NAMES = {
-    NAGIOS_OK: "OK",
-    NAGIOS_WARNING: "WARNING",
-    NAGIOS_CRITICAL: "CRITICAL",
-    NAGIOS_UNKNOWN: "UNKNOWN"
+    0: "OK",
+    1: "WARNING",
+    2: "CRITICAL",
+    3: "UNKNOWN",
 }


@@ -52,7 +51,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
        interval: Collection interval in seconds (default: 300)
        commands: List of command definitions with 'name' and 'command' keys
        timeout: Command execution timeout in seconds (default: 30)
-        shell: Whether to execute commands via shell (default: True)

    Example:
        nagios_runner:
@@ -76,15 +74,8 @@ class NagiosRunnerPlugin(MonitorPlugin):
        # Extract configuration
        self.commands: List[Dict[str, str]] = config.get("commands", []) if config else []
        self.timeout: int = config.get("timeout", 30) if config else 30
-        self.shell: bool = config.get("shell", True) if config else True
        self.interval = config.get("interval", 300) if config else 300
    
-        # Validate commands
-        if not self.commands:
-            self.logger.info(
-                "No Nagios commands configured. Add 'nagios_runner.commands' to config."
-            )
-    
    async def initialize(self) -> bool:
        """Initialize the Nagios runner plugin.

@@ -94,7 +85,7 @@ class NagiosRunnerPlugin(MonitorPlugin):
        self.logger.info(f"Initializing {self.name} plugin")

        if not self.commands:
-            self.logger.info("No Nagios commands configured")
+            self.skip_reason = "no commands configured (add nagios_runner.commands to config)"
            return False

        self.logger.info(f"Configured to run {len(self.commands)} Nagios plugin(s)")
@@ -102,6 +93,29 @@ class NagiosRunnerPlugin(MonitorPlugin):
            name = cmd_config.get("name", "unnamed")
            self.logger.info(f"  - {name}: {cmd_config.get('command', 'N/A')}")

+        # Validate absolute command paths early
+        for cmd_config in self.commands:
+            name = cmd_config.get("name", "unnamed")
+            command = cmd_config.get("command", "")
+            if not command:
+                continue
+            try:
+                tokens = shlex.split(command)
+            except ValueError:
+                continue  # malformed command string; skip validation
+            if not tokens:
+                continue
+            exe = tokens[0]
+            if os.path.isabs(exe):
+                if not os.path.isfile(exe):
+                    self.logger.warning(
+                        f"Command '{name}': executable not found: {exe}"
+                    )
+                elif not os.access(exe, os.X_OK):
+                    self.logger.warning(
+                        f"Command '{name}': executable not executable: {exe}"
+                    )
+
        return True
    
    async def _collect_metrics(self) -> Dict[str, Any]:
@@ -112,9 +126,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
        """
        results = {}

-        # Track overall status (worst status wins)
-        worst_status = NAGIOS_OK
-        
        for cmd_config in self.commands:
            name = cmd_config.get("name")
            command = cmd_config.get("command")
@@ -132,16 +143,12 @@ class NagiosRunnerPlugin(MonitorPlugin):
                results[f"{name}_status_code"] = status_code
                results[f"{name}_output"] = output

-                # Track worst status
-                if status_code > worst_status:
-                    worst_status = status_code
-                
                # Parse and add performance data
                if perfdata:
                    for metric_name, metric_value in perfdata.items():
                        results[f"{name}_{metric_name}"] = metric_value

-                self.logger.debug(
+                self.logger.info(
                    f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
                )

@@ -150,12 +157,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
                results[f"{name}_status"] = "ERROR"
                results[f"{name}_status_code"] = NAGIOS_UNKNOWN
                results[f"{name}_output"] = str(e)
-                worst_status = NAGIOS_UNKNOWN
-        
-        # Add overall status
-        results["overall_status"] = STATUS_NAMES.get(worst_status, "UNKNOWN")
-        results["overall_status_code"] = worst_status
-        results["plugin_count"] = len(self.commands)

        return results
    
@@ -163,46 +164,49 @@ class NagiosRunnerPlugin(MonitorPlugin):
        self,
        command: str
    ) -> Tuple[int, str, Dict[str, Any]]:
-        """Execute a Nagios plugin and parse its output.
-        
-        Args:
-            command: Command string to execute
-            
-        Returns:
-            Tuple of (status_code, output_message, performance_data_dict)
-        """
+        """Execute a Nagios plugin and parse its output."""
        try:
-            # Run command
-            result = subprocess.run(
+            proc = await asyncio.create_subprocess_shell(
                command,
-                shell=self.shell,
-                capture_output=True,
-                timeout=self.timeout,
-                text=True
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.PIPE,
            )
+            try:
+                stdout_bytes, stderr_bytes = await asyncio.wait_for(
+                    proc.communicate(), timeout=self.timeout
+                )
+            except asyncio.TimeoutError:
+                proc.kill()
+                await proc.communicate()
+                self.logger.error(f"Command timed out: {command}")
+                return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}

-            status_code = result.returncode
-            output = result.stdout.strip()
+            status_code = proc.returncode
+
+            if status_code < 0:
+                return NAGIOS_UNKNOWN, f"Process killed by signal {-status_code}", {}

-            # Nagios plugins can return codes > 3, treat as UNKNOWN
            if status_code > 3:
                status_code = NAGIOS_UNKNOWN

-            # Parse performance data
-            perfdata = self._parse_perfdata(output)
+            stdout = stdout_bytes.decode(errors="replace").strip()
+            stderr = stderr_bytes.decode(errors="replace").strip()

-            # Extract just the status message (before the pipe if present)
-            if '|' in output:
-                output_msg = output.split('|')[0].strip()
+            # Parse perfdata from stdout before mixing in stderr
+            perfdata = self._parse_perfdata(stdout)
+
+            # Build status message
+            status_part = stdout.split('|')[0].strip() if '|' in stdout else stdout
+
+            if not stdout and stderr:
+                output_msg = stderr
+            elif stdout and stderr:
+                output_msg = f"{status_part} [stderr: {stderr}]"
            else:
-                output_msg = output
+                output_msg = status_part

            return status_code, output_msg, perfdata

-        except subprocess.TimeoutExpired:
-            self.logger.error(f"Command timed out: {command}")
-            return NAGIOS_UNKNOWN, f"Command timed out after {self.timeout}s", {}
-        
        except Exception as e:
            self.logger.error(f"Error executing command: {e}")
            return NAGIOS_UNKNOWN, f"Execution error: {str(e)}", {}
@@ -60,6 +60,7 @@ class OSInfoPlugin(InfoPlugin):
                "python_version": platform.python_version(),
                "python_implementation": platform.python_implementation(),
                "hbc_version": hbc_version,
+                "hbc_type": "full",
            }
            
            # Add Linux-specific distribution info
@@ -0,0 +1,130 @@
+"""
+ZFS pool monitoring plugin for Heartbeat.
+
+Collects per-pool health, capacity, and cumulative I/O statistics via zpool(8).
+"""
+
+import asyncio
+import logging
+import shutil
+from typing import Any, Dict, List, Optional
+
+from hbd.client.plugin import MonitorPlugin
+
+logger = logging.getLogger(__name__)
+
+
+def _int(s: str) -> Optional[int]:
+    try:
+        return int(s.strip().rstrip("KMGTkBkmgt%x"))
+    except (ValueError, AttributeError):
+        return None
+
+
+def _float(s: str) -> Optional[float]:
+    try:
+        return float(s.strip().rstrip("%x"))
+    except (ValueError, AttributeError):
+        return None
+
+
+class ZFSMonitorPlugin(MonitorPlugin):
+    """Monitor ZFS pool health, capacity, and I/O statistics.
+
+    Collects per pool:
+    - health: ONLINE, DEGRADED, FAULTED, etc.
+    - size / alloc / free: total, allocated and free bytes
+    - capacity: percentage used (0-100)
+    - frag: fragmentation percentage
+    - dedup: deduplication ratio
+    - read_ops / write_ops: cumulative I/O operations since last boot/clear
+    - read_bw / write_bw: cumulative bytes transferred since last boot/clear
+
+    Configuration:
+        interval: collection interval in seconds (default: 300)
+        pools: list of pool names to monitor (default: all)
+    """
+
+    name = "zfs_monitor"
+    description = "ZFS pool health, capacity, and I/O statistics"
+    interval = 300
+
+    def __init__(self, config: Optional[Dict[str, Any]] = None):
+        super().__init__(config)
+        self.interval = self.config.get("interval", 300)
+        self._pools_filter: Optional[List[str]] = self.config.get("pools", None)
+
+    async def initialize(self) -> bool:
+        if not shutil.which("zpool"):
+            self.skip_reason = "zpool not found"
+            return False
+        logger.info("ZFS monitor initialized (interval: %ds)", self.interval)
+        return True
+
+    async def _run(self, *args: str) -> List[str]:
+        """Run a command and return its stdout lines, or [] on error."""
+        try:
+            proc = await asyncio.create_subprocess_exec(
+                *args,
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.DEVNULL,
+            )
+            stdout, _ = await asyncio.wait_for(proc.communicate(), timeout=15)
+            return stdout.decode(errors="replace").splitlines()
+        except (FileNotFoundError, asyncio.TimeoutError) as exc:
+            logger.warning("zfs_monitor: %s: %s", args[0], exc)
+            return []
+
+    async def _zpool_list(self) -> Dict[str, Dict]:
+        """Return per-pool health and capacity from `zpool list`."""
+        lines = await self._run(
+            "zpool", "list", "-H", "-p",
+            "-o", "name,health,size,alloc,free,cap,frag,dedup",
+        )
+        pools: Dict[str, Dict] = {}
+        for line in lines:
+            parts = line.split("\t")
+            if len(parts) < 8:
+                continue
+            name = parts[0].strip()
+            if self._pools_filter and name not in self._pools_filter:
+                continue
+            pools[name] = {
+                "health":   parts[1].strip(),
+                "size":     _int(parts[2]),
+                "alloc":    _int(parts[3]),
+                "free":     _int(parts[4]),
+                "capacity": _float(parts[5]),
+                "frag":     _float(parts[6]),
+                "dedup":    _float(parts[7]),
+            }
+        return pools
+
+    async def _zpool_iostat(self) -> Dict[str, Dict]:
+        """Return per-pool cumulative I/O counters from `zpool iostat`."""
+        lines = await self._run("zpool", "iostat", "-H", "-p")
+        io: Dict[str, Dict] = {}
+        for line in lines:
+            parts = line.split("\t")
+            if len(parts) < 7:
+                continue
+            name = parts[0].strip()
+            if not name or name.startswith(" "):
+                continue
+            io[name] = {
+                "read_ops": _int(parts[3]),
+                "write_ops": _int(parts[4]),
+                "read_bw":  _int(parts[5]),
+                "write_bw": _int(parts[6]),
+            }
+        return io
+
+    async def _collect_metrics(self) -> Dict[str, Any]:
+        pools, io = await asyncio.gather(self._zpool_list(), self._zpool_iostat())
+        for name, stats in io.items():
+            if name in pools:
+                pools[name].update(stats)
+        return {"pools": pools}
+
+
+plugin = ZFSMonitorPlugin
@@ -52,11 +52,16 @@ def decode_value(val: str) -> Any:
        except Exception:
            return val[1:]  # Return as string without @
    
-    # Try numeric evaluation (original behavior)
+    # Try numeric conversion (avoid eval to prevent SyntaxWarnings on version strings)
    if val[0].isdigit() or (val[0] == '-' and len(val) > 1 and val[1].isdigit()):
        try:
-            return eval(val)
-        except Exception:
+            return int(val)
+        except ValueError:
+            pass
+        try:
+            return float(val)
+        except ValueError:
+            pass
        return val
    
    return val
@@ -144,17 +144,16 @@ def cmd_notify(args):
        url=f"{base_url}/plugins" if base_url else "",
    )

-    # Bypass min_level for explicit test sends; run async channels directly
    import asyncio
+    from .notify import _send_matrix_async, _send_sms_voipms_async, _DRIVERS
    ch_type = channel_cfg.get("type", "")
    print(f"Sending via {args.channel} ({ch_type}): {title} — {args.message}")

-    if ch_type in ("matrix", "sms_voipms"):
-        from .notify import _send_matrix_async, _send_sms_voipms_async
-        driver_async = _send_matrix_async if ch_type == "matrix" else _send_sms_voipms_async
-        ok = asyncio.run(driver_async(channel_cfg, notif))
+    if ch_type == "matrix":
+        ok = asyncio.run(_send_matrix_async(channel_cfg, notif))
+    elif ch_type == "sms_voipms":
+        ok = asyncio.run(_send_sms_voipms_async(channel_cfg, notif))
    else:
-        from .notify import _DRIVERS
        driver = _DRIVERS.get(ch_type)
        if driver is None:
            print(f"Error: unknown channel type '{ch_type}'", file=sys.stderr)
@@ -225,7 +225,7 @@ def get_watchhosts(config):
    hosts_config = config.get("hosts", {})
    if isinstance(hosts_config, dict):
        for host_name, host_attrs in hosts_config.items():
-            if isinstance(host_attrs, dict) and host_attrs.get("watch", False):
+            if isinstance(host_attrs, dict) and host_attrs.get("watch", True):
                watchhosts.append(host_name)
    return watchhosts

@@ -95,7 +95,7 @@ class Connection:
        if not Null:
            d["addr"] = self.addr
            if self.rtts[-1]:
-                d["rtt"] = "%0.1f" % self.rtts[-1]
+                d["rtt"] = "%d" % round(self.rtts[-1])
            elif self.state == Connection.UNKNOWN:
                d["rtt"] = ""
            else:
@@ -286,7 +286,7 @@ class Host:
            Host.hosts[name] = self
        self.num = num
        self.dyn = False
-        self.watched = False
+        self.watched = True
        self.upcount = 0
        self.interval = 0
        self.doesack = -1
@@ -304,6 +304,7 @@ class Host:

    def statedict(self):
        d = {}
+        d["raw_name"] = self.name
        d["name"] = self.name
        if self.dyn:
            d["name"] += "*"
@@ -1,7 +1,11 @@
 """HTTP server implementation using aiohttp and jinja2."""

 import asyncio
+import datetime
 import json
+import platform
+import socket
+import sys
 import time
 import urllib.parse
 import os
@@ -111,6 +115,7 @@ async def start(
    This function is intended to be awaited inside the main asyncio event loop.
    """
    get_now = get_now or (lambda: time.time())
+    _start_epoch = time.time()

    async def old_index(request):
        _require_auth_redirect(request)
@@ -149,6 +154,25 @@ async def start(
        lst = [h.jsons() for h in hosts]
        return web.json_response(json.loads("[" + ",".join(lst) + "]"))

+    async def api_alert_summary(request):
+        """GET /api/0/alert_summary — counts of ok/warning/critical hosts visible to caller."""
+        user, err = _require_auth(request)
+        if err:
+            return err
+        from .threshold import AlertLevel
+        critical = warning = ok = 0
+        for host in hbdclass.Host.hosts.values():
+            if not _can_operate_host(user, host):
+                continue
+            levels = {s.level for s in host.alert_states.values()}
+            if AlertLevel.CRITICAL in levels:
+                critical += 1
+            elif AlertLevel.WARNING in levels:
+                warning += 1
+            else:
+                ok += 1
+        return web.json_response({"critical": critical, "warning": warning, "ok": ok})
+
    async def api_messages(request):
        lst = data.msgs[-30:]
        return web.json_response(lst)
@@ -210,15 +234,11 @@ async def start(
            return err
        qa = request.rel_url.query
        uname = urllib.parse.unquote(qa.get("h", ""))
-        ucode = qa.get("c")
-        if not ucode or not uname:
-            return web.Response(status=400, text="need h= and c= arguments")
+        if not uname:
+            return web.Response(status=400, text="need h= argument")
        if uname != "All" and uname not in hbdclass.Host.hosts:
            return web.Response(status=400, text=f"h={uname} not found")
-        if uname != "All":
-            names = [uname]
-        else:
-            names = [n for n in hbdclass.Host.hosts]
+        names = [uname] if uname != "All" else list(hbdclass.Host.hosts)
        out = []
        for n in names:
            host = hbdclass.Host.hosts[n]
@@ -227,8 +247,7 @@ async def start(
                continue
            op_err = None
            try:
-                r = {"csum": None, "code": ucode}
-                host.cmds.append(("UPD", r))
+                host.cmds.append(("UPD", {}))
            except Exception as e:
                op_err = str(e)
            out.append(f"update started for {n}: {op_err if op_err else 'OK'}")
@@ -258,7 +277,9 @@ async def start(
            extra_scripts=extra_scripts,
            hbd_version=hbd_version,
            hosts=[
-                hbdclass.Host.hosts[h].stateinfo() for h in sorted(hbdclass.Host.hosts)
+                hbdclass.Host.hosts[h].stateinfo()
+                for h in sorted(hbdclass.Host.hosts)
+                if _can_operate_host(current_user, hbdclass.Host.hosts[h])
            ],
            messages=data.msgs[-30:],
            current_user=current_user.to_dict() if current_user else None,
@@ -510,18 +531,19 @@ async def start(
        hosts_with_plugins = []
        for hostname in sorted(hbdclass.Host.hosts.keys()):
            host = hbdclass.Host.hosts[hostname]
-            if not _can_view_host(current_user, host):
+            if not _can_operate_host(current_user, host):
                continue
            if host.plugin_data:
                hosts_with_plugins.append({
                    "name": hostname,
                    "plugins": list(host.plugin_data.keys()),
+                    "is_owner": _can_own_host(current_user, host),
                })

        tmpl = env.get_template("plugins.html")
        body = tmpl.render(
-            title="Plugin Metrics - Heartbeat",
-            header="Plugin Metrics",
+            title="Host Overview - Heartbeat",
+            header="Host Overview",
            hosts=hosts_with_plugins,
            current_user=current_user.to_dict() if current_user else None,
            active_page="plugins",
@@ -811,6 +833,48 @@ async def start(
        )
        return web.Response(text=body, content_type="text/html")

+    # -------------------------------------------------------------------------
+    # About page
+    # -------------------------------------------------------------------------
+
+    async def about_page(request):
+        """GET /about — version, runtime, and project information."""
+        current_user, _ = _require_auth_redirect(request)
+        pkg_dir = os.path.dirname(__file__)
+        templates_dir = config.get("templates_dir", os.path.join(pkg_dir, "templates"))
+        env = jinja2.Environment(loader=jinja2.FileSystemLoader(templates_dir))
+        from hbd import __version__ as hbd_version
+
+        uptime_secs = int(time.time() - _start_epoch)
+        days, rem = divmod(uptime_secs, 86400)
+        hours, rem = divmod(rem, 3600)
+        mins, secs = divmod(rem, 60)
+        if days:
+            uptime_str = f"{days}d {hours}h {mins}m"
+        elif hours:
+            uptime_str = f"{hours}h {mins}m {secs}s"
+        else:
+            uptime_str = f"{mins}m {secs}s"
+
+        start_dt = datetime.datetime.fromtimestamp(_start_epoch)
+        start_time_str = start_dt.strftime("%Y-%m-%d %H:%M:%S")
+
+        tmpl = env.get_template("about.html")
+        body = tmpl.render(
+            title="About - Heartbeat",
+            header="About",
+            hbd_version=hbd_version,
+            python_version=f"{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro} ({platform.python_implementation()})",
+            server_hostname=socket.gethostname(),
+            start_epoch=int(_start_epoch),
+            start_time_str=start_time_str,
+            uptime_str=uptime_str,
+            host_count=len(hbdclass.Host.hosts),
+            current_user=current_user.to_dict() if current_user else None,
+            active_page="about",
+        )
+        return web.Response(text=body, content_type="text/html")
+
    # -------------------------------------------------------------------------
    # Settings page (admin only)
    # -------------------------------------------------------------------------
@@ -826,7 +890,7 @@ async def start(
        tmpl = env.get_template("settings.html")
        body = tmpl.render(
            title="Settings - Heartbeat",
-            sections=settings_mod.get_settings_sections(config),
+            sections=settings_mod.get_settings_sections(config, threshold_checker=threshold_checker),
            current_user=current_user.to_dict() if current_user else None,
            active_page="settings",
        )
@@ -849,6 +913,7 @@ async def start(
            web.get("/api/0/users/{username}/avatar", api_user_avatar),
            # Hosts
            web.get("/api/0/hosts", api_hosts),
+            web.get("/api/0/alert_summary", api_alert_summary),
            web.get("/api/0/messages", api_messages),
            web.get("/api/0/hosts/{hostname}/plugins", api_host_plugins),
            web.get("/api/0/hosts/{hostname}/plugins/{plugin_name}", api_host_plugin_detail),
@@ -864,6 +929,7 @@ async def start(
            web.get("/live", live),
            web.get("/plugins", plugins_page),
            web.get("/alerts", alerts_page),
+            web.get("/about", about_page),
            web.get("/profile", profile_page),
            web.get("/settings", settings_page),
            web.get("/static/{path:.*}", static),
@@ -101,9 +101,10 @@ async def reload_configuration(config_obj, config_path, components):
            access = config_mod.get_host_access(new_config, hostname)
            host.apply_access(access["owner"], access["managers"], access["monitors"])

-        # Reload threshold checker
+        # Reload threshold checker and prune alerts orphaned by the new config
        if 'threshold_checker' in components:
            components['threshold_checker'].reload(new_config)
+            components['threshold_checker'].purge_stale_alerts(hbdclass)
        
        # Note: Changes to the following require restart:
        # - hb_port, hbd_port, ws_port (already bound)
@@ -210,7 +211,6 @@ async def _run_async(config, config_path=None):
        ctx = dict(
            config=config,
            hbdclass=hbdclass,
-            log=eventlog,
            msg_to_websockets=msg_to_websockets,
            msg_journal=msg_journal,
            threshold_checker=threshold_checker,
@@ -237,12 +237,15 @@ async def _run_async(config, config_path=None):
    restore_ctx = dict(
        config=config,
        hbdclass=hbdclass,
-        log=eventlog,
        msg_to_websockets=msg_to_websockets,
        threshold_checker=threshold_checker,
    )
    udp.restore_connection_timers(hbdclass, restore_ctx)

+    # Drop alert states that no longer have a matching threshold (stale after
+    # upgrade or config change between runs).
+    threshold_checker.purge_stale_alerts(hbdclass)
+
    # HTTP server (asyncio-based via aiohttp)
    try:
        http_task = asyncio.create_task(
@@ -252,6 +255,7 @@ async def _run_async(config, config_path=None):
                config=config,
                hbdclass=hbdclass,
                tcss=None,
+                threshold_checker=threshold_checker,
                verbose=config.get("verbose", False),
                get_now=lambda: time.time(),
                VER="",
@@ -15,7 +15,6 @@ their own ``notification_channels`` list.  When no users are configured the
 server runs silently (no notifications sent).
 """

-import asyncio
 import asyncio
 import logging
 import smtplib
@@ -30,13 +29,10 @@ from . import ws as ws_mod

 logger = logging.getLogger(__name__)

-logger = logging.getLogger(__name__)
-
 msg_to_websockets = ws_mod.broadcast

 # Module-level state set via setup()
 _config: dict = {}
-_loop: Optional[asyncio.AbstractEventLoop] = None

 # Tracks which channels fired a WARNING/CRITICAL per host.
 # {host_name: set of channel_names}  — used to route RECOVER to the same channels.
@@ -73,11 +69,9 @@ class Notification:
 # ---------------------------------------------------------------------------

 def setup(cfg: dict, loop: Optional[asyncio.AbstractEventLoop] = None):
-    """Initialize notifier from configuration dict and event loop."""
-    global _config, _loop
+    """Initialize notifier from configuration dict."""
+    global _config
    _config = dict(cfg)
-    if loop is not None:
-        _loop = loop


 def reload_config(cfg: dict):
@@ -299,17 +293,6 @@ async def _send_sms_voipms_async(channel_cfg: dict, notif: Notification) -> bool
        return False


-def _send_sms_voipms(channel_cfg: dict, notif: Notification) -> bool:
-    """Dispatch voip.ms SMS send onto the shared event loop."""
-    if _loop is None:
-        logger.warning("sms_voipms: event loop not available")
-        return False
-    future = asyncio.run_coroutine_threadsafe(_send_sms_voipms_async(channel_cfg, notif), _loop)
-    try:
-        return future.result(timeout=15)
-    except Exception as e:
-        logger.error("sms_voipms send timed out or failed: %s", e)
-        return False


 async def _send_matrix_async(channel_cfg: dict, notif: Notification) -> bool:
@@ -357,48 +340,48 @@ async def _send_matrix_async(channel_cfg: dict, notif: Notification) -> bool:
        await client.close()


-def _send_matrix(channel_cfg: dict, notif: Notification) -> bool:
-    """Dispatch matrix send onto the shared event loop."""
-    if _loop is None:
-        logger.warning("matrix: event loop not available")
-        return False
-    future = asyncio.run_coroutine_threadsafe(_send_matrix_async(channel_cfg, notif), _loop)
-    try:
-        return future.result(timeout=15)
-    except Exception as e:
-        logger.error("matrix send timed out or failed: %s", e)
-        return False
-
-
 # ---------------------------------------------------------------------------
-# Channel dispatcher
+# Channel dispatcher  (all async — sync drivers run in a thread executor)
 # ---------------------------------------------------------------------------

+# Sync drivers kept for `hbd notify` CLI usage (asyncio.run wraps them there).
 _DRIVERS = {
    "pushover": _send_pushover,
    "email": _send_email,
    "mattermost": _send_mattermost,
    "signal": _send_signal,
-    "sms_voipms": _send_sms_voipms,
-    "matrix": _send_matrix,
 }

+_TIMEOUT = 15  # seconds per channel send

-def _dispatch_to_channel(channel_name: str, channel_cfg: dict, notif: Notification) -> bool:
+
+async def _dispatch_to_channel(channel_name: str, channel_cfg: dict, notif: Notification) -> bool:
    """Send *notif* to a single named channel, honouring min_level."""
+    level = notif.level.upper()
+    if level != "RECOVER":
        min_level = channel_cfg.get("min_level", "WARNING").upper()
-    if _level_value(notif.level) < _level_value(min_level):
+        if _level_value(level) < _level_value(min_level):
            logger.debug(
-            "channel '%s': skipping level %s (min_level=%s)", channel_name, notif.level, min_level
+                "channel '%s': skipping level %s (min_level=%s)", channel_name, level, min_level
            )
-        return True  # not an error — filtered intentionally
+            return True  # filtered intentionally

    ch_type = channel_cfg.get("type", "")
-    driver = _DRIVERS.get(ch_type)
-    if driver is None:
+    try:
+        if ch_type == "matrix":
+            return await asyncio.wait_for(_send_matrix_async(channel_cfg, notif), timeout=_TIMEOUT)
+        if ch_type == "sms_voipms":
+            return await asyncio.wait_for(_send_sms_voipms_async(channel_cfg, notif), timeout=_TIMEOUT)
+        sync_driver = _DRIVERS.get(ch_type)
+        if sync_driver is None:
            logger.warning("unknown channel type '%s' for channel '%s'", ch_type, channel_name)
            return False
-    return driver(channel_cfg, notif)
+        return await asyncio.wait_for(
+            asyncio.to_thread(sync_driver, channel_cfg, notif), timeout=_TIMEOUT
+        )
+    except asyncio.TimeoutError:
+        logger.error("channel '%s' timed out after %ds", channel_name, _TIMEOUT)
+        return False


 # ---------------------------------------------------------------------------
@@ -412,7 +395,7 @@ def _build_url(host_name: str) -> str:
    return f"{base_url}/plugins#{host_name}"


-def send_notification(host_name: str, notif: Notification) -> dict:
+async def send_notification(host_name: str, notif: Notification) -> dict:
    """Dispatch *notif* to all managers/owner of *host_name*.

    Looks up the host's owner + managers, resolves each user's
@@ -462,16 +445,12 @@ def send_notification(host_name: str, notif: Notification) -> dict:
            if not channel_cfg:
                continue
            try:
-                ch_type = channel_cfg.get("type", "")
-                driver = _DRIVERS.get(ch_type)
-                if driver:
-                    ok = driver(channel_cfg, notif)
+                ok = await _dispatch_to_channel(channel_name, channel_cfg, notif)
                results[channel_name] = ok
                if ok:
                    logger.info("recover sent to channel '%s': %s", channel_name, notif.title)
            except Exception as e:
                logger.error("error sending recover to channel '%s': %s", channel_name, e)
-        # Clear the alerted set once recovery is delivered
        del _alerted_channels[host_name]
        return results

@@ -482,14 +461,14 @@ def send_notification(host_name: str, notif: Notification) -> dict:
            continue
        for channel_name in user.notification_channels:
            if channel_name in results:
-                continue  # already dispatched to this channel this notification
+                continue
            channel_cfg = global_channels.get(channel_name)
            if not channel_cfg:
                logger.warning("channel '%s' not defined in notification_channels", channel_name)
                results[channel_name] = False
                continue
            try:
-                ok = _dispatch_to_channel(channel_name, channel_cfg, notif)
+                ok = await _dispatch_to_channel(channel_name, channel_cfg, notif)
                results[channel_name] = ok
                if ok:
                    logger.info("notification sent to channel '%s': %s", channel_name, notif.title)
@@ -24,7 +24,7 @@ sensitive   bool  True when the raw value must never be shown
 # Credential field names that should always be masked.
 _SECRET_KEYS = frozenset({
    "password", "token", "user_key", "api_key", "secret",
-    "smtp_password", "smtp_user",
+    "smtp_password", "smtp_user", "api_password", "access_token",
 })

 _CHANNEL_TYPE_LABELS = {
@@ -88,7 +88,7 @@ def _sanitize_channel(name, cfg):
 # Public API
 # ---------------------------------------------------------------------------

-def get_settings_sections(config: dict) -> list:
+def get_settings_sections(config: dict, threshold_checker=None) -> list:
    """Return ordered list of setting sections for the settings page.

    Each section:
@@ -181,6 +181,41 @@ def get_settings_sections(config: dict) -> list:
            "notification_channels": attrs.get("notification_channels", []),
        })

+    # ---- Threshold configurations -----------------------------------------
+    def _tc_to_row(tc):
+        return {
+            "metric": tc.metric_path,
+            "operator": tc.operator.value,
+            "warning": tc.warning,
+            "critical": tc.critical,
+            "hysteresis": tc.hysteresis,
+            "count": tc.count,
+            "enabled": tc.enabled,
+        }
+
+    threshold_config_list = []
+    if threshold_checker is not None:
+        if threshold_checker.threshold_configs:
+            for cfg_name, cfg_metrics in sorted(threshold_checker.threshold_configs.items()):
+                # For the default config use the merged effective set;
+                # for named overrides use only the explicitly defined metrics
+                # (threshold_raw_configs) so inherited defaults are not repeated.
+                if cfg_name == "default":
+                    display_metrics = cfg_metrics
+                else:
+                    display_metrics = threshold_checker.threshold_raw_configs.get(cfg_name, cfg_metrics)
+                metrics = sorted(
+                    [_tc_to_row(tc) for tc in display_metrics.values()],
+                    key=lambda m: m["metric"],
+                )
+                threshold_config_list.append({"name": cfg_name, "metrics": metrics})
+        elif threshold_checker.thresholds:
+            metrics = sorted(
+                [_tc_to_row(tc) for tc in threshold_checker.thresholds.values()],
+                key=lambda m: m["metric"],
+            )
+            threshold_config_list.append({"name": "default", "metrics": metrics})
+
    # ---- Hosts summary ----------------------------------------------------
    hosts_list = []
    for hname, hcfg in (config.get("hosts") or {}).items():
@@ -188,7 +223,7 @@ def get_settings_sections(config: dict) -> list:
            continue
        hosts_list.append({
            "name": hname,
-            "watch": bool(hcfg.get("watch", False)),
+            "watch": bool(hcfg.get("watch", True)),
            "dyndns": bool(hcfg.get("dyndns", False)),
            "owner": hcfg.get("owner", ""),
            "managers": hcfg.get("managers", []),
@@ -312,6 +347,16 @@ def get_settings_sections(config: dict) -> list:
            "hosts": hosts_list,
            "fields": [],
        },
+        {
+            "id": "thresholds",
+            "title": "Threshold Configurations",
+            "description": "Named alert threshold sets. Each defines warning/critical levels per metric.",
+            "threshold_configs": threshold_config_list,
+            "fields": [
+                field("default_threshold_config", "Default config", "text",
+                      "Threshold config used for hosts with no explicit mapping."),
+            ],
+        },
        {
            "id": "runtime",
            "title": "Runtime",
@@ -0,0 +1,199 @@
+<!DOCTYPE html>
+<html>
+  {% include 'head.html' %}
+
+  <style>
+    html, body { overflow: visible; }
+
+    .container {
+      max-width: 700px;
+      margin: 0 auto;
+    }
+
+    h1 {
+      color: #333;
+      margin-bottom: 4px;
+      font-size: 1.5em;
+    }
+
+    .subtitle {
+      color: #666;
+      margin-bottom: 24px;
+      font-size: 0.9em;
+    }
+
+    .section {
+      background: #fff;
+      border-radius: 8px;
+      box-shadow: 0 1px 6px rgba(0,0,0,0.1);
+      padding: 20px 24px;
+      margin-bottom: 20px;
+    }
+
+    .section h2 {
+      font-size: 1em;
+      font-weight: 700;
+      color: #333;
+      margin: 0 0 16px;
+      padding-bottom: 10px;
+      border-bottom: 1px solid #eee;
+      text-transform: uppercase;
+      letter-spacing: 0.5px;
+    }
+
+    .info-row {
+      display: flex;
+      align-items: baseline;
+      padding: 8px 0;
+      border-bottom: 1px solid #f5f5f5;
+      font-size: 0.9em;
+    }
+    .info-row:last-child { border-bottom: none; }
+
+    .info-label {
+      width: 160px;
+      flex-shrink: 0;
+      color: #666;
+      font-size: 0.88em;
+    }
+
+    .info-value {
+      color: #222;
+      word-break: break-all;
+    }
+
+    .info-value a {
+      color: #0066cc;
+      text-decoration: none;
+    }
+    .info-value a:hover { text-decoration: underline; }
+
+    .version-badge {
+      display: inline-block;
+      padding: 3px 12px;
+      background: #e8f0fe;
+      color: #1a73e8;
+      border-radius: 12px;
+      font-size: 0.85em;
+      font-weight: 600;
+      font-family: monospace;
+    }
+
+    .hb-logo {
+      font-size: 2.5em;
+      font-weight: 700;
+      color: #0066cc;
+      letter-spacing: -1px;
+      margin-bottom: 6px;
+    }
+
+    .hb-tagline {
+      color: #555;
+      font-size: 0.95em;
+    }
+
+    .logo-section {
+      display: flex;
+      align-items: center;
+      gap: 20px;
+      padding: 8px 0 4px;
+    }
+
+    .logo-text { flex: 1; }
+  </style>
+
+  <body>
+    {% include 'nav.html' %}
+
+    <div class="container">
+      <h1>{{ header }}</h1>
+      <p class="subtitle">Heartbeat monitoring system</p>
+
+      <div class="section">
+        <div class="logo-section">
+          <div class="logo-text">
+            <div class="hb-logo">Heartbeat</div>
+            <div class="hb-tagline">Lightweight host monitoring over UDP</div>
+          </div>
+          <span class="version-badge">v{{ hbd_version }}</span>
+        </div>
+      </div>
+
+      <div class="section">
+        <h2>Version</h2>
+        <div class="info-row">
+          <span class="info-label">Server version</span>
+          <span class="info-value">{{ hbd_version }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Python</span>
+          <span class="info-value">{{ python_version }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">License</span>
+          <span class="info-value">MIT</span>
+        </div>
+      </div>
+
+      <div class="section">
+        <h2>Runtime</h2>
+        <div class="info-row">
+          <span class="info-label">Host</span>
+          <span class="info-value">{{ server_hostname }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Started</span>
+          <span class="info-value">{{ start_time_str }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Uptime</span>
+          <span class="info-value" id="uptime-value">{{ uptime_str }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Hosts monitored</span>
+          <span class="info-value">{{ host_count }}</span>
+        </div>
+      </div>
+
+      <div class="section">
+        <h2>Contact &amp; Source</h2>
+        <div class="info-row">
+          <span class="info-label">Author</span>
+          <span class="info-value">Andreas Wrede</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Email</span>
+          <span class="info-value"><a href="mailto:aew@wrede.ca">aew@wrede.ca</a></span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Repository</span>
+          <span class="info-value"><a href="https://git.wrede.ca/andreas/heartbeat" target="_blank" rel="noopener">git.wrede.ca/andreas/heartbeat</a></span>
+        </div>
+      </div>
+
+    </div>
+
+    <script>
+      (function() {
+        var startEpoch = {{ start_epoch }};
+        var el = document.getElementById('uptime-value');
+        if (!el) return;
+        function fmt(s) {
+          var d = Math.floor(s / 86400);
+          var h = Math.floor((s % 86400) / 3600);
+          var m = Math.floor((s % 3600) / 60);
+          var sec = s % 60;
+          if (d > 0) return d + 'd ' + h + 'h ' + m + 'm';
+          if (h > 0) return h + 'h ' + m + 'm ' + sec + 's';
+          return m + 'm ' + sec + 's';
+        }
+        function tick() {
+          var up = Math.floor(Date.now() / 1000 - startEpoch);
+          el.textContent = fmt(up);
+        }
+        tick();
+        setInterval(tick, 1000);
+      })();
+    </script>
+  </body>
+</html>
@@ -3,9 +3,10 @@
  {% include 'head.html' %}

  <style>
-    body {
-      margin: 20px;
-      background: #f5f5f5;
+
+    html, body {
+      height: auto;
+      overflow-y: auto;
    }

    .container {
@@ -13,10 +14,7 @@
      margin: 0 auto;
    }

-    h1 {
-      color: #333;
-      margin-bottom: 10px;
-    }
+    h1 { color: #333; margin-bottom: 5px; margin-top: 15px; font-size: 1.5em; }

    .subtitle {
      color: #666;
@@ -41,7 +39,7 @@
      border-left: 4px solid #ddd;
    }

-    .summary-card.critical { border-left-color: #f44336; }
+    .summary-card.critical { border-left-color: #ea1e0f; }
    .summary-card.warning  { border-left-color: #ff9800; }
    .summary-card.ok       { border-left-color: #4caf50; }

@@ -51,7 +49,7 @@
      line-height: 1;
    }

-    .summary-number.critical { color: #f44336; }
+    .summary-number.critical { color: #ea1e0f; }
    .summary-number.warning  { color: #ff9800; }
    .summary-number.ok       { color: #4caf50; }

@@ -116,7 +114,7 @@
    }
    
    .alert-item.acknowledged {
-      opacity: 0.6;
+      opacity: 0.8;
      background: #f0f0f0;
    }

@@ -177,8 +175,12 @@

    .alert-hostname {
      font-weight: bold;
-      color: #333;
+      color: #0066cc;
      font-size: 1.1em;
+      text-decoration: none;
+    }
+    .alert-hostname:hover {
+      text-decoration: underline;
    }

    .alert-metric {
@@ -407,6 +409,10 @@
        } else if (alert.threshold_value !== undefined && alert.threshold_value !== null && alert.operator) {
          valueText += ` <span class="threshold-info">(threshold: ${alert.operator} ${formatValue(alert.threshold_value)})</span>`;
        }
+        if (alert.recovery_threshold !== undefined && alert.recovery_threshold !== null) {
+          const recOp = (alert.operator === '>' || alert.operator === '>=') ? '<' : '>';
+          valueText += ` <span class="threshold-info" style="color:#888">(recovers ${recOp} ${formatValue(alert.recovery_threshold)})</span>`;
+        }
        
        // Build actions section
        let actionsHtml = '';
@@ -431,7 +437,7 @@
            <div class="alert-main">
              <div class="alert-header">
                <span class="alert-level ${level}">${alert.level}</span>
-                <span class="alert-hostname">${alert.hostname}</span>
+                <a class="alert-hostname" href="/plugins/${alert.hostname}">${alert.hostname}</a>
              </div>
              <div class="alert-metric">${alert.metric_path}</div>
              <div class="alert-details">
@@ -6,13 +6,32 @@
    <title>{{ title }}</title>
    {% if extra_scripts %}<script src="{{ extra_scripts }}"></script>{% endif %}
    <style>
+      /* ── Reset / shared baseline ── */
+      *, *::before, *::after { box-sizing: border-box; }
+      html {
+        font-family: 'Segoe UI', system-ui, -apple-system, sans-serif;
+        font-size: 14px;
+      }
+      body {
+        margin: 0;
+        padding: 10px;
+        padding-top: 60px;
+        background: #f5f5f5;
+      }
+      h1 { font-size: 1.5em; color: #333; margin: 0 0 5px; }
+      h2 { font-size: 1.1em; color: #333; margin: 0 0 8px; }
+      p  { margin: 0; }
+
      /* Navigation bar — shared across all pages */
      .nav {
+        position: fixed;
+        top: 0;
+        left: 0;
+        right: 0;
+        z-index: 200;
        background: #fff;
-        padding: 10px 15px;
-        margin-bottom: 10px;
+        padding: 6px 12px;
        box-shadow: 0 2px 4px rgba(0,0,0,.1);
-        border-radius: 4px;
        display: flex;
        align-items: center;
        justify-content: space-between;
@@ -42,6 +61,17 @@
        transition: background 0.15s;
      }
      .nav-user:hover { background: #f0f4ff; text-decoration: none; }
+      .nav-username {
+        max-width: 0;
+        overflow: hidden;
+        white-space: nowrap;
+        opacity: 0;
+        transition: max-width 0.2s ease, opacity 0.2s ease;
+      }
+      .nav-user:hover .nav-username {
+        max-width: 160px;
+        opacity: 1;
+      }
      .nav-avatar {
        width: 28px; height: 28px;
        border-radius: 50%;
@@ -94,6 +124,164 @@
        .nav-links.nav-open { display: flex; }
        .nav-links a { margin-right: 0; padding: 6px 0; font-size: 1em; }
      }
+
+      /* Swiss railway clock — nav */
+      .nav-pie {
+        flex-shrink: 0;
+        line-height: 0;
+        margin-left: auto;
+        padding: 4px 4px 4px 0;
+      }
+      #alert-pie { display: block; cursor: default; }
+      .nav-clock {
+        flex-shrink: 0;
+        line-height: 0;
+        padding: 4px 4px 4px 0;
+        cursor: pointer;
+      }
+      #swiss-clock { display: block; }
+
+      /* Swiss railway clock — full-page overlay */
+      #clock-overlay {
+        display: none;
+        position: fixed;
+        inset: 0;
+        z-index: 9999;
+        background: #1a1a1a;
+        align-items: center;
+        justify-content: center;
+        cursor: pointer;
+      }
+      #clock-overlay.visible { display: flex; }
+      #swiss-clock-overlay { display: block; }
    </style>
+    <script>
+    /* ── Swiss Federal Railway (SBB) clock ── */
+
+    /* Draw one frame of the clock onto any canvas element. */
+    function drawSwissClock(canvas) {
+      var SIZE = canvas.width;
+      var R = SIZE / 2;
+      var ctx = canvas.getContext('2d');
+      var now = new Date();
+      var h  = now.getHours() % 12;
+      var m  = now.getMinutes();
+      var s  = now.getSeconds();
+      var ms = now.getMilliseconds();
+
+      /* Seconds hand idles ~1.5 s at 12 before advancing (SBB behaviour) */
+      var sFrac = s + ms / 1000;
+      var sAngle = sFrac >= 58.5 ? 0 : (sFrac / 58.5) * Math.PI * 2;
+
+      ctx.clearRect(0, 0, SIZE, SIZE);
+
+      /* face */
+      ctx.beginPath();
+      ctx.arc(R, R, R - 1, 0, Math.PI * 2);
+      ctx.fillStyle = '#fff';
+      ctx.fill();
+      ctx.strokeStyle = '#333';
+      ctx.lineWidth = SIZE * 0.018;
+      ctx.stroke();
+
+      /* tick marks */
+      for (var i = 0; i < 60; i++) {
+        var a = (i / 60) * Math.PI * 2 - Math.PI / 2;
+        var isHour = (i % 5 === 0);
+        ctx.beginPath();
+        ctx.moveTo(R + Math.cos(a) * (isHour ? R * 0.72 : R * 0.88),
+                   R + Math.sin(a) * (isHour ? R * 0.72 : R * 0.88));
+        ctx.lineTo(R + Math.cos(a) * R * 0.94,
+                   R + Math.sin(a) * R * 0.94);
+        ctx.strokeStyle = '#222';
+        ctx.lineWidth = isHour ? SIZE * 0.027 : SIZE * 0.011;
+        ctx.lineCap = 'butt';
+        ctx.stroke();
+      }
+
+      /* hands */
+      function hand(angle, tip, tail, width, color) {
+        ctx.save();
+        ctx.translate(R, R);
+        ctx.rotate(angle);
+        ctx.beginPath();
+        ctx.moveTo(tail, 0);
+        ctx.lineTo(tip,  0);
+        ctx.strokeStyle = color;
+        ctx.lineWidth = width;
+        ctx.lineCap = 'square';
+        ctx.stroke();
+        ctx.restore();
+      }
+
+      hand((m + s / 60) / 60 * Math.PI * 2 - Math.PI / 2,
+           R * 0.88, -R * 0.12, SIZE * 0.027, '#222');           /* minute */
+      hand((h + m / 60) / 12 * Math.PI * 2 - Math.PI / 2,
+           R * 0.58, -R * 0.12, SIZE * 0.039, '#222');           /* hour   */
+      hand(sAngle - Math.PI / 2, R * 0.78, -R * 0.22,
+           SIZE * 0.013, '#e00');                                 /* second tail+tip */
+
+      /* round dot at tip of second hand */
+      var dotR = SIZE * 0.028;
+      ctx.save();
+      ctx.translate(R, R);
+      ctx.rotate(sAngle - Math.PI / 2);
+      ctx.beginPath();
+      ctx.arc(R * 0.78, 0, dotR, 0, Math.PI * 2);
+      ctx.fillStyle = '#e00';
+      ctx.fill();
+      ctx.restore();
+
+      /* centre cap */
+      ctx.beginPath();
+      ctx.arc(R, R, R * 0.04, 0, Math.PI * 2);
+      ctx.fillStyle = '#222';
+      ctx.fill();
+    }
+
+    /* Resize the overlay canvas to fit the viewport, keeping it square. */
+    function resizeOverlayClock() {
+      var oc = document.getElementById('swiss-clock-overlay');
+      if (!oc) return;
+      var size = Math.min(window.innerWidth, window.innerHeight) * 0.88;
+      size = Math.floor(size);
+      oc.width  = size;
+      oc.height = size;
+    }
+
+    /* Main tick — redraws both nav clock and (if visible) overlay clock. */
+    function clockTick() {
+      var nav = document.getElementById('swiss-clock');
+      if (nav) drawSwissClock(nav);
+      var overlay = document.getElementById('clock-overlay');
+      if (overlay && overlay.classList.contains('visible')) {
+        var oc = document.getElementById('swiss-clock-overlay');
+        if (oc) drawSwissClock(oc);
+      }
+      var delay = 100 - (Date.now() % 100);
+      setTimeout(clockTick, delay);
+    }
+
+    document.addEventListener('DOMContentLoaded', function() {
+      /* Start the shared tick loop */
+      clockTick();
+
+      /* Overlay toggle — clicking the nav clock opens it */
+      var navClock = document.querySelector('.nav-clock');
+      var overlay  = document.getElementById('clock-overlay');
+      if (navClock && overlay) {
+        navClock.addEventListener('click', function() {
+          resizeOverlayClock();
+          overlay.classList.add('visible');
+        });
+        overlay.addEventListener('click', function() {
+          overlay.classList.remove('visible');
+        });
+        window.addEventListener('resize', function() {
+          if (overlay.classList.contains('visible')) resizeOverlayClock();
+        });
+      }
+    });
+    </script>
    <script src="static/sorttable.js"></script>
 </head>
@@ -7,10 +7,6 @@
      display: flex;
      flex-direction: column;
      height: 100vh;
-      box-sizing: border-box;
-      padding: 10px;
-      margin: 0;
-      background: #f5f5f5;
      overflow: hidden;
    }

@@ -49,6 +45,7 @@
    h1 {
      color: #333;
      margin-bottom: 5px;
+      margin-top: 15px; 
      font-size: 1.5em;
    }

@@ -239,6 +236,8 @@
      color: #ff9800;
      font-weight: 700;
    }
+    #ntable a.host-link { color: inherit; text-decoration: none; }
+    #ntable a.host-link:hover { text-decoration: underline; }
  </style>
  <script type="text/javascript">
    var cnt = 0;
@@ -248,11 +247,13 @@
    var HBD_VERSION = "{{ hbd_version }}";

    function hostNameHtml(data) {
+      var rawName = data.raw_name || data.name.replace(/<[^>]+>/g, '').replace('*', '').trim();
      var nameHtml = data.name;
      if (!data.hbc_version || data.hbc_version !== HBD_VERSION) {
        nameHtml += ' 🥀';
      }
-      return data.dyn ? '<b>' + nameHtml + '</b>' : nameHtml;
+      var display = data.dyn ? '<b>' + nameHtml + '</b>' : nameHtml;
+      return '<a class="host-link" href="/plugins#' + encodeURIComponent(rawName) + '">' + display + '</a>';
    }

    function setup() {
@@ -407,7 +408,7 @@
        );
        if (data.connections[i].state == "up") {
          state = '<span class="state-up">up</span>';
-          latency = Number.parseFloat(data.connections[i].rtts[0]).toFixed(2);
+          latency = String(Math.round(Number.parseFloat(data.connections[i].rtts[0])));
        } else {
          if (data.connections[i].state == "unknown") {
            state = "";
@@ -489,8 +490,10 @@
    {% include 'menu.html' %}

    <div class="container">
+      <div>
        <h1>{{ header }}</h1>
        <p class="subtitle">Real-time host monitoring and event log</p>
+      </div>
      
      <div class="table-section">
        <table id="ntable" class="sortable">
@@ -512,7 +515,7 @@
          <tbody id="ntablebody">
            {% for host in hosts %}
            <tr class="{% if host.alert_critical_unacked > 0 or host.alert_critical_acked > 0 %}row-critical{% elif host.alert_warning_unacked > 0 or host.alert_warning_acked > 0 %}row-warning{% endif %}">
-              <td data-name="{{ host.name }}">{{ host.name }}{% if not host.hbc_version or host.hbc_version != hbd_version %} 🥀{% endif %}</td>
+              <td data-name="{{ host.name }}"><a class="host-link" href="/plugins#{{ host.raw_name | urlencode }}">{{ host.name }}{% if not host.hbc_version or host.hbc_version != hbd_version %} 🥀{% endif %}</a></td>
              <td style="text-align: center; color: #ff9800; font-weight: bold;">
                {%- set warning_unacked = host.alert_warning_unacked -%}
                {%- set warning_acked = host.alert_warning_acked -%}
@@ -4,11 +4,18 @@
  </button>
  <div class="nav-links" id="nav-links">
    <a href="/live"{% if active_page == "live" %} class="active"{% endif %}>Live Dashboard</a>
-    <a href="/plugins"{% if active_page == "plugins" %} class="active"{% endif %}>Plugin Metrics</a>
+    <a href="/plugins"{% if active_page == "plugins" %} class="active"{% endif %}>Host Overview</a>
    <a href="/alerts"{% if active_page == "alerts" %} class="active"{% endif %}>Alerts</a>
    {% if current_user and current_user.admin %}
    <a href="/settings"{% if active_page == "settings" %} class="active"{% endif %}>Settings</a>
    {% endif %}
+    <a href="/about"{% if active_page == "about" %} class="active"{% endif %}>About</a>
+  </div>
+  <div class="nav-pie" title="Host alert status">
+    <canvas id="alert-pie" width="44" height="44"></canvas>
+  </div>
+  <div class="nav-clock" title="Click for full-screen clock">
+    <canvas id="swiss-clock" width="44" height="44"></canvas>
  </div>
  {% if current_user %}
  <a href="/profile" class="nav-user{% if active_page == 'profile' %} active{% endif %}" title="{{ current_user.full_name or current_user.username }}">
@@ -21,6 +28,12 @@
  </a>
  {% endif %}
 </div>
+
+<!-- Full-page clock overlay (click anywhere to dismiss) -->
+<div id="clock-overlay">
+  <canvas id="swiss-clock-overlay" width="400" height="400"></canvas>
+</div>
+
 <script>
  (function() {
    var btn = document.getElementById('nav-hamburger-btn');
@@ -32,4 +45,52 @@
      });
    }
  })();
+
+  function drawAlertPie(critical, warning, ok) {
+    var canvas = document.getElementById('alert-pie');
+    if (!canvas) return;
+    var ctx = canvas.getContext('2d');
+    var SIZE = canvas.width;
+    var R = SIZE / 2;
+    ctx.clearRect(0, 0, SIZE, SIZE);
+    var total = critical + warning + ok;
+    if (total === 0) {
+      ctx.beginPath();
+      ctx.arc(R, R, R - 1, 0, Math.PI * 2);
+      ctx.fillStyle = '#ccc';
+      ctx.fill();
+      return;
+    }
+    var slices = [
+      { value: critical, color: '#e53935' },
+      { value: warning,  color: '#ffb300' },
+      { value: ok,       color: '#43a047' }
+    ];
+    var start = -Math.PI / 2;
+    slices.forEach(function(s) {
+      if (s.value === 0) return;
+      var sweep = (s.value / total) * Math.PI * 2;
+      ctx.beginPath();
+      ctx.moveTo(R, R);
+      ctx.arc(R, R, R - 1, start, start + sweep);
+      ctx.closePath();
+      ctx.fillStyle = s.color;
+      ctx.fill();
+      start += sweep;
+    });
+  }
+
+  function updateAlertPie() {
+    fetch('/api/0/alert_summary').then(function(r) {
+      if (!r.ok) return;
+      return r.json();
+    }).then(function(d) {
+      if (d) drawAlertPie(d.critical || 0, d.warning || 0, d.ok || 0);
+    }).catch(function() {});
+  }
+
+  document.addEventListener('DOMContentLoaded', function() {
+    updateAlertPie();
+    setInterval(updateAlertPie, 30000);
+  });
 </script>
@@ -3,15 +3,7 @@
  {% include 'head.html' %}

  <style>
-    html, body {
-      overflow: visible;
-    }
-
-    body {
-      margin: 20px;
-      background: #f5f5f5;
-      font-family: 'Segoe UI', system-ui, sans-serif;
-    }
+    html, body { overflow: visible; }

    .container {
      max-width: 900px;
@@ -3,22 +3,13 @@
  {% include 'head.html' %}

  <style>
-    html, body {
-      overflow: visible;
-    }
-
-    body {
-      margin: 20px;
-      background: #f5f5f5;
-      font-family: 'Segoe UI', system-ui, sans-serif;
-    }
+    html, body { overflow: visible; }

    .container {
      max-width: 960px;
-      margin: 0 auto;
    }

-    h1 { color: #333; margin-bottom: 4px; font-size: 1.5em; }
+    h1 { color: #333; margin-bottom: 5px; margin-top: 15px; font-size: 1.5em; }
    .subtitle { color: #666; margin-bottom: 24px; font-size: 0.9em; }

    /* ---- Sidebar + content layout ---- */
@@ -32,7 +23,7 @@
      width: 180px;
      flex-shrink: 0;
      position: sticky;
-      top: 20px;
+      top: 60px;
    }

    .sidebar-nav a {
@@ -263,6 +254,17 @@
    .host-bool { text-align: center; }
    .dot-yes { color: #2e7d32; font-size: 1.1em; }
    .dot-no  { color: #ddd;    font-size: 1.1em; }
+
+    /* ---- Threshold configurations ---- */
+    .thresh-config { margin: 12px 20px 20px; }
+    .thresh-config-name {
+      font-weight: 600; font-size: 0.9em; color: #1a237e;
+      margin-bottom: 6px;
+    }
+    .mini-table .warn  { color: #e65100; font-weight: 600; }
+    .mini-table .crit  { color: #b71c1c; font-weight: 600; }
+    .mini-table .dim   { color: #aaa; }
+    .mini-table .metric-path { font-family: monospace; font-size: 0.88em; }
  </style>

  <body>
@@ -403,6 +405,49 @@
            {% endif %}
            {% endif %}

+            {# ---- Threshold configurations section ---- #}
+            {% if section.id == "thresholds" %}
+            {% if section.threshold_configs %}
+            {% for tc in section.threshold_configs %}
+            <div class="thresh-config">
+              <div class="thresh-config-name">{{ tc.name }}</div>
+              {% if tc.metrics %}
+              <div style="overflow-x: auto;">
+                <table class="mini-table">
+                  <thead>
+                    <tr>
+                      <th>Metric</th>
+                      <th>Op</th>
+                      <th>Warning</th>
+                      <th>Critical</th>
+                      <th>Hysteresis</th>
+                      <th>Count</th>
+                    </tr>
+                  </thead>
+                  <tbody>
+                    {% for m in tc.metrics %}
+                    <tr {% if not m.enabled %} style="opacity:0.45"{% endif %}>
+                      <td class="metric-path">{{ m.metric }}</td>
+                      <td>{{ m.operator or '>' }}</td>
+                      <td class="warn">{{ m.warning if m.warning is not none else '—' }}</td>
+                      <td class="crit">{{ m.critical if m.critical is not none else '—' }}</td>
+                      <td class="dim">{{ '%.0f%%' % (m.hysteresis * 100) if m.hysteresis else '—' }}</td>
+                      <td class="dim">{{ m.count }}</td>
+                    </tr>
+                    {% endfor %}
+                  </tbody>
+                </table>
+              </div>
+              {% else %}
+              <span class="val-empty">No thresholds defined.</span>
+              {% endif %}
+            </div>
+            {% endfor %}
+            {% else %}
+            <div class="field-row"><span class="val-empty">No threshold configurations defined.</span></div>
+            {% endif %}
+            {% endif %}
+
            {# ---- Hosts section ---- #}
            {% if section.id == "hosts" %}
            {% if section.hosts %}
@@ -9,10 +9,11 @@ This module provides a flexible threshold checking system that:
 - Supports multiple comparison operators
 """

+import asyncio
 import logging
 import time
 from enum import Enum
-from typing import Dict, Any, Optional, Tuple, Callable
+from typing import Dict, List, Any, Optional, Tuple, Callable
 from . import notify as notify_mod
 from .config import THRESHOLD_DEFAULTS

@@ -56,10 +57,12 @@ class AlertState:
        self.last_notification = None
        self.threshold_value = None  # The threshold value that triggered alert
        self.operator = None  # The comparison operator (>, <, >=, etc.)
+        self.hysteresis: Optional[float] = None  # Hysteresis fraction used for recovery
        self.formatted_message = None  # Formatted display message for UI
        self.acknowledged = False  # Whether alert has been acknowledged
        self.acknowledged_at = None  # Timestamp when acknowledged
        self.consecutive_count = 0  # Consecutive exceedances while still OK (for count gating)
+        self.pending_since: Optional[float] = None  # non-None while waiting out grace period before notifying
    
    def update(
        self, 
@@ -105,6 +108,7 @@ class AlertState:
            self.level = level
            self.since = now
            self.notification_count = 0
+            self.last_notification = None  # restart reminder interval on level change
            # Reset acknowledgment on state change
            if level != AlertLevel.OK:
                # Only reset if changing to a different alert level
@@ -149,6 +153,15 @@ class AlertState:
        if self.formatted_message is not None:
            result["formatted_message"] = self.formatted_message

+        # Compute and expose the recovery threshold so the UI can display it
+        if (self.hysteresis and self.threshold_value is not None
+                and self.operator is not None):
+            ha = abs(self.threshold_value * self.hysteresis)
+            if self.operator in ('>', '>='):
+                result["recovery_threshold"] = round(self.threshold_value - ha, 4)
+            elif self.operator in ('<', '<='):
+                result["recovery_threshold"] = round(self.threshold_value + ha, 4)
+
        return result
    
    def __setstate__(self, state):
@@ -156,6 +169,8 @@ class AlertState:
        self.__dict__.update(state)
        if not hasattr(self, 'consecutive_count'):
            self.consecutive_count = 0
+        if not hasattr(self, 'hysteresis'):
+            self.hysteresis = None

    def acknowledge(self):
        """Acknowledge this alert to stop reminder notifications."""
@@ -326,19 +341,23 @@ class ThresholdChecker:
            renotify_interval: Seconds between repeat notifications (default: 1 hour)
            journal: Optional MessageJournal instance for logging threshold events
        """
-        # Named threshold configurations: {config_name: {metric_path: ThresholdConfig}}
+        # Named threshold configurations (pre-merged: defaults + overrides): {config_name: {metric_path: ThresholdConfig}}
        self.threshold_configs = {}

+        # Raw overrides only for each named config (no defaults baked in): {config_name: {metric_path: ThresholdConfig}}
+        self.threshold_raw_configs: Dict[str, Dict[str, ThresholdConfig]] = {}
+
        # Single threshold set for backward compatibility: {metric_path: ThresholdConfig}
        self.thresholds = {}

-        # Host to config name mapping: {host_name: config_name}
-        self.host_config_mapping = {}
+        # Host to ordered list of config names: {host_name: [config_name, ...]}
+        self.host_config_mapping: Dict[str, List[str]] = {}

        # Default config name to use when no mapping exists
        self.default_config = "default"
        
        self.renotify_interval = renotify_interval
+        self.grace_seconds: float = float(config.get("grace", 2))
        self.journal = journal

        # Parse configuration
@@ -369,8 +388,10 @@ class ThresholdChecker:
        
        # Clear old configuration
        self.threshold_configs.clear()
+        self.threshold_raw_configs.clear()
        self.thresholds.clear()
        self.host_config_mapping.clear()
+        self.grace_seconds = float(config.get("grace", 2))

        # Parse new configuration
        self._parse_config(config)
@@ -420,9 +441,10 @@ class ThresholdChecker:
                        self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=effective_defaults)

        self.threshold_configs["default"] = dict(effective_defaults)
+        self.threshold_raw_configs["default"] = {}
        logger.info("Registered 'default' threshold config with %d metrics", len(effective_defaults))

-        # Parse each named configuration, seeding it with effective_defaults first
+        # Parse each named configuration
        for config_name, config_data in threshold_configs.items():
            if config_name == "default":
                continue  # already handled above
@@ -436,33 +458,41 @@ class ThresholdChecker:
                continue

            logger.info("Parsing threshold configuration: %s", config_name)
-            self.threshold_configs[config_name] = dict(effective_defaults)

+            # Raw overrides only (used for multi-config layering)
+            raw_overrides: Dict[str, ThresholdConfig] = {}
            thresholds_config = config_data["thresholds"]
            for plugin_name, plugin_thresholds in thresholds_config.items():
-                if not isinstance(plugin_thresholds, dict):
-                    continue
+                if isinstance(plugin_thresholds, dict):
+                    self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=raw_overrides)
+            self.threshold_raw_configs[config_name] = raw_overrides

-                self._parse_plugin_thresholds(
-                    plugin_name,
-                    plugin_thresholds,
-                    target_dict=self.threshold_configs[config_name]
-                )
+            # Pre-merged version (defaults + overrides) for single-config fast path
+            self.threshold_configs[config_name] = dict(effective_defaults)
+            self.threshold_configs[config_name].update(raw_overrides)

-        # Parse host to config mapping from two possible sources
-        # 1. New format: hosts section with threshold_config attribute
+        # Parse host → config list mapping from two possible sources
+
+        def _normalise(value) -> List[str]:
+            """Accept a string or list; always return a list."""
+            if isinstance(value, list):
+                return [str(v) for v in value]
+            return [str(value)]
+
+        # 1. hosts section with threshold_config attribute (string or list)
        if "hosts" in config:
            hosts_config = config["hosts"]
            if isinstance(hosts_config, dict):
                for host_name, host_attrs in hosts_config.items():
                    if isinstance(host_attrs, dict) and "threshold_config" in host_attrs:
-                        self.host_config_mapping[host_name] = host_attrs["threshold_config"]
+                        self.host_config_mapping[host_name] = _normalise(host_attrs["threshold_config"])

-        # 2. Legacy format: host_threshold_mapping section (for backward compatibility)
+        # 2. Legacy host_threshold_mapping section (string values only)
        if "host_threshold_mapping" in config:
            legacy_mapping = config.get("host_threshold_mapping", {})
            if isinstance(legacy_mapping, dict):
-                self.host_config_mapping.update(legacy_mapping)
+                for host_name, value in legacy_mapping.items():
+                    self.host_config_mapping[host_name] = _normalise(value)
        
        # Set default config (first one alphabetically or explicitly set)
        self.default_config = config.get("default_threshold_config", "default")
@@ -528,7 +558,7 @@ class ThresholdChecker:
            critical = threshold_config.get("critical")
            operator = threshold_config.get("operator", ">")
            display = threshold_config.get("display", "(threshold: {op_symbol} {threshold_value})")
-            hysteresis = threshold_config.get("hysteresis", 0.1)  # 10% default
+            hysteresis = threshold_config.get("hysteresis", 0.02)  # 2% default
            enabled = threshold_config.get("enabled", True)
            
            if warning is None and critical is None:
@@ -631,7 +661,7 @@ class ThresholdChecker:
        warning = rtt_thresholds.get("warning")
        critical = rtt_thresholds.get("critical")
        operator = rtt_thresholds.get("operator", ">")
-        hysteresis = rtt_thresholds.get("hysteresis", 0.1)  # 10% default
+        hysteresis = rtt_thresholds.get("hysteresis", 0.02)  # 2% default
        enabled = rtt_thresholds.get("enabled", True)
        display = rtt_thresholds.get("display")
        count = rtt_thresholds.get("count", 1)
@@ -660,7 +690,10 @@ class ThresholdChecker:
        )
    
    def get_thresholds_for_host(self, host_name: str) -> Dict[str, ThresholdConfig]:
-        """Get the appropriate threshold configuration for a host.
+        """Get the effective threshold configuration for a host.
+
+        When threshold_config is a list, configs are applied left-to-right on top
+        of the default thresholds so earlier entries can be overridden by later ones.

        Args:
            host_name: Name of the host
@@ -672,23 +705,40 @@ class ThresholdChecker:
        if self.thresholds and not self.threshold_configs:
            return self.thresholds

-        # Multi-config mode: look up host-specific configuration
-        if self.threshold_configs:
-            config_name = self.host_config_mapping.get(host_name, self.default_config)
+        if not self.threshold_configs:
+            return {}

-            if config_name in self.threshold_configs:
-                return self.threshold_configs[config_name]
-            else:
+        config_names = self.host_config_mapping.get(host_name)
+
+        # No host-specific mapping → return pre-merged default
+        if not config_names:
+            return self.threshold_configs.get(self.default_config, {})
+
+        # Single config → fast path using pre-merged copy
+        if len(config_names) == 1:
+            name = config_names[0]
+            if name in self.threshold_configs:
+                return self.threshold_configs[name]
            logger.warning(
                "Threshold config '%s' not found for host '%s', using default '%s'",
-                    config_name,
-                    host_name,
-                    self.default_config
+                name, host_name, self.default_config,
            )
            return self.threshold_configs.get(self.default_config, {})

-        # No thresholds configured
-        return {}
+        # Multiple configs → start from defaults, layer raw overrides in order
+        result = dict(self.threshold_configs.get(self.default_config, {}))
+        for name in config_names:
+            if name == self.default_config:
+                continue  # defaults already the base
+            raw = self.threshold_raw_configs.get(name)
+            if raw is None:
+                logger.warning(
+                    "Threshold config '%s' not found for host '%s', skipping",
+                    name, host_name,
+                )
+            else:
+                result.update(raw)
+        return result
    
    def check_value(
        self,
@@ -756,20 +806,51 @@ class ThresholdChecker:
        elif new_level == AlertLevel.WARNING and threshold.warning is not None:
            threshold_value = threshold.warning

+        # Keep hysteresis on the state so the UI can show the recovery threshold
+        if new_level != AlertLevel.OK:
+            alert_state.hysteresis = threshold.hysteresis
+        else:
+            alert_state.hysteresis = None
+
        # Update state and check for changes
        old_level = alert_state.level
        if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
-            # For check_value, we don't have full plugin data, pass None
-            lvl, message, formatted_msg = self._trigger_notification(host_name, metric_path, old_level, new_level, value, threshold, None)
-            # Update alert state with formatted message
-            alert_state.formatted_message = formatted_msg
-            self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
+            self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, None)
            return (old_level, new_level)
        elif new_level != AlertLevel.OK:
-            # Check if we should re-notify
-            self._check_renotify(host_name, alert_state, metric_path, value, threshold, None)
+            self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, None)

        return None
+    def _find_threshold(
+        self, thresholds: Dict[str, "ThresholdConfig"], metric_path: str
+    ) -> Tuple[Optional["ThresholdConfig"], Optional[str]]:
+        """Return (threshold, check_name) for *metric_path*, falling back to suffix matches.
+
+        Allows generic thresholds like ``nagios_runner.status_code`` to match
+        fully-qualified paths like ``nagios_runner.check_disk_root_status_code``.
+        The exact match is always tried first; then successive leading
+        underscore-delimited segments are stripped from the field name until
+        a match is found or no segments remain.
+
+        Returns:
+            (ThresholdConfig, None) for an exact match.
+            (ThresholdConfig, "check_disk_root") for a suffix match — the second
+            element is the stripped prefix, available as ``{check_name}`` in
+            display format templates.
+            (None, None) when no threshold is found.
+        """
+        if metric_path in thresholds:
+            return thresholds[metric_path], None
+        plugin, sep, field = metric_path.partition(".")
+        if not sep:
+            return None, None
+        parts = field.split("_")
+        for i in range(1, len(parts)):
+            candidate = plugin + "." + "_".join(parts[i:])
+            if candidate in thresholds:
+                return thresholds[candidate], "_".join(parts[:i])
+        return None, None
+
    def check_plugin_data(
        self,
        host_name: str,
@@ -798,11 +879,10 @@ class ThresholdChecker:
        for metric_name, value in data.items():
            metric_path = f"{plugin_name}.{metric_name}"

-            if metric_path not in thresholds:
+            threshold, check_name = self._find_threshold(thresholds, metric_path)
+            if threshold is None:
                continue

-            threshold = thresholds[metric_path]
-            
            # Get or create alert state
            if metric_path not in alert_states:
                alert_states[metric_path] = AlertState(metric_path)
@@ -822,17 +902,15 @@ class ThresholdChecker:
            elif new_level == AlertLevel.WARNING and threshold.warning is not None:
                threshold_value = threshold.warning

+            alert_state.hysteresis = threshold.hysteresis if new_level != AlertLevel.OK else None
+
            # Update state and check for changes
            old_level = alert_state.level
            if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
                state_changes.append((metric_path, old_level, new_level, value))
-                lvl, message, formatted_msg = self._trigger_notification(host_name, metric_path, old_level, new_level, value, threshold, data)
-                # Update alert state with formatted message
-                alert_state.formatted_message = formatted_msg
-                self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
+                self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data, check_name=check_name, metric_name=metric_name)
            elif new_level != AlertLevel.OK:
-                # Check if we should re-notify
-                self._check_renotify(host_name, alert_state, metric_path, value, threshold, data)
+                self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data, check_name=check_name, metric_name=metric_name)

        # Check nested metrics (e.g., partition data in disk_monitor)
        self._check_nested_metrics(
@@ -892,23 +970,14 @@ class ThresholdChecker:
                    elif new_level == AlertLevel.WARNING and threshold.warning is not None:
                        threshold_value = threshold.warning

+                    alert_state.hysteresis = threshold.hysteresis if new_level != AlertLevel.OK else None
+
                    old_level = alert_state.level
                    if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
                        state_changes.append((metric_path, old_level, new_level, value))
-                        lvl, message, formatted_msg = self._trigger_notification(
-                            host_name,
-                            metric_path,
-                            old_level,
-                            new_level,
-                            value,
-                            threshold,
-                            data  # Pass full plugin data for format string
-                        )
-                        # Update alert state with formatted message
-                        alert_state.formatted_message = formatted_msg
-                        self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
+                        self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data)
                    elif new_level != AlertLevel.OK:
-                        self._check_renotify(host_name, alert_state, metric_path, value, threshold, data)
+                        self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data)
    
    def _trigger_notification(
        self,
@@ -919,6 +988,8 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]] = None,
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ):
        """Trigger a notification for an alert state change.
        
@@ -947,7 +1018,7 @@ class ThresholdChecker:

        # Format message
        if new_level == AlertLevel.OK:
-            lvl = "RECOVERED"
+            lvl = "RECOVER"
            message = f"{metric_path} = {display_value} ({old_level.name} -> OK)"
        elif new_level == AlertLevel.WARNING:
            lvl = "WARNING"
@@ -957,7 +1028,9 @@ class ThresholdChecker:
                    value=display_value,
                    threshold_value=threshold_value,
                    op_symbol=op_symbol,
-                    plugin_data=plugin_data
+                    plugin_data=plugin_data,
+                    check_name=check_name,
+                    metric_name=metric_name,
                )
                message = f"{metric_path} = {display_value} {threshold_info}"
            else:
@@ -970,7 +1043,9 @@ class ThresholdChecker:
                    value=display_value,
                    threshold_value=threshold_value,
                    op_symbol=op_symbol,
-                    plugin_data=plugin_data
+                    plugin_data=plugin_data,
+                    check_name=check_name,
+                    metric_name=metric_name,
                )
                message = f"{metric_path} = {display_value} {threshold_info}"
            else:
@@ -987,7 +1062,9 @@ class ThresholdChecker:
                value=display_value,
                threshold_value=threshold_value,
                op_symbol=op_symbol,
-                plugin_data=plugin_data
+                plugin_data=plugin_data,
+                check_name=check_name,
+                metric_name=metric_name,
            )

        return lvl, message, formatted_threshold_msg
@@ -1003,23 +1080,23 @@ class ThresholdChecker:
        value: Any,
    ):
        """Send notification and log to journal/eventlog."""
-        try:
-            notify_mod.send_notification(
+        from . import hbdclass
+        host = hbdclass.Host.hosts.get(host_name)
+        if host is not None and not host.watched:
+            eventlog(host_name, lvl, message, service="threshold")
+            return
+        asyncio.get_event_loop().create_task(notify_mod.send_notification(
            host_name,
            notify_mod.Notification(
                title=f"[{lvl}] {host_name}",
                body=message,
                level=lvl,
            ),
-            )
-            logger.info("Notification sent: %s", message)
-        except Exception as e:
-            logger.error("Failed to send notification: %s", e)
+        ))
        
        # Log to journal
        if self.journal is not None:
            try:
-                import asyncio
                loop = asyncio.get_event_loop()
                loop.create_task(self.journal.log_threshold_event(
                    host_name=host_name,
@@ -1040,15 +1117,21 @@ class ThresholdChecker:
        threshold_value: float,
        op_symbol: str,
        plugin_data: Optional[Dict[str, Any]] = None,
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ) -> str:
        """Format the display string using available data.

-        Args:
-            display_format: Format string from threshold config
-            value: Current metric value
-            threshold_value: Threshold value that was exceeded
-            op_symbol: Comparison operator symbol
-            plugin_data: Optional dictionary of plugin data fields
+        Available template variables:
+            {value}           - current metric value
+            {threshold_value} - threshold that was exceeded
+            {op_symbol}       - comparison operator (>, <, >=, <=, ==, !=)
+            {check_name}      - prefix stripped for generic threshold match
+                                (e.g. "check_disk_root" when metric
+                                "check_disk_root_status_code" matched generic
+                                threshold "status_code")
+            {metric_name}     - field name within the plugin data dict
+            Any key from plugin_data is also available.

        Returns:
            Formatted display string
@@ -1060,10 +1143,29 @@ class ThresholdChecker:
            'op_symbol': op_symbol,
        }

+        # Add generic-match context variables when available
+        if check_name is not None:
+            format_context['check_name'] = check_name
+        if metric_name is not None:
+            format_context['metric_name'] = metric_name
+
        # Add all plugin data fields if available
        if plugin_data:
            format_context.update(plugin_data)

+        # For nagios_runner generic matches, expose the matched check's output
+        # and status as short aliases {output} and {status} so display templates
+        # don't need to use the full {check_disk_root_output} form.
+        if check_name and plugin_data:
+            if 'output' not in format_context:
+                output = plugin_data.get(f"{check_name}_output")
+                if output is not None:
+                    format_context['output'] = output
+            if 'status' not in format_context:
+                status = plugin_data.get(f"{check_name}_status")
+                if status is not None:
+                    format_context['status'] = status
+        
        try:
            # Format the display string
            return display_format.format(**format_context)
@@ -1083,6 +1185,90 @@ class ThresholdChecker:
            )
            return f"(threshold: {op_symbol} {threshold_value})"
    
+    def _apply_grace(
+        self,
+        host_name: str,
+        alert_state: AlertState,
+        metric_path: str,
+        old_level: AlertLevel,
+        new_level: AlertLevel,
+        value: Any,
+        threshold: ThresholdConfig,
+        plugin_data: Optional[Dict[str, Any]],
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
+    ) -> None:
+        """Handle a state-change transition with grace-period logic.
+
+        Transitioning INTO alert (worsening): defers the notification for grace_seconds.
+        De-escalation within alert states (e.g. CRITICAL→WARNING): no new notification;
+          the metric is still alerting so no RECOVER was sent.
+        Transitioning TO OK:
+          - Still in grace window (pending_since set): suppresses both the alert
+            and the recovery — the spike never warranted a page.
+          - Past grace: fires the RECOVER notification normally.
+        """
+        lvl, message, formatted_msg = self._trigger_notification(
+            host_name, metric_path, old_level, new_level, value, threshold, plugin_data,
+            check_name=check_name, metric_name=metric_name,
+        )
+        alert_state.formatted_message = formatted_msg
+
+        if new_level == AlertLevel.OK:
+            if alert_state.pending_since is not None:
+                logger.info(
+                    "Alert suppressed (recovered within %.0fs grace): %s on %s",
+                    self.grace_seconds, metric_path, host_name,
+                )
+                alert_state.pending_since = None
+            else:
+                self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
+        elif new_level.value > old_level.value:
+            # Worsening (OK→WARNING, OK→CRITICAL, WARNING→CRITICAL): schedule notification.
+            alert_state.pending_since = time.time()
+            logger.debug(
+                "Alert deferred (%.0fs grace): %s on %s = %s",
+                self.grace_seconds, metric_path, host_name, value,
+            )
+        else:
+            # De-escalation within alert states (e.g. CRITICAL→WARNING): metric is still
+            # alerting but did not recover, so no new notification.
+            logger.debug(
+                "De-escalation %s→%s for %s on %s, no notification",
+                old_level.name, new_level.name, metric_path, host_name,
+            )
+
+    def _check_pending_or_renotify(
+        self,
+        host_name: str,
+        alert_state: AlertState,
+        metric_path: str,
+        value: Any,
+        threshold: ThresholdConfig,
+        plugin_data: Optional[Dict[str, Any]],
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
+    ) -> None:
+        """Called when alert level is unchanged and non-OK.
+
+        If a deferred notification is pending and grace_seconds have elapsed,
+        fires it now. Otherwise falls through to normal reminder logic.
+        """
+        if alert_state.pending_since is not None:
+            if time.time() - alert_state.pending_since >= self.grace_seconds:
+                lvl, message, formatted_msg = self._trigger_notification(
+                    host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data,
+                    check_name=check_name, metric_name=metric_name,
+                )
+                alert_state.formatted_message = formatted_msg
+                self._send_notification(
+                    host_name, lvl, message, metric_path, AlertLevel.OK, alert_state.level, value
+                )
+                alert_state.pending_since = None
+            # else: still within grace window, do nothing
+        else:
+            self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data, check_name=check_name, metric_name=metric_name)
+
    def _check_renotify(
        self,
        host_name: str,
@@ -1091,6 +1277,8 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]] = None,
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ):
        """Check if we should send a repeat notification.
        
@@ -1137,26 +1325,48 @@ class ThresholdChecker:
                    value=value,
                    threshold_value=threshold_value,
                    op_symbol=op_symbol,
-                    plugin_data=plugin_data
+                    plugin_data=plugin_data,
+                    check_name=check_name,
+                    metric_name=metric_name,
                )
                message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} {threshold_info}, ongoing for {int(now - alert_state.since)}s"
            else:
                message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} (ongoing for {int(now - alert_state.since)}s)"
            
-            try:
-                notify_mod.send_notification(
+            from . import hbdclass
+            host = hbdclass.Host.hosts.get(host_name)
+            if host is None or host.watched:
+                asyncio.get_event_loop().create_task(notify_mod.send_notification(
                    host_name,
                    notify_mod.Notification(
                        title=f"[REMINDER/{alert_state.level.name}] {host_name}",
                        body=message,
                        level=alert_state.level.name,
                    ),
-                )
+                ))
+                logger.info("Re-notification sent: %s", message)
            alert_state.last_notification = now
            alert_state.notification_count += 1
-                logger.info("Re-notification sent: %s", message)
-            except Exception as e:
-                logger.error("Failed to send re-notification: %s", e)
+    
+    def purge_stale_alerts(self, hbdclass) -> None:
+        """Remove alert states that have no matching threshold configuration.
+
+        Called after startup (pickle restore) and after each config reload so
+        that alerts orphaned by configuration changes do not linger forever.
+        Alerts whose metric_path is not present in the current threshold config
+        for that host are silently dropped.
+        """
+        for hostname, host in hbdclass.Host.hosts.items():
+            if not host.alert_states:
+                continue
+            configured = self.get_thresholds_for_host(hostname)
+            stale = [mp for mp in host.alert_states if self._find_threshold(configured, mp)[0] is None]
+            for mp in stale:
+                logger.info(
+                    "Purging stale alert state for %s / %s (no threshold configured)",
+                    hostname, mp,
+                )
+                del host.alert_states[mp]

    def get_active_alerts(self, alert_states: Dict[str, AlertState]) -> list:
        """
@@ -171,6 +171,24 @@ def dicttos(ID, d):
 DROPOVERDUE = 7 * 24 * 3600  # seconds before an overdue host becomes UNKNOWN


+def _set_connectivity_alert(host, afam, level_name):
+    """Update (or clear) a connectivity alert_state entry for a host/address-family.
+
+    level_name is "CRITICAL", "WARNING", or "OK".  "OK" removes the entry so
+    that recovered hosts don't clutter the Alerts Dashboard.
+    """
+    from .threshold import AlertState, AlertLevel
+    metric_path = f"connectivity.{afam}"
+    level = getattr(AlertLevel, level_name, AlertLevel.OK)
+    if level == AlertLevel.OK:
+        host.alert_states.pop(metric_path, None)
+        return
+    if metric_path not in host.alert_states:
+        host.alert_states[metric_path] = AlertState(metric_path)
+    state = host.alert_states[metric_path]
+    state.update(level, level_name)
+
+
 def _make_timer_callbacks(uname, host, ctx):
    """Return (on_overdue, on_unknown) async callbacks for connection timer logic.

@@ -182,6 +200,7 @@ def _make_timer_callbacks(uname, host, ctx):

    async def on_unknown(connection):
        connection.newstate(connection.__class__.UNKNOWN, connection.lastbeat)
+        # Keep connectivity alert active when host transitions to unknown
        if msg_to_websockets:
            msg_to_websockets("host", host.stateinfo())

@@ -192,10 +211,13 @@ def _make_timer_callbacks(uname, host, ctx):
        connection.newstate(connection.__class__.OVERDUE, now, cfg.get("grace", 2))
        msg = f"{connection.afam} overdue"
        eventlog(uname, "CRITICAL", msg)
-        notify_mod.send_notification(
+        if host.watched:
+            asyncio.create_task(notify_mod.send_notification(
                uname,
                notify_mod.Notification(title=f"[CRITICAL] {uname}", body=msg, level="CRITICAL"),
-        )
+            ))
+        # Track in alert_states so the Alerts Dashboard shows this
+        _set_connectivity_alert(host, connection.afam, "CRITICAL")
        if threshold_checker:
            threshold_checker.check_value(
                host_name=uname,
@@ -294,7 +316,6 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
    
    cfg = ctx.get("config", {})
    hbdcls = ctx.get("hbdclass")
-    log = ctx.get("log")
    msg_to_websockets = ctx.get("msg_to_websockets")
    DEBUG = ctx.get("DEBUG", 0)
    verbose = ctx.get("verbose", False)
@@ -387,10 +408,11 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):

    if res:
        eventlog(uname, "WARNING", res)
-        notify_mod.send_notification(
+        if host.watched:
+            asyncio.create_task(notify_mod.send_notification(
                uname,
                notify_mod.Notification(title=f"[WARNING] {uname}", body=res, level="WARNING"),
-        )
+            ))

    interval = int(msg.get("interval", 0) or 0)
    shutdown = msg.get("shutdown", 0)
@@ -400,28 +422,36 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):

    if boot:
        eventlog(uname, "INFO", "booted")
-        notify_mod.send_notification(
+        if host.watched:
+            asyncio.create_task(notify_mod.send_notification(
                uname,
                notify_mod.Notification(title=f"[INFO] {uname}", body=f"{host.name} booted", level="INFO"),
-        )
+            ))
    if message:
        eventlog(uname, "INFO", "msg: %s" % message, service=service)

    if conn.getstate() != hbdcls.Connection.UP:
        lasts = conn.state
        d = conn.newstate(hbdcls.Connection.UP, now)
+        # Clear connectivity alert now that the host is back up
+        _set_connectivity_alert(host, conn.afam, "OK")
        # Don't log/notify RECOVER for a brand-new host seen for the first time —
        # it was never down, it just hasn't been seen before.
        if not newh:
            if d == 0 or lasts == "unknown":
                m = "%s is up" % (conn.afam)
+            elif d < 4:
+                # Transient blip (likely client restart) — skip log and notification
+                m = None
            else:
                m = "%s back after being %s for %s" % (conn.afam, lasts, dur(d))
+            if m:
                eventlog(uname, "RECOVER", m)
-            notify_mod.send_notification(
+                if host.watched:
+                    asyncio.create_task(notify_mod.send_notification(
                        uname,
                        notify_mod.Notification(title=f"[RECOVER] {uname}", body=m, level="RECOVER"),
-            )
+                    ))

    if boot or newh:
        host.upcount = host.doesack
@@ -431,11 +461,13 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
    if shutdown:
        m = "%s shutdown" % conn.afam
        eventlog(uname, "INFO", m)
-        notify_mod.send_notification(
+        if host.watched:
+            asyncio.create_task(notify_mod.send_notification(
                uname,
                notify_mod.Notification(title=f"[INFO] {uname}", body=m, level="INFO"),
-        )
+            ))
        conn.newstate(hbdcls.Connection.DOWN, now)
+        _set_connectivity_alert(host, conn.afam, "CRITICAL")

    if interval > 0:
        host.interval = interval
@@ -467,12 +499,10 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
        op, rmsg = host.cmds[0]
        if op == "CMD":
            del host.cmds[0]
-            if log:
-                log(uname, "command sent")
+            eventlog(uname, "INFO", "command sent")
        elif op == "UPD":
            del host.cmds[0]
-            if log:
-                log(uname, "update initiated")
+            eventlog(uname, "INFO", "update initiated")
        opkt = dicttos(op, rmsg)
        try:
            transport.sendto(opkt, addr)
@@ -13,7 +13,8 @@ from . import data

 logger = logging.getLogger(__name__)

-_connections: set = set()
+# Map of WebSocket → User object (or None when auth is disabled)
+_connections: dict = {}
 _loop: Optional[asyncio.AbstractEventLoop] = None
 _get_hosts: Optional[Callable[[], Iterable]] = None
 _verbose: bool = False
@@ -34,22 +35,52 @@ def setup(
    _verbose = verbose


+def _user_can_see_host(user, host_name: str) -> bool:
+    """Return True if *user* may see updates for *host_name* (manager or higher)."""
+    from . import hbdclass, users as users_mod
+    if user is None or not users_mod.users_enabled():
+        return True
+    if user.admin:
+        return True
+    host = hbdclass.Host.hosts.get(host_name)
+    if host is None:
+        return False
+    return host.is_manager(user.username)
+
+
+def _get_token(request) -> str:
+    """Extract session token from request (mirrors logic in http.py)."""
+    auth = request.headers.get("Authorization", "")
+    if auth.startswith("Bearer "):
+        return auth[7:].strip()
+    token = request.headers.get("X-Auth-Token", "")
+    if token:
+        return token
+    return request.cookies.get("hbd_session", "")
+
+
 async def handler(request):
    """aiohttp WebSocket upgrade handler — register as GET /ws."""
    from aiohttp import web
+    from . import users as users_mod

    ws = web.WebSocketResponse()
    await ws.prepare(request)

-    _connections.add(ws)
+    token = _get_token(request)
+    user = users_mod.get_session_user(token) if token else None
+
+    _connections[ws] = user
    remote = request.remote
    logger.info("WebSocket connected from %s", remote)

    try:
-        # Send current host state to the new client
+        # Send current host state, filtered to hosts this user may see
        if _get_hosts:
            try:
                for h in list(_get_hosts()):
+                    host_name = h.get("raw_name") or h.get("name", "")
+                    if _user_can_see_host(user, host_name):
                        await ws.send_str(json.dumps({"type": "host", "data": h}))
            except Exception as e:
                logger.error("Error sending initial hosts: %s", e)
@@ -74,7 +105,7 @@ async def handler(request):
    except Exception as e:
        logger.exception("WebSocket handler error from %s: %s", remote, e)
    finally:
-        _connections.discard(ws)
+        _connections.pop(ws, None)
        logger.info("WebSocket disconnected from %s", remote)

    return ws
@@ -83,25 +114,37 @@ async def handler(request):
 def broadcast(typ: str, payload) -> bool:
    """Thread-safe broadcast to all connected WebSocket clients.

+    For host and plugin updates, only sends to clients whose user has
+    manager-or-higher access to that host.  Other message types are
+    broadcast to all clients.
+
    Can be called from any thread; schedules sends on the event loop.
    Returns False if the loop is not running yet.
    """
    if not _loop:
        return False
+
+    # Determine the host name for access-filtered message types
+    host_name: Optional[str] = None
+    if typ in ("host", "plugin"):
+        host_name = payload.get("raw_name") or payload.get("host") or payload.get("name")
+
    jmsg = json.dumps({"type": typ, "data": payload})

    async def _send_all():
        dead = set()
-        for ws in list(_connections):
+        for ws, user in list(_connections.items()):
            try:
-                if not ws.closed:
-                    await ws.send_str(jmsg)
-                else:
+                if ws.closed:
                    dead.add(ws)
+                    continue
+                if host_name is not None and not _user_can_see_host(user, host_name):
+                    continue
+                await ws.send_str(jmsg)
            except Exception:
                dead.add(ws)
        for ws in dead:
-            _connections.discard(ws)
+            _connections.pop(ws, None)

    asyncio.run_coroutine_threadsafe(_send_all(), _loop)
    return True
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "hbd"
-version = "5.1.1"
+version = "5.1.21"
 description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
 readme = "README.md"
 requires-python = ">=3.11"
@@ -34,6 +34,9 @@ server = [
  "matrix-nio>=0.24",
 ]

+# Minimal client — hbc_mini only, no external dependencies
+mini = []
+
 # Install both client and server
 all = [
  "hbd[client,server]",
@@ -54,6 +57,9 @@ dev = [
 hbd = "hbd.server.cli:main"
 hbc = "hbd.client.main:main"

+[tool.setuptools]
+script-files = ["scripts/hb_install.sh", "scripts/hbc_mini.py"]
+
 [tool.setuptools.packages.find]
 where = ["."]
 include = ["hbd*"]
@@ -4,12 +4,14 @@ set -e
 uv version --bump patch 
 VER=$(uv  version  --short)
 sed -i".bak"  "s/__version__ = \"[0-9.]*\"\(.*\)$/__version__ = \"$VER\"\1/" hbd/__init__.py
+sed -i".bak"  "s/__version__ = \"[0-9.]*\"\(.*\)$/__version__ = \"$VER\"\1/" scripts/hbc_mini.py

 # commit pyproject.toml
-git commit -m "version $VER" pyproject.toml hbd/__init__.py
+git commit -m "version $VER" pyproject.toml hbd/__init__.py scripts/hbc_mini.py
 git push 
 # tag version
 git tag -a v$VER -m "Version $VER"
 git push --tags

 rm hbd/__init__.py.bak
+rm scripts/hbc_mini.py.bak
@@ -0,0 +1,115 @@
+#!/bin/sh
+
+# Helper script to install the heartbeat tools. By default, it will only
+# install the heartbeat client, hbc. The server is installed when the arg 'server' is passed 
+# to the script. The script will install the heartbeat tools in a python 
+# virtual environment in ~/venvs/hbd. The hbd and hbc commands will be
+# installed from the wheel and symlinked to ~/bin/hbd and ~/bin/hbc,
+# respectively. If the virtual environment already exists, it will be
+# reused. The script will also remove any existing symlinks for hbd and hbc
+# in ~/bin before creating new ones.
+
+set -e
+what=$1
+on_ha=0
+where=""
+venv=""
+[ "$2" = "HA" ] && on_ha=1
+[ -z "$what" ] && what="client"
+
+if [ -d /homeassistant ]; then  # if running from HA command line
+    echo "HA, running \"docker exec homeassistant /config/bin/hb_install.sh $@\""
+    docker exec homeassistant /config/bin/hb_install.sh $@ HA
+    rc=$?
+    if [ $rc -ne 0 ]; then
+        echo "Failed to install heartbeat in HA, please check the logs for more details"
+        exit 1
+    fi
+    exit 0
+fi
+
+if [ $on_ha -eq 1 ] || [ -r /.dockerenv ] && [ -d /config/bin ]; then
+    # Installing under docker on Home Assistant OS, using /config/bin for executables and /config/venvs for virtual environments 
+    echo "Home Assistant OS detected, installing under docker"
+    where="/config/bin"
+    venv="/config/venvs"
+else
+    if [ ! -d $HOME/.local/bin ] && [ ! -d $HOME/bin ]; then
+        echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
+        exit 1
+    fi
+    for where in $HOME/bin $HOME/.local/bin notset ; do
+        if echo ":$PATH:" | grep -q ":$where:" ; then
+            break
+        fi
+    done
+    if [ "$where" = "notset" ]; then
+        echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
+        exit 1
+    fi
+    if [ "$what" = "mini" ]; then
+        venv=""
+    else
+        venv="$HOME/venvs"
+    fi
+fi
+echo "Installing $what to $where"
+if [ ! -z "$venv" ]; then
+    echo "Using virtual environment at $venv/hbd"
+fi
+
+if [ "$venv" != "" ] && [ ! -d  $venv/hbd ]; then
+    arg=""
+    have_pip=$(python3 -c "import pip" 2>/dev/null &> /dev/null && echo "Installed" || echo "Not Installed")
+    if [ "$have_pip" = "Not Installed" ]; then
+        # some systems do not have pip installed by default, so we need to fetch get-pip.py and install pip
+        echo "pip is not installed, fetching get-pip.py and installing pip"
+        arg="--without-pip"
+    fi
+    mkdir -p $venv
+    have_venv=$(python3 -c "import venv" 2>/dev/null &> /dev/null && echo "Installed" || echo "Not Installed")
+    if [ "$have_venv" = "Not Installed" ]; then
+        if [ "$have_pip" = "Not Installed" ]; then
+            echo "python has no venv, and no pip to install virtualenv, cannot continue"
+            exit 1
+        fi
+        echo "python venv module not found, installing virtualenv"
+        python3 -m pip install --user virtualenv
+        python3 -m virtualenv $venv/hbd --system-site-packages $arg
+    else
+        python3 -m venv $venv/hbd --system-site-packages $arg
+    fi
+    . $venv/hbd/bin/activate
+    if [ -n "$arg" ]; then  
+        curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py
+    fi
+    deactivate
+fi
+
+if [ ! -z "$venv" ]; then
+    . $venv/hbd/bin/activate
+fi
+if [ "$what" = "mini" ]; then
+    curl -s -o $where/hbc_mini https://git.wrede.ca/andreas/heartbeat/raw/branch/master/scripts/hbc_mini.py
+    chmod +x $where/hbc_mini
+else
+    python3 -mpip install --upgrade --index-url https://git.wrede.ca/api/packages/andreas/pypi/simple/ --extra-index-url https://pypi.org/simple hbd[$what]
+fi
+
+if [ ! -z "$venv" ]; then
+    echo "linking executables to $where"
+    if [ "$what" = "server" ]; then
+        rm -f $where/hbd
+        ln -sf $(which hbd) $where/hbd
+    elif [ "$what" = "client" ]; then
+        rm -f $where/hbc
+        ln -sf $(which hbc) $where/hbc
+    fi
+    rm -f $where/hb_install.sh
+    ln -sf $(which hb_install.sh) $where/hb_install.sh
+fi
+echo "Installation complete. To upgrade, run the following:"
+echo "    $where/hb_install.sh $what"
+echo "To install on another machine, run the following obtain the install script and run it:"
+echo "from https://git.wrede.ca/andreas/heartbeat/raw/branch/master/scripts/hb_install.sh"
+echo "and then run sh hb_install.sh [mini|client]"
@@ -1,58 +0,0 @@
-#!/bin/sh
-
-# install the heartbeat client, hbc. The server is installed when the arg 'server' is passed 
-# install the heartbeat client, hbc. The server is installed when the arg 'server' is passed 
-# to the script. The script will install the heartbeat tools in a python 
-# virtual environment in ~/venvs/hbd. The hbd and hbc commands will be
-# installed from the wheel and symlinked to ~/bin/hbd and ~/bin/hbc,
-# respectively. If the virtual environment already exists, it will be
-# reused. The script will also remove any existing symlinks for hbd and hbc
-# in ~/bin before creating new ones.
-
-
-# hbd/hbc from wheel and create symlinks for hbd and hbc in ~/bin
-
-set -e
-what=$1
-
-if [ -d /homeassistant ]; then
-    echo "cannot install in HA, run \"docker exec -it homeassistant $0 $@\""
-    exit 1
-fi
-if [ -d /config ]; then
-    echo "Installing on HA"
-    where="/config/bin"
-    venv="/config/venvs"
-else
-    if [ ! -d ~/.local/bin ] && [ ! -d ~/bin ]; then
-        echo "No suitable bin directory found in PATH, please add either ~/.local/bin or ~/bin to your PATH"
-        exit 1
-    fi
-    for where in ~/bin ~/.local/bin; do
-        if echo ":$PATH:" | grep -q ":$where:" ; then
-            break
-        fi
-    done
-    venv="~/venvs"
-fi
-python3 -m pip --version > /dev/null 2>&1 || { echo "pip is not installed, please install pip for python3"; exit 1; }
-
-if [ "$what" = "server" ]; then
-    echo "Installing heartbeat server (hbd)"
-else
-    what="client"
-    echo "Installing heartbeat client (hbc)"
-fi
-if [ ! -d  $venv/hbd ]; then
-    mkdir -p $venv
-    python3 -m venv $venv/hbd --system-site-packages
-fi
-. $venv/hbd/bin/activate
-pip install --index-url https://git.wrede.ca/api/packages/andreas/pypi/simple/ --extra-index-url https://pypi.org/simple hbd[$what]
-if [ "$what" = "server" ]; then
-    rm -f ~$where/hbd
-    ln -sf $(which hbd) $where/hbd
-else
-    rm -f $where/hbc
-    ln -sf $(which hbc) $where/hbc
-fi
@@ -0,0 +1,99 @@
+import asyncio
+import logging
+import os
+import stat
+
+from hbd.client.plugins.nagios_runner import (
+    NagiosRunnerPlugin,
+    NAGIOS_OK,
+    NAGIOS_WARNING,
+    NAGIOS_CRITICAL,
+    NAGIOS_UNKNOWN,
+)
+
+
+def test_no_commands_sets_skip_reason():
+    plugin = NagiosRunnerPlugin(config={"commands": []})
+    result = asyncio.run(plugin.initialize())
+    assert result is False
+    assert plugin.skip_reason is not None
+    assert "nagios_runner.commands" in plugin.skip_reason
+
+
+def test_stderr_used_when_stdout_empty(tmp_path):
+    script = tmp_path / "check_err.sh"
+    script.write_text("#!/bin/sh\necho 'error from stderr' >&2\nexit 2\n")
+    script.chmod(script.stat().st_mode | stat.S_IEXEC)
+
+    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert "error from stderr" in data["t_output"]
+    assert data["t_status_code"] == NAGIOS_CRITICAL
+
+
+def test_stderr_appended_when_both_present(tmp_path):
+    script = tmp_path / "check_both.sh"
+    script.write_text("#!/bin/sh\necho 'OK - all good'\necho 'extra detail' >&2\nexit 0\n")
+    script.chmod(script.stat().st_mode | stat.S_IEXEC)
+
+    config = {"commands": [{"name": "t", "command": str(script)}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert "OK - all good" in data["t_output"]
+    assert "extra detail" in data["t_output"]
+    assert data["t_status_code"] == NAGIOS_OK
+
+
+def test_negative_returncode_maps_to_unknown():
+    # kill -9 $$ kills the shell itself; asyncio sees returncode -9
+    config = {"commands": [{"name": "t", "command": "kill -9 $$"}], "timeout": 5}
+    plugin = NagiosRunnerPlugin(config=config)
+    asyncio.run(plugin.initialize())
+    data = asyncio.run(plugin._collect_metrics())
+
+    assert data["t_status_code"] == NAGIOS_UNKNOWN
+    assert "signal" in data["t_output"].lower()
+
+
+def test_absolute_path_not_found_warns(caplog):
+    fake_cmd = "/nonexistent_hbc_test_path/check_something"
+    config = {"commands": [{"name": "t", "command": fake_cmd}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert any("not found" in r.message for r in caplog.records)
+
+
+def test_absolute_path_not_executable_warns(caplog, tmp_path):
+    non_exec = tmp_path / "check_test"
+    non_exec.write_text("#!/bin/sh\necho OK\n")
+    non_exec.chmod(0o644)  # readable but not executable
+
+    config = {"commands": [{"name": "t", "command": str(non_exec)}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert any("not executable" in r.message for r in caplog.records)
+
+
+def test_relative_path_not_checked(caplog):
+    # Relative paths (resolved via PATH) must not generate warnings
+    config = {"commands": [{"name": "t", "command": "echo OK"}]}
+    plugin = NagiosRunnerPlugin(config=config)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.nagios_runner"):
+        asyncio.run(plugin.initialize())
+
+    assert not any(
+        "not found" in r.message or "not executable" in r.message
+        for r in caplog.records
+    )
@@ -0,0 +1,83 @@
+import asyncio
+import logging
+import textwrap
+
+from hbd.client.plugin import PluginLoader, PluginRegistry
+
+
+def test_plugin_skip_reason_defaults_none(tmp_path):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class MinimalPlugin(MonitorPlugin):
+            name = "minimal"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                return True
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "minimal.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+    asyncio.run(loader.load_from_directory(tmp_path))
+    plugin = registry.get("minimal")
+    assert plugin is not None
+    assert plugin.skip_reason is None
+
+
+def test_loader_logs_info_when_skip_reason_set(tmp_path, caplog):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class SkippablePlugin(MonitorPlugin):
+            name = "skippable"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                self.skip_reason = "not configured in yaml"
+                return False
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "skippable.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+
+    with caplog.at_level(logging.INFO, logger="plugin.loader"):
+        count = asyncio.run(loader.load_from_directory(tmp_path))
+
+    assert count == 0
+    assert any("skipped: not configured in yaml" in r.message for r in caplog.records)
+    assert not any("failed initialization" in r.message for r in caplog.records)
+
+
+def test_loader_logs_warning_when_no_skip_reason(tmp_path, caplog):
+    plugin_code = textwrap.dedent("""
+        from hbd.client.plugin import MonitorPlugin
+
+        class FailPlugin(MonitorPlugin):
+            name = "fail"
+            version = "1.0.0"
+            interval = 60
+
+            async def initialize(self):
+                return False
+
+            async def _collect_metrics(self):
+                return {}
+    """)
+    (tmp_path / "fail_plugin.py").write_text(plugin_code)
+    registry = PluginRegistry()
+    loader = PluginLoader(registry)
+
+    with caplog.at_level(logging.WARNING, logger="plugin.loader"):
+        count = asyncio.run(loader.load_from_directory(tmp_path))
+
+    assert count == 0
+    assert any("failed initialization" in r.message for r in caplog.records)