version 5.2.3

hbc/hbc_mini: log name and version at startup; ui: bump alert-metric font size
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 10:15:11 -04:00 · 2026-05-07 10:15:03 -04:00 · 2026-05-07 06:26:27 -04:00 · 2026-05-07 06:12:15 -04:00 · 2026-05-06 11:57:43 -04:00 · 2026-05-06 11:54:09 -04:00
37 changed files with 4054 additions and 1736 deletions
@@ -0,0 +1,4 @@
+1. Don't assume. Don't hide confusion. Surface tradeoffs.
+2. Minimum code that solves the problem. Nothing speculative.
+3. Touch only what you must. Clean up only your own mess.
+4. Define success criteria. Loop until verified.
@@ -27,6 +27,7 @@ A lightweight daemon that listens for UDP heartbeat messages and acts on them: k
  - Configurable retention and backup management
 - **Plugin system for extensible monitoring** ✅
  - Collect system metrics (CPU, memory, disk, network)
+  - Monitor ZFS pool health, capacity, and I/O via `zpool(8)`
  - Execute existing Nagios monitoring plugins
  - Create custom plugins with simple Python classes
 - **Threshold alerting system** ✅
@@ -34,6 +35,8 @@ A lightweight daemon that listens for UDP heartbeat messages and acts on them: k
  - Hysteresis to prevent alert flapping
  - Automatic notifications on state changes
  - Re-notification for ongoing alerts
+- **Per-host watch flag** — set `watch: false` on any host to silence all notifications for that host without removing its configuration ✅
+- **Role-filtered dashboards** — Live Dashboard and Host Overview show only hosts where the logged-in user is owner or manager (admins see all) ✅
 - Modular codebase suitable for unit testing and CI ✅

 ---
@@ -55,21 +58,26 @@ Heartbeat includes a comprehensive plugin architecture that extends monitoring b
 ### Built-in Plugins

 - `os_info`: Collects OS, kernel, distribution, and architecture information
- `cpu_monitor`: Monitors CPU usage, load average, frequency, and process counts
- `memory_monitor`: Monitors RAM and swap usage, available memory
+- `cpu_monitor`: Monitors CPU usage, load average, frequency, process counts, and uptime
+- `memory_monitor`: Monitors RAM and swap usage, available memory (ZFS ARC-aware)
 - `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
 - `network_monitor`: Monitors network interface statistics, bandwidth, and connections
+- `ping_monitor`: Measures round-trip latency to configured hosts
 - `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
 - `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
+- `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`

 ### Nagios Integration

 The `nagios_runner` plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored:

- Executes plugins via subprocess with timeout protection
+- Executes plugins asynchronously (non-blocking) with timeout protection
+- Captures both stdout and stderr; if stdout is empty, stderr is used as the status message
+- Handles signal-killed processes (negative exit code → UNKNOWN status)
+- Validates absolute command paths at startup and warns on missing or non-executable files
 - Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
 - Extracts performance data with thresholds
- Reports aggregated status across all configured checks
+- Reports per-check status, exit code, and output; no aggregate rollup field

 See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.

@@ -147,9 +155,11 @@ Heartbeat includes a sophisticated threshold alerting system that monitors plugi
 - **Multi-level alerts**: WARNING and CRITICAL severity levels
 - **Flexible operators**: Support for >, >=, <, <=, ==, != comparisons
 - **Hysteresis**: Prevents alert flapping with configurable recovery thresholds
- **Smart notifications**: Alerts only on state changes, not every check
+- **Smart notifications**: Alerts only on state changes, not every check; de-escalations (e.g. CRITICAL → WARNING) do not generate a notification
 - **Re-notifications**: Periodic reminders for ongoing alerts
+- **Short-duration suppression**: Recovery notifications are suppressed for down events under 4 seconds (avoids noise from transient blips)
 - **Journal integration**: All threshold events logged for audit trail
+- **`ping_monitor` thresholds**: Latency and packet-loss thresholds use the same format as all other plugin metrics

 ### Configuration

@@ -172,7 +182,8 @@ thresholds:
      warning: 80.0      # Warn when CPU > 80%
      critical: 90.0     # Critical when CPU > 90%
      operator: ">"
-      hysteresis: 0.1    # 10% hysteresis to prevent flapping
+      hysteresis: 0.02   # 2% hysteresis to prevent flapping
+      display: "(threshold: {op_symbol} {threshold_value}%)"  # optional
  
  memory_monitor:
    percent:
@@ -214,7 +225,7 @@ thresholds:
    <hostname>:
      warning: <milliseconds>   # Warn when RTT > this value
      critical: <milliseconds>  # Critical when RTT > this value
-      hysteresis: 0.1           # Optional: 10% hysteresis (default)
+      hysteresis: 0.02          # Optional: 2% hysteresis (default)
 ```

 **Example alerts:**
@@ -265,7 +276,94 @@ All plugin metrics can be thresholded:
 - **Memory**: percent, available_mb, swap_percent
 - **Disk**: Per-partition percent, free_gb, free_mb
 - **Network**: errors_total, dropped packets, connection counts
- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
+- **Nagios**: Any field emitted by `nagios_runner` (`<name>_status_code`, `<name>_status`, `<name>_output`, performance data fields)
+
+### Display Format Templates
+
+Each threshold entry accepts an optional `display` field — a Python format string shown in notifications and on the Alerts dashboard:
+
+```yaml
+nagios_runner:
+  status_code:
+    warning: 1
+    critical: 2
+    operator: ">="
+    display: "{check_name}: exit {value} (expected < {threshold_value})"
+```
+
+Available variables:
+
+| Variable | Description |
+|---|---|
+| `{value}` | Current metric value |
+| `{threshold_value}` | Threshold that was crossed |
+| `{op_symbol}` | Comparison operator (`>`, `<`, `>=`, …); `"nagios"` for the nagios operator |
+| `{check_name}` | Prefix stripped by generic matching (see below) |
+| `{metric_name}` | Full field name within the plugin data |
+| `{output}` | For `nagios_runner` generic matches: the matched check's status text (alias for `{check_name}_output`) |
+| `{status}` | For `nagios_runner` generic matches: the matched check's status name — OK/WARNING/CRITICAL/UNKNOWN (alias for `{check_name}_status`) |
+| any plugin field | Any other field present in the plugin's data |
+
+### Generic Threshold Matching
+
+When a metric name has no exact threshold entry, the server progressively strips leading underscore-separated segments and re-tries the lookup. This lets a single generic entry cover an entire family of metrics.
+
+The classic use case is `nagios_runner`, which names each metric after the command that produced it:
+
+```
+nagios_runner.check_disk_root_status_code    → no exact match
+nagios_runner.disk_root_status_code          → no match
+nagios_runner.root_status_code               → no match
+nagios_runner.status_code                    → matched ✓
+```
+
+Configure the generic threshold once using the `nagios` operator, which maps exit codes directly to alert severity without requiring numeric warning/critical values:
+
+```yaml
+nagios_runner:
+  status_code:
+    operator: "nagios"   # 0=OK  1=WARNING  2=CRITICAL  3=UNKNOWN
+    display: "{check_name}: {output}"
+```
+
+The stripped prefix (`check_disk_root` in the example above) is available as `{check_name}` in the display template, so you can identify which check triggered the alert without writing a separate threshold entry per command.
+
+Exact matches always take priority. A generic entry only applies when no specific one is defined.
+
+### Per-Host Threshold Profiles
+
+Named threshold configurations let different hosts use different limits. A host's `threshold_config` can be a single name or a **list** — lists are applied left-to-right so profiles compose without duplication:
+
+```yaml
+threshold_configs:
+  default:
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 80, critical: 90}
+      memory_monitor:
+        memory_percent: {warning: 85, critical: 95}
+
+  tight_cpu:           # override CPU limits only
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 60, critical: 75}
+
+  db_disk:             # add a database partition check
+    thresholds:
+      disk_monitor:
+        partitions:
+          /var/lib/postgresql:
+            percent: {warning: 75, critical: 88}
+
+hosts:
+  web-01:
+    threshold_config: default          # single profile
+
+  db-01:
+    threshold_config: [tight_cpu, db_disk]   # layered: CPU override + extra disk check
+```
+
+Each named config's overrides are applied in order on top of the defaults. Metrics not mentioned in a profile are inherited unchanged.

 See [docs/THRESHOLD_ALERTING.md](docs/THRESHOLD_ALERTING.md) for comprehensive documentation including best practices, troubleshooting, and advanced configuration.

@@ -328,9 +426,10 @@ Heartbeat includes a built-in HTTP/WebSocket server that provides both a REST AP
 ### Web Dashboards

 - **Login** (`/login`): Browser login form (shown automatically when auth is configured)
- **Live View** (`/live`): Real-time host connectivity, latency, and messages
- **Plugin Metrics** (`/plugins`): Browse and visualize metrics from all plugins
- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering
+- **Live View** (`/live`): Real-time host connectivity, latency, and messages; hostnames link directly to the Host Overview page
+- **Host Overview** (`/plugins/<host>`): Per-host plugin metrics with ZFS pool visualization; filtered to hosts where the logged-in user is owner or manager (admins see all)
+- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering; alert count pie chart shown in the navigation bar
+- **Settings** (`/settings`): Server configuration, user management, and threshold configuration viewer

 ### API Endpoints

@@ -377,7 +476,7 @@ This project now declares its dependencies in `pyproject.toml`. Instead
 of the old `requirements.txt` flow, install the package into a virtualenv
 using `pip`:

-See `scripts/install.sh` for a way to install.
+See `scripts/hb_install.sh` for a way to install.

 Run the daemon (example):

@@ -408,6 +507,9 @@ hbc --boot your-server.example.com

 # Verbose output
 hbc -v your-server.example.com
+
+# Send 'boot' and 'shutdown' messages on start and exit 
+hbc -b your-server.example.com
 ```

 You can also run it via the module entrypoint:
@@ -416,12 +518,11 @@ You can also run it via the module entrypoint:
 python -m hbd.client.main your-server.example.com
 ```

-Client configuration can also be specified in YAML:
+Client configuration can also be specified in YAML (`~/.hbc.yaml`):

 ```yaml
-server: hbd.example.com
-port: 50003
-interval: 30
+hb_port: 50003        # Server port (default: 50003)
+interval: 30          # Heartbeat interval in seconds
 plugins:
  cpu_monitor:
    interval: 300      # Check every 5 minutes (default)
@@ -435,12 +536,84 @@ plugins:
  nagios_runner:
    interval: 300      # Check every 5 minutes (default)
    commands:
-      - /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
-      - /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
+      - name: check_load
+        command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
+      - name: check_disk
+        command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
 ```

+The server hostname is always passed as a positional command-line argument; there is no `server:` config key.
+
 All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.

+**Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
+
+**Daemon logging:** When running with `-d`, `hbc` routes all log output to syslog (`LOG_DAEMON` facility) after daemonizing. Without `-d`, logs go to stderr as usual.
+
+### hbc_mini — single-file client (no external dependencies)
+
+`scripts/hbc_mini.py` is a self-contained version of the heartbeat client that requires only Python 3.8+ and no external packages. Copy it to any host and run it directly — no virtualenv, no `pip install`.
+
+```bash
+# Basic usage
+python3 hbc_mini.py your-server.example.com
+
+# Run as daemon
+python3 hbc_mini.py -d your-server.example.com
+
+# Send a boot message
+python3 hbc_mini.py -b your-server.example.com
+
+# Send a one-off message
+python3 hbc_mini.py -m "maintenance starting" your-server.example.com
+```
+
+**Config:** `~/.hbc.json` (same keys as `~/.hbc.yaml`, JSON format). Example:
+
+```json
+{
+  "hb_port": 50003,
+  "interval": 30,
+  "plugins": {
+    "ping_monitor": {
+      "interval": 60,
+      "hosts": ["8.8.8.8", "192.168.1.1"]
+    },
+    "nagios_runner": {
+      "interval": 300,
+      "commands": [
+        {"name": "check_load", "command": "/usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6"}
+      ]
+    }
+  }
+}
+```
+
+**Plugin availability:**
+
+| Plugin | Platform | Data source |
+|---|---|---|
+| `os_info` | all | `platform` stdlib |
+| `ping_monitor` | all | `ping` subprocess |
+| `nagios_runner` | all (not Windows) | subprocess |
+| `cpu_monitor` | Linux | `/proc/stat` |
+| `memory_monitor` | Linux | `/proc/meminfo` |
+| `disk_monitor` | Linux, macOS, BSD | `df -P` subprocess |
+| `network_monitor` | Linux | `/proc/net/dev` |
+
+**What is not available compared to the full `hbc`:**
+
+- No YAML config (use JSON instead)
+- No `filesystem_info` plugin
+- No `zfs_monitor` plugin (requires `zpool(8)` and the full plugin loader)
+- `cpu_monitor` does not report per-core usage or CPU frequency (no psutil)
+- Plugins cannot be loaded from external `.py` files — all plugins are compiled in
+- No IPv6 early-fail protection — connections that fail to open at startup are silently skipped rather than retried
+
+Everything else — heartbeat protocol, ACK/CMD/UPD handling, `hb_install.sh`-based self-update, daemonize, syslog — is identical to the full client.
+
+---
+
 ## 🐞 Debugging in VS Code

 This repository includes a ready-to-use `.vscode/launch.json` with configurations to run or attach the VS Code debugger to `hbd`.
@@ -1,234 +0,0 @@
-# HBD/HBC Separation Refactoring
-
-## Overview
-
-The heartbeat monitoring system has been refactored into a modular package structure with separate client and server components. This allows users to install only what they need and provides clear separation of concerns.
-
-## New Package Structure
-
-```
-hbd/
-├── __init__.py                 # Main package (minimal)
-├── client/                     # HBC - System monitoring client
-│   ├── __init__.py
-│   ├── main.py                # Entry point (was hbc.py)
-│   ├── config.py              # Client-specific configuration
-│   ├── plugin.py              # Plugin framework
-│   ├── threshold.py           # Threshold checking
-│   └── plugins/               # Monitoring plugins
-│       ├── cpu_monitor.py
-│       ├── disk_monitor.py
-│       ├── memory_monitor.py
-│       ├── network_monitor.py
-│       ├── filesystem_info.py
-│       ├── os_info.py
-│       └── nagios_runner.py
-├── server/                     # HBD - Heartbeat daemon/server
-│   ├── __init__.py
-│   ├── main.py                # Server runtime (was server.py)
-│   ├── cli.py                 # Command-line interface
-│   ├── config.py              # Server-specific configuration
-│   ├── http.py                # HTTP/REST API
-│   ├── ws.py                  # WebSocket server
-│   ├── udp.py                 # UDP heartbeat listener
-│   ├── dns.py                 # DNS update functionality
-│   ├── notify.py              # Notification handlers
-│   ├── monitor.py             # Host monitoring
-│   ├── hbdclass.py            # Host class definitions
-│   ├── journal.py             # Message journaling
-│   ├── templates/             # Jinja2 web templates
-│   └── static/                # Web UI assets
-└── common/                     # Shared utilities
-    ├── __init__.py
-    ├── proto.py               # Protocol encoding/decoding
-    └── utils.py               # Common utilities
-
-## Configuration Files
-
-### Client Configuration (hbd/client/config.py)
-
-Client-specific defaults:
- `hb_port`: Port where hbd servers listen (default: 50003)
- `interval`: Heartbeat interval in seconds (default: 10)
- `plugins`: Per-plugin configuration
- `thresholds`: Threshold configuration for monitoring
-
-### Server Configuration (hbd/server/config.py)
-
-Server-specific defaults:
- `hb_port`: Port to listen for heartbeats (default: 50003)
- `hbd_port`: HTTP API port (default: 50004)
- `ws_port`: WebSocket port (default: 50005)
- `logfile`: Log file path
- `pushsrv`, `pushover_token`, etc.: Notification settings
- `watchhosts`, `dyndnshosts`: Host monitoring
- `smtpserver`, etc.: Email settings
- `journal_*`: Message journaling settings
-
-## Installation Options
-
-### Install Core Only (minimal, PyYAML only)
-```bash
-pip install hbd
-```
-
-### Install Client Only (for monitoring)
-```bash
-pip install hbd[client]
-# Installs: PyYAML, psutil
-```
-
-### Install Server Only (for daemon)
-```bash
-pip install hbd[server]
-# Installs: PyYAML, websockets, mattermostdriver, aiohttp, Jinja2
-```
-
-### Install Everything
-```bash
-pip install hbd[all]
-# Installs all dependencies for both client and server
-```
-
-### Development Installation
-```bash
-pip install -e ".[dev]"
-# Includes all dependencies plus testing/linting tools
-```
-
-## Command-Line Interfaces
-
-### HBC (Client)
-```bash
-hbc [options] host1 [host2 ...]
-
-# Entry point: hbd.client.main:main
-# Location: hbd/client/main.py
-```
-
-### HBD (Server)
-```bash
-hbd [options]
-
-# Entry point: hbd.server.cli:main
-# Location: hbd/server/cli.py → hbd/server/main.py
-```
-
-## Import Changes
-
-### Client Code
-```python
-# Old imports
-from .config import load_config
-from .proto import dicttos, stodict
-from .plugin import PluginRegistry
-
-# New imports
-from .config import load_config          # Still in client/
-from ..common.proto import dicttos       # Moved to common/
-from .plugin import PluginRegistry       # Still in client/
-```
-
-### Server Code
-```python
-# Old imports
-from .config import load_config
-from .proto import stodict
-from .threshold import AlertLevel
-
-# New imports
-from .config import load_config          # Server-specific config
-from ..common.proto import stodict       # Moved to common/
-from ..client.threshold import AlertLevel # Client module
-```
-
-### Plugin Code
-```python
-# Old import
-from hbd.plugin import MonitorPlugin
-
-# New import
-from hbd.client.plugin import MonitorPlugin
-```
-
-## Benefits
-
-1. **Modular Installation**: Install only what you need
-   - Client-only systems don't need web server dependencies
-   - Server-only systems don't need psutil
-   
-2. **Clearer Architecture**: Explicit separation of concerns
-   - Client: System monitoring and data collection
-   - Server: Heartbeat reception, web UI, notifications
-   - Common: Shared protocol and utilities
-
-3. **Independent Evolution**: Client and server can evolve separately
-   - Different release cycles possible
-   - Clear API boundaries via common/
-
-4. **Smaller Footprint**: Reduced dependency installation
-   - Client: ~1 dependency (psutil)
-   - Server: ~4 dependencies (websockets, aiohttp, Jinja2, mattermostdriver)
-
-## Migration Guide
-
-### For Existing Installations
-
-1. **Reinstall the package**:
-   ```bash
-   pip install -e ".[all]"  # For development
-   # or
-   pip install hbd[all]     # For production
-   ```
-
-2. **Configuration files remain unchanged**:
-   - Both client and server read from `~/.hb.yaml`
-   - All existing config keys are supported in both configs
-   - Server has additional keys (journal, websocket, email, etc.)
-   - Client has minimal keys (interval, plugins, thresholds)
-
-3. **Commands remain the same**:
-   - `hbc` command works identically
-   - `hbd` command works identically
-
-### For New Deployments
-
-1. **Client-only system** (monitoring host):
-   ```bash
-   pip install hbd[client]
-   hbc server1.example.com server2.example.com
-   ```
-
-2. **Server-only system** (monitoring daemon):
-   ```bash
-   pip install hbd[server]
-   hbd -c /etc/hbd.yaml -f
-   ```
-
-3. **Combined system** (dev/test):
-   ```bash
-   pip install hbd[all]
-   ```
-
-## Testing
-
-All imports and entry points have been tested and validated:
- ✅ Package imports work correctly
- ✅ `hbc` command entry point functional
- ✅ `hbd` command entry point functional
- ✅ Optional dependencies properly configured
- ✅ All internal imports updated
-
-## Files Archived
-
-The following files were renamed to avoid conflicts:
- `hbd/config.py` → `hbd/config.py.old` (split into client/server configs)
- `hbd/hbc_old.py` → `hbd/hbc_old.py.bak` (backup file)
-
-## Next Steps
-
-1. Test client functionality with a monitoring host
-2. Test server functionality with web UI and notifications
-3. Update documentation (README.md) with new structure
-4. Consider publishing to PyPI with new structure
-5. Update any deployment scripts/Dockerfiles to use optional dependencies
@@ -104,11 +104,6 @@ The `nagios_runner` plugin collects:
 - `{name}_{metric}_min` - Minimum value (if present)
 - `{name}_{metric}_max` - Maximum value (if present)

-**Overall:**
- `overall_status` - Worst status from all commands
- `overall_status_code` - Worst status code
- `plugin_count` - Number of Nagios plugins executed
-
 ## Configuration Options

 ```yaml
@@ -814,42 +814,39 @@ Planned features:

 ## Multi-Threshold Configuration

-**New in version 2.0**: Support for multiple named threshold configurations with per-host mapping.
+Support for multiple named threshold configurations with per-host mapping and composable layering.

 ### Overview

 The multi-threshold feature allows you to:
- Define multiple sets of threshold configurations
- Map different hosts to different threshold sets
+- Define multiple named threshold configurations
+- Assign one or more configurations to each host
+- Compose configurations by layering — each named config's overrides are applied in order on top of the defaults
 - Use different sensitivity levels for different environments
- Maintain a default configuration for unmapped hosts

 ### Configuration Structure

+Named configurations are defined under `threshold_configs`. Each host selects which ones to use via `threshold_config` in the `hosts` section (a string for a single config, or a list to layer multiple):
+
 ```yaml
-# Optional: Set the default configuration name (defaults to "default")
+# Optional: set the default configuration name (defaults to "default")
 default_threshold_config: "default"

-# Define multiple named threshold configurations
 threshold_configs:
-  # Configuration name 1
  default:
    thresholds:
-      # Standard threshold definitions
      cpu_monitor:
        cpu_percent:
          warning: 80.0
          critical: 90.0
-  
-  # Configuration name 2
+
  high_sensitivity:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 60.0
          critical: 75.0
-  
-  # Configuration name 3
+
  low_sensitivity:
    thresholds:
      cpu_monitor:
@@ -857,14 +854,77 @@ threshold_configs:
          warning: 90.0
          critical: 95.0

-# Map specific hosts to specific configurations
-host_threshold_mapping:
-  prod-web-01: high_sensitivity
-  prod-web-02: high_sensitivity
-  dev-server-01: low_sensitivity
-  # Unmapped hosts use default_threshold_config
+hosts:
+  prod-web-01:
+    threshold_config: high_sensitivity   # single config
+
+  dev-server-01:
+    threshold_config: low_sensitivity
+
+  # Hosts with no threshold_config use default_threshold_config
 ```

+### Composable Configurations (list form)
+
+`threshold_config` can be a list. Configs are applied **left to right**: the defaults are the base, then each named config's overrides are layered on top. Later entries in the list win on any metric they define.
+
+```yaml
+threshold_configs:
+  default:
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 80, critical: 90}
+      memory_monitor:
+        memory_percent: {warning: 85, critical: 95}
+      disk_monitor:
+        partitions:
+          /:
+            percent: {warning: 80, critical: 90}
+
+  # Tighter CPU limits for busy servers
+  high_cpu_load:
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 60, critical: 75}
+
+  # Tighter disk limits for data-heavy servers
+  busy_disk:
+    thresholds:
+      disk_monitor:
+        partitions:
+          /:
+            percent: {warning: 70, critical: 85}
+
+hosts:
+  # Gets default thresholds only
+  web-01:
+    threshold_config: default
+
+  # Gets tighter CPU limits, default memory and disk
+  build-server:
+    threshold_config: high_cpu_load
+
+  # Layers both: tighter CPU AND tighter disk, default memory
+  db-01:
+    threshold_config: [high_cpu_load, busy_disk]
+
+  # Three layers: busy_disk overrides high_cpu_load if they conflict
+  storage-01:
+    threshold_config: [default, high_cpu_load, busy_disk]
+```
+
+**How layering works:**
+
+Starting from the `default` thresholds:
+
+| Layer | Applied config | Effect |
+|-------|---------------|--------|
+| Base  | `default` | all default thresholds |
+| +1    | `high_cpu_load` | cpu_percent overridden to 60/75 |
+| +2    | `busy_disk` | disk percent overridden to 70/85; cpu_percent stays at 60/75 |
+
+Each named config only overrides the metrics it explicitly defines. Metrics not mentioned in a config inherit from the layers beneath.
+
 ### Use Cases

 #### 1. Environment-Based Thresholds
@@ -879,7 +939,7 @@ threshold_configs:
        cpu_percent:
          warning: 70.0   # Alert earlier in production
          critical: 85.0
-  
+
  development:
    thresholds:
      cpu_monitor:
@@ -887,11 +947,15 @@ threshold_configs:
          warning: 90.0   # More relaxed for dev
          critical: 98.0

-host_threshold_mapping:
-  prod-web-01: production
-  prod-web-02: production
-  dev-web-01: development
-  dev-web-02: development
+hosts:
+  prod-web-01:
+    threshold_config: production
+  prod-web-02:
+    threshold_config: production
+  dev-web-01:
+    threshold_config: development
+  dev-web-02:
+    threshold_config: development
 ```

 #### 2. Server Role-Based Thresholds
@@ -906,7 +970,7 @@ threshold_configs:
        cpu_percent:
          warning: 80.0
          critical: 90.0
-  
+
  database:
    thresholds:
      cpu_monitor:
@@ -914,7 +978,7 @@ threshold_configs:
          warning: 70.0
          critical: 85.0
      memory_monitor:
-        percent:
+        memory_percent:
          warning: 90.0   # Databases can use high memory
          critical: 97.0
      disk_monitor:
@@ -923,21 +987,27 @@ threshold_configs:
            percent:
              warning: 75.0
              critical: 85.0
-  
+
  cache:
    thresholds:
      memory_monitor:
-        percent:
+        memory_percent:
          warning: 95.0   # Redis/Memcached can use very high memory
          critical: 99.0

-host_threshold_mapping:
-  web-01: webserver
-  web-02: webserver
-  db-01: database
-  db-02: database
-  redis-01: cache
-  memcached-01: cache
+hosts:
+  web-01:
+    threshold_config: webserver
+  web-02:
+    threshold_config: webserver
+  db-01:
+    threshold_config: database
+  db-02:
+    threshold_config: database
+  redis-01:
+    threshold_config: cache
+  memcached-01:
+    threshold_config: cache
 ```

 #### 3. Sensitivity Levels
@@ -952,10 +1022,10 @@ threshold_configs:
        partitions:
          /:
            percent:
-              warning: 70.0    # Very sensitive
+              warning: 70.0
              critical: 80.0
              hysteresis: 0.15
-  
+
  standard:
    thresholds:
      disk_monitor:
@@ -965,7 +1035,7 @@ threshold_configs:
              warning: 85.0
              critical: 95.0
              hysteresis: 0.1
-  
+
  relaxed:
    thresholds:
      disk_monitor:
@@ -976,52 +1046,91 @@ threshold_configs:
              critical: 98.0
              hysteresis: 0.05

-host_threshold_mapping:
-  payment-gateway: critical
-  auth-server: critical
-  web-01: standard
-  web-02: standard
-  test-server: relaxed
+hosts:
+  payment-gateway:
+    threshold_config: critical
+  auth-server:
+    threshold_config: critical
+  web-01:
+    threshold_config: standard
+  web-02:
+    threshold_config: standard
+  test-server:
+    threshold_config: relaxed
 ```

-### Backward Compatibility
+#### 4. Composable Profiles

-The legacy single threshold configuration is fully supported:
+Build host-specific thresholds by combining small, focused configs:

 ```yaml
-# Old format - still works
-thresholds:
-  cpu_monitor:
-    cpu_percent:
-      warning: 80.0
-      critical: 90.0
-```
-
-This is equivalent to:
-
-```yaml
-# New format
 threshold_configs:
+  # Baseline — everything at default levels
  default:
    thresholds:
      cpu_monitor:
-        cpu_percent:
-          warning: 80.0
-          critical: 90.0
-```
+        cpu_percent: {warning: 80, critical: 90}
+      memory_monitor:
+        memory_percent: {warning: 85, critical: 95}

+  # Overlay: tighter CPU only
+  tight_cpu:
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 60, critical: 75}
+
+  # Overlay: tighter memory only
+  tight_memory:
+    thresholds:
+      memory_monitor:
+        memory_percent: {warning: 70, critical: 85}
+
+  # Overlay: extra disk partition for database servers
+  db_disk:
+    thresholds:
+      disk_monitor:
+        partitions:
+          /var/lib/postgresql:
+            percent: {warning: 75, critical: 88}
+
+hosts:
+  # Plain web server
+  web-01:
+    threshold_config: default
+
+  # Build server: tight CPU, default memory and disk
+  build-01:
+    threshold_config: tight_cpu
+
+  # Database: tight CPU + tight memory + extra disk partition
+  db-01:
+    threshold_config: [tight_cpu, tight_memory, db_disk]
+
+  # Replica database: tight memory + extra disk, normal CPU
+  db-02:
+    threshold_config: [tight_memory, db_disk]
+```
 ### Configuration Priority

-1. **Host-specific mapping**: If host is in `host_threshold_mapping`, use that config
-2. **Default config**: Use `default_threshold_config` 
-3. **First alphabetically**: If default not found, use first config alphabetically
-4. **Legacy fallback**: If `threshold_configs` not present, use `thresholds`
+1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults
+2. **Host `threshold_config` (string)**: Use that single named config directly
+3. **`host_threshold_mapping`** (legacy): Same as above, string only
+4. **`default_threshold_config`**: Used for hosts with no mapping
+5. **First alphabetically**: If the default config is not found, use the first config alphabetically
+6. **Legacy `thresholds` section**: Used when `threshold_configs` is absent entirely

-### Example: Complete Multi-Threshold Setup
+### Backward Compatibility

-See `hbd/config_multi_threshold_example.yaml` for a complete example with:
- 4 named configurations (default, high_sensitivity, low_sensitivity, database)
- Host-to-config mappings for production, development, and test systems
- Specialized database server thresholds
- Custom display messages with plugin data
+The legacy `host_threshold_mapping` top-level key and the flat `thresholds` section are still fully supported:
+
+```yaml
+# Still works — equivalent to hosts: {prod-web-01: {threshold_config: high_sensitivity}}
+host_threshold_mapping:
+  prod-web-01: high_sensitivity
+
+# Still works — equivalent to threshold_configs: {default: {thresholds: ...}}
+thresholds:
+  cpu_monitor:
+    cpu_percent: {warning: 80, critical: 90}
+```

@@ -1,21 +0,0 @@
-Plan the following changes, ask questions to clarify before implementing
-
-Re-factor the notification system:
- use available libraries for pushover, matrix, email and sms notifications.
- notifications have a title/subject:  alert_type (recover/warning/critical), a body (info from threshold check) and a link to the host plugin metrix page
- define a list of notification channels for each user
- notifications are dispatched to users that are listed as managers for the host
-
-
-
-1 - correct
-2 - for now channels are defined globaly 
-3 - matrix-nio)sounds good, homeserver URL, access token, room ID per channel?
-4 - use the REST api provided by https://voip.ms/api/v1/rest.php
-5 - The page does not exist yet, point at the host tab in the /plugins
-6 - per-channel minimum severity is a good idea, go fo it
-7 - yes
-
-1 - use base_url, there might not have been any incoming requests yet
-2 - use same asyncio loop for matrix-nio
-3 - for now, just silently do nothing
@@ -14,4 +14,4 @@ Install options:
 """

 __all__ = ["__version__"]
-__version__ = "5.1.3"
+__version__ = "5.2.3"
@@ -14,7 +14,6 @@ import signal
 import socket
 import sys
 import time
-from hashlib import md5
 from logging.handlers import SysLogHandler
 from pathlib import Path
 from typing import Dict, List, Optional
@@ -22,6 +21,7 @@ from typing import Dict, List, Optional
 # Import protocol and config
 from .config import load_config
 from ..common.proto import dicttos, stodict
+from .. import __version__

 # Import plugin system
 from .plugin import PluginRegistry, PluginLoader, InfoPlugin, MonitorPlugin
@@ -56,23 +56,27 @@ class AsyncConnection:
        
        self.transport: Optional[asyncio.DatagramTransport] = None
        self.protocol: Optional[asyncio.DatagramProtocol] = None
-        
+        self._dead = False
+        self._ever_opened = False
+        self._open_fail_count = 0   # consecutive failures before first success
+
        self.logger = logging.getLogger(f"hbc.conn.{addr}")
-    
+
    async def open(self) -> bool:
        """Open the UDP connection.
-        
+
        Returns:
            True if successful, False otherwise
        """
        try:
            loop = asyncio.get_event_loop()
-            
+
            # Create datagram endpoint
            self.transport, self.protocol = await loop.create_datagram_endpoint(
                lambda: HeartbeatProtocol(self),
                family=self.af
            )
+            self._ever_opened = True
            self.logger.debug(f"Opened connection to {self.addr}:{self.port}")
            return True
        except Exception as e:
@@ -93,9 +97,12 @@ class AsyncConnection:
            msg: Message dictionary
            msg_id: Message ID (HTB, PLG, etc.)
        """
+        if self._dead:
+            return
+
        if not self.transport:
            await self.open()
-        
+
        if not self.transport:
            self.logger.error("Cannot send - no transport")
            return
@@ -166,8 +173,9 @@ class HeartbeatProtocol(asyncio.DatagramProtocol):
            self.logger.error(f"Error processing datagram: {e}", exc_info=True)
    
    def error_received(self, exc):
-        """Handle protocol errors."""
-        self.logger.error(f"Protocol error: {exc}")
+        """Handle protocol errors — close transport so the heartbeat sender retries."""
+        self.logger.warning(f"Protocol error on {self.connection.addr}: {exc} — will retry")
+        self.connection.close()


 async def handle_command(conn: AsyncConnection, msg: dict):
@@ -204,55 +212,52 @@ async def handle_command(conn: AsyncConnection, msg: dict):
    await conn.sendto(response)


-async def handle_update(conn: AsyncConnection, msg: dict):
-    """Handle self-update from server."""
-    import codecs
+async def handle_update(conn: AsyncConnection, _msg: dict):  # pyright: ignore[reportUnusedParameter]
+    """Handle self-update by running hb_install.sh."""
    import shutil
-    
+
    logger = logging.getLogger("hbc.update")
-    
+
+    installer = shutil.which("hb_install.sh")
+    if installer is None:
+        candidate = Path(sys.argv[0]).parent / "hb_install.sh"
+        if candidate.exists():
+            installer = str(candidate)
+
+    if installer is None:
+        error = "hb_install.sh not found in PATH or alongside hbc"
+        logger.error(error)
+        await conn.sendto({"service": "update", "msg": error})
+        return
+
+    logger.info(f"Running installer: {installer}")
    try:
-        code = codecs.decode(msg["code"], "base64").decode()
-        csum = msg["csum"]
+        proc = await asyncio.create_subprocess_exec(
+            installer, "client",
+            stdout=asyncio.subprocess.PIPE,
+            stderr=asyncio.subprocess.STDOUT,
+        )
+        out, _ = await asyncio.wait_for(proc.communicate(), timeout=120)
+    except asyncio.TimeoutError:
+        error = "Installer timed out"
+        logger.error(error)
+        await conn.sendto({"service": "update", "msg": error})
+        return
    except Exception as e:
-        error = f"Missing code/csum: {e}"
+        error = f"Installer failed: {e}"
        logger.error(error)
        await conn.sendto({"service": "update", "msg": error})
        return
-    
-    # Verify checksum
-    m = md5()
-    m.update(code.encode())
-    if m.hexdigest() != csum:
-        error = "Checksum mismatch"
+
+    if proc.returncode != 0:
+        error = f"Installer exited {proc.returncode}: {out.decode().strip()}"
        logger.error(error)
        await conn.sendto({"service": "update", "msg": error})
        return
-    
-    # Backup current file
-    fn = sys.argv[0]
-    ofn = f"{fn}.sav"
-    try:
-        shutil.copy2(fn, ofn)
-    except Exception as e:
-        error = f"Backup failed: {e}"
-        logger.error(error)
-        await conn.sendto({"service": "update", "msg": error})
-        return
-    
-    # Write new code
-    try:
-        with open(fn, "w") as fh:
-            fh.write(code)
-    except Exception as e:
-        error = f"Write failed: {e}"
-        logger.error(error)
-        await conn.sendto({"service": "update", "msg": error})
-        return
-    
+
    logger.info("Update successful, restart required")
    await conn.sendto({"service": "update", "msg": "OK"})
-    
+
    # Trigger restart
    global dorestart
    dorestart = True
@@ -260,15 +265,51 @@ async def handle_update(conn: AsyncConnection, msg: dict):


 async def heartbeat_sender(conn: AsyncConnection, interval: int):
-    """Send periodic heartbeats.
-    
+    """Send periodic heartbeats, retrying the connection if it is not open.
+
+    IPv6 connections that fail to open before their first successful send are
+    dropped after IPV6_EARLY_FAIL_LIMIT attempts so that a network without IPv6
+    does not keep a dead sender alive.  IPv4 connections are retried indefinitely.
+
    Args:
        conn: Connection to send on
        interval: Heartbeat interval in seconds
    """
    logger = logging.getLogger("hbc.heartbeat")
-    
-    while running:
+    IPV6_EARLY_FAIL_LIMIT = 3
+
+    while running and not conn._dead:
+        # Ensure transport is open before attempting to send.
+        if not conn.transport:
+            opened = await conn.open()
+            if opened:
+                conn._open_fail_count = 0
+            else:
+                conn._open_fail_count += 1
+                # Drop an IPv6 connection that has never come up within the
+                # first few attempts — it is likely unavailable on this network.
+                if (not conn._ever_opened
+                        and conn.af == socket.AF_INET6
+                        and conn._open_fail_count >= IPV6_EARLY_FAIL_LIMIT):
+                    logger.warning(
+                        f"IPv6 connection to {conn.addr} unreachable after "
+                        f"{conn._open_fail_count} attempts, disabling"
+                    )
+                    conn._dead = True
+                    break
+                # Retry after the normal interval; IPv4 retries forever.
+                try:
+                    if shutdown_event:
+                        await asyncio.wait_for(shutdown_event.wait(), timeout=interval)
+                        break
+                    else:
+                        await asyncio.sleep(interval)
+                except asyncio.TimeoutError:
+                    pass
+                except asyncio.CancelledError:
+                    raise
+                continue
+
        try:
            msg = {
                "acks": conn.ackcount,
@@ -276,20 +317,17 @@ async def heartbeat_sender(conn: AsyncConnection, interval: int):
                "interval": interval
            }
            await conn.sendto(msg, "HTB")
-            
-        except Exception as e:
-            logger.error(f"Error sending heartbeat: {e}", exc_info=True)
+
        except asyncio.CancelledError:
            logger.debug("Heartbeat sender cancelled")
            raise
-        
+        except Exception as e:
+            logger.error(f"Error sending heartbeat: {e}", exc_info=True)
+
        # Wait for next interval or shutdown event
        try:
            if shutdown_event:
-                await asyncio.wait_for(
-                    shutdown_event.wait(), 
-                    timeout=interval
-                )
+                await asyncio.wait_for(shutdown_event.wait(), timeout=interval)
                break
            else:
                await asyncio.sleep(interval)
@@ -425,16 +463,13 @@ async def cleanup(connections: List[AsyncConnection]):
    logger = logging.getLogger("hbc.cleanup")
    logger.info("Cleaning up connections")
    
-    for conn in connections:
+    target = next((c for c in connections if c.transport), connections[0] if connections else None)
+    if target and send_shutdown:
        try:
-            msg = {
-                "shutdown": 1,
-                "acks": conn.ackcount
-            }
-            await conn.sendto(msg)
+            await target.sendto({"shutdown": 1, "acks": target.ackcount})
        except Exception as e:
            logger.error(f"Error sending shutdown: {e}")
-        
+    for conn in connections:
        conn.close()
    
    # Give messages time to send
@@ -443,7 +478,7 @@ async def cleanup(connections: List[AsyncConnection]):

 async def async_main(args, config):
    """Async main function."""
-    global running, shutdown_event, active_tasks
+    global running, shutdown_event, active_tasks, send_shutdown 
    
    # Create shutdown event
    shutdown_event = asyncio.Event()
@@ -460,6 +495,7 @@ async def async_main(args, config):
    hb_port = config.get("hb_port", PORT)
    interval = config.get("interval", INTERVAL)
    
+    logger.info(f"hbc {__version__} starting on {iam}")
    logger.info(f"Starting hbc for {iam} -> {hb_hosts}")
    logger.info(f"Port: {hb_port}, Interval: {interval}s")
    
@@ -477,30 +513,34 @@ async def async_main(args, config):
        for addr_info in addrs:
            af = addr_info[0]
            addr = addr_info[4][0]
-            
+
            conn = AsyncConnection(conn_id, addr, hb_port, af, iam)
-            if await conn.open():
-                connections.append(conn)
-                conn_id += 1
-    
+            if not await conn.open():
+                logger.warning(f"Initial open to {addr} failed, heartbeat sender will retry")
+            connections.append(conn)
+            conn_id += 1
+
    if not connections:
-        logger.error("No connections established")
+        logger.error("No connections established (DNS resolution failed for all hosts)")
        return 1
    
    logger.info(f"Created {len(connections)} connections")
    
    # Send boot/message if requested
+    send_shutdown = False
    if args.boot or args.message:
        boot_msg = {}
        if args.boot:
            boot_msg["boot"] = 1
+            args.boot = False  # Clear boot flag so we don't send it again in main loop
+            send_shutdown = True
        if args.message:
            boot_msg["service"] = "service"
            boot_msg["msg"] = args.message
        
        boot_msg["acks"] = 0
-        for conn in connections:
-            await conn.sendto(boot_msg)
+        target = next((c for c in connections if c.transport), connections[0])
+        await target.sendto(boot_msg)
        
        if args.message and not args.daemon:
            # Message-only mode
@@ -522,6 +562,13 @@ async def async_main(args, config):
    loop = asyncio.get_event_loop()
    for sig in (signal.SIGTERM, signal.SIGINT):
        loop.add_signal_handler(sig, stop)
+
+    def _sighup():
+        global dorestart
+        dorestart = True
+        stop()
+
+    loop.add_signal_handler(signal.SIGHUP, _sighup)
    
    # Start async tasks
    # Heartbeat senders (one per connection)
@@ -693,7 +740,7 @@ def main(argv=None):
    
    # Daemonize if requested
    if args.daemon:
-        print("Daemonizing...")
+        logging.info("Daemonizing...")
        daemonize()
        _reconfigure_logging_for_daemon(log_level)
        logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
@@ -118,6 +118,13 @@ class CPUMonitorPlugin(MonitorPlugin):
                    data["cpu_iowait"] = round(cpu_times.iowait, 1)
            except Exception as e:
                self.logger.debug(f"Could not get CPU times: {e}")
+
+            # Uptime in seconds
+            try:
+                import time
+                data["uptime_seconds"] = int(time.time() - self.psutil.boot_time())
+            except Exception as e:
+                self.logger.debug(f"Could not get uptime: {e}")
            
            self.logger.debug(
                f"Collected CPU metrics: {data.get('cpu_percent', 'N/A')}% usage"
@@ -14,6 +14,24 @@ except ImportError:

 from hbd.client.plugin import MonitorPlugin

+
+def _zfs_arc_bytes() -> int:
+    """Return current ZFS ARC size in bytes, or 0 if ZFS is not present.
+
+    ZFS ARC is reclaimable but is not included in MemAvailable by the Linux
+    kernel (it is not in SReclaimable), so it would otherwise be counted as
+    used memory.
+    """
+    try:
+        with open("/proc/spl/kstat/zfs/arcstats") as fh:
+            for line in fh:
+                parts = line.split()
+                if len(parts) >= 3 and parts[0] == "size":
+                    return int(parts[2])
+    except (OSError, ValueError):
+        pass
+    return 0
+
 logger = logging.getLogger(__name__)


@@ -101,11 +119,21 @@ class MemoryMonitorPlugin(MonitorPlugin):
        
        # Virtual (physical) memory statistics
        vmem = psutil.virtual_memory()
+
+        # psutil's available already excludes page cache / file buffers
+        # (uses MemAvailable on Linux). Add ZFS ARC on top because the kernel
+        # does not include it in SReclaimable / MemAvailable even though it is
+        # reclaimable.
+        arc_bytes = _zfs_arc_bytes()
+        available = min(vmem.available + arc_bytes, vmem.total)
+        used = vmem.total - available
+        percent = round(used / vmem.total * 100, 1) if vmem.total else 0.0
+
        metrics['memory_total'] = vmem.total
-        metrics['memory_available'] = vmem.available
-        metrics['memory_used'] = vmem.used
+        metrics['memory_available'] = available
+        metrics['memory_used'] = used
        metrics['memory_free'] = vmem.free
-        metrics['memory_percent'] = vmem.percent
+        metrics['memory_percent'] = percent
        
        # Platform-specific memory details
        if hasattr(vmem, 'active'):
@@ -31,16 +31,13 @@ from hbd.client.plugin import MonitorPlugin


 # Nagios exit codes
-NAGIOS_OK = 0
-NAGIOS_WARNING = 1
-NAGIOS_CRITICAL = 2
 NAGIOS_UNKNOWN = 3

 STATUS_NAMES = {
-    NAGIOS_OK: "OK",
-    NAGIOS_WARNING: "WARNING",
-    NAGIOS_CRITICAL: "CRITICAL",
-    NAGIOS_UNKNOWN: "UNKNOWN"
+    0: "OK",
+    1: "WARNING",
+    2: "CRITICAL",
+    3: "UNKNOWN",
 }


@@ -128,52 +125,39 @@ class NagiosRunnerPlugin(MonitorPlugin):
            Dictionary with results from all plugins
        """
        results = {}
-        
-        # Track overall status (worst status wins)
-        worst_status = NAGIOS_OK
-        
+
        for cmd_config in self.commands:
            name = cmd_config.get("name")
            command = cmd_config.get("command")
-            
+
            if not name or not command:
                self.logger.warning("Skipping command with missing name or command")
                continue
-            
+
            # Execute plugin
            try:
                status_code, output, perfdata = await self._run_nagios_plugin(command)
-                
+
                # Store results
                results[f"{name}_status"] = STATUS_NAMES.get(status_code, "UNKNOWN")
                results[f"{name}_status_code"] = status_code
                results[f"{name}_output"] = output
-                
-                # Track worst status
-                if status_code > worst_status:
-                    worst_status = status_code
-                
+
                # Parse and add performance data
                if perfdata:
                    for metric_name, metric_value in perfdata.items():
                        results[f"{name}_{metric_name}"] = metric_value
-                
+
                self.logger.info(
                    f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
                )
-                
+
            except Exception as e:
                self.logger.error(f"Error running {name}: {e}", exc_info=True)
                results[f"{name}_status"] = "ERROR"
                results[f"{name}_status_code"] = NAGIOS_UNKNOWN
                results[f"{name}_output"] = str(e)
-                worst_status = NAGIOS_UNKNOWN
-        
-        # Add overall status
-        results["overall_status"] = STATUS_NAMES.get(worst_status, "UNKNOWN")
-        results["overall_status_code"] = worst_status
-        results["plugin_count"] = len(self.commands)
-        
+
        return results
    
    async def _run_nagios_plugin(
@@ -60,6 +60,7 @@ class OSInfoPlugin(InfoPlugin):
                "python_version": platform.python_version(),
                "python_implementation": platform.python_implementation(),
                "hbc_version": hbc_version,
+                "hbc_type": "full",
            }
            
            # Add Linux-specific distribution info
@@ -13,12 +13,8 @@ plugins:
    count: 3              # ICMP packets per ping run (default 3)
    timeout: 5            # seconds before a host is considered unreachable (default 5)
    hosts:
-      8.8.8.8:
-        warning: 20.0     # ms
-        critical: 100.0   # ms
-      192.168.1.1:
-        warning: 5.0
-        critical: 20.0
+      - 8.8.8.8
+      - 192.168.1.1
 ```

 Reported metrics per host (metric key uses the hostname with dots/colons replaced
@@ -0,0 +1,130 @@
+"""
+ZFS pool monitoring plugin for Heartbeat.
+
+Collects per-pool health, capacity, and cumulative I/O statistics via zpool(8).
+"""
+
+import asyncio
+import logging
+import shutil
+from typing import Any, Dict, List, Optional
+
+from hbd.client.plugin import MonitorPlugin
+
+logger = logging.getLogger(__name__)
+
+
+def _int(s: str) -> Optional[int]:
+    try:
+        return int(s.strip().rstrip("KMGTkBkmgt%x"))
+    except (ValueError, AttributeError):
+        return None
+
+
+def _float(s: str) -> Optional[float]:
+    try:
+        return float(s.strip().rstrip("%x"))
+    except (ValueError, AttributeError):
+        return None
+
+
+class ZFSMonitorPlugin(MonitorPlugin):
+    """Monitor ZFS pool health, capacity, and I/O statistics.
+
+    Collects per pool:
+    - health: ONLINE, DEGRADED, FAULTED, etc.
+    - size / alloc / free: total, allocated and free bytes
+    - capacity: percentage used (0-100)
+    - frag: fragmentation percentage
+    - dedup: deduplication ratio
+    - read_ops / write_ops: cumulative I/O operations since last boot/clear
+    - read_bw / write_bw: cumulative bytes transferred since last boot/clear
+
+    Configuration:
+        interval: collection interval in seconds (default: 300)
+        pools: list of pool names to monitor (default: all)
+    """
+
+    name = "zfs_monitor"
+    description = "ZFS pool health, capacity, and I/O statistics"
+    interval = 300
+
+    def __init__(self, config: Optional[Dict[str, Any]] = None):
+        super().__init__(config)
+        self.interval = self.config.get("interval", 300)
+        self._pools_filter: Optional[List[str]] = self.config.get("pools", None)
+
+    async def initialize(self) -> bool:
+        if not shutil.which("zpool"):
+            self.skip_reason = "zpool not found"
+            return False
+        logger.info("ZFS monitor initialized (interval: %ds)", self.interval)
+        return True
+
+    async def _run(self, *args: str) -> List[str]:
+        """Run a command and return its stdout lines, or [] on error."""
+        try:
+            proc = await asyncio.create_subprocess_exec(
+                *args,
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.DEVNULL,
+            )
+            stdout, _ = await asyncio.wait_for(proc.communicate(), timeout=15)
+            return stdout.decode(errors="replace").splitlines()
+        except (FileNotFoundError, asyncio.TimeoutError) as exc:
+            logger.warning("zfs_monitor: %s: %s", args[0], exc)
+            return []
+
+    async def _zpool_list(self) -> Dict[str, Dict]:
+        """Return per-pool health and capacity from `zpool list`."""
+        lines = await self._run(
+            "zpool", "list", "-H", "-p",
+            "-o", "name,health,size,alloc,free,cap,frag,dedup",
+        )
+        pools: Dict[str, Dict] = {}
+        for line in lines:
+            parts = line.split("\t")
+            if len(parts) < 8:
+                continue
+            name = parts[0].strip()
+            if self._pools_filter and name not in self._pools_filter:
+                continue
+            pools[name] = {
+                "health":   parts[1].strip(),
+                "size":     _int(parts[2]),
+                "alloc":    _int(parts[3]),
+                "free":     _int(parts[4]),
+                "capacity": _float(parts[5]),
+                "frag":     _float(parts[6]),
+                "dedup":    _float(parts[7]),
+            }
+        return pools
+
+    async def _zpool_iostat(self) -> Dict[str, Dict]:
+        """Return per-pool cumulative I/O counters from `zpool iostat`."""
+        lines = await self._run("zpool", "iostat", "-H", "-p")
+        io: Dict[str, Dict] = {}
+        for line in lines:
+            parts = line.split("\t")
+            if len(parts) < 7:
+                continue
+            name = parts[0].strip()
+            if not name or name.startswith(" "):
+                continue
+            io[name] = {
+                "read_ops": _int(parts[3]),
+                "write_ops": _int(parts[4]),
+                "read_bw":  _int(parts[5]),
+                "write_bw": _int(parts[6]),
+            }
+        return io
+
+    async def _collect_metrics(self) -> Dict[str, Any]:
+        pools, io = await asyncio.gather(self._zpool_list(), self._zpool_iostat())
+        for name, stats in io.items():
+            if name in pools:
+                pools[name].update(stats)
+        return {"pools": pools}
+
+
+plugin = ZFSMonitorPlugin
@@ -144,17 +144,16 @@ def cmd_notify(args):
        url=f"{base_url}/plugins" if base_url else "",
    )

-    # Bypass min_level for explicit test sends; run async channels directly
    import asyncio
+    from .notify import _send_matrix_async, _send_sms_voipms_async, _DRIVERS
    ch_type = channel_cfg.get("type", "")
    print(f"Sending via {args.channel} ({ch_type}): {title} — {args.message}")

-    if ch_type in ("matrix", "sms_voipms"):
-        from .notify import _send_matrix_async, _send_sms_voipms_async
-        driver_async = _send_matrix_async if ch_type == "matrix" else _send_sms_voipms_async
-        ok = asyncio.run(driver_async(channel_cfg, notif))
+    if ch_type == "matrix":
+        ok = asyncio.run(_send_matrix_async(channel_cfg, notif))
+    elif ch_type == "sms_voipms":
+        ok = asyncio.run(_send_sms_voipms_async(channel_cfg, notif))
    else:
-        from .notify import _DRIVERS
        driver = _DRIVERS.get(ch_type)
        if driver is None:
            print(f"Error: unknown channel type '{ch_type}'", file=sys.stderr)
@@ -95,6 +95,12 @@ THRESHOLD_DEFAULTS = {
                'warning': 200,
                'critical': 250.0,
                'count': 3  # Optional: number of consecutive breaches before alerting
+            },
+            'nagios_runner': {
+                'status_code': {
+                    'display': '{check_name} {output}',
+                    'operator': "nagios"
+                }   
            }
        }
    }
@@ -225,7 +231,7 @@ def get_watchhosts(config):
    hosts_config = config.get("hosts", {})
    if isinstance(hosts_config, dict):
        for host_name, host_attrs in hosts_config.items():
-            if isinstance(host_attrs, dict) and host_attrs.get("watch", False):
+            if isinstance(host_attrs, dict) and host_attrs.get("watch", True):
                watchhosts.append(host_name)
    return watchhosts

@@ -95,7 +95,7 @@ class Connection:
        if not Null:
            d["addr"] = self.addr
            if self.rtts[-1]:
-                d["rtt"] = "%0.1f" % self.rtts[-1]
+                d["rtt"] = "%d" % round(self.rtts[-1])
            elif self.state == Connection.UNKNOWN:
                d["rtt"] = ""
            else:
@@ -286,7 +286,7 @@ class Host:
            Host.hosts[name] = self
        self.num = num
        self.dyn = False
-        self.watched = False
+        self.watched = True
        self.upcount = 0
        self.interval = 0
        self.doesack = -1
@@ -304,6 +304,7 @@ class Host:

    def statedict(self):
        d = {}
+        d["raw_name"] = self.name
        d["name"] = self.name
        if self.dyn:
            d["name"] += "*"
@@ -1,7 +1,11 @@
 """HTTP server implementation using aiohttp and jinja2."""

 import asyncio
+import datetime
 import json
+import platform
+import socket
+import sys
 import time
 import urllib.parse
 import os
@@ -111,6 +115,7 @@ async def start(
    This function is intended to be awaited inside the main asyncio event loop.
    """
    get_now = get_now or (lambda: time.time())
+    _start_epoch = time.time()

    async def old_index(request):
        _require_auth_redirect(request)
@@ -149,6 +154,25 @@ async def start(
        lst = [h.jsons() for h in hosts]
        return web.json_response(json.loads("[" + ",".join(lst) + "]"))

+    async def api_alert_summary(request):
+        """GET /api/0/alert_summary — counts of ok/warning/critical hosts visible to caller."""
+        user, err = _require_auth(request)
+        if err:
+            return err
+        from .threshold import AlertLevel
+        critical = warning = ok = 0
+        for host in hbdclass.Host.hosts.values():
+            if not _can_operate_host(user, host):
+                continue
+            levels = {s.level for s in host.alert_states.values()}
+            if AlertLevel.CRITICAL in levels:
+                critical += 1
+            elif AlertLevel.WARNING in levels:
+                warning += 1
+            else:
+                ok += 1
+        return web.json_response({"critical": critical, "warning": warning, "ok": ok})
+
    async def api_messages(request):
        lst = data.msgs[-30:]
        return web.json_response(lst)
@@ -210,15 +234,11 @@ async def start(
            return err
        qa = request.rel_url.query
        uname = urllib.parse.unquote(qa.get("h", ""))
-        ucode = qa.get("c")
-        if not ucode or not uname:
-            return web.Response(status=400, text="need h= and c= arguments")
+        if not uname:
+            return web.Response(status=400, text="need h= argument")
        if uname != "All" and uname not in hbdclass.Host.hosts:
            return web.Response(status=400, text=f"h={uname} not found")
-        if uname != "All":
-            names = [uname]
-        else:
-            names = [n for n in hbdclass.Host.hosts]
+        names = [uname] if uname != "All" else list(hbdclass.Host.hosts)
        out = []
        for n in names:
            host = hbdclass.Host.hosts[n]
@@ -227,8 +247,7 @@ async def start(
                continue
            op_err = None
            try:
-                r = {"csum": None, "code": ucode}
-                host.cmds.append(("UPD", r))
+                host.cmds.append(("UPD", {}))
            except Exception as e:
                op_err = str(e)
            out.append(f"update started for {n}: {op_err if op_err else 'OK'}")
@@ -258,7 +277,9 @@ async def start(
            extra_scripts=extra_scripts,
            hbd_version=hbd_version,
            hosts=[
-                hbdclass.Host.hosts[h].stateinfo() for h in sorted(hbdclass.Host.hosts)
+                hbdclass.Host.hosts[h].stateinfo()
+                for h in sorted(hbdclass.Host.hosts)
+                if _can_operate_host(current_user, hbdclass.Host.hosts[h])
            ],
            messages=data.msgs[-30:],
            current_user=current_user.to_dict() if current_user else None,
@@ -510,18 +531,19 @@ async def start(
        hosts_with_plugins = []
        for hostname in sorted(hbdclass.Host.hosts.keys()):
            host = hbdclass.Host.hosts[hostname]
-            if not _can_view_host(current_user, host):
+            if not _can_operate_host(current_user, host):
                continue
            if host.plugin_data:
                hosts_with_plugins.append({
                    "name": hostname,
                    "plugins": list(host.plugin_data.keys()),
+                    "is_owner": _can_own_host(current_user, host),
                })

        tmpl = env.get_template("plugins.html")
        body = tmpl.render(
-            title="Plugin Metrics - Heartbeat",
-            header="Plugin Metrics",
+            title="Host Overview - Heartbeat",
+            header="Host Overview",
            hosts=hosts_with_plugins,
            current_user=current_user.to_dict() if current_user else None,
            active_page="plugins",
@@ -811,6 +833,48 @@ async def start(
        )
        return web.Response(text=body, content_type="text/html")

+    # -------------------------------------------------------------------------
+    # About page
+    # -------------------------------------------------------------------------
+
+    async def about_page(request):
+        """GET /about — version, runtime, and project information."""
+        current_user, _ = _require_auth_redirect(request)
+        pkg_dir = os.path.dirname(__file__)
+        templates_dir = config.get("templates_dir", os.path.join(pkg_dir, "templates"))
+        env = jinja2.Environment(loader=jinja2.FileSystemLoader(templates_dir))
+        from hbd import __version__ as hbd_version
+
+        uptime_secs = int(time.time() - _start_epoch)
+        days, rem = divmod(uptime_secs, 86400)
+        hours, rem = divmod(rem, 3600)
+        mins, secs = divmod(rem, 60)
+        if days:
+            uptime_str = f"{days}d {hours}h {mins}m"
+        elif hours:
+            uptime_str = f"{hours}h {mins}m {secs}s"
+        else:
+            uptime_str = f"{mins}m {secs}s"
+
+        start_dt = datetime.datetime.fromtimestamp(_start_epoch)
+        start_time_str = start_dt.strftime("%Y-%m-%d %H:%M:%S")
+
+        tmpl = env.get_template("about.html")
+        body = tmpl.render(
+            title="About - Heartbeat",
+            header="About",
+            hbd_version=hbd_version,
+            python_version=f"{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro} ({platform.python_implementation()})",
+            server_hostname=socket.gethostname(),
+            start_epoch=int(_start_epoch),
+            start_time_str=start_time_str,
+            uptime_str=uptime_str,
+            host_count=len(hbdclass.Host.hosts),
+            current_user=current_user.to_dict() if current_user else None,
+            active_page="about",
+        )
+        return web.Response(text=body, content_type="text/html")
+
    # -------------------------------------------------------------------------
    # Settings page (admin only)
    # -------------------------------------------------------------------------
@@ -826,7 +890,7 @@ async def start(
        tmpl = env.get_template("settings.html")
        body = tmpl.render(
            title="Settings - Heartbeat",
-            sections=settings_mod.get_settings_sections(config),
+            sections=settings_mod.get_settings_sections(config, threshold_checker=threshold_checker),
            current_user=current_user.to_dict() if current_user else None,
            active_page="settings",
        )
@@ -849,6 +913,7 @@ async def start(
            web.get("/api/0/users/{username}/avatar", api_user_avatar),
            # Hosts
            web.get("/api/0/hosts", api_hosts),
+            web.get("/api/0/alert_summary", api_alert_summary),
            web.get("/api/0/messages", api_messages),
            web.get("/api/0/hosts/{hostname}/plugins", api_host_plugins),
            web.get("/api/0/hosts/{hostname}/plugins/{plugin_name}", api_host_plugin_detail),
@@ -864,6 +929,7 @@ async def start(
            web.get("/live", live),
            web.get("/plugins", plugins_page),
            web.get("/alerts", alerts_page),
+            web.get("/about", about_page),
            web.get("/profile", profile_page),
            web.get("/settings", settings_page),
            web.get("/static/{path:.*}", static),
@@ -101,9 +101,10 @@ async def reload_configuration(config_obj, config_path, components):
            access = config_mod.get_host_access(new_config, hostname)
            host.apply_access(access["owner"], access["managers"], access["monitors"])

-        # Reload threshold checker
+        # Reload threshold checker and prune alerts orphaned by the new config
        if 'threshold_checker' in components:
            components['threshold_checker'].reload(new_config)
+            components['threshold_checker'].purge_stale_alerts(hbdclass)
        
        # Note: Changes to the following require restart:
        # - hb_port, hbd_port, ws_port (already bound)
@@ -210,7 +211,6 @@ async def _run_async(config, config_path=None):
        ctx = dict(
            config=config,
            hbdclass=hbdclass,
-            log=eventlog,
            msg_to_websockets=msg_to_websockets,
            msg_journal=msg_journal,
            threshold_checker=threshold_checker,
@@ -237,12 +237,15 @@ async def _run_async(config, config_path=None):
    restore_ctx = dict(
        config=config,
        hbdclass=hbdclass,
-        log=eventlog,
        msg_to_websockets=msg_to_websockets,
        threshold_checker=threshold_checker,
    )
    udp.restore_connection_timers(hbdclass, restore_ctx)

+    # Drop alert states that no longer have a matching threshold (stale after
+    # upgrade or config change between runs).
+    threshold_checker.purge_stale_alerts(hbdclass)
+
    # HTTP server (asyncio-based via aiohttp)
    try:
        http_task = asyncio.create_task(
@@ -252,6 +255,7 @@ async def _run_async(config, config_path=None):
                config=config,
                hbdclass=hbdclass,
                tcss=None,
+                threshold_checker=threshold_checker,
                verbose=config.get("verbose", False),
                get_now=lambda: time.time(),
                VER="",
@@ -471,6 +475,8 @@ def run(config, config_path=None):
    if config.get("debug", 0) > 0:
        log_level = logging.DEBUG
    logging.basicConfig(level=log_level)
+    if not config.get("debug", 0):
+        logging.getLogger("aiohttp.access").propagate = False
    load_pickled_hosts(config, hbdclass)

    notify_mod.initlog(logfile=config.get("logfile", "messages.log"))
@@ -15,7 +15,6 @@ their own ``notification_channels`` list.  When no users are configured the
 server runs silently (no notifications sent).
 """

-import asyncio
 import asyncio
 import logging
 import smtplib
@@ -30,13 +29,10 @@ from . import ws as ws_mod

 logger = logging.getLogger(__name__)

-logger = logging.getLogger(__name__)
-
 msg_to_websockets = ws_mod.broadcast

 # Module-level state set via setup()
 _config: dict = {}
-_loop: Optional[asyncio.AbstractEventLoop] = None

 # Tracks which channels fired a WARNING/CRITICAL per host.
 # {host_name: set of channel_names}  — used to route RECOVER to the same channels.
@@ -73,11 +69,9 @@ class Notification:
 # ---------------------------------------------------------------------------

 def setup(cfg: dict, loop: Optional[asyncio.AbstractEventLoop] = None):
-    """Initialize notifier from configuration dict and event loop."""
-    global _config, _loop
+    """Initialize notifier from configuration dict."""
+    global _config
    _config = dict(cfg)
-    if loop is not None:
-        _loop = loop


 def reload_config(cfg: dict):
@@ -299,17 +293,6 @@ async def _send_sms_voipms_async(channel_cfg: dict, notif: Notification) -> bool
        return False


-def _send_sms_voipms(channel_cfg: dict, notif: Notification) -> bool:
-    """Dispatch voip.ms SMS send onto the shared event loop."""
-    if _loop is None:
-        logger.warning("sms_voipms: event loop not available")
-        return False
-    future = asyncio.run_coroutine_threadsafe(_send_sms_voipms_async(channel_cfg, notif), _loop)
-    try:
-        return future.result(timeout=15)
-    except Exception as e:
-        logger.error("sms_voipms send timed out or failed: %s", e)
-        return False


 async def _send_matrix_async(channel_cfg: dict, notif: Notification) -> bool:
@@ -357,40 +340,23 @@ async def _send_matrix_async(channel_cfg: dict, notif: Notification) -> bool:
        await client.close()


-def _send_matrix(channel_cfg: dict, notif: Notification) -> bool:
-    """Dispatch matrix send onto the shared event loop."""
-    if _loop is None:
-        logger.warning("matrix: event loop not available")
-        return False
-    future = asyncio.run_coroutine_threadsafe(_send_matrix_async(channel_cfg, notif), _loop)
-    try:
-        return future.result(timeout=15)
-    except Exception as e:
-        logger.error("matrix send timed out or failed: %s", e)
-        return False
-
-
 # ---------------------------------------------------------------------------
-# Channel dispatcher
+# Channel dispatcher  (all async — sync drivers run in a thread executor)
 # ---------------------------------------------------------------------------

+# Sync drivers kept for `hbd notify` CLI usage (asyncio.run wraps them there).
 _DRIVERS = {
    "pushover": _send_pushover,
    "email": _send_email,
    "mattermost": _send_mattermost,
    "signal": _send_signal,
-    "sms_voipms": _send_sms_voipms,
-    "matrix": _send_matrix,
 }

+_TIMEOUT = 15  # seconds per channel send

-def _dispatch_to_channel(channel_name: str, channel_cfg: dict, notif: Notification) -> bool:
-    """Send *notif* to a single named channel, honouring min_level.

-    RECOVER always bypasses min_level — a recovery is always relevant if the
-    channel was configured for any alerting (handles the restart-then-recover case
-    where _alerted_channels is empty and we fall through to the normal loop).
-    """
+async def _dispatch_to_channel(channel_name: str, channel_cfg: dict, notif: Notification) -> bool:
+    """Send *notif* to a single named channel, honouring min_level."""
    level = notif.level.upper()
    if level != "RECOVER":
        min_level = channel_cfg.get("min_level", "WARNING").upper()
@@ -398,14 +364,24 @@ def _dispatch_to_channel(channel_name: str, channel_cfg: dict, notif: Notificati
            logger.debug(
                "channel '%s': skipping level %s (min_level=%s)", channel_name, level, min_level
            )
-            return True  # not an error — filtered intentionally
+            return True  # filtered intentionally

    ch_type = channel_cfg.get("type", "")
-    driver = _DRIVERS.get(ch_type)
-    if driver is None:
-        logger.warning("unknown channel type '%s' for channel '%s'", ch_type, channel_name)
+    try:
+        if ch_type == "matrix":
+            return await asyncio.wait_for(_send_matrix_async(channel_cfg, notif), timeout=_TIMEOUT)
+        if ch_type == "sms_voipms":
+            return await asyncio.wait_for(_send_sms_voipms_async(channel_cfg, notif), timeout=_TIMEOUT)
+        sync_driver = _DRIVERS.get(ch_type)
+        if sync_driver is None:
+            logger.warning("unknown channel type '%s' for channel '%s'", ch_type, channel_name)
+            return False
+        return await asyncio.wait_for(
+            asyncio.to_thread(sync_driver, channel_cfg, notif), timeout=_TIMEOUT
+        )
+    except asyncio.TimeoutError:
+        logger.error("channel '%s' timed out after %ds", channel_name, _TIMEOUT)
        return False
-    return driver(channel_cfg, notif)


 # ---------------------------------------------------------------------------
@@ -419,7 +395,7 @@ def _build_url(host_name: str) -> str:
    return f"{base_url}/plugins#{host_name}"


-def send_notification(host_name: str, notif: Notification) -> dict:
+async def send_notification(host_name: str, notif: Notification) -> dict:
    """Dispatch *notif* to all managers/owner of *host_name*.

    Looks up the host's owner + managers, resolves each user's
@@ -469,16 +445,12 @@ def send_notification(host_name: str, notif: Notification) -> dict:
            if not channel_cfg:
                continue
            try:
-                ch_type = channel_cfg.get("type", "")
-                driver = _DRIVERS.get(ch_type)
-                if driver:
-                    ok = driver(channel_cfg, notif)
-                    results[channel_name] = ok
-                    if ok:
-                        logger.info("recover sent to channel '%s': %s", channel_name, notif.title)
+                ok = await _dispatch_to_channel(channel_name, channel_cfg, notif)
+                results[channel_name] = ok
+                if ok:
+                    logger.info("recover sent to channel '%s': %s", channel_name, notif.title)
            except Exception as e:
                logger.error("error sending recover to channel '%s': %s", channel_name, e)
-        # Clear the alerted set once recovery is delivered
        del _alerted_channels[host_name]
        return results

@@ -489,14 +461,14 @@ def send_notification(host_name: str, notif: Notification) -> dict:
            continue
        for channel_name in user.notification_channels:
            if channel_name in results:
-                continue  # already dispatched to this channel this notification
+                continue
            channel_cfg = global_channels.get(channel_name)
            if not channel_cfg:
                logger.warning("channel '%s' not defined in notification_channels", channel_name)
                results[channel_name] = False
                continue
            try:
-                ok = _dispatch_to_channel(channel_name, channel_cfg, notif)
+                ok = await _dispatch_to_channel(channel_name, channel_cfg, notif)
                results[channel_name] = ok
                if ok:
                    logger.info("notification sent to channel '%s': %s", channel_name, notif.title)
@@ -24,7 +24,7 @@ sensitive   bool  True when the raw value must never be shown
 # Credential field names that should always be masked.
 _SECRET_KEYS = frozenset({
    "password", "token", "user_key", "api_key", "secret",
-    "smtp_password", "smtp_user",
+    "smtp_password", "smtp_user", "api_password", "access_token",
 })

 _CHANNEL_TYPE_LABELS = {
@@ -88,7 +88,7 @@ def _sanitize_channel(name, cfg):
 # Public API
 # ---------------------------------------------------------------------------

-def get_settings_sections(config: dict) -> list:
+def get_settings_sections(config: dict, threshold_checker=None) -> list:
    """Return ordered list of setting sections for the settings page.

    Each section:
@@ -181,6 +181,41 @@ def get_settings_sections(config: dict) -> list:
            "notification_channels": attrs.get("notification_channels", []),
        })

+    # ---- Threshold configurations -----------------------------------------
+    def _tc_to_row(tc):
+        return {
+            "metric": tc.metric_path,
+            "operator": tc.operator.value,
+            "warning": tc.warning,
+            "critical": tc.critical,
+            "hysteresis": tc.hysteresis,
+            "count": tc.count,
+            "enabled": tc.enabled,
+        }
+
+    threshold_config_list = []
+    if threshold_checker is not None:
+        if threshold_checker.threshold_configs:
+            for cfg_name, cfg_metrics in sorted(threshold_checker.threshold_configs.items()):
+                # For the default config use the merged effective set;
+                # for named overrides use only the explicitly defined metrics
+                # (threshold_raw_configs) so inherited defaults are not repeated.
+                if cfg_name == "default":
+                    display_metrics = cfg_metrics
+                else:
+                    display_metrics = threshold_checker.threshold_raw_configs.get(cfg_name, cfg_metrics)
+                metrics = sorted(
+                    [_tc_to_row(tc) for tc in display_metrics.values()],
+                    key=lambda m: m["metric"],
+                )
+                threshold_config_list.append({"name": cfg_name, "metrics": metrics})
+        elif threshold_checker.thresholds:
+            metrics = sorted(
+                [_tc_to_row(tc) for tc in threshold_checker.thresholds.values()],
+                key=lambda m: m["metric"],
+            )
+            threshold_config_list.append({"name": "default", "metrics": metrics})
+
    # ---- Hosts summary ----------------------------------------------------
    hosts_list = []
    for hname, hcfg in (config.get("hosts") or {}).items():
@@ -188,7 +223,7 @@ def get_settings_sections(config: dict) -> list:
            continue
        hosts_list.append({
            "name": hname,
-            "watch": bool(hcfg.get("watch", False)),
+            "watch": bool(hcfg.get("watch", True)),
            "dyndns": bool(hcfg.get("dyndns", False)),
            "owner": hcfg.get("owner", ""),
            "managers": hcfg.get("managers", []),
@@ -312,6 +347,16 @@ def get_settings_sections(config: dict) -> list:
            "hosts": hosts_list,
            "fields": [],
        },
+        {
+            "id": "thresholds",
+            "title": "Threshold Configurations",
+            "description": "Named alert threshold sets. Each defines warning/critical levels per metric.",
+            "threshold_configs": threshold_config_list,
+            "fields": [
+                field("default_threshold_config", "Default config", "text",
+                      "Threshold config used for hosts with no explicit mapping."),
+            ],
+        },
        {
            "id": "runtime",
            "title": "Runtime",
@@ -0,0 +1,199 @@
+<!DOCTYPE html>
+<html>
+  {% include 'head.html' %}
+
+  <style>
+    html, body { overflow: visible; }
+
+    .container {
+      max-width: 700px;
+      margin: 0 auto;
+    }
+
+    h1 {
+      color: #333;
+      margin-bottom: 4px;
+      font-size: 1.5em;
+    }
+
+    .subtitle {
+      color: #666;
+      margin-bottom: 24px;
+      font-size: 0.9em;
+    }
+
+    .section {
+      background: #fff;
+      border-radius: 8px;
+      box-shadow: 0 1px 6px rgba(0,0,0,0.1);
+      padding: 20px 24px;
+      margin-bottom: 20px;
+    }
+
+    .section h2 {
+      font-size: 1em;
+      font-weight: 700;
+      color: #333;
+      margin: 0 0 16px;
+      padding-bottom: 10px;
+      border-bottom: 1px solid #eee;
+      text-transform: uppercase;
+      letter-spacing: 0.5px;
+    }
+
+    .info-row {
+      display: flex;
+      align-items: baseline;
+      padding: 8px 0;
+      border-bottom: 1px solid #f5f5f5;
+      font-size: 0.9em;
+    }
+    .info-row:last-child { border-bottom: none; }
+
+    .info-label {
+      width: 160px;
+      flex-shrink: 0;
+      color: #666;
+      font-size: 0.88em;
+    }
+
+    .info-value {
+      color: #222;
+      word-break: break-all;
+    }
+
+    .info-value a {
+      color: #0066cc;
+      text-decoration: none;
+    }
+    .info-value a:hover { text-decoration: underline; }
+
+    .version-badge {
+      display: inline-block;
+      padding: 3px 12px;
+      background: #e8f0fe;
+      color: #1a73e8;
+      border-radius: 12px;
+      font-size: 0.85em;
+      font-weight: 600;
+      font-family: monospace;
+    }
+
+    .hb-logo {
+      font-size: 2.5em;
+      font-weight: 700;
+      color: #0066cc;
+      letter-spacing: -1px;
+      margin-bottom: 6px;
+    }
+
+    .hb-tagline {
+      color: #555;
+      font-size: 0.95em;
+    }
+
+    .logo-section {
+      display: flex;
+      align-items: center;
+      gap: 20px;
+      padding: 8px 0 4px;
+    }
+
+    .logo-text { flex: 1; }
+  </style>
+
+  <body>
+    {% include 'nav.html' %}
+
+    <div class="container">
+      <h1>{{ header }}</h1>
+      <p class="subtitle">Heartbeat monitoring system</p>
+
+      <div class="section">
+        <div class="logo-section">
+          <div class="logo-text">
+            <div class="hb-logo">Heartbeat</div>
+            <div class="hb-tagline">Lightweight host monitoring over UDP</div>
+          </div>
+          <span class="version-badge">v{{ hbd_version }}</span>
+        </div>
+      </div>
+
+      <div class="section">
+        <h2>Version</h2>
+        <div class="info-row">
+          <span class="info-label">Server version</span>
+          <span class="info-value">{{ hbd_version }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Python</span>
+          <span class="info-value">{{ python_version }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">License</span>
+          <span class="info-value">MIT</span>
+        </div>
+      </div>
+
+      <div class="section">
+        <h2>Runtime</h2>
+        <div class="info-row">
+          <span class="info-label">Host</span>
+          <span class="info-value">{{ server_hostname }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Started</span>
+          <span class="info-value">{{ start_time_str }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Uptime</span>
+          <span class="info-value" id="uptime-value">{{ uptime_str }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Hosts monitored</span>
+          <span class="info-value">{{ host_count }}</span>
+        </div>
+      </div>
+
+      <div class="section">
+        <h2>Contact &amp; Source</h2>
+        <div class="info-row">
+          <span class="info-label">Author</span>
+          <span class="info-value">Andreas Wrede</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Email</span>
+          <span class="info-value"><a href="mailto:aew@wrede.ca">aew@wrede.ca</a></span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Repository</span>
+          <span class="info-value"><a href="https://git.wrede.ca/andreas/heartbeat" target="_blank" rel="noopener">git.wrede.ca/andreas/heartbeat</a></span>
+        </div>
+      </div>
+
+    </div>
+
+    <script>
+      (function() {
+        var startEpoch = {{ start_epoch }};
+        var el = document.getElementById('uptime-value');
+        if (!el) return;
+        function fmt(s) {
+          var d = Math.floor(s / 86400);
+          var h = Math.floor((s % 86400) / 3600);
+          var m = Math.floor((s % 3600) / 60);
+          var sec = s % 60;
+          if (d > 0) return d + 'd ' + h + 'h ' + m + 'm';
+          if (h > 0) return h + 'h ' + m + 'm ' + sec + 's';
+          return m + 'm ' + sec + 's';
+        }
+        function tick() {
+          var up = Math.floor(Date.now() / 1000 - startEpoch);
+          el.textContent = fmt(up);
+        }
+        tick();
+        setInterval(tick, 1000);
+      })();
+    </script>
+  </body>
+</html>
@@ -4,12 +4,17 @@

  <style>

+    html, body {
+      height: auto;
+      overflow-y: auto;
+    }
+
    .container {
      max-width: 1400px;
      margin: 0 auto;
    }

-    h1 { color: #333; margin-bottom: 10px; font-size: 1.5em; }
+    h1 { color: #333; margin-bottom: 5px; margin-top: 15px; font-size: 1.5em; }

    .subtitle {
      color: #666;
@@ -170,14 +175,18 @@

    .alert-hostname {
      font-weight: bold;
-      color: #333;
+      color: #0066cc;
      font-size: 1.1em;
+      text-decoration: none;
+    }
+    .alert-hostname:hover {
+      text-decoration: underline;
    }

    .alert-metric {
-      color: #666;
-      font-family: 'Courier New', monospace;
-      font-size: 0.9em;
+      color: #0066cc;
+      font-size: 1.1em;
+      font-weight: normal;
    }

    .alert-details {
@@ -400,6 +409,10 @@
        } else if (alert.threshold_value !== undefined && alert.threshold_value !== null && alert.operator) {
          valueText += ` <span class="threshold-info">(threshold: ${alert.operator} ${formatValue(alert.threshold_value)})</span>`;
        }
+        if (alert.recovery_threshold !== undefined && alert.recovery_threshold !== null) {
+          const recOp = (alert.operator === '>' || alert.operator === '>=') ? '<' : '>';
+          valueText += ` <span class="threshold-info" style="color:#888">(recovers ${recOp} ${formatValue(alert.recovery_threshold)})</span>`;
+        }
        
        // Build actions section
        let actionsHtml = '';
@@ -424,9 +437,9 @@
            <div class="alert-main">
              <div class="alert-header">
                <span class="alert-level ${level}">${alert.level}</span>
-                <span class="alert-hostname">${alert.hostname}</span>
+                <a class="alert-hostname" href="/plugins#${alert.hostname}">${alert.hostname}</a>
+                <span class="alert-metric">${alert.metric_path.includes('.') ? alert.metric_path.slice(alert.metric_path.indexOf('.') + 1) : alert.metric_path}</span>
              </div>
-              <div class="alert-metric">${alert.metric_path}</div>
              <div class="alert-details">
                <span>${valueText}</span>
                <span class="alert-duration">Active for ${duration}</span>
@@ -15,6 +15,7 @@
      body {
        margin: 0;
        padding: 10px;
+        padding-top: 60px;
        background: #f5f5f5;
      }
      h1 { font-size: 1.5em; color: #333; margin: 0 0 5px; }
@@ -23,11 +24,14 @@

      /* Navigation bar — shared across all pages */
      .nav {
+        position: fixed;
+        top: 0;
+        left: 0;
+        right: 0;
+        z-index: 200;
        background: #fff;
        padding: 6px 12px;
-        margin-bottom: 10px;
        box-shadow: 0 2px 4px rgba(0,0,0,.1);
-        border-radius: 4px;
        display: flex;
        align-items: center;
        justify-content: space-between;
@@ -122,11 +126,17 @@
      }

      /* Swiss railway clock — nav */
-      .nav-clock {
+      .nav-pie {
        flex-shrink: 0;
        line-height: 0;
        margin-left: auto;
        padding: 4px 4px 4px 0;
+      }
+      #alert-pie { display: block; cursor: default; }
+      .nav-clock {
+        flex-shrink: 0;
+        line-height: 0;
+        padding: 4px 4px 4px 0;
        cursor: pointer;
      }
      #swiss-clock { display: block; }
@@ -45,6 +45,7 @@
    h1 {
      color: #333;
      margin-bottom: 5px;
+      margin-top: 15px; 
      font-size: 1.5em;
    }

@@ -235,6 +236,8 @@
      color: #ff9800;
      font-weight: 700;
    }
+    #ntable a.host-link { color: inherit; text-decoration: none; }
+    #ntable a.host-link:hover { text-decoration: underline; }
  </style>
  <script type="text/javascript">
    var cnt = 0;
@@ -244,11 +247,13 @@
    var HBD_VERSION = "{{ hbd_version }}";

    function hostNameHtml(data) {
+      var rawName = data.raw_name || data.name.replace(/<[^>]+>/g, '').replace('*', '').trim();
      var nameHtml = data.name;
      if (!data.hbc_version || data.hbc_version !== HBD_VERSION) {
        nameHtml += ' 🥀';
      }
-      return data.dyn ? '<b>' + nameHtml + '</b>' : nameHtml;
+      var display = data.dyn ? '<b>' + nameHtml + '</b>' : nameHtml;
+      return '<a class="host-link" href="/plugins#' + encodeURIComponent(rawName) + '">' + display + '</a>';
    }

    function setup() {
@@ -403,7 +408,7 @@
        );
        if (data.connections[i].state == "up") {
          state = '<span class="state-up">up</span>';
-          latency = Number.parseFloat(data.connections[i].rtts[0]).toFixed(2);
+          latency = String(Math.round(Number.parseFloat(data.connections[i].rtts[0])));
        } else {
          if (data.connections[i].state == "unknown") {
            state = "";
@@ -510,7 +515,7 @@
          <tbody id="ntablebody">
            {% for host in hosts %}
            <tr class="{% if host.alert_critical_unacked > 0 or host.alert_critical_acked > 0 %}row-critical{% elif host.alert_warning_unacked > 0 or host.alert_warning_acked > 0 %}row-warning{% endif %}">
-              <td data-name="{{ host.name }}">{{ host.name }}{% if not host.hbc_version or host.hbc_version != hbd_version %} 🥀{% endif %}</td>
+              <td data-name="{{ host.name }}"><a class="host-link" href="/plugins#{{ host.raw_name | urlencode }}">{{ host.name }}{% if not host.hbc_version or host.hbc_version != hbd_version %} 🥀{% endif %}</a></td>
              <td style="text-align: center; color: #ff9800; font-weight: bold;">
                {%- set warning_unacked = host.alert_warning_unacked -%}
                {%- set warning_acked = host.alert_warning_acked -%}
@@ -4,11 +4,15 @@
  </button>
  <div class="nav-links" id="nav-links">
    <a href="/live"{% if active_page == "live" %} class="active"{% endif %}>Live Dashboard</a>
-    <a href="/plugins"{% if active_page == "plugins" %} class="active"{% endif %}>Plugin Metrics</a>
+    <a href="/plugins"{% if active_page == "plugins" %} class="active"{% endif %}>Host Overview</a>
    <a href="/alerts"{% if active_page == "alerts" %} class="active"{% endif %}>Alerts</a>
    {% if current_user and current_user.admin %}
    <a href="/settings"{% if active_page == "settings" %} class="active"{% endif %}>Settings</a>
    {% endif %}
+    <a href="/about"{% if active_page == "about" %} class="active"{% endif %}>About</a>
+  </div>
+  <div class="nav-pie" title="Host alert status">
+    <canvas id="alert-pie" width="44" height="44"></canvas>
  </div>
  <div class="nav-clock" title="Click for full-screen clock">
    <canvas id="swiss-clock" width="44" height="44"></canvas>
@@ -41,4 +45,52 @@
      });
    }
  })();
+
+  function drawAlertPie(critical, warning, ok) {
+    var canvas = document.getElementById('alert-pie');
+    if (!canvas) return;
+    var ctx = canvas.getContext('2d');
+    var SIZE = canvas.width;
+    var R = SIZE / 2;
+    ctx.clearRect(0, 0, SIZE, SIZE);
+    var total = critical + warning + ok;
+    if (total === 0) {
+      ctx.beginPath();
+      ctx.arc(R, R, R - 1, 0, Math.PI * 2);
+      ctx.fillStyle = '#ccc';
+      ctx.fill();
+      return;
+    }
+    var slices = [
+      { value: critical, color: '#e53935' },
+      { value: warning,  color: '#ffb300' },
+      { value: ok,       color: '#43a047' }
+    ];
+    var start = -Math.PI / 2;
+    slices.forEach(function(s) {
+      if (s.value === 0) return;
+      var sweep = (s.value / total) * Math.PI * 2;
+      ctx.beginPath();
+      ctx.moveTo(R, R);
+      ctx.arc(R, R, R - 1, start, start + sweep);
+      ctx.closePath();
+      ctx.fillStyle = s.color;
+      ctx.fill();
+      start += sweep;
+    });
+  }
+
+  function updateAlertPie() {
+    fetch('/api/0/alert_summary').then(function(r) {
+      if (!r.ok) return;
+      return r.json();
+    }).then(function(d) {
+      if (d) drawAlertPie(d.critical || 0, d.warning || 0, d.ok || 0);
+    }).catch(function() {});
+  }
+
+  document.addEventListener('DOMContentLoaded', function() {
+    updateAlertPie();
+    setInterval(updateAlertPie, 30000);
+  });
 </script>
@@ -9,7 +9,7 @@
      max-width: 960px;
    }

-    h1 { color: #333; margin-bottom: 4px; font-size: 1.5em; }
+    h1 { color: #333; margin-bottom: 5px; margin-top: 15px; font-size: 1.5em; }
    .subtitle { color: #666; margin-bottom: 24px; font-size: 0.9em; }

    /* ---- Sidebar + content layout ---- */
@@ -23,7 +23,7 @@
      width: 180px;
      flex-shrink: 0;
      position: sticky;
-      top: 20px;
+      top: 60px;
    }

    .sidebar-nav a {
@@ -254,6 +254,17 @@
    .host-bool { text-align: center; }
    .dot-yes { color: #2e7d32; font-size: 1.1em; }
    .dot-no  { color: #ddd;    font-size: 1.1em; }
+
+    /* ---- Threshold configurations ---- */
+    .thresh-config { margin: 12px 20px 20px; }
+    .thresh-config-name {
+      font-weight: 600; font-size: 0.9em; color: #1a237e;
+      margin-bottom: 6px;
+    }
+    .mini-table .warn  { color: #e65100; font-weight: 600; }
+    .mini-table .crit  { color: #b71c1c; font-weight: 600; }
+    .mini-table .dim   { color: #aaa; }
+    .mini-table .metric-path { font-family: monospace; font-size: 0.88em; }
  </style>

  <body>
@@ -394,6 +405,49 @@
            {% endif %}
            {% endif %}

+            {# ---- Threshold configurations section ---- #}
+            {% if section.id == "thresholds" %}
+            {% if section.threshold_configs %}
+            {% for tc in section.threshold_configs %}
+            <div class="thresh-config">
+              <div class="thresh-config-name">{{ tc.name }}</div>
+              {% if tc.metrics %}
+              <div style="overflow-x: auto;">
+                <table class="mini-table">
+                  <thead>
+                    <tr>
+                      <th>Metric</th>
+                      <th>Op</th>
+                      <th>Warning</th>
+                      <th>Critical</th>
+                      <th>Hysteresis</th>
+                      <th>Count</th>
+                    </tr>
+                  </thead>
+                  <tbody>
+                    {% for m in tc.metrics %}
+                    <tr {% if not m.enabled %} style="opacity:0.45"{% endif %}>
+                      <td class="metric-path">{{ m.metric }}</td>
+                      <td>{{ m.operator or '>' }}</td>
+                      <td class="warn">{{ m.warning if m.warning is not none else '—' }}</td>
+                      <td class="crit">{{ m.critical if m.critical is not none else '—' }}</td>
+                      <td class="dim">{{ '%.0f%%' % (m.hysteresis * 100) if m.hysteresis else '—' }}</td>
+                      <td class="dim">{{ m.count }}</td>
+                    </tr>
+                    {% endfor %}
+                  </tbody>
+                </table>
+              </div>
+              {% else %}
+              <span class="val-empty">No thresholds defined.</span>
+              {% endif %}
+            </div>
+            {% endfor %}
+            {% else %}
+            <div class="field-row"><span class="val-empty">No threshold configurations defined.</span></div>
+            {% endif %}
+            {% endif %}
+
            {# ---- Hosts section ---- #}
            {% if section.id == "hosts" %}
            {% if section.hosts %}
@@ -9,10 +9,11 @@ This module provides a flexible threshold checking system that:
 - Supports multiple comparison operators
 """

+import asyncio
 import logging
 import time
 from enum import Enum
-from typing import Dict, Any, Optional, Tuple, Callable
+from typing import Dict, List, Any, Optional, Tuple, Callable
 from . import notify as notify_mod
 from .config import THRESHOLD_DEFAULTS

@@ -29,12 +30,13 @@ class AlertLevel(Enum):

 class ComparisonOperator(Enum):
    """Supported comparison operators for threshold checks."""
-    GT = ">"      # Greater than
-    GTE = ">="    # Greater than or equal
-    LT = "<"      # Less than
-    LTE = "<="    # Less than or equal
-    EQ = "=="     # Equal to
-    NEQ = "!="    # Not equal to
+    GT = ">"        # Greater than
+    GTE = ">="      # Greater than or equal
+    LT = "<"        # Less than
+    LTE = "<="      # Less than or equal
+    EQ = "=="       # Equal to
+    NEQ = "!="      # Not equal to
+    NAGIOS = "nagios"  # Nagios exit-code semantics: 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN


 class AlertState:
@@ -56,6 +58,7 @@ class AlertState:
        self.last_notification = None
        self.threshold_value = None  # The threshold value that triggered alert
        self.operator = None  # The comparison operator (>, <, >=, etc.)
+        self.hysteresis: Optional[float] = None  # Hysteresis fraction used for recovery
        self.formatted_message = None  # Formatted display message for UI
        self.acknowledged = False  # Whether alert has been acknowledged
        self.acknowledged_at = None  # Timestamp when acknowledged
@@ -150,7 +153,16 @@ class AlertState:
            result["operator"] = self.operator
        if self.formatted_message is not None:
            result["formatted_message"] = self.formatted_message
-        
+
+        # Compute and expose the recovery threshold so the UI can display it
+        if (self.hysteresis and self.threshold_value is not None
+                and self.operator is not None):
+            ha = abs(self.threshold_value * self.hysteresis)
+            if self.operator in ('>', '>='):
+                result["recovery_threshold"] = round(self.threshold_value - ha, 4)
+            elif self.operator in ('<', '<='):
+                result["recovery_threshold"] = round(self.threshold_value + ha, 4)
+
        return result
    
    def __setstate__(self, state):
@@ -158,6 +170,8 @@ class AlertState:
        self.__dict__.update(state)
        if not hasattr(self, 'consecutive_count'):
            self.consecutive_count = 0
+        if not hasattr(self, 'hysteresis'):
+            self.hysteresis = None

    def acknowledge(self):
        """Acknowledge this alert to stop reminder notifications."""
@@ -216,33 +230,43 @@ class ThresholdConfig:
    def evaluate(self, value: float) -> AlertLevel:
        """
        Evaluate a value against this threshold.
-        
+
        Args:
            value: Metric value to check
-            
+
        Returns:
            AlertLevel indicating the severity
        """
        if not self.enabled:
            return AlertLevel.OK
-        
+
+        # Nagios exit-code semantics: value IS the severity
+        if self.operator == ComparisonOperator.NAGIOS:
+            try:
+                code = int(value)
+            except (TypeError, ValueError):
+                return AlertLevel.UNKNOWN
+            return {0: AlertLevel.OK, 1: AlertLevel.WARNING, 2: AlertLevel.CRITICAL}.get(
+                code, AlertLevel.UNKNOWN
+            )
+
        try:
            # Convert value to float for comparison
            value = float(value)
        except (TypeError, ValueError):
            logger.warning("Cannot convert value %s to float for %s", value, self.metric_path)
            return AlertLevel.UNKNOWN
-        
+
        # Check critical threshold first
        if self.critical is not None:
            if self._compare(value, self.critical):
                return AlertLevel.CRITICAL
-        
+
        # Then check warning threshold
        if self.warning is not None:
            if self._compare(value, self.warning):
                return AlertLevel.WARNING
-        
+
        return AlertLevel.OK
    
    def evaluate_with_hysteresis(
@@ -261,7 +285,11 @@ class ThresholdConfig:
            New alert level considering hysteresis
        """
        new_level = self.evaluate(value)
-        
+
+        # Nagios exit codes are discrete integers — hysteresis doesn't apply
+        if self.operator == ComparisonOperator.NAGIOS:
+            return new_level
+
        # If no hysteresis, return new level
        if self.hysteresis == 0.0:
            return new_level
@@ -328,15 +356,18 @@ class ThresholdChecker:
            renotify_interval: Seconds between repeat notifications (default: 1 hour)
            journal: Optional MessageJournal instance for logging threshold events
        """
-        # Named threshold configurations: {config_name: {metric_path: ThresholdConfig}}
+        # Named threshold configurations (pre-merged: defaults + overrides): {config_name: {metric_path: ThresholdConfig}}
        self.threshold_configs = {}
-        
+
+        # Raw overrides only for each named config (no defaults baked in): {config_name: {metric_path: ThresholdConfig}}
+        self.threshold_raw_configs: Dict[str, Dict[str, ThresholdConfig]] = {}
+
        # Single threshold set for backward compatibility: {metric_path: ThresholdConfig}
        self.thresholds = {}
-        
-        # Host to config name mapping: {host_name: config_name}
-        self.host_config_mapping = {}
-        
+
+        # Host to ordered list of config names: {host_name: [config_name, ...]}
+        self.host_config_mapping: Dict[str, List[str]] = {}
+
        # Default config name to use when no mapping exists
        self.default_config = "default"
        
@@ -372,6 +403,7 @@ class ThresholdChecker:
        
        # Clear old configuration
        self.threshold_configs.clear()
+        self.threshold_raw_configs.clear()
        self.thresholds.clear()
        self.host_config_mapping.clear()
        self.grace_seconds = float(config.get("grace", 2))
@@ -387,14 +419,28 @@ class ThresholdChecker:
    
    def _parse_config(self, config: Dict[str, Any]):
        """Parse threshold configuration from YAML structure.
-        
+
        Supports two formats:
        1. Legacy format with direct 'thresholds' section
        2. New format with 'threshold_configs' and 'host_threshold_mapping'
+
+        In all cases, THRESHOLD_DEFAULTS are seeded into threshold_configs["default"]
+        so the Settings page always shows the built-in defaults.
+        _parse_multi_config() overwrites this with the fully-merged effective defaults.
        """
+        # Always expose built-in defaults through threshold_configs["default"] so
+        # the Settings page has something to display even in legacy/no-config mode.
+        seed: Dict[str, ThresholdConfig] = {}
+        for plugin_name, plugin_thresholds in THRESHOLD_DEFAULTS.get("thresholds", {}).items():
+            if isinstance(plugin_thresholds, dict):
+                self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=seed)
+        if seed:
+            self.threshold_configs["default"] = seed
+            self.threshold_raw_configs["default"] = {}
+
        # Check for new multi-config format
        if "threshold_configs" in config:
-            self._parse_multi_config(config)
+            self._parse_multi_config(config)  # overwrites threshold_configs["default"]
        elif "thresholds" in config:
            # Legacy single threshold configuration
            self._parse_legacy_config(config)
@@ -424,9 +470,10 @@ class ThresholdChecker:
                        self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=effective_defaults)

        self.threshold_configs["default"] = dict(effective_defaults)
+        self.threshold_raw_configs["default"] = {}
        logger.info("Registered 'default' threshold config with %d metrics", len(effective_defaults))

-        # Parse each named configuration, seeding it with effective_defaults first
+        # Parse each named configuration
        for config_name, config_data in threshold_configs.items():
            if config_name == "default":
                continue  # already handled above
@@ -440,33 +487,41 @@ class ThresholdChecker:
                continue

            logger.info("Parsing threshold configuration: %s", config_name)
-            self.threshold_configs[config_name] = dict(effective_defaults)

+            # Raw overrides only (used for multi-config layering)
+            raw_overrides: Dict[str, ThresholdConfig] = {}
            thresholds_config = config_data["thresholds"]
            for plugin_name, plugin_thresholds in thresholds_config.items():
-                if not isinstance(plugin_thresholds, dict):
-                    continue
+                if isinstance(plugin_thresholds, dict):
+                    self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=raw_overrides)
+            self.threshold_raw_configs[config_name] = raw_overrides

-                self._parse_plugin_thresholds(
-                    plugin_name,
-                    plugin_thresholds,
-                    target_dict=self.threshold_configs[config_name]
-                )
-        
-        # Parse host to config mapping from two possible sources
-        # 1. New format: hosts section with threshold_config attribute
+            # Pre-merged version (defaults + overrides) for single-config fast path
+            self.threshold_configs[config_name] = dict(effective_defaults)
+            self.threshold_configs[config_name].update(raw_overrides)
+
+        # Parse host → config list mapping from two possible sources
+
+        def _normalise(value) -> List[str]:
+            """Accept a string or list; always return a list."""
+            if isinstance(value, list):
+                return [str(v) for v in value]
+            return [str(value)]
+
+        # 1. hosts section with threshold_config attribute (string or list)
        if "hosts" in config:
            hosts_config = config["hosts"]
            if isinstance(hosts_config, dict):
                for host_name, host_attrs in hosts_config.items():
                    if isinstance(host_attrs, dict) and "threshold_config" in host_attrs:
-                        self.host_config_mapping[host_name] = host_attrs["threshold_config"]
-        
-        # 2. Legacy format: host_threshold_mapping section (for backward compatibility)
+                        self.host_config_mapping[host_name] = _normalise(host_attrs["threshold_config"])
+
+        # 2. Legacy host_threshold_mapping section (string values only)
        if "host_threshold_mapping" in config:
            legacy_mapping = config.get("host_threshold_mapping", {})
            if isinstance(legacy_mapping, dict):
-                self.host_config_mapping.update(legacy_mapping)
+                for host_name, value in legacy_mapping.items():
+                    self.host_config_mapping[host_name] = _normalise(value)
        
        # Set default config (first one alphabetically or explicitly set)
        self.default_config = config.get("default_threshold_config", "default")
@@ -531,11 +586,14 @@ class ThresholdChecker:
            warning = threshold_config.get("warning")
            critical = threshold_config.get("critical")
            operator = threshold_config.get("operator", ">")
-            display = threshold_config.get("display", "(threshold: {op_symbol} {threshold_value})")
-            hysteresis = threshold_config.get("hysteresis", 0.1)  # 10% default
+            # Nagios operator maps exit codes directly; no numeric thresholds needed
+            is_nagios_op = (operator == "nagios")
+            default_display = "{check_name}: {output}" if is_nagios_op else "(threshold: {op_symbol} {threshold_value})"
+            display = threshold_config.get("display", default_display)
+            hysteresis = threshold_config.get("hysteresis", 0.0 if is_nagios_op else 0.02)
            enabled = threshold_config.get("enabled", True)
-            
-            if warning is None and critical is None:
+
+            if warning is None and critical is None and not is_nagios_op:
                logger.warning("No thresholds defined for %s, skipping", metric_path)
                continue
            
@@ -635,7 +693,7 @@ class ThresholdChecker:
        warning = rtt_thresholds.get("warning")
        critical = rtt_thresholds.get("critical")
        operator = rtt_thresholds.get("operator", ">")
-        hysteresis = rtt_thresholds.get("hysteresis", 0.1)  # 10% default
+        hysteresis = rtt_thresholds.get("hysteresis", 0.02)  # 2% default
        enabled = rtt_thresholds.get("enabled", True)
        display = rtt_thresholds.get("display")
        count = rtt_thresholds.get("count", 1)
@@ -664,35 +722,55 @@ class ThresholdChecker:
        )
    
    def get_thresholds_for_host(self, host_name: str) -> Dict[str, ThresholdConfig]:
-        """Get the appropriate threshold configuration for a host.
-        
+        """Get the effective threshold configuration for a host.
+
+        When threshold_config is a list, configs are applied left-to-right on top
+        of the default thresholds so earlier entries can be overridden by later ones.
+
        Args:
            host_name: Name of the host
-            
+
        Returns:
            Dictionary of thresholds for this host
        """
        # Legacy mode: single threshold set for all hosts
        if self.thresholds and not self.threshold_configs:
            return self.thresholds
-        
-        # Multi-config mode: look up host-specific configuration
-        if self.threshold_configs:
-            config_name = self.host_config_mapping.get(host_name, self.default_config)
-            
-            if config_name in self.threshold_configs:
-                return self.threshold_configs[config_name]
-            else:
+
+        if not self.threshold_configs:
+            return {}
+
+        config_names = self.host_config_mapping.get(host_name)
+
+        # No host-specific mapping → return pre-merged default
+        if not config_names:
+            return self.threshold_configs.get(self.default_config, {})
+
+        # Single config → fast path using pre-merged copy
+        if len(config_names) == 1:
+            name = config_names[0]
+            if name in self.threshold_configs:
+                return self.threshold_configs[name]
+            logger.warning(
+                "Threshold config '%s' not found for host '%s', using default '%s'",
+                name, host_name, self.default_config,
+            )
+            return self.threshold_configs.get(self.default_config, {})
+
+        # Multiple configs → start from defaults, layer raw overrides in order
+        result = dict(self.threshold_configs.get(self.default_config, {}))
+        for name in config_names:
+            if name == self.default_config:
+                continue  # defaults already the base
+            raw = self.threshold_raw_configs.get(name)
+            if raw is None:
                logger.warning(
-                    "Threshold config '%s' not found for host '%s', using default '%s'",
-                    config_name,
-                    host_name,
-                    self.default_config
+                    "Threshold config '%s' not found for host '%s', skipping",
+                    name, host_name,
                )
-                return self.threshold_configs.get(self.default_config, {})
-        
-        # No thresholds configured
-        return {}
+            else:
+                result.update(raw)
+        return result
    
    def check_value(
        self,
@@ -760,6 +838,12 @@ class ThresholdChecker:
        elif new_level == AlertLevel.WARNING and threshold.warning is not None:
            threshold_value = threshold.warning

+        # Keep hysteresis on the state so the UI can show the recovery threshold
+        if new_level != AlertLevel.OK:
+            alert_state.hysteresis = threshold.hysteresis
+        else:
+            alert_state.hysteresis = None
+
        # Update state and check for changes
        old_level = alert_state.level
        if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
@@ -769,6 +853,36 @@ class ThresholdChecker:
            self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, None)

        return None
+    def _find_threshold(
+        self, thresholds: Dict[str, "ThresholdConfig"], metric_path: str
+    ) -> Tuple[Optional["ThresholdConfig"], Optional[str]]:
+        """Return (threshold, check_name) for *metric_path*, falling back to suffix matches.
+
+        Allows generic thresholds like ``nagios_runner.status_code`` to match
+        fully-qualified paths like ``nagios_runner.check_disk_root_status_code``.
+        The exact match is always tried first; then successive leading
+        underscore-delimited segments are stripped from the field name until
+        a match is found or no segments remain.
+
+        Returns:
+            (ThresholdConfig, None) for an exact match.
+            (ThresholdConfig, "check_disk_root") for a suffix match — the second
+            element is the stripped prefix, available as ``{check_name}`` in
+            display format templates.
+            (None, None) when no threshold is found.
+        """
+        if metric_path in thresholds:
+            return thresholds[metric_path], None
+        plugin, sep, field = metric_path.partition(".")
+        if not sep:
+            return None, None
+        parts = field.split("_")
+        for i in range(1, len(parts)):
+            candidate = plugin + "." + "_".join(parts[i:])
+            if candidate in thresholds:
+                return thresholds[candidate], "_".join(parts[:i])
+        return None, None
+
    def check_plugin_data(
        self,
        host_name: str,
@@ -796,38 +910,39 @@ class ThresholdChecker:
        # Check flat metrics
        for metric_name, value in data.items():
            metric_path = f"{plugin_name}.{metric_name}"
-            
-            if metric_path not in thresholds:
+
+            threshold, check_name = self._find_threshold(thresholds, metric_path)
+            if threshold is None:
                continue
-            
-            threshold = thresholds[metric_path]
-            
+
            # Get or create alert state
            if metric_path not in alert_states:
                alert_states[metric_path] = AlertState(metric_path)
-            
+
            alert_state = alert_states[metric_path]
-            
+
            # Evaluate threshold with hysteresis
            new_level = threshold.evaluate_with_hysteresis(
                value,
                alert_state.level
            )
-            
+
            # Determine which threshold was exceeded
            threshold_value = None
            if new_level == AlertLevel.CRITICAL and threshold.critical is not None:
                threshold_value = threshold.critical
            elif new_level == AlertLevel.WARNING and threshold.warning is not None:
                threshold_value = threshold.warning
-            
+
+            alert_state.hysteresis = threshold.hysteresis if new_level != AlertLevel.OK else None
+
            # Update state and check for changes
            old_level = alert_state.level
            if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
                state_changes.append((metric_path, old_level, new_level, value))
-                self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data)
+                self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data, check_name=check_name, metric_name=metric_name)
            elif new_level != AlertLevel.OK:
-                self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data)
+                self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data, check_name=check_name, metric_name=metric_name)

        # Check nested metrics (e.g., partition data in disk_monitor)
        self._check_nested_metrics(
@@ -886,7 +1001,9 @@ class ThresholdChecker:
                        threshold_value = threshold.critical
                    elif new_level == AlertLevel.WARNING and threshold.warning is not None:
                        threshold_value = threshold.warning
-                    
+
+                    alert_state.hysteresis = threshold.hysteresis if new_level != AlertLevel.OK else None
+
                    old_level = alert_state.level
                    if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
                        state_changes.append((metric_path, old_level, new_level, value))
@@ -903,6 +1020,8 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]] = None,
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ):
        """Trigger a notification for an alert state change.
        
@@ -924,56 +1043,54 @@ class ThresholdChecker:
        
        # Format operator symbol
        op_symbol = threshold.operator.value
-        
+
+        # Short metric label: strip the plugin-name prefix for readability
+        short_path = metric_path.partition(".")[2] or metric_path
+
        # Use a display-friendly value (inf is the sentinel for "overdue")
        import math
        display_value = "overdue" if isinstance(value, float) and math.isinf(value) else value

-        # Format message
-        if new_level == AlertLevel.OK:
-            lvl = "RECOVER"
-            message = f"{metric_path} = {display_value} ({old_level.name} -> OK)"
-        elif new_level == AlertLevel.WARNING:
-            lvl = "WARNING"
-            if threshold_value is not None:
-                threshold_info = self._format_display(
-                    threshold.display,
-                    value=display_value,
-                    threshold_value=threshold_value,
-                    op_symbol=op_symbol,
-                    plugin_data=plugin_data
-                )
-                message = f"{metric_path} = {display_value} {threshold_info}"
-            else:
-                message = f"{metric_path} = {display_value}"
-        elif new_level == AlertLevel.CRITICAL:
-            lvl = "CRITICAL"
-            if threshold_value is not None:
-                threshold_info = self._format_display(
-                    threshold.display,
-                    value=display_value,
-                    threshold_value=threshold_value,
-                    op_symbol=op_symbol,
-                    plugin_data=plugin_data
-                )
-                message = f"{metric_path} = {display_value} {threshold_info}"
-            else:
-                message = f"{metric_path} = {display_value}"
-        else:
-            lvl = "UNKNOWN"
-            message = f"{metric_path} = {display_value}"
-        
-        # Return the formatted threshold info for storing in AlertState
-        formatted_threshold_msg = None
-        if threshold_value is not None and new_level != AlertLevel.OK:
-            formatted_threshold_msg = self._format_display(
+        # Format message — for the nagios operator there is no numeric threshold_value;
+        # render the display template whenever one is available.
+        has_display = threshold_value is not None or threshold.operator == ComparisonOperator.NAGIOS
+
+        def _fmt():
+            return self._format_display(
                threshold.display,
                value=display_value,
                threshold_value=threshold_value,
                op_symbol=op_symbol,
-                plugin_data=plugin_data
+                plugin_data=plugin_data,
+                check_name=check_name,
+                metric_name=metric_name,
            )
-        
+
+        if new_level == AlertLevel.OK:
+            lvl = "RECOVER"
+            message = f"{short_path} = {display_value} ({old_level.name} -> OK)"
+        elif new_level == AlertLevel.WARNING:
+            lvl = "WARNING"
+            if has_display:
+                message = f"{short_path} = {display_value} {_fmt()}"
+            else:
+                message = f"{short_path} = {display_value}"
+        elif new_level == AlertLevel.CRITICAL:
+            lvl = "CRITICAL"
+            if has_display:
+                message = f"{short_path} = {display_value} {_fmt()}"
+            else:
+                message = f"{short_path} = {display_value}"
+        else:
+            lvl = "UNKNOWN"
+            if has_display:
+                message = f"{short_path} = {display_value} {_fmt()}"
+            else:
+                message = f"{short_path} = {display_value}"
+
+        # Formatted threshold info stored on AlertState for the UI
+        formatted_threshold_msg = _fmt() if has_display and new_level != AlertLevel.OK else None
+
        return lvl, message, formatted_threshold_msg
    
    def _send_notification(
@@ -987,23 +1104,28 @@ class ThresholdChecker:
        value: Any,
    ):
        """Send notification and log to journal/eventlog."""
-        try:
-            notify_mod.send_notification(
-                host_name,
-                notify_mod.Notification(
-                    title=f"[{lvl}] {host_name}",
-                    body=message,
-                    level=lvl,
-                ),
-            )
-            logger.info("Notification sent: %s", message)
-        except Exception as e:
-            logger.error("Failed to send notification: %s", e)
+        from . import hbdclass
+        host = hbdclass.Host.hosts.get(host_name)
+        if host is not None and not host.watched:
+            eventlog(host_name, lvl, message, service="threshold")
+            return
+        short_path = metric_path.partition(".")[2] or metric_path
+        title = f"[{lvl}] {host_name}  {short_path}"
+        # Strip the "metric = " prefix from message so body is just the value/detail
+        prefix = short_path + " = "
+        body = message[len(prefix):] if message.startswith(prefix) else message
+        asyncio.get_event_loop().create_task(notify_mod.send_notification(
+            host_name,
+            notify_mod.Notification(
+                title=title,
+                body=body,
+                level=lvl,
+            ),
+        ))
        
        # Log to journal
        if self.journal is not None:
            try:
-                import asyncio
                loop = asyncio.get_event_loop()
                loop.create_task(self.journal.log_threshold_event(
                    host_name=host_name,
@@ -1021,32 +1143,61 @@ class ThresholdChecker:
        self,
        display_format: str,
        value: Any,
-        threshold_value: float,
+        threshold_value: Optional[float],
        op_symbol: str,
        plugin_data: Optional[Dict[str, Any]] = None,
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ) -> str:
        """Format the display string using available data.
-        
-        Args:
-            display_format: Format string from threshold config
-            value: Current metric value
-            threshold_value: Threshold value that was exceeded
-            op_symbol: Comparison operator symbol
-            plugin_data: Optional dictionary of plugin data fields
-            
+
+        Available template variables:
+            {value}           - current metric value
+            {threshold_value} - threshold that was exceeded
+            {op_symbol}       - comparison operator (>, <, >=, <=, ==, !=)
+            {check_name}      - prefix stripped for generic threshold match
+                                (e.g. "check_disk_root" when metric
+                                "check_disk_root_status_code" matched generic
+                                threshold "status_code")
+            {metric_name}     - field name within the plugin data dict
+            Any key from plugin_data is also available.
+
        Returns:
            Formatted display string
        """
+        if not display_format:
+            display_format = "(threshold: {op_symbol} {threshold_value})" if threshold_value is not None else ""
+
        # Build format context with standard variables
        format_context = {
            'value': value,
-            'threshold_value': threshold_value,
            'op_symbol': op_symbol,
        }
-        
+        if threshold_value is not None:
+            format_context['threshold_value'] = threshold_value
+
+        # Add generic-match context variables when available
+        if check_name is not None:
+            format_context['check_name'] = check_name
+        if metric_name is not None:
+            format_context['metric_name'] = metric_name
+
        # Add all plugin data fields if available
        if plugin_data:
            format_context.update(plugin_data)
+
+        # For nagios_runner generic matches, expose the matched check's output
+        # and status as short aliases {output} and {status} so display templates
+        # don't need to use the full {check_disk_root_output} form.
+        if check_name and plugin_data:
+            if 'output' not in format_context:
+                output = plugin_data.get(f"{check_name}_output")
+                if output is not None:
+                    format_context['output'] = output
+            if 'status' not in format_context:
+                status = plugin_data.get(f"{check_name}_status")
+                if status is not None:
+                    format_context['status'] = status
        
        try:
            # Format the display string
@@ -1077,17 +1228,22 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]],
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ) -> None:
        """Handle a state-change transition with grace-period logic.

-        Transitioning INTO alert: defers the notification for grace_seconds.
+        Transitioning INTO alert (worsening): defers the notification for grace_seconds.
+        De-escalation within alert states (e.g. CRITICAL→WARNING): no new notification;
+          the metric is still alerting so no RECOVER was sent.
        Transitioning TO OK:
          - Still in grace window (pending_since set): suppresses both the alert
            and the recovery — the spike never warranted a page.
          - Past grace: fires the RECOVER notification normally.
        """
        lvl, message, formatted_msg = self._trigger_notification(
-            host_name, metric_path, old_level, new_level, value, threshold, plugin_data
+            host_name, metric_path, old_level, new_level, value, threshold, plugin_data,
+            check_name=check_name, metric_name=metric_name,
        )
        alert_state.formatted_message = formatted_msg

@@ -1100,12 +1256,20 @@ class ThresholdChecker:
                alert_state.pending_since = None
            else:
                self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
-        else:
+        elif new_level.value > old_level.value:
+            # Worsening (OK→WARNING, OK→CRITICAL, WARNING→CRITICAL): schedule notification.
            alert_state.pending_since = time.time()
            logger.debug(
                "Alert deferred (%.0fs grace): %s on %s = %s",
                self.grace_seconds, metric_path, host_name, value,
            )
+        else:
+            # De-escalation within alert states (e.g. CRITICAL→WARNING): metric is still
+            # alerting but did not recover, so no new notification.
+            logger.debug(
+                "De-escalation %s→%s for %s on %s, no notification",
+                old_level.name, new_level.name, metric_path, host_name,
+            )

    def _check_pending_or_renotify(
        self,
@@ -1115,6 +1279,8 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]],
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ) -> None:
        """Called when alert level is unchanged and non-OK.

@@ -1124,7 +1290,8 @@ class ThresholdChecker:
        if alert_state.pending_since is not None:
            if time.time() - alert_state.pending_since >= self.grace_seconds:
                lvl, message, formatted_msg = self._trigger_notification(
-                    host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data
+                    host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data,
+                    check_name=check_name, metric_name=metric_name,
                )
                alert_state.formatted_message = formatted_msg
                self._send_notification(
@@ -1133,7 +1300,7 @@ class ThresholdChecker:
                alert_state.pending_since = None
            # else: still within grace window, do nothing
        else:
-            self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data)
+            self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data, check_name=check_name, metric_name=metric_name)

    def _check_renotify(
        self,
@@ -1143,6 +1310,8 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]] = None,
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ):
        """Check if we should send a repeat notification.
        
@@ -1180,7 +1349,8 @@ class ThresholdChecker:
            
            # Format operator symbol
            op_symbol = threshold.operator.value
-            
+            short_path = metric_path.partition(".")[2] or metric_path
+
            # Time to re-notify
            if threshold_value is not None:
                # Use display format string
@@ -1189,27 +1359,50 @@ class ThresholdChecker:
                    value=value,
                    threshold_value=threshold_value,
                    op_symbol=op_symbol,
-                    plugin_data=plugin_data
+                    plugin_data=plugin_data,
+                    check_name=check_name,
+                    metric_name=metric_name,
                )
-                message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} {threshold_info}, ongoing for {int(now - alert_state.since)}s"
+                body = f"{value} {threshold_info}, ongoing for {int(now - alert_state.since)}s"
            else:
-                message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} (ongoing for {int(now - alert_state.since)}s)"
-            
-            try:
-                notify_mod.send_notification(
+                body = f"{value} (ongoing for {int(now - alert_state.since)}s)"
+            message = f"REMINDER ({alert_state.level.name}): {host_name} - {short_path} = {body}"
+
+            from . import hbdclass
+            host = hbdclass.Host.hosts.get(host_name)
+            if host is None or host.watched:
+                asyncio.get_event_loop().create_task(notify_mod.send_notification(
                    host_name,
                    notify_mod.Notification(
-                        title=f"[REMINDER/{alert_state.level.name}] {host_name}",
-                        body=message,
+                        title=f"[REMINDER/{alert_state.level.name}] {host_name}  {short_path}",
+                        body=body,
                        level=alert_state.level.name,
                    ),
-                )
-                alert_state.last_notification = now
-                alert_state.notification_count += 1
+                ))
                logger.info("Re-notification sent: %s", message)
-            except Exception as e:
-                logger.error("Failed to send re-notification: %s", e)
+            alert_state.last_notification = now
+            alert_state.notification_count += 1
    
+    def purge_stale_alerts(self, hbdclass) -> None:
+        """Remove alert states that have no matching threshold configuration.
+
+        Called after startup (pickle restore) and after each config reload so
+        that alerts orphaned by configuration changes do not linger forever.
+        Alerts whose metric_path is not present in the current threshold config
+        for that host are silently dropped.
+        """
+        for hostname, host in hbdclass.Host.hosts.items():
+            if not host.alert_states:
+                continue
+            configured = self.get_thresholds_for_host(hostname)
+            stale = [mp for mp in host.alert_states if self._find_threshold(configured, mp)[0] is None]
+            for mp in stale:
+                logger.info(
+                    "Purging stale alert state for %s / %s (no threshold configured)",
+                    hostname, mp,
+                )
+                del host.alert_states[mp]
+
    def get_active_alerts(self, alert_states: Dict[str, AlertState]) -> list:
        """
        Get all currently active (non-OK) alerts.
@@ -211,10 +211,11 @@ def _make_timer_callbacks(uname, host, ctx):
        connection.newstate(connection.__class__.OVERDUE, now, cfg.get("grace", 2))
        msg = f"{connection.afam} overdue"
        eventlog(uname, "CRITICAL", msg)
-        notify_mod.send_notification(
-            uname,
-            notify_mod.Notification(title=f"[CRITICAL] {uname}", body=msg, level="CRITICAL"),
-        )
+        if host.watched:
+            asyncio.create_task(notify_mod.send_notification(
+                uname,
+                notify_mod.Notification(title=f"[CRITICAL] {uname}", body=msg, level="CRITICAL"),
+            ))
        # Track in alert_states so the Alerts Dashboard shows this
        _set_connectivity_alert(host, connection.afam, "CRITICAL")
        if threshold_checker:
@@ -315,7 +316,6 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
    
    cfg = ctx.get("config", {})
    hbdcls = ctx.get("hbdclass")
-    log = ctx.get("log")
    msg_to_websockets = ctx.get("msg_to_websockets")
    DEBUG = ctx.get("DEBUG", 0)
    verbose = ctx.get("verbose", False)
@@ -336,8 +336,7 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
        # Apply user-access settings from config
        access = config_mod.get_host_access(cfg, uname)
        host.apply_access(access["owner"], access["managers"], access["monitors"])
-        if verbose:
-            print(("XX: New host, num now %s" % (len(hbdcls.Host.hosts))))
+        logger.info("New host signed on: %s (dyn=%s, access=%s)", uname, host.dyn, access)
        newh = True
    else:
        host = hbdcls.Host.hosts[uname]
@@ -408,10 +407,11 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):

    if res:
        eventlog(uname, "WARNING", res)
-        notify_mod.send_notification(
-            uname,
-            notify_mod.Notification(title=f"[WARNING] {uname}", body=res, level="WARNING"),
-        )
+        if host.watched:
+            asyncio.create_task(notify_mod.send_notification(
+                uname,
+                notify_mod.Notification(title=f"[WARNING] {uname}", body=res, level="WARNING"),
+            ))

    interval = int(msg.get("interval", 0) or 0)
    shutdown = msg.get("shutdown", 0)
@@ -421,10 +421,11 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):

    if boot:
        eventlog(uname, "INFO", "booted")
-        notify_mod.send_notification(
-            uname,
-            notify_mod.Notification(title=f"[INFO] {uname}", body=f"{host.name} booted", level="INFO"),
-        )
+        if host.watched:
+            asyncio.create_task(notify_mod.send_notification(
+                uname,
+                notify_mod.Notification(title=f"[INFO] {uname}", body=f"{host.name} booted", level="INFO"),
+            ))
    if message:
        eventlog(uname, "INFO", "msg: %s" % message, service=service)

@@ -438,13 +439,18 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
        if not newh:
            if d == 0 or lasts == "unknown":
                m = "%s is up" % (conn.afam)
+            elif d < 4:
+                # Transient blip (likely client restart) — skip log and notification
+                m = None
            else:
                m = "%s back after being %s for %s" % (conn.afam, lasts, dur(d))
-            eventlog(uname, "RECOVER", m)
-            notify_mod.send_notification(
-                uname,
-                notify_mod.Notification(title=f"[RECOVER] {uname}", body=m, level="RECOVER"),
-            )
+            if m:
+                eventlog(uname, "RECOVER", m)
+                if host.watched:
+                    asyncio.create_task(notify_mod.send_notification(
+                        uname,
+                        notify_mod.Notification(title=f"[RECOVER] {uname}", body=m, level="RECOVER"),
+                    ))

    if boot or newh:
        host.upcount = host.doesack
@@ -454,10 +460,11 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
    if shutdown:
        m = "%s shutdown" % conn.afam
        eventlog(uname, "INFO", m)
-        notify_mod.send_notification(
-            uname,
-            notify_mod.Notification(title=f"[INFO] {uname}", body=m, level="INFO"),
-        )
+        if host.watched:
+            asyncio.create_task(notify_mod.send_notification(
+                uname,
+                notify_mod.Notification(title=f"[INFO] {uname}", body=m, level="INFO"),
+            ))
        conn.newstate(hbdcls.Connection.DOWN, now)
        _set_connectivity_alert(host, conn.afam, "CRITICAL")

@@ -491,12 +498,10 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
        op, rmsg = host.cmds[0]
        if op == "CMD":
            del host.cmds[0]
-            if log:
-                log(uname, "command sent")
+            eventlog(uname, "INFO", "command sent")
        elif op == "UPD":
            del host.cmds[0]
-            if log:
-                log(uname, "update initiated")
+            eventlog(uname, "INFO", "update initiated")
        opkt = dicttos(op, rmsg)
        try:
            transport.sendto(opkt, addr)
@@ -13,7 +13,8 @@ from . import data

 logger = logging.getLogger(__name__)

-_connections: set = set()
+# Map of WebSocket → User object (or None when auth is disabled)
+_connections: dict = {}
 _loop: Optional[asyncio.AbstractEventLoop] = None
 _get_hosts: Optional[Callable[[], Iterable]] = None
 _verbose: bool = False
@@ -34,23 +35,53 @@ def setup(
    _verbose = verbose


+def _user_can_see_host(user, host_name: str) -> bool:
+    """Return True if *user* may see updates for *host_name* (manager or higher)."""
+    from . import hbdclass, users as users_mod
+    if user is None or not users_mod.users_enabled():
+        return True
+    if user.admin:
+        return True
+    host = hbdclass.Host.hosts.get(host_name)
+    if host is None:
+        return False
+    return host.is_manager(user.username)
+
+
+def _get_token(request) -> str:
+    """Extract session token from request (mirrors logic in http.py)."""
+    auth = request.headers.get("Authorization", "")
+    if auth.startswith("Bearer "):
+        return auth[7:].strip()
+    token = request.headers.get("X-Auth-Token", "")
+    if token:
+        return token
+    return request.cookies.get("hbd_session", "")
+
+
 async def handler(request):
    """aiohttp WebSocket upgrade handler — register as GET /ws."""
    from aiohttp import web
+    from . import users as users_mod

    ws = web.WebSocketResponse()
    await ws.prepare(request)

-    _connections.add(ws)
+    token = _get_token(request)
+    user = users_mod.get_session_user(token) if token else None
+
+    _connections[ws] = user
    remote = request.remote
    logger.info("WebSocket connected from %s", remote)

    try:
-        # Send current host state to the new client
+        # Send current host state, filtered to hosts this user may see
        if _get_hosts:
            try:
                for h in list(_get_hosts()):
-                    await ws.send_str(json.dumps({"type": "host", "data": h}))
+                    host_name = h.get("raw_name") or h.get("name", "")
+                    if _user_can_see_host(user, host_name):
+                        await ws.send_str(json.dumps({"type": "host", "data": h}))
            except Exception as e:
                logger.error("Error sending initial hosts: %s", e)

@@ -74,7 +105,7 @@ async def handler(request):
    except Exception as e:
        logger.exception("WebSocket handler error from %s: %s", remote, e)
    finally:
-        _connections.discard(ws)
+        _connections.pop(ws, None)
        logger.info("WebSocket disconnected from %s", remote)

    return ws
@@ -83,25 +114,37 @@ async def handler(request):
 def broadcast(typ: str, payload) -> bool:
    """Thread-safe broadcast to all connected WebSocket clients.

+    For host and plugin updates, only sends to clients whose user has
+    manager-or-higher access to that host.  Other message types are
+    broadcast to all clients.
+
    Can be called from any thread; schedules sends on the event loop.
    Returns False if the loop is not running yet.
    """
    if not _loop:
        return False
+
+    # Determine the host name for access-filtered message types
+    host_name: Optional[str] = None
+    if typ in ("host", "plugin"):
+        host_name = payload.get("raw_name") or payload.get("host") or payload.get("name")
+
    jmsg = json.dumps({"type": typ, "data": payload})

    async def _send_all():
        dead = set()
-        for ws in list(_connections):
+        for ws, user in list(_connections.items()):
            try:
-                if not ws.closed:
-                    await ws.send_str(jmsg)
-                else:
+                if ws.closed:
                    dead.add(ws)
+                    continue
+                if host_name is not None and not _user_can_see_host(user, host_name):
+                    continue
+                await ws.send_str(jmsg)
            except Exception:
                dead.add(ws)
        for ws in dead:
-            _connections.discard(ws)
+            _connections.pop(ws, None)

    asyncio.run_coroutine_threadsafe(_send_all(), _loop)
    return True
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "hbd"
-version = "5.1.3"
+version = "5.2.3"
 description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
 readme = "README.md"
 requires-python = ">=3.11"
@@ -34,6 +34,9 @@ server = [
  "matrix-nio>=0.24",
 ]

+# Minimal client — hbc_mini only, no external dependencies
+mini = []
+
 # Install both client and server
 all = [
  "hbd[client,server]",
@@ -54,6 +57,9 @@ dev = [
 hbd = "hbd.server.cli:main"
 hbc = "hbd.client.main:main"

+[tool.setuptools]
+script-files = ["scripts/hb_install.sh", "scripts/hbc_mini.py"]
+
 [tool.setuptools.packages.find]
 where = ["."]
 include = ["hbd*"]
@@ -4,12 +4,14 @@ set -e
 uv version --bump patch 
 VER=$(uv  version  --short)
 sed -i".bak"  "s/__version__ = \"[0-9.]*\"\(.*\)$/__version__ = \"$VER\"\1/" hbd/__init__.py
+sed -i".bak"  "s/__version__ = \"[0-9.]*\"\(.*\)$/__version__ = \"$VER\"\1/" scripts/hbc_mini.py

 # commit pyproject.toml
-git commit -m "version $VER" pyproject.toml hbd/__init__.py
+git commit -m "version $VER" pyproject.toml hbd/__init__.py scripts/hbc_mini.py
 git push 
 # tag version
 git tag -a v$VER -m "Version $VER"
 git push --tags

 rm hbd/__init__.py.bak
+rm scripts/hbc_mini.py.bak
@@ -0,0 +1,115 @@
+#!/bin/sh
+
+# Helper script to install the heartbeat tools. By default, it will only
+# install the heartbeat client, hbc. The server is installed when the arg 'server' is passed 
+# to the script. The script will install the heartbeat tools in a python 
+# virtual environment in ~/venvs/hbd. The hbd and hbc commands will be
+# installed from the wheel and symlinked to ~/bin/hbd and ~/bin/hbc,
+# respectively. If the virtual environment already exists, it will be
+# reused. The script will also remove any existing symlinks for hbd and hbc
+# in ~/bin before creating new ones.
+
+set -e
+what=$1
+on_ha=0
+where=""
+venv=""
+[ "$2" = "HA" ] && on_ha=1
+[ -z "$what" ] && what="client"
+
+if [ -d /homeassistant ]; then  # if running from HA command line
+    echo "HA, running \"docker exec homeassistant /config/bin/hb_install.sh $@\""
+    docker exec homeassistant /config/bin/hb_install.sh $@ HA
+    rc=$?
+    if [ $rc -ne 0 ]; then
+        echo "Failed to install heartbeat in HA, please check the logs for more details"
+        exit 1
+    fi
+    exit 0
+fi
+
+if [ $on_ha -eq 1 ] || [ -r /.dockerenv ] && [ -d /config/bin ]; then
+    # Installing under docker on Home Assistant OS, using /config/bin for executables and /config/venvs for virtual environments 
+    echo "Home Assistant OS detected, installing under docker"
+    where="/config/bin"
+    venv="/config/venvs"
+else
+    if [ ! -d $HOME/.local/bin ] && [ ! -d $HOME/bin ]; then
+        echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
+        exit 1
+    fi
+    for where in $HOME/bin $HOME/.local/bin notset ; do
+        if echo ":$PATH:" | grep -q ":$where:" ; then
+            break
+        fi
+    done
+    if [ "$where" = "notset" ]; then
+        echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
+        exit 1
+    fi
+    if [ "$what" = "mini" ]; then
+        venv=""
+    else
+        venv="$HOME/venvs"
+    fi
+fi
+echo "Installing $what to $where"
+if [ ! -z "$venv" ]; then
+    echo "Using virtual environment at $venv/hbd"
+fi
+
+if [ "$venv" != "" ] && [ ! -d  $venv/hbd ]; then
+    arg=""
+    have_pip=$(python3 -c "import pip" 2>/dev/null &> /dev/null && echo "Installed" || echo "Not Installed")
+    if [ "$have_pip" = "Not Installed" ]; then
+        # some systems do not have pip installed by default, so we need to fetch get-pip.py and install pip
+        echo "pip is not installed, fetching get-pip.py and installing pip"
+        arg="--without-pip"
+    fi
+    mkdir -p $venv
+    have_venv=$(python3 -c "import venv" 2>/dev/null &> /dev/null && echo "Installed" || echo "Not Installed")
+    if [ "$have_venv" = "Not Installed" ]; then
+        if [ "$have_pip" = "Not Installed" ]; then
+            echo "python has no venv, and no pip to install virtualenv, cannot continue"
+            exit 1
+        fi
+        echo "python venv module not found, installing virtualenv"
+        python3 -m pip install --user virtualenv
+        python3 -m virtualenv $venv/hbd --system-site-packages $arg
+    else
+        python3 -m venv $venv/hbd --system-site-packages $arg
+    fi
+    . $venv/hbd/bin/activate
+    if [ -n "$arg" ]; then  
+        curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py
+    fi
+    deactivate
+fi
+
+if [ ! -z "$venv" ]; then
+    . $venv/hbd/bin/activate
+fi
+if [ "$what" = "mini" ]; then
+    curl -s -o $where/hbc_mini https://git.wrede.ca/andreas/heartbeat/raw/branch/master/scripts/hbc_mini.py
+    chmod +x $where/hbc_mini
+else
+    python3 -mpip install --upgrade --index-url https://git.wrede.ca/api/packages/andreas/pypi/simple/ --extra-index-url https://pypi.org/simple hbd[$what]
+fi
+
+if [ ! -z "$venv" ]; then
+    echo "linking executables to $where"
+    if [ "$what" = "server" ]; then
+        rm -f $where/hbd
+        ln -sf $(which hbd) $where/hbd
+    elif [ "$what" = "client" ]; then
+        rm -f $where/hbc
+        ln -sf $(which hbc) $where/hbc
+    fi
+    rm -f $where/hb_install.sh
+    ln -sf $(which hb_install.sh) $where/hb_install.sh
+fi
+echo "Installation complete. To upgrade, run the following:"
+echo "    $where/hb_install.sh $what"
+echo "To install on another machine, run the following obtain the install script and run it:"
+echo "from https://git.wrede.ca/andreas/heartbeat/raw/branch/master/scripts/hb_install.sh"
+echo "and then run sh hb_install.sh [mini|client]"
@@ -1,88 +0,0 @@
-#!/bin/sh
-
-# install the heartbeat client, hbc. The server is installed when the arg 'server' is passed 
-# install the heartbeat client, hbc. The server is installed when the arg 'server' is passed 
-# to the script. The script will install the heartbeat tools in a python 
-# virtual environment in ~/venvs/hbd. The hbd and hbc commands will be
-# installed from the wheel and symlinked to ~/bin/hbd and ~/bin/hbc,
-# respectively. If the virtual environment already exists, it will be
-# reused. The script will also remove any existing symlinks for hbd and hbc
-# in ~/bin before creating new ones.
-
-
-# hbd/hbc from wheel and create symlinks for hbd and hbc in ~/bin
-
-set -e
-what=$1
-on_ha=0
-[ -z "$what" ] && what="client"
-
-if [ -d /homeassistant ]; then
-    echo "cannot install in HA, run \"docker exec -it homeassistant $0 $@\""
-    exit 1
-fi
-if [ -d /config ]; then
-    echo "Installing on HA"
-    where="/config/bin"
-    venv="/config/venvs"
-    on_ha=1
-else
-    if [ ! -d $HOME/.local/bin ] && [ ! -d $HOME/bin ]; then
-        echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
-        exit 1
-    fi
-    for where in $HOME/bin $HOME/.local/bin notset ; do
-        if echo ":$PATH:" | grep -q ":$where:" ; then
-            break
-        fi
-    done
-    if [ "$where" = "notset" ]; then
-        echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
-        exit 1
-    fi
-    venv="$HOME/venvs"
-fi
-
-echo "Installing heartbeat $what"
-
-if [ ! -d  $venv/hbd ]; then
-    python3 -m pip --version > /dev/null 2>&1 
-    if [ $? -ne 0 ]; then
-        # truenas does not have pip installed by default, so we need to fetch get-pip.py and install pip
-        echo "pip is not installed, fetching get-pip.py and installing pip"
-        arg="--without-pip"
-    fi
-    mkdir -p $venv
-    have_venv=$(python3 -c "import venv" &> /dev/null && echo "Installed" || echo "Not Installed")
-    if [ "$have_venv" = "Not Installed" ]; then
-        echo "python venv module not found, installing virtualenv"
-        python3 -m pip install --user virtualenv
-        python3 -m virtualenv $venv/hbd --system-site-packages $arg
-    else
-        python3 -m venv $venv/hbd --system-site-packages $arg
-    fi
-    . $venv/hbd/bin/activate
-    if [ -n "$arg" ]; then  
-        curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py
-    fi
-    deactivate
-fi
-
-. $venv/hbd/bin/activate
-python3 -mpip install --upgrade --index-url https://git.wrede.ca/api/packages/andreas/pypi/simple/ --extra-index-url https://pypi.org/simple hbd[$what]
-
-if [ "$what" = "server" ]; then
-    rm -f $where/hbd
-    ln -sf $(which hbd) $where/hbd
-    echo "hbd installed, you can run it with \"$where/hbd\" or \"hbd\" if $where is in your PATH"
-else
-    rm -f $where/hbc
-    ln -sf $(which hbc) $where/hbc
-    if [ $on_ha -eq 1 ]; then
-        echo "restarting hbc "
-        job=$(grep run_hbc configuration.yaml | sed 's/run_hbc://')
-        $job
-    else
-        echo "hbc installed, you can run it with \"$where/hbc\" or \"hbc\" if $where is in your PATH"
-    fi  
-fi
@@ -68,8 +68,7 @@ async def test_nagios_runner():
    print(f"   ✓ Collected {len(data)} data points")
    
    print(f"\n4. Results:")
-    print(f"   Overall Status: {data.get('overall_status')} (code: {data.get('overall_status_code')})")
-    print(f"   Plugins Executed: {data.get('plugin_count')}")
+    print(f"   Data points collected: {len(data)}")
    
    # Show individual plugin results
    print(f"\n5. Individual Plugin Results:")