version 5.2.5

fix: agree: zpool ONLINE=OK, DEGRADED=WARNING, all else is CRITICAL
fix: typo
2026-05-08 17:25:50 -04:00 · 2026-05-08 17:18:41 -04:00 · 2026-05-08 17:03:32 -04:00 · 2026-05-08 16:57:45 -04:00 · 2026-05-08 16:39:16 -04:00 · 2026-05-08 16:25:55 -04:00
46 changed files with 5235 additions and 583 deletions
@@ -27,6 +27,7 @@ A lightweight daemon that listens for UDP heartbeat messages and acts on them: k
  - Configurable retention and backup management
 - **Plugin system for extensible monitoring** ✅
  - Collect system metrics (CPU, memory, disk, network)
+  - Monitor ZFS pool health, capacity, and I/O via `zpool(8)`
  - Execute existing Nagios monitoring plugins
  - Create custom plugins with simple Python classes
 - **Threshold alerting system** ✅
@@ -34,6 +35,8 @@ A lightweight daemon that listens for UDP heartbeat messages and acts on them: k
  - Hysteresis to prevent alert flapping
  - Automatic notifications on state changes
  - Re-notification for ongoing alerts
+- **Per-host watch flag** — set `watch: false` on any host to silence all notifications for that host without removing its configuration ✅
+- **Role-filtered dashboards** — Live Dashboard and Host Overview show only hosts where the logged-in user is owner or manager (admins see all) ✅
 - Modular codebase suitable for unit testing and CI ✅

 ---
@@ -55,21 +58,26 @@ Heartbeat includes a comprehensive plugin architecture that extends monitoring b
 ### Built-in Plugins

 - `os_info`: Collects OS, kernel, distribution, and architecture information
- `cpu_monitor`: Monitors CPU usage, load average, frequency, and process counts
- `memory_monitor`: Monitors RAM and swap usage, available memory
+- `cpu_monitor`: Monitors CPU usage, load average, frequency, process counts, and uptime
+- `memory_monitor`: Monitors RAM and swap usage, available memory (ZFS ARC-aware)
 - `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
 - `network_monitor`: Monitors network interface statistics, bandwidth, and connections
+- `ping_monitor`: Measures round-trip latency to configured hosts
 - `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
 - `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
+- `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`

 ### Nagios Integration

 The `nagios_runner` plugin provides seamless integration with the vast Nagios plugin ecosystem. You can run any Nagios-compatible plugin and have the results automatically parsed and stored:

- Executes plugins via subprocess with timeout protection
+- Executes plugins asynchronously (non-blocking) with timeout protection
+- Captures both stdout and stderr; if stdout is empty, stderr is used as the status message
+- Handles signal-killed processes (negative exit code → UNKNOWN status)
+- Validates absolute command paths at startup and warns on missing or non-executable files
 - Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
 - Extracts performance data with thresholds
- Reports aggregated status across all configured checks
+- Reports per-check status, exit code, and output; no aggregate rollup field

 See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.

@@ -147,9 +155,11 @@ Heartbeat includes a sophisticated threshold alerting system that monitors plugi
 - **Multi-level alerts**: WARNING and CRITICAL severity levels
 - **Flexible operators**: Support for >, >=, <, <=, ==, != comparisons
 - **Hysteresis**: Prevents alert flapping with configurable recovery thresholds
- **Smart notifications**: Alerts only on state changes, not every check
+- **Smart notifications**: Alerts only on state changes, not every check; de-escalations (e.g. CRITICAL → WARNING) do not generate a notification
 - **Re-notifications**: Periodic reminders for ongoing alerts
+- **Short-duration suppression**: Recovery notifications are suppressed for down events under 4 seconds (avoids noise from transient blips)
 - **Journal integration**: All threshold events logged for audit trail
+- **`ping_monitor` thresholds**: Latency and packet-loss thresholds use the same format as all other plugin metrics

 ### Configuration

@@ -172,7 +182,8 @@ thresholds:
      warning: 80.0      # Warn when CPU > 80%
      critical: 90.0     # Critical when CPU > 90%
      operator: ">"
-      hysteresis: 0.1    # 10% hysteresis to prevent flapping
+      hysteresis: 0.02   # 2% hysteresis to prevent flapping
+      display: "(threshold: {op_symbol} {threshold_value}%)"  # optional
  
  memory_monitor:
    percent:
@@ -214,7 +225,7 @@ thresholds:
    <hostname>:
      warning: <milliseconds>   # Warn when RTT > this value
      critical: <milliseconds>  # Critical when RTT > this value
-      hysteresis: 0.1           # Optional: 10% hysteresis (default)
+      hysteresis: 0.02          # Optional: 2% hysteresis (default)
 ```

 **Example alerts:**
@@ -265,7 +276,94 @@ All plugin metrics can be thresholded:
 - **Memory**: percent, available_mb, swap_percent
 - **Disk**: Per-partition percent, free_gb, free_mb
 - **Network**: errors_total, dropped packets, connection counts
- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
+- **Nagios**: Any field emitted by `nagios_runner` (`<name>_status_code`, `<name>_status`, `<name>_output`, performance data fields)
+
+### Display Format Templates
+
+Each threshold entry accepts an optional `display` field — a Python format string shown in notifications and on the Alerts dashboard:
+
+```yaml
+nagios_runner:
+  status_code:
+    warning: 1
+    critical: 2
+    operator: ">="
+    display: "{check_name}: exit {value} (expected < {threshold_value})"
+```
+
+Available variables:
+
+| Variable | Description |
+|---|---|
+| `{value}` | Current metric value |
+| `{threshold_value}` | Threshold that was crossed |
+| `{op_symbol}` | Comparison operator (`>`, `<`, `>=`, …); `"nagios"` for the nagios operator |
+| `{check_name}` | Prefix stripped by generic matching (see below) |
+| `{metric_name}` | Full field name within the plugin data |
+| `{output}` | For `nagios_runner` generic matches: the matched check's status text (alias for `{check_name}_output`) |
+| `{status}` | For `nagios_runner` generic matches: the matched check's status name — OK/WARNING/CRITICAL/UNKNOWN (alias for `{check_name}_status`) |
+| any plugin field | Any other field present in the plugin's data |
+
+### Generic Threshold Matching
+
+When a metric name has no exact threshold entry, the server progressively strips leading underscore-separated segments and re-tries the lookup. This lets a single generic entry cover an entire family of metrics.
+
+The classic use case is `nagios_runner`, which names each metric after the command that produced it:
+
+```
+nagios_runner.check_disk_root_status_code    → no exact match
+nagios_runner.disk_root_status_code          → no match
+nagios_runner.root_status_code               → no match
+nagios_runner.status_code                    → matched ✓
+```
+
+Configure the generic threshold once using the `nagios` operator, which maps exit codes directly to alert severity without requiring numeric warning/critical values:
+
+```yaml
+nagios_runner:
+  status_code:
+    operator: "nagios"   # 0=OK  1=WARNING  2=CRITICAL  3=UNKNOWN
+    display: "{check_name}: {output}"
+```
+
+The stripped prefix (`check_disk_root` in the example above) is available as `{check_name}` in the display template, so you can identify which check triggered the alert without writing a separate threshold entry per command.
+
+Exact matches always take priority. A generic entry only applies when no specific one is defined.
+
+### Per-Host Threshold Profiles
+
+Named threshold configurations let different hosts use different limits. A host's `threshold_config` can be a single name or a **list** — lists are applied left-to-right so profiles compose without duplication:
+
+```yaml
+threshold_configs:
+  default:
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 80, critical: 90}
+      memory_monitor:
+        memory_percent: {warning: 85, critical: 95}
+
+  tight_cpu:           # override CPU limits only
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 60, critical: 75}
+
+  db_disk:             # add a database partition check
+    thresholds:
+      disk_monitor:
+        partitions:
+          /var/lib/postgresql:
+            percent: {warning: 75, critical: 88}
+
+hosts:
+  web-01:
+    threshold_config: default          # single profile
+
+  db-01:
+    threshold_config: [tight_cpu, db_disk]   # layered: CPU override + extra disk check
+```
+
+Each named config's overrides are applied in order on top of the defaults. Metrics not mentioned in a profile are inherited unchanged.

 See [docs/THRESHOLD_ALERTING.md](docs/THRESHOLD_ALERTING.md) for comprehensive documentation including best practices, troubleshooting, and advanced configuration.

@@ -328,9 +426,10 @@ Heartbeat includes a built-in HTTP/WebSocket server that provides both a REST AP
 ### Web Dashboards

 - **Login** (`/login`): Browser login form (shown automatically when auth is configured)
- **Live View** (`/live`): Real-time host connectivity, latency, and messages
- **Plugin Metrics** (`/plugins`): Browse and visualize metrics from all plugins
- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering
+- **Live View** (`/live`): Real-time host connectivity, latency, and messages; hostnames link directly to the Host Overview page
+- **Host Overview** (`/plugins/<host>`): Per-host plugin metrics with ZFS pool visualization; filtered to hosts where the logged-in user is owner or manager (admins see all)
+- **Alerts Dashboard** (`/alerts`): Monitor active alerts with severity filtering; alert count pie chart shown in the navigation bar
+- **Settings** (`/settings`): Server configuration, user management, and threshold configuration viewer

 ### API Endpoints

@@ -408,6 +507,9 @@ hbc --boot your-server.example.com

 # Verbose output
 hbc -v your-server.example.com
+
+# Send 'boot' and 'shutdown' messages on start and exit 
+hbc -b your-server.example.com
 ```

 You can also run it via the module entrypoint:
@@ -416,12 +518,11 @@ You can also run it via the module entrypoint:
 python -m hbd.client.main your-server.example.com
 ```

-Client configuration can also be specified in YAML:
+Client configuration can also be specified in YAML (`~/.hbc.yaml`):

 ```yaml
-server: hbd.example.com
-port: 50003
-interval: 30
+hb_port: 50003        # Server port (default: 50003)
+interval: 30          # Heartbeat interval in seconds
 plugins:
  cpu_monitor:
    interval: 300      # Check every 5 minutes (default)
@@ -435,12 +536,20 @@ plugins:
  nagios_runner:
    interval: 300      # Check every 5 minutes (default)
    commands:
-      - /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
-      - /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
+      - name: check_load
+        command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
+      - name: check_disk
+        command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
 ```

+The server hostname is always passed as a positional command-line argument; there is no `server:` config key.
+
 All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.

+**Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
+
+**Daemon logging:** When running with `-d`, `hbc` routes all log output to syslog (`LOG_DAEMON` facility) after daemonizing. Without `-d`, logs go to stderr as usual.
+
 ### hbc_mini — single-file client (no external dependencies)

 `scripts/hbc_mini.py` is a self-contained version of the heartbeat client that requires only Python 3.8+ and no external packages. Copy it to any host and run it directly — no virtualenv, no `pip install`.
@@ -496,8 +605,10 @@ python3 hbc_mini.py -m "maintenance starting" your-server.example.com

 - No YAML config (use JSON instead)
 - No `filesystem_info` plugin
+- No `zfs_monitor` plugin (requires `zpool(8)` and the full plugin loader)
 - `cpu_monitor` does not report per-core usage or CPU frequency (no psutil)
 - Plugins cannot be loaded from external `.py` files — all plugins are compiled in
+- No IPv6 early-fail protection — connections that fail to open at startup are silently skipped rather than retried

 Everything else — heartbeat protocol, ACK/CMD/UPD handling, `hb_install.sh`-based self-update, daemonize, syslog — is identical to the full client.

@@ -104,11 +104,6 @@ The `nagios_runner` plugin collects:
 - `{name}_{metric}_min` - Minimum value (if present)
 - `{name}_{metric}_max` - Maximum value (if present)

-**Overall:**
- `overall_status` - Worst status from all commands
- `overall_status_code` - Worst status code
- `plugin_count` - Number of Nagios plugins executed
-
 ## Configuration Options

 ```yaml
@@ -8,6 +8,7 @@ This guide explains how to create custom plugins for the Heartbeat monitoring sy
 - [Plugin Types](#plugin-types)
 - [Creating a Plugin](#creating-a-plugin)
 - [Plugin Lifecycle](#plugin-lifecycle)
+- [Server-initiated InfoPlugin refresh](#server-initiated-infoplugin-refresh)
 - [Configuration](#configuration)
 - [Best Practices](#best-practices)
 - [Examples](#examples)
@@ -250,6 +251,28 @@ Understanding the plugin lifecycle helps you implement plugins correctly:
   └─> Plugin releases resources, closes connections
 ```

+## Server-initiated InfoPlugin refresh
+
+When a heartbeat packet arrives from a host the server has no plugin data for (e.g. after a server restart), the server sets `request_update = 1` in the ACK reply. The client detects this flag and immediately re-runs all InfoPlugins — clearing their cached results first — then resends the data as PLG messages.
+
+This means InfoPlugin data will always reach the server as soon as possible without requiring a client restart. No action is needed from plugin authors: the framework handles cache invalidation and re-collection automatically.
+
+The lifecycle for this case looks like:
+
+```
+Server restarts, host reconnects
+   └─> hbd receives HTB with no existing plugin_data for host
+   └─> hbd sets request_update=1 in ACK
+
+Client receives ACK
+   └─> Detects request_update flag
+   └─> Clears _cache on every registered InfoPlugin
+   └─> Calls collect() on each InfoPlugin
+   └─> Sends fresh PLG messages to server
+```
+
+If you write an `InfoPlugin` with side effects in `_collect_info()` (opening connections, writing files, etc.), be aware it may be called more than once per client session when this mechanism triggers.
+
 ## Configuration

 ### Plugin-Specific Configuration
@@ -256,6 +256,56 @@ disk_monitor:
        operator: "<"
 ```

+### ZFS Monitor
+
+ZFS pool health is checked automatically for every pool. A pool in any state
+other than `ONLINE` (e.g. `DEGRADED`, `SUSPENDED`, `FAULTED`, `UNAVAIL`) raises
+a **CRITICAL** alert by default — no configuration required.
+
+The default threshold is equivalent to:
+
+```yaml
+zfs_monitor:
+  pools:
+    '*':
+      status:
+        warning: 1
+        critical: 2
+        operator: ">"
+        hysteresis: 0.0
+        display: "ZFS pool {pool_name} is {health}"
+```
+
+`'*'` matches every pool on the host. The notification message includes the pool
+name and its current health string, e.g. `ZFS pool tank is DEGRADED`.
+
+**Override for specific pools** — named pool entries take priority over `'*'`:
+
+```yaml
+zfs_monitor:
+  pools:
+    # Suppress health alerts for a scratch pool (not mission-critical)
+    scratch:
+      status:
+        enabled: false
+
+    # Capacity threshold for a specific pool
+    tank:
+      capacity:
+        warning: 75.0
+        critical: 90.0
+        operator: ">"
+        hysteresis: 0.05
+```
+
+**Alert state paths** follow the pattern `zfs_monitor.<pool_name>.status`,
+so acknowledgements and silences target individual pools:
+
+```
+zfs_monitor.tank.status
+zfs_monitor.backup.status
+```
+
 ### Network Monitor

 ```yaml
@@ -814,34 +864,32 @@ Planned features:

 ## Multi-Threshold Configuration

-**New in version 2.0**: Support for multiple named threshold configurations with per-host mapping.
+Support for multiple named threshold configurations with per-host mapping and composable layering.

 ### Overview

 The multi-threshold feature allows you to:
- Define multiple sets of threshold configurations
- Map different hosts to different threshold sets
+- Define multiple named threshold configurations
+- Assign one or more configurations to each host
+- Compose configurations by layering — each named config's overrides are applied in order on top of the defaults
 - Use different sensitivity levels for different environments
- Maintain a default configuration for unmapped hosts

 ### Configuration Structure

+Named configurations are defined under `threshold_configs`. Each host selects which ones to use via `threshold_config` in the `hosts` section (a string for a single config, or a list to layer multiple):
+
 ```yaml
-# Optional: Set the default configuration name (defaults to "default")
+# Optional: set the default configuration name (defaults to "default")
 default_threshold_config: "default"

-# Define multiple named threshold configurations
 threshold_configs:
-  # Configuration name 1
  default:
    thresholds:
-      # Standard threshold definitions
      cpu_monitor:
        cpu_percent:
          warning: 80.0
          critical: 90.0

-  # Configuration name 2
  high_sensitivity:
    thresholds:
      cpu_monitor:
@@ -849,7 +897,6 @@ threshold_configs:
          warning: 60.0
          critical: 75.0

-  # Configuration name 3
  low_sensitivity:
    thresholds:
      cpu_monitor:
@@ -857,14 +904,77 @@ threshold_configs:
          warning: 90.0
          critical: 95.0

-# Map specific hosts to specific configurations
-host_threshold_mapping:
-  prod-web-01: high_sensitivity
-  prod-web-02: high_sensitivity
-  dev-server-01: low_sensitivity
-  # Unmapped hosts use default_threshold_config
+hosts:
+  prod-web-01:
+    threshold_config: high_sensitivity   # single config
+
+  dev-server-01:
+    threshold_config: low_sensitivity
+
+  # Hosts with no threshold_config use default_threshold_config
 ```

+### Composable Configurations (list form)
+
+`threshold_config` can be a list. Configs are applied **left to right**: the defaults are the base, then each named config's overrides are layered on top. Later entries in the list win on any metric they define.
+
+```yaml
+threshold_configs:
+  default:
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 80, critical: 90}
+      memory_monitor:
+        memory_percent: {warning: 85, critical: 95}
+      disk_monitor:
+        partitions:
+          /:
+            percent: {warning: 80, critical: 90}
+
+  # Tighter CPU limits for busy servers
+  high_cpu_load:
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 60, critical: 75}
+
+  # Tighter disk limits for data-heavy servers
+  busy_disk:
+    thresholds:
+      disk_monitor:
+        partitions:
+          /:
+            percent: {warning: 70, critical: 85}
+
+hosts:
+  # Gets default thresholds only
+  web-01:
+    threshold_config: default
+
+  # Gets tighter CPU limits, default memory and disk
+  build-server:
+    threshold_config: high_cpu_load
+
+  # Layers both: tighter CPU AND tighter disk, default memory
+  db-01:
+    threshold_config: [high_cpu_load, busy_disk]
+
+  # Three layers: busy_disk overrides high_cpu_load if they conflict
+  storage-01:
+    threshold_config: [default, high_cpu_load, busy_disk]
+```
+
+**How layering works:**
+
+Starting from the `default` thresholds:
+
+| Layer | Applied config | Effect |
+|-------|---------------|--------|
+| Base  | `default` | all default thresholds |
+| +1    | `high_cpu_load` | cpu_percent overridden to 60/75 |
+| +2    | `busy_disk` | disk percent overridden to 70/85; cpu_percent stays at 60/75 |
+
+Each named config only overrides the metrics it explicitly defines. Metrics not mentioned in a config inherit from the layers beneath.
+
 ### Use Cases

 #### 1. Environment-Based Thresholds
@@ -887,11 +997,15 @@ threshold_configs:
          warning: 90.0   # More relaxed for dev
          critical: 98.0

-host_threshold_mapping:
-  prod-web-01: production
-  prod-web-02: production
-  dev-web-01: development
-  dev-web-02: development
+hosts:
+  prod-web-01:
+    threshold_config: production
+  prod-web-02:
+    threshold_config: production
+  dev-web-01:
+    threshold_config: development
+  dev-web-02:
+    threshold_config: development
 ```

 #### 2. Server Role-Based Thresholds
@@ -914,7 +1028,7 @@ threshold_configs:
          warning: 70.0
          critical: 85.0
      memory_monitor:
-        percent:
+        memory_percent:
          warning: 90.0   # Databases can use high memory
          critical: 97.0
      disk_monitor:
@@ -927,17 +1041,23 @@ threshold_configs:
  cache:
    thresholds:
      memory_monitor:
-        percent:
+        memory_percent:
          warning: 95.0   # Redis/Memcached can use very high memory
          critical: 99.0

-host_threshold_mapping:
-  web-01: webserver
-  web-02: webserver
-  db-01: database
-  db-02: database
-  redis-01: cache
-  memcached-01: cache
+hosts:
+  web-01:
+    threshold_config: webserver
+  web-02:
+    threshold_config: webserver
+  db-01:
+    threshold_config: database
+  db-02:
+    threshold_config: database
+  redis-01:
+    threshold_config: cache
+  memcached-01:
+    threshold_config: cache
 ```

 #### 3. Sensitivity Levels
@@ -952,7 +1072,7 @@ threshold_configs:
        partitions:
          /:
            percent:
-              warning: 70.0    # Very sensitive
+              warning: 70.0
              critical: 80.0
              hysteresis: 0.15

@@ -976,52 +1096,91 @@ threshold_configs:
              critical: 98.0
              hysteresis: 0.05

-host_threshold_mapping:
-  payment-gateway: critical
-  auth-server: critical
-  web-01: standard
-  web-02: standard
-  test-server: relaxed
+hosts:
+  payment-gateway:
+    threshold_config: critical
+  auth-server:
+    threshold_config: critical
+  web-01:
+    threshold_config: standard
+  web-02:
+    threshold_config: standard
+  test-server:
+    threshold_config: relaxed
 ```

-### Backward Compatibility
+#### 4. Composable Profiles

-The legacy single threshold configuration is fully supported:
+Build host-specific thresholds by combining small, focused configs:

 ```yaml
-# Old format - still works
-thresholds:
-  cpu_monitor:
-    cpu_percent:
-      warning: 80.0
-      critical: 90.0
-```
-
-This is equivalent to:
-
-```yaml
-# New format
 threshold_configs:
+  # Baseline — everything at default levels
  default:
    thresholds:
      cpu_monitor:
-        cpu_percent:
-          warning: 80.0
-          critical: 90.0
-```
+        cpu_percent: {warning: 80, critical: 90}
+      memory_monitor:
+        memory_percent: {warning: 85, critical: 95}

+  # Overlay: tighter CPU only
+  tight_cpu:
+    thresholds:
+      cpu_monitor:
+        cpu_percent: {warning: 60, critical: 75}
+
+  # Overlay: tighter memory only
+  tight_memory:
+    thresholds:
+      memory_monitor:
+        memory_percent: {warning: 70, critical: 85}
+
+  # Overlay: extra disk partition for database servers
+  db_disk:
+    thresholds:
+      disk_monitor:
+        partitions:
+          /var/lib/postgresql:
+            percent: {warning: 75, critical: 88}
+
+hosts:
+  # Plain web server
+  web-01:
+    threshold_config: default
+
+  # Build server: tight CPU, default memory and disk
+  build-01:
+    threshold_config: tight_cpu
+
+  # Database: tight CPU + tight memory + extra disk partition
+  db-01:
+    threshold_config: [tight_cpu, tight_memory, db_disk]
+
+  # Replica database: tight memory + extra disk, normal CPU
+  db-02:
+    threshold_config: [tight_memory, db_disk]
+```
 ### Configuration Priority

-1. **Host-specific mapping**: If host is in `host_threshold_mapping`, use that config
-2. **Default config**: Use `default_threshold_config` 
-3. **First alphabetically**: If default not found, use first config alphabetically
-4. **Legacy fallback**: If `threshold_configs` not present, use `thresholds`
+1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults
+2. **Host `threshold_config` (string)**: Use that single named config directly
+3. **`host_threshold_mapping`** (legacy): Same as above, string only
+4. **`default_threshold_config`**: Used for hosts with no mapping
+5. **First alphabetically**: If the default config is not found, use the first config alphabetically
+6. **Legacy `thresholds` section**: Used when `threshold_configs` is absent entirely

-### Example: Complete Multi-Threshold Setup
+### Backward Compatibility

-See `hbd/config_multi_threshold_example.yaml` for a complete example with:
- 4 named configurations (default, high_sensitivity, low_sensitivity, database)
- Host-to-config mappings for production, development, and test systems
- Specialized database server thresholds
- Custom display messages with plugin data
+The legacy `host_threshold_mapping` top-level key and the flat `thresholds` section are still fully supported:
+
+```yaml
+# Still works — equivalent to hosts: {prod-web-01: {threshold_config: high_sensitivity}}
+host_threshold_mapping:
+  prod-web-01: high_sensitivity
+
+# Still works — equivalent to threshold_configs: {default: {thresholds: ...}}
+thresholds:
+  cpu_monitor:
+    cpu_percent: {warning: 80, critical: 90}
+```

@@ -46,6 +46,24 @@ default_owner: andreas            # owns hosts with no explicit owner
                                  # falls back to the first admin user if omitted
 ```

+### Client-declared host ownership
+
+A host can declare its own owner directly in the hbc or hbc_mini client configuration. This is useful for hosts that are not listed in the server config, or during initial setup before a server-side config entry has been created.
+
+**`~/.hbc.yaml`** (hbc):
+```yaml
+owner: andreas
+```
+
+**`~/.hbc.json`** (hbc_mini):
+```json
+{ "owner": "andreas" }
+```
+
+When set, the value is included in the `os_info` plugin data sent to the server. The server applies it as `host.owner` the first time `os_info` arrives, provided no owner has been configured server-side for that host. Server-configured ownership always takes precedence.
+
+---
+
 ### Assigning roles to hosts

 ```yaml
@@ -0,0 +1,781 @@
+# Gitea OAuth2 Authentication Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Add Gitea as an OAuth2 login provider that coexists with password auth, auto-provisioning new users on first login.
+
+**Architecture:** A new `oauth.py` module owns all Gitea-specific logic (CSRF state, URL building, token exchange, user-info fetch). `users.py` gains one function to upsert an OAuth-sourced user. `http.py` gets two new route handlers and a small login-page change. No new dependencies — `aiohttp.ClientSession` is already used in the codebase.
+
+**Tech Stack:** Python 3.12, aiohttp 3.x, pytest, pytest-asyncio
+
+---
+
+## File Map
+
+| Action | Path | Responsibility |
+|--------|------|----------------|
+| Modify | `hbd/server/config.py` | Add `"oauth": {}` default |
+| Create | `hbd/server/oauth.py` | CSRF state, URL builder, token exchange, user-info fetch |
+| Modify | `hbd/server/users.py` | Add `provision_oauth_user()` |
+| Modify | `hbd/server/http.py` | Import oauth, two new routes, login page button |
+| Create | `tests/test_oauth.py` | All new unit tests |
+
+---
+
+## Task 1: Add config default and `is_enabled()`
+
+**Files:**
+- Modify: `hbd/server/config.py:34` (after the `"users"` line)
+- Create: `hbd/server/oauth.py`
+- Create: `tests/test_oauth.py`
+
+- [ ] **Step 1: Write the failing test**
+
+Create `tests/test_oauth.py`:
+
+```python
+import pytest
+from hbd.server import oauth
+
+
+CFG_OFF = {}
+CFG_ON = {
+    "oauth": {
+        "gitea": {
+            "url": "https://git.example.com",
+            "client_id": "cid",
+            "client_secret": "csec",
+        }
+    }
+}
+CFG_PARTIAL = {"oauth": {"gitea": {"url": "https://git.example.com"}}}
+
+
+def test_is_enabled_when_all_keys_present():
+    assert oauth.is_enabled(CFG_ON) is True
+
+
+def test_is_enabled_false_when_no_oauth_key():
+    assert oauth.is_enabled(CFG_OFF) is False
+
+
+def test_is_enabled_false_when_partial_config():
+    assert oauth.is_enabled(CFG_PARTIAL) is False
+```
+
+- [ ] **Step 2: Run to confirm failure**
+
+```
+pytest tests/test_oauth.py -v
+```
+
+Expected: `ModuleNotFoundError: No module named 'hbd.server.oauth'`
+
+- [ ] **Step 3: Add config default**
+
+In `hbd/server/config.py`, add after the `"default_owner"` line (currently line 35):
+
+```python
+    # OAuth2 providers
+    "oauth": {},                 # oauth.gitea.{url,client_id,client_secret}
+```
+
+- [ ] **Step 4: Create `hbd/server/oauth.py` with `is_enabled`**
+
+```python
+"""Gitea OAuth2 support.
+
+Config shape (in ~/.hb.yaml):
+
+    oauth:
+      gitea:
+        url: https://git.example.com
+        client_id: <client-id>
+        client_secret: <client-secret>
+
+Register a Gitea OAuth2 application at:
+  Gitea → Settings → Applications → OAuth2
+Set the redirect URI to:
+  https://<hbd-host>/login/oauth/gitea/callback
+"""
+
+import logging
+import secrets
+import time
+
+import aiohttp
+
+logger = logging.getLogger(__name__)
+
+STATE_TTL = 600  # 10 minutes
+
+# state_token -> expiry timestamp
+_states: dict[str, float] = {}
+
+
+class OAuthError(Exception):
+    """Raised when the OAuth2 flow fails for any reason."""
+
+
+def _gitea_cfg(config: dict) -> dict:
+    """Return the gitea sub-dict or {} if absent/incomplete."""
+    return config.get("oauth", {}).get("gitea", {})
+
+
+def is_enabled(config: dict) -> bool:
+    """Return True when all three required Gitea OAuth keys are present."""
+    g = _gitea_cfg(config)
+    return bool(g.get("url") and g.get("client_id") and g.get("client_secret"))
+```
+
+- [ ] **Step 5: Run to confirm tests pass**
+
+```
+pytest tests/test_oauth.py -v
+```
+
+Expected: 3 passed
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add hbd/server/config.py hbd/server/oauth.py tests/test_oauth.py
+git commit -m "feat: add oauth module skeleton and is_enabled()"
+```
+
+---
+
+## Task 2: CSRF state management
+
+**Files:**
+- Modify: `hbd/server/oauth.py` (add `make_state`, `validate_state`)
+- Modify: `tests/test_oauth.py` (add state tests)
+
+- [ ] **Step 1: Write the failing tests**
+
+Append to `tests/test_oauth.py`:
+
+```python
+import time as time_mod
+
+
+def test_make_state_returns_unique_tokens():
+    s1 = oauth.make_state()
+    s2 = oauth.make_state()
+    assert s1 != s2
+    assert len(s1) == 64  # 32 bytes hex
+
+
+def test_validate_state_valid():
+    state = oauth.make_state()
+    assert oauth.validate_state(state) is True
+
+
+def test_validate_state_consumed_on_use():
+    state = oauth.make_state()
+    oauth.validate_state(state)
+    assert oauth.validate_state(state) is False  # replay rejected
+
+
+def test_validate_state_unknown():
+    assert oauth.validate_state("notastate") is False
+
+
+def test_validate_state_expired(monkeypatch):
+    state = oauth.make_state()
+    # Wind expiry into the past
+    monkeypatch.setitem(oauth._states, state, time_mod.time() - 1)
+    assert oauth.validate_state(state) is False
+```
+
+- [ ] **Step 2: Run to confirm failure**
+
+```
+pytest tests/test_oauth.py -v -k "state"
+```
+
+Expected: `AttributeError: module 'hbd.server.oauth' has no attribute 'make_state'`
+
+- [ ] **Step 3: Implement state functions**
+
+Add to `hbd/server/oauth.py` after the `_states` dict definition:
+
+```python
+def make_state() -> str:
+    """Generate a CSRF state token, store it with TTL, and return it."""
+    _purge_states()
+    token = secrets.token_hex(32)
+    _states[token] = time.time() + STATE_TTL
+    return token
+
+
+def validate_state(state: str) -> bool:
+    """Return True if *state* is known and unexpired; always removes it."""
+    expiry = _states.pop(state, None)
+    if expiry is None:
+        return False
+    return time.time() < expiry
+
+
+def _purge_states() -> None:
+    now = time.time()
+    expired = [k for k, exp in list(_states.items()) if exp < now]
+    for k in expired:
+        del _states[k]
+```
+
+- [ ] **Step 4: Run to confirm tests pass**
+
+```
+pytest tests/test_oauth.py -v
+```
+
+Expected: 8 passed
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add hbd/server/oauth.py tests/test_oauth.py
+git commit -m "feat: add OAuth2 CSRF state management"
+```
+
+---
+
+## Task 3: `provision_oauth_user` in users.py
+
+**Files:**
+- Modify: `hbd/server/users.py` (add `provision_oauth_user`)
+- Modify: `tests/test_oauth.py` (add provisioning tests)
+
+- [ ] **Step 1: Write the failing tests**
+
+Append to `tests/test_oauth.py`:
+
+```python
+from hbd.server import users as users_mod
+from hbd.server.users import User
+
+
+def _reset_users(entries=None):
+    users_mod.users = entries or {}
+
+
+def test_provision_oauth_user_new():
+    _reset_users()
+    user = users_mod.provision_oauth_user("gituser", "Git User", "https://example.com/avatar.png")
+    assert user.username == "gituser"
+    assert user.full_name == "Git User"
+    assert user.avatar == "https://example.com/avatar.png"
+    assert user.admin is False
+    assert user.password_hash == ""
+    assert "gituser" in users_mod.users
+
+
+def test_provision_oauth_user_no_password_login():
+    _reset_users()
+    user = users_mod.provision_oauth_user("gituser", "Git User", "")
+    assert user.check_password("anything") is False
+
+
+def test_provision_oauth_user_existing_updates_profile():
+    existing = User(
+        username="alice",
+        full_name="Old Name",
+        avatar="old.png",
+        password_hash="pbkdf2:sha256:1:salt:abc",
+        admin=True,
+        notification_channels=["chan1"],
+    )
+    _reset_users({"alice": existing})
+    user = users_mod.provision_oauth_user("alice", "New Name", "new.png")
+    assert user.full_name == "New Name"
+    assert user.avatar == "new.png"
+    # Preserved
+    assert user.admin is True
+    assert user.password_hash == "pbkdf2:sha256:1:salt:abc"
+    assert user.notification_channels == ["chan1"]
+
+
+def test_provision_oauth_user_does_not_overwrite_with_empty():
+    existing = User(username="bob", full_name="Bob", avatar="bob.png")
+    _reset_users({"bob": existing})
+    user = users_mod.provision_oauth_user("bob", "", "")
+    assert user.full_name == "Bob"
+    assert user.avatar == "bob.png"
+```
+
+- [ ] **Step 2: Run to confirm failure**
+
+```
+pytest tests/test_oauth.py -v -k "provision"
+```
+
+Expected: `AttributeError: module 'hbd.server.users' has no attribute 'provision_oauth_user'`
+
+- [ ] **Step 3: Implement `provision_oauth_user`**
+
+Add to `hbd/server/users.py` after the `authenticate()` function (after line 187):
+
+```python
+def provision_oauth_user(username: str, full_name: str, avatar: str) -> "User":
+    """Create or update a user sourced from an OAuth2 provider.
+
+    New users are inserted with no password_hash — they can only authenticate
+    via OAuth.  Existing users (e.g. defined in config with a password) have
+    their display name and avatar refreshed; all other attributes are preserved.
+    """
+    user = users.get(username)
+    if user is None:
+        user = User(username=username, full_name=full_name, avatar=avatar)
+        users[username] = user
+        logger.info("Provisioned OAuth user %r", username)
+    else:
+        if full_name:
+            user.full_name = full_name
+        if avatar:
+            user.avatar = avatar
+    return user
+```
+
+- [ ] **Step 4: Run to confirm tests pass**
+
+```
+pytest tests/test_oauth.py -v
+```
+
+Expected: 12 passed
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add hbd/server/users.py tests/test_oauth.py
+git commit -m "feat: add provision_oauth_user() to users module"
+```
+
+---
+
+## Task 4: URL builder, token exchange, and user-info fetch
+
+**Files:**
+- Modify: `hbd/server/oauth.py` (add `authorization_url`, `exchange_code`, `fetch_user`)
+- Modify: `tests/test_oauth.py` (add async tests with mocked HTTP)
+
+- [ ] **Step 1: Write the failing tests**
+
+Append to `tests/test_oauth.py`:
+
+```python
+import pytest
+from unittest.mock import AsyncMock, MagicMock, patch
+from urllib.parse import urlparse, parse_qs
+
+
+def test_authorization_url_shape():
+    state = "teststate"
+    redirect_uri = "https://hbd.example.com/login/oauth/gitea/callback"
+    url = oauth.authorization_url(CFG_ON, state, redirect_uri)
+    parsed = urlparse(url)
+    qs = parse_qs(parsed.query)
+    assert parsed.scheme == "https"
+    assert parsed.netloc == "git.example.com"
+    assert parsed.path == "/login/oauth/authorize"
+    assert qs["client_id"] == ["cid"]
+    assert qs["state"] == ["teststate"]
+    assert qs["redirect_uri"] == [redirect_uri]
+    assert qs["scope"] == ["user:email"]
+    assert qs["response_type"] == ["code"]
+
+
+@pytest.mark.asyncio
+async def test_exchange_code_returns_token():
+    redirect_uri = "https://hbd.example.com/login/oauth/gitea/callback"
+    mock_response = AsyncMock()
+    mock_response.status = 200
+    mock_response.json = AsyncMock(return_value={"access_token": "tok123"})
+
+    mock_session = MagicMock()
+    mock_session.post = MagicMock(return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_response),
+        __aexit__=AsyncMock(return_value=False),
+    ))
+
+    with patch("hbd.server.oauth.aiohttp.ClientSession", return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_session),
+        __aexit__=AsyncMock(return_value=False),
+    )):
+        token = await oauth.exchange_code(CFG_ON, "mycode", redirect_uri)
+    assert token == "tok123"
+
+
+@pytest.mark.asyncio
+async def test_exchange_code_raises_on_error_status():
+    redirect_uri = "https://hbd.example.com/login/oauth/gitea/callback"
+    mock_response = AsyncMock()
+    mock_response.status = 401
+    mock_response.text = AsyncMock(return_value="unauthorized")
+
+    mock_session = MagicMock()
+    mock_session.post = MagicMock(return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_response),
+        __aexit__=AsyncMock(return_value=False),
+    ))
+
+    with patch("hbd.server.oauth.aiohttp.ClientSession", return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_session),
+        __aexit__=AsyncMock(return_value=False),
+    )):
+        with pytest.raises(oauth.OAuthError):
+            await oauth.exchange_code(CFG_ON, "badcode", redirect_uri)
+
+
+@pytest.mark.asyncio
+async def test_fetch_user_returns_profile():
+    mock_response = AsyncMock()
+    mock_response.status = 200
+    mock_response.json = AsyncMock(return_value={
+        "login": "alice",
+        "full_name": "Alice Smith",
+        "avatar_url": "https://git.example.com/avatars/alice.png",
+    })
+
+    mock_session = MagicMock()
+    mock_session.get = MagicMock(return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_response),
+        __aexit__=AsyncMock(return_value=False),
+    ))
+
+    with patch("hbd.server.oauth.aiohttp.ClientSession", return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_session),
+        __aexit__=AsyncMock(return_value=False),
+    )):
+        profile = await oauth.fetch_user(CFG_ON, "tok123")
+    assert profile == {
+        "login": "alice",
+        "full_name": "Alice Smith",
+        "avatar_url": "https://git.example.com/avatars/alice.png",
+    }
+```
+
+- [ ] **Step 2: Run to confirm failure**
+
+```
+pytest tests/test_oauth.py -v -k "url or exchange or fetch"
+```
+
+Expected: `AttributeError: module 'hbd.server.oauth' has no attribute 'authorization_url'`
+
+- [ ] **Step 3: Implement the three functions**
+
+Add to `hbd/server/oauth.py`:
+
+```python
+import urllib.parse
+
+
+def authorization_url(config: dict, state: str, redirect_uri: str) -> str:
+    """Return the Gitea OAuth2 authorization URL to redirect the browser to."""
+    g = _gitea_cfg(config)
+    params = urllib.parse.urlencode({
+        "client_id": g["client_id"],
+        "redirect_uri": redirect_uri,
+        "response_type": "code",
+        "scope": "user:email",
+        "state": state,
+    })
+    return f"{g['url'].rstrip('/')}/login/oauth/authorize?{params}"
+
+
+async def exchange_code(config: dict, code: str, redirect_uri: str) -> str:
+    """Exchange an authorization *code* for a Gitea access token.
+
+    Returns the access token string.  Raises OAuthError on any failure.
+    """
+    g = _gitea_cfg(config)
+    url = f"{g['url'].rstrip('/')}/login/oauth/access_token"
+    payload = {
+        "client_id": g["client_id"],
+        "client_secret": g["client_secret"],
+        "code": code,
+        "grant_type": "authorization_code",
+        "redirect_uri": redirect_uri,
+    }
+    timeout = aiohttp.ClientTimeout(total=10)
+    try:
+        async with aiohttp.ClientSession(timeout=timeout) as session:
+            async with session.post(url, json=payload, headers={"Accept": "application/json"}) as resp:
+                if resp.status != 200:
+                    text = await resp.text()
+                    raise OAuthError(f"Token exchange failed ({resp.status}): {text}")
+                data = await resp.json()
+    except aiohttp.ClientError as exc:
+        raise OAuthError(f"Token exchange network error: {exc}") from exc
+    token = data.get("access_token")
+    if not token:
+        raise OAuthError(f"No access_token in response: {data}")
+    return token
+
+
+async def fetch_user(config: dict, token: str) -> dict:
+    """Fetch the authenticated user's profile from Gitea.
+
+    Returns a dict with keys: login, full_name, avatar_url.
+    Raises OAuthError on any failure.
+    """
+    g = _gitea_cfg(config)
+    url = f"{g['url'].rstrip('/')}/api/v1/user"
+    timeout = aiohttp.ClientTimeout(total=10)
+    try:
+        async with aiohttp.ClientSession(timeout=timeout) as session:
+            async with session.get(url, headers={"Authorization": f"token {token}"}) as resp:
+                if resp.status != 200:
+                    text = await resp.text()
+                    raise OAuthError(f"User fetch failed ({resp.status}): {text}")
+                data = await resp.json()
+    except aiohttp.ClientError as exc:
+        raise OAuthError(f"User fetch network error: {exc}") from exc
+    return {
+        "login": data.get("login", ""),
+        "full_name": data.get("full_name", ""),
+        "avatar_url": data.get("avatar_url", ""),
+    }
+```
+
+Also add `import urllib.parse` at the top of `oauth.py` (alongside the existing imports).
+
+- [ ] **Step 4: Run to confirm tests pass**
+
+```
+pytest tests/test_oauth.py -v
+```
+
+Expected: 17 passed
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add hbd/server/oauth.py tests/test_oauth.py
+git commit -m "feat: add authorization_url, exchange_code, fetch_user to oauth module"
+```
+
+---
+
+## Task 5: HTTP routes — redirect and callback
+
+**Files:**
+- Modify: `hbd/server/http.py`
+
+`http.py` defines all handlers inside `async def start(...)`. The two new handlers go in the same block, just before the `app = web.Application()` line (~line 900). The import goes at the top of the file.
+
+- [ ] **Step 1: Add the import**
+
+In `hbd/server/http.py`, add after the existing local imports (after `from . import users as users_mod`):
+
+```python
+from . import oauth as oauth_mod
+```
+
+- [ ] **Step 2: Add the two route handlers**
+
+In `hbd/server/http.py`, add the two handlers immediately before the `app = web.Application()` line:
+
+```python
+    async def oauth_gitea_redirect(request):
+        """GET /login/oauth/gitea — kick off the Gitea OAuth2 flow."""
+        if not oauth_mod.is_enabled(config):
+            return web.Response(status=404, text="OAuth not configured")
+        state = oauth_mod.make_state()
+        redirect_uri = f"{request.url.origin()}/login/oauth/gitea/callback"
+        raise web.HTTPFound(oauth_mod.authorization_url(config, state, redirect_uri))
+
+    async def oauth_gitea_callback(request):
+        """GET /login/oauth/gitea/callback — handle Gitea's redirect back."""
+        if not oauth_mod.is_enabled(config):
+            return web.Response(status=404, text="OAuth not configured")
+        code = request.rel_url.query.get("code", "")
+        state = request.rel_url.query.get("state", "")
+        if not code or not state:
+            return web.Response(status=400, text="Missing code or state")
+        if not oauth_mod.validate_state(state):
+            raise web.HTTPFound("/login?error=1")
+        redirect_uri = f"{request.url.origin()}/login/oauth/gitea/callback"
+        try:
+            token = await oauth_mod.exchange_code(config, code, redirect_uri)
+            profile = await oauth_mod.fetch_user(config, token)
+        except oauth_mod.OAuthError as exc:
+            logger.warning("OAuth error: %s", exc)
+            raise web.HTTPFound("/login?error=1")
+        user = users_mod.provision_oauth_user(
+            profile["login"],
+            profile["full_name"],
+            profile["avatar_url"],
+        )
+        session_token = users_mod.create_session(user.username)
+        resp = web.HTTPFound("/")
+        resp.set_cookie(
+            SESSION_COOKIE,
+            session_token,
+            max_age=users_mod.SESSION_TTL,
+            httponly=True,
+            samesite="Lax",
+        )
+        raise resp
+```
+
+- [ ] **Step 3: Register the routes**
+
+In `hbd/server/http.py`, add to the route list after the existing auth routes (after `web.post("/api/0/auth/logout", api_logout)`):
+
+```python
+            web.get("/login/oauth/gitea",          oauth_gitea_redirect),
+            web.get("/login/oauth/gitea/callback", oauth_gitea_callback),
+```
+
+- [ ] **Step 4: Manual smoke test**
+
+Start the server locally with OAuth configured in `~/.hb.yaml`:
+
+```yaml
+oauth:
+  gitea:
+    url: https://your-gitea-instance.example.com
+    client_id: your-client-id
+    client_secret: your-client-secret
+```
+
+Visit `http://localhost:50004/login/oauth/gitea` — confirm you are redirected to Gitea's authorization page.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add hbd/server/http.py
+git commit -m "feat: add Gitea OAuth2 redirect and callback routes"
+```
+
+---
+
+## Task 6: Login page — "Sign in with Gitea" button
+
+**Files:**
+- Modify: `hbd/server/http.py` (update `login_page` handler, ~line 625)
+
+- [ ] **Step 1: Replace the login page HTML**
+
+In `hbd/server/http.py`, find the `html = f"""` block inside `login_page` and replace it with:
+
+```python
+        gitea_button = ""
+        if oauth_mod.is_enabled(config):
+            gitea_url = _gitea_cfg_url(config)
+            gitea_button = f"""
+    <div class="divider">or</div>
+    <a href="/login/oauth/gitea" class="gitea-btn">
+      Sign in with Gitea
+    </a>"""
+
+        html = f"""<!DOCTYPE html>
+<html>
+<head>
+  <meta charset="utf-8">
+  <title>Heartbeat — Login</title>
+  <style>
+    body {{ font-family: sans-serif; background: #f5f5f5; display: flex;
+            justify-content: center; align-items: center; height: 100vh; margin: 0; }}
+    .box {{ background: #fff; padding: 2em 2.5em; border-radius: 8px;
+             box-shadow: 0 2px 12px rgba(0,0,0,.15); min-width: 300px; }}
+    h2 {{ margin: 0 0 1.2em; color: #333; font-size: 1.4em; }}
+    label {{ display: block; margin-bottom: .3em; font-size: .9em; color: #555; }}
+    input {{ width: 100%; padding: .5em .7em; border: 1px solid #ccc;
+              border-radius: 4px; font-size: 1em; box-sizing: border-box; }}
+    button {{ margin-top: 1.2em; width: 100%; padding: .6em; background: #0066cc;
+               color: #fff; border: none; border-radius: 4px; font-size: 1em; cursor: pointer; }}
+    button:hover {{ background: #0055aa; }}
+    .error {{ color: #c00; font-size: .9em; margin-bottom: .8em; }}
+    .field {{ margin-bottom: .9em; }}
+    .divider {{ text-align: center; margin: 1.2em 0 .8em; color: #999;
+                font-size: .85em; border-top: 1px solid #eee; padding-top: .8em; }}
+    .gitea-btn {{ display: block; width: 100%; padding: .6em; background: #609926;
+                  color: #fff; border-radius: 4px; font-size: 1em; text-align: center;
+                  text-decoration: none; box-sizing: border-box; }}
+    .gitea-btn:hover {{ background: #4e7d1e; }}
+  </style>
+</head>
+<body>
+  <div class="box">
+    <h2>Heartbeat</h2>
+    {'<p class="error">Invalid username, password, or OAuth error.</p>' if error else ''}
+    <form method="post">
+      <div class="field"><label>Username</label><input name="username" autofocus></div>
+      <div class="field"><label>Password</label><input name="password" type="password"></div>
+      <button type="submit">Sign in</button>
+    </form>{gitea_button}
+  </div>
+</body>
+</html>"""
+```
+
+- [ ] **Step 2: Add the `_gitea_cfg_url` helper**
+
+Add this small helper in `hbd/server/http.py` just before the `login_page` handler (around line 600) so the template can read the Gitea display URL without importing internal oauth details:
+
+```python
+def _gitea_cfg_url(config: dict) -> str:
+    return config.get("oauth", {}).get("gitea", {}).get("url", "")
+```
+
+Also update the `login_page` handler's `error` logic to show the error when the `?error=1` query param is present (set by the callback on OAuth failure):
+
+```python
+    async def login_page(request):
+        """GET /login — show login form; POST /login — process and redirect."""
+        if not users_mod.users_enabled():
+            raise web.HTTPFound("/")
+
+        error = ""
+        if request.method == "POST":
+            form = await request.post()
+            username = form.get("username", "")
+            password = form.get("password", "")
+            user = users_mod.authenticate(username, password)
+            if user:
+                token = users_mod.create_session(username)
+                redirect_to = request.rel_url.query.get("next", "/")
+                resp = web.HTTPFound(redirect_to)
+                resp.set_cookie(
+                    SESSION_COOKIE,
+                    token,
+                    max_age=users_mod.SESSION_TTL,
+                    httponly=True,
+                    samesite="Lax",
+                )
+                raise resp
+            error = "Invalid username or password."
+        elif request.rel_url.query.get("error"):
+            error = "Sign-in failed. Please try again."
+```
+
+- [ ] **Step 3: Manual verification**
+
+Start the server with OAuth configured. Visit `/login`. Confirm:
+- The "Sign in with Gitea" button appears (green, below a divider)
+- Clicking it redirects to Gitea
+- After authorising on Gitea, you are redirected back and land on `/` with a valid session cookie
+
+Without OAuth configured, confirm the button does not appear.
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add hbd/server/http.py
+git commit -m "feat: add Sign in with Gitea button to login page"
+```
+
+---
+
+## Self-Review Notes
+
+- All 5 spec requirements covered: coexist ✓, auto-provision ✓, regular user ✓, any Gitea user ✓, config-driven ✓
+- `exchange_code` signature in Task 4 matches usage in Task 5 (`config, code, redirect_uri`) ✓
+- `fetch_user` returns `{login, full_name, avatar_url}` — matched in callback handler ✓
+- `validate_state` removes state on use (replay protection) ✓
+- `provision_oauth_user` skips empty strings so existing avatar/name aren't erased ✓
+- `_gitea_cfg_url` is a plain `def`, not `async` — safe to call in template prep ✓
@@ -0,0 +1,184 @@
+# Gitea OAuth2 Authentication — Design Spec
+
+Date: 2026-05-08
+
+## Overview
+
+Add Gitea as an OAuth2 login provider alongside the existing username/password
+authentication. Any user on the configured Gitea instance can sign in; their
+local account is auto-provisioned on first login as a regular (non-admin) user.
+Password login continues to work unchanged.
+
+---
+
+## Config
+
+A new optional `oauth.gitea` block in `~/.hb.yaml`. OAuth is disabled when the
+block is absent or any of the three required keys is missing.
+
+```yaml
+oauth:
+  gitea:
+    url: https://git.example.com   # Gitea base URL, no trailing slash
+    client_id: <gitea-app-client-id>
+    client_secret: <gitea-app-client-secret>
+```
+
+**Gitea setup:** Create an OAuth2 application in Gitea under
+*Settings → Applications → OAuth2*. Set the redirect URI to
+`https://<hbd-host>/login/oauth/gitea/callback`.
+
+`config.py` default:
+
+```python
+"oauth": {},
+```
+
+---
+
+## New module: `hbd/server/oauth.py`
+
+Owns all OAuth2 logic. No new dependencies — uses `aiohttp.ClientSession`
+already present in the codebase.
+
+### CSRF state store
+
+```python
+# state -> expires (float)
+_states: dict[str, float] = {}
+STATE_TTL = 600  # 10 minutes
+```
+
+`_states` is an in-memory dict. Entries are created on redirect and deleted on
+use or expiry. A purge runs on every new state generation.
+
+### Public API
+
+| Function | Description |
+|---|---|
+| `is_enabled(config)` | Returns `True` when url, client_id, and client_secret are all set |
+| `make_state()` | Generates a random state token, stores it with TTL, returns it |
+| `validate_state(state)` | Returns `True` and removes the state if valid and unexpired |
+| `authorization_url(config, state, redirect_uri)` | Builds the Gitea `/login/oauth/authorize` redirect URL with `client_id`, `redirect_uri`, `scope=user:email`, `state` |
+| `exchange_code(config, code, redirect_uri)` async | POSTs to Gitea `/login/oauth/access_token` with code and redirect_uri, returns the access token string or raises `OAuthError` |
+| `fetch_user(config, token)` async | GETs Gitea `/api/v1/user` with Bearer token, returns `{"login", "full_name", "avatar_url"}` or raises `OAuthError` |
+
+### Error handling
+
+`OAuthError(message)` is a module-level exception. The callback route catches it
+and renders the login page with an error message — identical to an invalid
+password error in UX terms.
+
+Network timeouts use a 10-second `aiohttp` timeout. Any non-2xx response from
+Gitea raises `OAuthError`.
+
+---
+
+## Change: `hbd/server/users.py`
+
+One new function added to the public API:
+
+```python
+def provision_oauth_user(username: str, full_name: str, avatar: str) -> User:
+```
+
+- If the username does not exist in the live `users` dict, creates a `User`
+  with no `password_hash` (so password login is impossible for this account)
+  and inserts it.
+- If the username already exists (e.g. was defined in config with a password),
+  updates `full_name` and `avatar` from the OAuth profile and returns the
+  existing user unchanged in all other respects (preserving admin flag,
+  notification channels, etc.).
+- Logs a one-line INFO message on first provision.
+
+---
+
+## Changes: `hbd/server/http.py`
+
+### Two new route handlers
+
+**`GET /login/oauth/gitea`**
+
+1. Checks `oauth.is_enabled(config)` — returns 404 if not.
+2. Calls `oauth.make_state()`.
+3. Constructs `redirect_uri` as `{request.url.origin()}/login/oauth/gitea/callback` using aiohttp's `request.url.origin()`.
+4. Redirects the browser to `oauth.authorization_url(config, state, redirect_uri)`.
+
+**`GET /login/oauth/gitea/callback`**
+
+1. Reads `code` and `state` query params; returns 400 if either is missing.
+2. Calls `oauth.validate_state(state)` — redirects to `/login` with error if
+   invalid (CSRF or replay protection).
+3. Reconstructs the same `redirect_uri` as the redirect handler (required by OAuth2 spec for token exchange).
+4. Calls `await oauth.exchange_code(config, code, redirect_uri)` to get the access token.
+4. Calls `await oauth.fetch_user(config, token)` to get the Gitea user profile.
+5. Calls `users_mod.provision_oauth_user(login, full_name, avatar_url)`.
+6. Calls `users_mod.create_session(username)` to get a session token.
+7. Sets `hbd_session` cookie (same flags as password login: httponly, Lax,
+   24h TTL).
+8. Redirects to `/`.
+9. Any `OAuthError` re-renders the login page with a generic error message.
+
+### Login page change
+
+When `oauth.is_enabled(config)` is `True`, the existing login form gains a
+separator and a "Sign in with Gitea" link button pointing to
+`/login/oauth/gitea`. The password form is always rendered regardless.
+
+### Route registration
+
+```python
+web.get("/login/oauth/gitea",          oauth_redirect),
+web.get("/login/oauth/gitea/callback", oauth_callback),
+```
+
+Added alongside the existing `/login` and `/logout` routes.
+
+---
+
+## Data flow
+
+```
+Browser                    hbd                        Gitea
+  |                          |                           |
+  |-- GET /login ----------->|                           |
+  |<- login page (+ button) -|                           |
+  |                          |                           |
+  |-- GET /login/oauth/gitea>|                           |
+  |<- 302 Gitea /authorize --|                           |
+  |                          |                           |
+  |-- GET /login/oauth/authorize ----------------------->|
+  |<- 302 /login/oauth/gitea/callback?code=..&state=.. --|
+  |                          |                           |
+  |-- GET /callback -------->|                           |
+  |                          |-- POST /access_token ---->|
+  |                          |<- {access_token} ---------|
+  |                          |-- GET /api/v1/user ------>|
+  |                          |<- {login, name, avatar} --|
+  |                          | provision_oauth_user()    |
+  |                          | create_session()          |
+  |<- 302 / (set cookie) ----|                           |
+```
+
+---
+
+## Testing
+
+- `test_oauth_state`: `make_state` + `validate_state` happy path; expired state
+  returns False; replay (double-use) returns False.
+- `test_provision_oauth_user_new`: new username creates User with no password.
+- `test_provision_oauth_user_existing`: existing config user updates name/avatar,
+  preserves admin flag and notification_channels.
+- `test_oauth_callback_invalid_state`: callback with bad state redirects to login.
+- Integration: mock Gitea endpoints with `aiohttp_client` fixture; full
+  redirect → callback → session cookie flow.
+
+---
+
+## Out of scope
+
+- Restricting login to specific Gitea organisations or teams.
+- Making OAuth users admin automatically.
+- Multiple OAuth providers.
+- Token refresh (Gitea access tokens are long-lived; the hbd session TTL governs
+  re-authentication).
@@ -14,4 +14,4 @@ Install options:
 """

 __all__ = ["__version__"]
-__version__ = "5.1.7"
+__version__ = "5.2.5"
@@ -16,6 +16,9 @@ CLIENT_DEFAULTS = {
    "hb_port": 50003,          # Port where hbd servers listen
    "interval": 10,             # Heartbeat interval in seconds

+    # Host identity
+    "owner": None,             # Optional username to set as this host's owner on the server
+
    # Runtime flags
    "foreground": False,
    "verbose": False,
@@ -21,6 +21,7 @@ from typing import Dict, List, Optional
 # Import protocol and config
 from .config import load_config
 from ..common.proto import dicttos, stodict
+from .. import __version__

 # Import plugin system
 from .plugin import PluginRegistry, PluginLoader, InfoPlugin, MonitorPlugin
@@ -56,6 +57,9 @@ class AsyncConnection:
        self.transport: Optional[asyncio.DatagramTransport] = None
        self.protocol: Optional[asyncio.DatagramProtocol] = None
        self._dead = False
+        self._ever_opened = False
+        self._open_fail_count = 0   # consecutive failures before first success
+        self.request_info_event: asyncio.Event = asyncio.Event()

        self.logger = logging.getLogger(f"hbc.conn.{addr}")

@@ -73,6 +77,7 @@ class AsyncConnection:
                lambda: HeartbeatProtocol(self),
                family=self.af
            )
+            self._ever_opened = True
            self.logger.debug(f"Opened connection to {self.addr}:{self.port}")
            return True
        except Exception as e:
@@ -134,6 +139,9 @@ class AsyncConnection:
        
        self.ackcount += 1
        self.logger.debug(f"ACK received, RTT: {rtt:.1f}ms")
+        if msg.get("request_update"):
+            self.logger.info("server requested plugin info refresh")
+            self.request_info_event.set()


 class HeartbeatProtocol(asyncio.DatagramProtocol):
@@ -169,9 +177,8 @@ class HeartbeatProtocol(asyncio.DatagramProtocol):
            self.logger.error(f"Error processing datagram: {e}", exc_info=True)
    
    def error_received(self, exc):
-        """Handle protocol errors."""
-        self.logger.warning(f"Protocol error on {self.connection.addr}: {exc} — dropping connection")
-        self.connection._dead = True
+        """Handle protocol errors — close transport so the heartbeat sender retries."""
+        self.logger.warning(f"Protocol error on {self.connection.addr}: {exc} — will retry")
        self.connection.close()


@@ -262,15 +269,51 @@ async def handle_update(conn: AsyncConnection, _msg: dict):  # pyright: ignore[r


 async def heartbeat_sender(conn: AsyncConnection, interval: int):
-    """Send periodic heartbeats.
+    """Send periodic heartbeats, retrying the connection if it is not open.
+
+    IPv6 connections that fail to open before their first successful send are
+    dropped after IPV6_EARLY_FAIL_LIMIT attempts so that a network without IPv6
+    does not keep a dead sender alive.  IPv4 connections are retried indefinitely.

    Args:
        conn: Connection to send on
        interval: Heartbeat interval in seconds
    """
    logger = logging.getLogger("hbc.heartbeat")
+    IPV6_EARLY_FAIL_LIMIT = 3
+
+    while running and not conn._dead:
+        # Ensure transport is open before attempting to send.
+        if not conn.transport:
+            opened = await conn.open()
+            if opened:
+                conn._open_fail_count = 0
+            else:
+                conn._open_fail_count += 1
+                # Drop an IPv6 connection that has never come up within the
+                # first few attempts — it is likely unavailable on this network.
+                if (not conn._ever_opened
+                        and conn.af == socket.AF_INET6
+                        and conn._open_fail_count >= IPV6_EARLY_FAIL_LIMIT):
+                    logger.warning(
+                        f"IPv6 connection to {conn.addr} unreachable after "
+                        f"{conn._open_fail_count} attempts, disabling"
+                    )
+                    conn._dead = True
+                    break
+                # Retry after the normal interval; IPv4 retries forever.
+                try:
+                    if shutdown_event:
+                        await asyncio.wait_for(shutdown_event.wait(), timeout=interval)
+                        break
+                    else:
+                        await asyncio.sleep(interval)
+                except asyncio.TimeoutError:
+                    pass
+                except asyncio.CancelledError:
+                    raise
+                continue

-    while running:
        try:
            msg = {
                "acks": conn.ackcount,
@@ -279,19 +322,16 @@ async def heartbeat_sender(conn: AsyncConnection, interval: int):
            }
            await conn.sendto(msg, "HTB")

-        except Exception as e:
-            logger.error(f"Error sending heartbeat: {e}", exc_info=True)
        except asyncio.CancelledError:
            logger.debug("Heartbeat sender cancelled")
            raise
+        except Exception as e:
+            logger.error(f"Error sending heartbeat: {e}", exc_info=True)

        # Wait for next interval or shutdown event
        try:
            if shutdown_event:
-                await asyncio.wait_for(
-                    shutdown_event.wait(), 
-                    timeout=interval
-                )
+                await asyncio.wait_for(shutdown_event.wait(), timeout=interval)
                break
            else:
                await asyncio.sleep(interval)
@@ -302,6 +342,26 @@ async def heartbeat_sender(conn: AsyncConnection, interval: int):
            raise


+async def _info_plugin_refresh_loop(conn: AsyncConnection, info_plugins: List):
+    """Wait for server requests to re-send InfoPlugin data."""
+    logger = logging.getLogger("hbc.plugins")
+    while running:
+        await conn.request_info_event.wait()
+        if not running:
+            break
+        conn.request_info_event.clear()
+        logger.info("refreshing InfoPlugins on server request")
+        for plugin in info_plugins:
+            plugin._cache = None
+            try:
+                data = await plugin.collect()
+                if data:
+                    await conn.sendto({"plugin": plugin.name, **data}, "PLG")
+                    logger.info(f"Resent {plugin.name} data")
+            except Exception as e:
+                logger.error(f"Error re-collecting {plugin.name}: {e}", exc_info=True)
+
+
 async def plugin_collector(conn: AsyncConnection, registry: PluginRegistry):
    """Collect and send plugin data.

@@ -333,16 +393,13 @@ async def plugin_collector(conn: AsyncConnection, registry: PluginRegistry):
    for plugin in monitor_plugins:
        by_interval[plugin.interval].append(plugin)

-    # Create tasks for each interval
-    tasks = []
+    # Create tasks for each interval; always include the info-refresh watcher
+    tasks = [asyncio.create_task(_info_plugin_refresh_loop(conn, info_plugins))]
    for interval, plugins in by_interval.items():
-        task = asyncio.create_task(
+        tasks.append(asyncio.create_task(
            plugin_collector_interval(conn, plugins, interval)
-        )
-        tasks.append(task)
+        ))

-    # Wait for all tasks
-    if tasks:
    try:
        await asyncio.gather(*tasks, return_exceptions=True)
    except asyncio.CancelledError:
@@ -427,16 +484,13 @@ async def cleanup(connections: List[AsyncConnection]):
    logger = logging.getLogger("hbc.cleanup")
    logger.info("Cleaning up connections")
    
-    for conn in connections:
+    target = next((c for c in connections if c.transport), connections[0] if connections else None)
+    if target and send_shutdown:
        try:
-            msg = {
-                "shutdown": 1,
-                "acks": conn.ackcount
-            }
-            await conn.sendto(msg)
+            await target.sendto({"shutdown": 1, "acks": target.ackcount})
        except Exception as e:
            logger.error(f"Error sending shutdown: {e}")
-        
+    for conn in connections:
        conn.close()
    
    # Give messages time to send
@@ -445,7 +499,7 @@ async def cleanup(connections: List[AsyncConnection]):

 async def async_main(args, config):
    """Async main function."""
-    global running, shutdown_event, active_tasks
+    global running, shutdown_event, active_tasks, send_shutdown 
    
    # Create shutdown event
    shutdown_event = asyncio.Event()
@@ -462,8 +516,7 @@ async def async_main(args, config):
    hb_port = config.get("hb_port", PORT)
    interval = config.get("interval", INTERVAL)
    
-    logger.info(f"Starting hbc for {iam} -> {hb_hosts}")
-    logger.info(f"Port: {hb_port}, Interval: {interval}s")
+    logger.info(f"hbc {__version__} on {iam} -> {hb_hosts} port={hb_port}, interval={interval}s")
    
    # Create connections
    connections = []
@@ -481,28 +534,32 @@ async def async_main(args, config):
            addr = addr_info[4][0]

            conn = AsyncConnection(conn_id, addr, hb_port, af, iam)
-            if await conn.open():
+            if not await conn.open():
+                logger.warning(f"Initial open to {addr} failed, heartbeat sender will retry")
            connections.append(conn)
            conn_id += 1

    if not connections:
-        logger.error("No connections established")
+        logger.error("No connections established (DNS resolution failed for all hosts)")
        return 1
    
    logger.info(f"Created {len(connections)} connections")
    
    # Send boot/message if requested
+    send_shutdown = False
    if args.boot or args.message:
        boot_msg = {}
        if args.boot:
            boot_msg["boot"] = 1
+            args.boot = False  # Clear boot flag so we don't send it again in main loop
+            send_shutdown = True
        if args.message:
            boot_msg["service"] = "service"
            boot_msg["msg"] = args.message
        
        boot_msg["acks"] = 0
-        for conn in connections:
-            await conn.sendto(boot_msg)
+        target = next((c for c in connections if c.transport), connections[0])
+        await target.sendto(boot_msg)
        
        if args.message and not args.daemon:
            # Message-only mode
@@ -525,6 +582,13 @@ async def async_main(args, config):
    for sig in (signal.SIGTERM, signal.SIGINT):
        loop.add_signal_handler(sig, stop)

+    def _sighup():
+        global dorestart
+        dorestart = True
+        stop()
+
+    loop.add_signal_handler(signal.SIGHUP, _sighup)
+    
    # Start async tasks
    # Heartbeat senders (one per connection)
    for conn in connections:
@@ -695,7 +759,7 @@ def main(argv=None):
    
    # Daemonize if requested
    if args.daemon:
-        print("Daemonizing...")
+        logging.info("Daemonizing...")
        daemonize()
        _reconfigure_logging_for_daemon(log_level)
        logging.info(f"hbc starting, sending heartbeat to {', '.join(args.hosts)}")
@@ -364,7 +364,10 @@ class PluginLoader:
                    
                    # Instantiate plugin with config — check plugins subdict first,
                    # then top-level keys (e.g. nagios_runner: ... at root of config).
-                    plugin_instance_config = plugins_subconfig.get(obj.name) or raw_config.get(obj.name, {})
+                    plugin_instance_config = dict(plugins_subconfig.get(obj.name) or raw_config.get(obj.name) or {})
+                    # Propagate top-level owner so os_info (and any future plugin) can report it.
+                    if "owner" in raw_config and "owner" not in plugin_instance_config:
+                        plugin_instance_config["owner"] = raw_config["owner"]
                    plugin = obj(config=plugin_instance_config)
                    
                    # Initialize plugin
@@ -119,6 +119,13 @@ class CPUMonitorPlugin(MonitorPlugin):
            except Exception as e:
                self.logger.debug(f"Could not get CPU times: {e}")

+            # Uptime in seconds
+            try:
+                import time
+                data["uptime_seconds"] = int(time.time() - self.psutil.boot_time())
+            except Exception as e:
+                self.logger.debug(f"Could not get uptime: {e}")
+            
            self.logger.debug(
                f"Collected CPU metrics: {data.get('cpu_percent', 'N/A')}% usage"
            )
@@ -14,6 +14,24 @@ except ImportError:

 from hbd.client.plugin import MonitorPlugin

+
+def _zfs_arc_bytes() -> int:
+    """Return current ZFS ARC size in bytes, or 0 if ZFS is not present.
+
+    ZFS ARC is reclaimable but is not included in MemAvailable by the Linux
+    kernel (it is not in SReclaimable), so it would otherwise be counted as
+    used memory.
+    """
+    try:
+        with open("/proc/spl/kstat/zfs/arcstats") as fh:
+            for line in fh:
+                parts = line.split()
+                if len(parts) >= 3 and parts[0] == "size":
+                    return int(parts[2])
+    except (OSError, ValueError):
+        pass
+    return 0
+
 logger = logging.getLogger(__name__)


@@ -101,11 +119,21 @@ class MemoryMonitorPlugin(MonitorPlugin):
        
        # Virtual (physical) memory statistics
        vmem = psutil.virtual_memory()
+
+        # psutil's available already excludes page cache / file buffers
+        # (uses MemAvailable on Linux). Add ZFS ARC on top because the kernel
+        # does not include it in SReclaimable / MemAvailable even though it is
+        # reclaimable.
+        arc_bytes = _zfs_arc_bytes()
+        available = min(vmem.available + arc_bytes, vmem.total)
+        used = vmem.total - available
+        percent = round(used / vmem.total * 100, 1) if vmem.total else 0.0
+
        metrics['memory_total'] = vmem.total
-        metrics['memory_available'] = vmem.available
-        metrics['memory_used'] = vmem.used
+        metrics['memory_available'] = available
+        metrics['memory_used'] = used
        metrics['memory_free'] = vmem.free
-        metrics['memory_percent'] = vmem.percent
+        metrics['memory_percent'] = percent
        
        # Platform-specific memory details
        if hasattr(vmem, 'active'):
@@ -31,16 +31,13 @@ from hbd.client.plugin import MonitorPlugin


 # Nagios exit codes
-NAGIOS_OK = 0
-NAGIOS_WARNING = 1
-NAGIOS_CRITICAL = 2
 NAGIOS_UNKNOWN = 3

 STATUS_NAMES = {
-    NAGIOS_OK: "OK",
-    NAGIOS_WARNING: "WARNING",
-    NAGIOS_CRITICAL: "CRITICAL",
-    NAGIOS_UNKNOWN: "UNKNOWN"
+    0: "OK",
+    1: "WARNING",
+    2: "CRITICAL",
+    3: "UNKNOWN",
 }


@@ -129,9 +126,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
        """
        results = {}

-        # Track overall status (worst status wins)
-        worst_status = NAGIOS_OK
-        
        for cmd_config in self.commands:
            name = cmd_config.get("name")
            command = cmd_config.get("command")
@@ -149,10 +143,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
                results[f"{name}_status_code"] = status_code
                results[f"{name}_output"] = output

-                # Track worst status
-                if status_code > worst_status:
-                    worst_status = status_code
-                
                # Parse and add performance data
                if perfdata:
                    for metric_name, metric_value in perfdata.items():
@@ -167,12 +157,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
                results[f"{name}_status"] = "ERROR"
                results[f"{name}_status_code"] = NAGIOS_UNKNOWN
                results[f"{name}_output"] = str(e)
-                worst_status = NAGIOS_UNKNOWN
-        
-        # Add overall status
-        results["overall_status"] = STATUS_NAMES.get(worst_status, "UNKNOWN")
-        results["overall_status_code"] = worst_status
-        results["plugin_count"] = len(self.commands)

        return results
    
@@ -60,7 +60,11 @@ class OSInfoPlugin(InfoPlugin):
                "python_version": platform.python_version(),
                "python_implementation": platform.python_implementation(),
                "hbc_version": hbc_version,
+                "hbc_type": "full",
            }
+            if self.config.get("owner"):
+                self.logger.debug(f"Adding owner from config: {self.config['owner']}")
+                data["owner"] = self.config["owner"]
            
            # Add Linux-specific distribution info
            if platform.system() == "Linux":
@@ -13,12 +13,8 @@ plugins:
    count: 3              # ICMP packets per ping run (default 3)
    timeout: 5            # seconds before a host is considered unreachable (default 5)
    hosts:
-      8.8.8.8:
-        warning: 20.0     # ms
-        critical: 100.0   # ms
-      192.168.1.1:
-        warning: 5.0
-        critical: 20.0
+      - 8.8.8.8
+      - 192.168.1.1
 ```

 Reported metrics per host (metric key uses the hostname with dots/colons replaced
@@ -0,0 +1,140 @@
+"""
+ZFS pool monitoring plugin for Heartbeat.
+
+Collects per-pool health, capacity, and cumulative I/O statistics via zpool(8).
+"""
+
+import asyncio
+import logging
+import shutil
+from typing import Any, Dict, List, Optional
+
+from hbd.client.plugin import MonitorPlugin
+
+logger = logging.getLogger(__name__)
+
+
+def _int(s: str) -> Optional[int]:
+    try:
+        return int(s.strip().rstrip("KMGTkBkmgt%x"))
+    except (ValueError, AttributeError):
+        return None
+
+
+def _float(s: str) -> Optional[float]:
+    try:
+        return float(s.strip().rstrip("%x"))
+    except (ValueError, AttributeError):
+        return None
+
+
+class ZFSMonitorPlugin(MonitorPlugin):
+    """Monitor ZFS pool health, capacity, and I/O statistics.
+
+    Collects per pool:
+    - health: ONLINE, DEGRADED, FAULTED, etc.
+    - size / alloc / free: total, allocated and free bytes
+    - capacity: percentage used (0-100)
+    - frag: fragmentation percentage
+    - dedup: deduplication ratio
+    - read_ops / write_ops: cumulative I/O operations since last boot/clear
+    - read_bw / write_bw: cumulative bytes transferred since last boot/clear
+
+    Configuration:
+        interval: collection interval in seconds (default: 300)
+        pools: list of pool names to monitor (default: all)
+    """
+
+    name = "zfs_monitor"
+    description = "ZFS pool health, capacity, and I/O statistics"
+    interval = 300
+
+    def __init__(self, config: Optional[Dict[str, Any]] = None):
+        super().__init__(config)
+        self.interval = self.config.get("interval", 300)
+        self._pools_filter: Optional[List[str]] = self.config.get("pools", None)
+
+    async def initialize(self) -> bool:
+        if not shutil.which("zpool"):
+            self.skip_reason = "zpool not found"
+            return False
+        logger.info("ZFS monitor initialized (interval: %ds)", self.interval)
+        return True
+
+    async def _run(self, *args: str) -> List[str]:
+        """Run a command and return its stdout lines, or [] on error."""
+        try:
+            proc = await asyncio.create_subprocess_exec(
+                *args,
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.DEVNULL,
+            )
+            stdout, _ = await asyncio.wait_for(proc.communicate(), timeout=15)
+            return stdout.decode(errors="replace").splitlines()
+        except (FileNotFoundError, asyncio.TimeoutError) as exc:
+            logger.warning("zfs_monitor: %s: %s", args[0], exc)
+            return []
+
+    async def _zpool_list(self) -> Dict[str, Dict]:
+        """Return per-pool health and capacity from `zpool list`."""
+        lines = await self._run(
+            "zpool", "list", "-H", "-p",
+            "-o", "name,health,size,alloc,free,cap,frag,dedup",
+        )
+        pools: Dict[str, Dict] = {}
+        for line in lines:
+            parts = line.split("\t")
+            if len(parts) < 8:
+                continue
+            name = parts[0].strip()
+            if self._pools_filter and name not in self._pools_filter:
+                continue
+            health = parts[1].strip()
+            if health == "ONLINE":
+                status = 0
+            elif health in ("DEGRADED", "ONLINE with errors"):
+                status = 1
+            elif health in ("FAULTED", "OFFLINE", "UNAVAIL"):
+                status = 2
+            else:
+                status = 3  # unknown status
+            pools[name] = {
+                "health":    health,
+                "status": status,
+                "size":      _int(parts[2]),
+                "alloc":     _int(parts[3]),
+                "free":      _int(parts[4]),
+                "capacity":  _float(parts[5]),
+                "frag":      _float(parts[6]),
+                "dedup":     _float(parts[7]),
+            }
+        return pools
+
+    async def _zpool_iostat(self) -> Dict[str, Dict]:
+        """Return per-pool cumulative I/O counters from `zpool iostat`."""
+        lines = await self._run("zpool", "iostat", "-H", "-p")
+        io: Dict[str, Dict] = {}
+        for line in lines:
+            parts = line.split("\t")
+            if len(parts) < 7:
+                continue
+            name = parts[0].strip()
+            if not name or name.startswith(" "):
+                continue
+            io[name] = {
+                "read_ops": _int(parts[3]),
+                "write_ops": _int(parts[4]),
+                "read_bw":  _int(parts[5]),
+                "write_bw": _int(parts[6]),
+            }
+        return io
+
+    async def _collect_metrics(self) -> Dict[str, Any]:
+        pools, io = await asyncio.gather(self._zpool_list(), self._zpool_iostat())
+        for name, stats in io.items():
+            if name in pools:
+                pools[name].update(stats)
+        return {"pools": pools}
+
+
+plugin = ZFSMonitorPlugin
@@ -134,6 +134,30 @@ thresholds:
          hysteresis: 0.1
          enabled: true
  
+  # ----------------------------------------------------------------------------
+  # ZFS Monitor Thresholds
+  # ----------------------------------------------------------------------------
+  zfs_monitor:
+    # Pool health check — built-in default; shown here for reference/override.
+    # status is 0 (ONLINE) or 1 (DEGRADED) or 2 (SUSPENDED, FAULTED, UNAVAIL…).
+    # Use '*' to apply the same rule to every pool, or name a specific pool.
+    pools:
+      '*':
+        status:
+          warning: 1           # Alert WARNING when pool is DEGRADED
+          critical: 2           # Alert CRITICAL when pool is SUSPENDED/FAULTED/UNAVAIL
+          operator: ">"
+          hysteresis: 0.0       # No hysteresis — a degraded pool is always critical
+          display: "ZFS pool {pool_name} is {health}"
+
+      # Per-pool capacity thresholds (optional; add pools you care about)
+      # tank:
+      #   capacity:
+      #     warning: 75.0       # Warn at 75% used
+      #     critical: 90.0      # Critical at 90% used
+      #     operator: ">"
+      #     hysteresis: 0.05
+
  # ----------------------------------------------------------------------------
  # Network Monitor Thresholds
  # ----------------------------------------------------------------------------
@@ -144,17 +144,16 @@ def cmd_notify(args):
        url=f"{base_url}/plugins" if base_url else "",
    )

-    # Bypass min_level for explicit test sends; run async channels directly
    import asyncio
+    from .notify import _send_matrix_async, _send_sms_voipms_async, _DRIVERS
    ch_type = channel_cfg.get("type", "")
    print(f"Sending via {args.channel} ({ch_type}): {title} — {args.message}")

-    if ch_type in ("matrix", "sms_voipms"):
-        from .notify import _send_matrix_async, _send_sms_voipms_async
-        driver_async = _send_matrix_async if ch_type == "matrix" else _send_sms_voipms_async
-        ok = asyncio.run(driver_async(channel_cfg, notif))
+    if ch_type == "matrix":
+        ok = asyncio.run(_send_matrix_async(channel_cfg, notif))
+    elif ch_type == "sms_voipms":
+        ok = asyncio.run(_send_sms_voipms_async(channel_cfg, notif))
    else:
-        from .notify import _DRIVERS
        driver = _DRIVERS.get(ch_type)
        if driver is None:
            print(f"Error: unknown channel type '{ch_type}'", file=sys.stderr)
@@ -34,6 +34,9 @@ SERVER_DEFAULTS = {
    "users": {},                # username -> {full_name, avatar, password, admin, notification_channels}
    "default_owner": None,      # Username that owns hosts with no explicit owner

+    # OAuth2 providers
+    "oauth": {},                 # oauth.gitea.{url,client_id,client_secret}
+
    # Host management
    "hosts": {},                # Unified host definitions
    "dyndnshosts": [],          # Hosts with dynamic DNS (legacy)
@@ -95,7 +98,26 @@ THRESHOLD_DEFAULTS = {
                'warning': 200,
                'critical': 250.0,
                'count': 3  # Optional: number of consecutive breaches before alerting
+            },
+            'nagios_runner': {
+                'status_code': {
+                    'display': '{check_name} {output}',
+                    'operator': "nagios"
                }
+            },
+            'zfs_monitor': {
+                'pools': {
+                    '*': {
+                        'status': {
+                            'warning': 1,  
+                            'critical': 2,  
+                            'operator': '>',
+                            'hysteresis': 0.0,
+                            'display': 'ZFS pool {pool_name} is {health}'
+                        }
+                    }
+                }
+            },
        }
    }

@@ -225,7 +247,7 @@ def get_watchhosts(config):
    hosts_config = config.get("hosts", {})
    if isinstance(hosts_config, dict):
        for host_name, host_attrs in hosts_config.items():
-            if isinstance(host_attrs, dict) and host_attrs.get("watch", False):
+            if isinstance(host_attrs, dict) and host_attrs.get("watch", True):
                watchhosts.append(host_name)
    return watchhosts

@@ -303,7 +325,7 @@ def get_host_access(config, hostname) -> dict:
    """
    host_cfg = get_host_config(config, hostname)

-    owner = host_cfg.get("owner") or get_default_owner(config)
+    owner = host_cfg.get("owner") # or get_default_owner(config)

    managers = host_cfg.get("managers", [])
    if isinstance(managers, str):
@@ -95,7 +95,7 @@ class Connection:
        if not Null:
            d["addr"] = self.addr
            if self.rtts[-1]:
-                d["rtt"] = "%0.1f" % self.rtts[-1]
+                d["rtt"] = "%d" % round(self.rtts[-1])
            elif self.state == Connection.UNKNOWN:
                d["rtt"] = ""
            else:
@@ -286,7 +286,7 @@ class Host:
            Host.hosts[name] = self
        self.num = num
        self.dyn = False
-        self.watched = False
+        self.watched = True
        self.upcount = 0
        self.interval = 0
        self.doesack = -1
@@ -304,6 +304,7 @@ class Host:

    def statedict(self):
        d = {}
+        d["raw_name"] = self.name
        d["name"] = self.name
        if self.dyn:
            d["name"] += "*"
@@ -1,7 +1,11 @@
 """HTTP server implementation using aiohttp and jinja2."""

 import asyncio
+import datetime
 import json
+import platform
+import socket
+import sys
 import time
 import urllib.parse
 import os
@@ -12,6 +16,7 @@ from . import data
 from . import notify as notify_mod
 from . import settings as settings_mod
 from . import users as users_mod
+from . import oauth as oauth_mod
 from . import ws as ws_mod

 logger = logging.getLogger(__name__)
@@ -111,6 +116,7 @@ async def start(
    This function is intended to be awaited inside the main asyncio event loop.
    """
    get_now = get_now or (lambda: time.time())
+    _start_epoch = time.time()

    async def old_index(request):
        _require_auth_redirect(request)
@@ -149,6 +155,25 @@ async def start(
        lst = [h.jsons() for h in hosts]
        return web.json_response(json.loads("[" + ",".join(lst) + "]"))

+    async def api_alert_summary(request):
+        """GET /api/0/alert_summary — counts of ok/warning/critical hosts visible to caller."""
+        user, err = _require_auth(request)
+        if err:
+            return err
+        from .threshold import AlertLevel
+        critical = warning = ok = 0
+        for host in hbdclass.Host.hosts.values():
+            if not _can_operate_host(user, host):
+                continue
+            levels = {s.level for s in host.alert_states.values()}
+            if AlertLevel.CRITICAL in levels:
+                critical += 1
+            elif AlertLevel.WARNING in levels:
+                warning += 1
+            else:
+                ok += 1
+        return web.json_response({"critical": critical, "warning": warning, "ok": ok})
+
    async def api_messages(request):
        lst = data.msgs[-30:]
        return web.json_response(lst)
@@ -253,7 +278,9 @@ async def start(
            extra_scripts=extra_scripts,
            hbd_version=hbd_version,
            hosts=[
-                hbdclass.Host.hosts[h].stateinfo() for h in sorted(hbdclass.Host.hosts)
+                hbdclass.Host.hosts[h].stateinfo()
+                for h in sorted(hbdclass.Host.hosts)
+                if _can_operate_host(current_user, hbdclass.Host.hosts[h])
            ],
            messages=data.msgs[-30:],
            current_user=current_user.to_dict() if current_user else None,
@@ -505,12 +532,14 @@ async def start(
        hosts_with_plugins = []
        for hostname in sorted(hbdclass.Host.hosts.keys()):
            host = hbdclass.Host.hosts[hostname]
-            if not _can_view_host(current_user, host):
+            if not _can_operate_host(current_user, host):
                continue
            if host.plugin_data:
                hosts_with_plugins.append({
                    "name": hostname,
                    "plugins": list(host.plugin_data.keys()),
+                    "is_owner": _can_own_host(current_user, host),
+                    "owner": host.owner,
                })

        tmpl = env.get_template("plugins.html")
@@ -593,6 +622,16 @@ async def start(
                )
                raise resp
            error = "Invalid username or password."
+        elif request.rel_url.query.get("error"):
+            error = "Sign-in failed. Please try again."
+
+        gitea_button = ""
+        if oauth_mod.is_enabled(config):
+            gitea_button = f"""
+    <div class="divider">or</div>
+    <a href="/login/oauth/gitea" class="gitea-btn">
+      Sign in with Gitea
+    </a>"""

        html = f"""<!DOCTYPE html>
 <html>
@@ -613,6 +652,12 @@ async def start(
    button:hover {{ background: #0055aa; }}
    .error {{ color: #c00; font-size: .9em; margin-bottom: .8em; }}
    .field {{ margin-bottom: .9em; }}
+    .divider {{ text-align: center; margin: 1.2em 0 .8em; color: #999;
+                font-size: .85em; border-top: 1px solid #eee; padding-top: .8em; }}
+    .gitea-btn {{ display: block; width: 100%; padding: .6em; background: #609926;
+                  color: #fff; border-radius: 4px; font-size: 1em; text-align: center;
+                  text-decoration: none; box-sizing: border-box; }}
+    .gitea-btn:hover {{ background: #4e7d1e; }}
  </style>
 </head>
 <body>
@@ -623,7 +668,7 @@ async def start(
      <div class="field"><label>Username</label><input name="username" autofocus></div>
      <div class="field"><label>Password</label><input name="password" type="password"></div>
      <button type="submit">Sign in</button>
-    </form>
+    </form>{gitea_button}
  </div>
 </body>
 </html>"""
@@ -806,6 +851,48 @@ async def start(
        )
        return web.Response(text=body, content_type="text/html")

+    # -------------------------------------------------------------------------
+    # About page
+    # -------------------------------------------------------------------------
+
+    async def about_page(request):
+        """GET /about — version, runtime, and project information."""
+        current_user, _ = _require_auth_redirect(request)
+        pkg_dir = os.path.dirname(__file__)
+        templates_dir = config.get("templates_dir", os.path.join(pkg_dir, "templates"))
+        env = jinja2.Environment(loader=jinja2.FileSystemLoader(templates_dir))
+        from hbd import __version__ as hbd_version
+
+        uptime_secs = int(time.time() - _start_epoch)
+        days, rem = divmod(uptime_secs, 86400)
+        hours, rem = divmod(rem, 3600)
+        mins, secs = divmod(rem, 60)
+        if days:
+            uptime_str = f"{days}d {hours}h {mins}m"
+        elif hours:
+            uptime_str = f"{hours}h {mins}m {secs}s"
+        else:
+            uptime_str = f"{mins}m {secs}s"
+
+        start_dt = datetime.datetime.fromtimestamp(_start_epoch)
+        start_time_str = start_dt.strftime("%Y-%m-%d %H:%M:%S")
+
+        tmpl = env.get_template("about.html")
+        body = tmpl.render(
+            title="About - Heartbeat",
+            header="About",
+            hbd_version=hbd_version,
+            python_version=f"{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro} ({platform.python_implementation()})",
+            server_hostname=socket.gethostname(),
+            start_epoch=int(_start_epoch),
+            start_time_str=start_time_str,
+            uptime_str=uptime_str,
+            host_count=len(hbdclass.Host.hosts),
+            current_user=current_user.to_dict() if current_user else None,
+            active_page="about",
+        )
+        return web.Response(text=body, content_type="text/html")
+
    # -------------------------------------------------------------------------
    # Settings page (admin only)
    # -------------------------------------------------------------------------
@@ -821,12 +908,56 @@ async def start(
        tmpl = env.get_template("settings.html")
        body = tmpl.render(
            title="Settings - Heartbeat",
-            sections=settings_mod.get_settings_sections(config),
+            sections=settings_mod.get_settings_sections(config, threshold_checker=threshold_checker),
            current_user=current_user.to_dict() if current_user else None,
            active_page="settings",
        )
        return web.Response(text=body, content_type="text/html")

+    def _oauth_redirect_uri(request) -> str:
+        base = config.get("base_url", "").rstrip("/") or str(request.url.origin())
+        return f"{base}/login/oauth/gitea/callback"
+
+    async def oauth_gitea_redirect(request):
+        """GET /login/oauth/gitea — kick off the Gitea OAuth2 flow."""
+        if not oauth_mod.is_enabled(config):
+            return web.Response(status=404, text="OAuth not configured")
+        state = oauth_mod.make_state()
+        raise web.HTTPFound(oauth_mod.authorization_url(config, state, _oauth_redirect_uri(request)))
+
+    async def oauth_gitea_callback(request):
+        """GET /login/oauth/gitea/callback — handle Gitea's redirect back."""
+        if not oauth_mod.is_enabled(config):
+            return web.Response(status=404, text="OAuth not configured")
+        code = request.rel_url.query.get("code", "")
+        state = request.rel_url.query.get("state", "")
+        if not code or not state:
+            return web.Response(status=400, text="Missing code or state")
+        if not oauth_mod.validate_state(state):
+            logger.warning("OAuth: invalid or expired state token from %s", request.remote)
+            raise web.HTTPFound("/login?error=1")
+        try:
+            token = await oauth_mod.exchange_code(config, code, _oauth_redirect_uri(request))
+            profile = await oauth_mod.fetch_user(config, token)
+        except oauth_mod.OAuthError as exc:
+            logger.warning("OAuth error: %s", exc)
+            raise web.HTTPFound("/login?error=1")
+        user = users_mod.provision_oauth_user(
+            profile["login"],
+            profile["full_name"],
+            profile["avatar_url"],
+        )
+        session_token = users_mod.create_session(user.username)
+        resp = web.HTTPFound("/")
+        resp.set_cookie(
+            SESSION_COOKIE,
+            session_token,
+            max_age=users_mod.SESSION_TTL,
+            httponly=True,
+            samesite="Lax",
+        )
+        raise resp
+
    app = web.Application()
    app.add_routes(
        [
@@ -838,12 +969,15 @@ async def start(
            web.get("/logout", web_logout),
            web.post("/api/0/auth/login", api_login),
            web.post("/api/0/auth/logout", api_logout),
+            web.get("/login/oauth/gitea",          oauth_gitea_redirect),
+            web.get("/login/oauth/gitea/callback", oauth_gitea_callback),
            # Users
            web.get("/api/0/users", api_users),
            web.get("/api/0/users/me", api_user_self),
            web.get("/api/0/users/{username}/avatar", api_user_avatar),
            # Hosts
            web.get("/api/0/hosts", api_hosts),
+            web.get("/api/0/alert_summary", api_alert_summary),
            web.get("/api/0/messages", api_messages),
            web.get("/api/0/hosts/{hostname}/plugins", api_host_plugins),
            web.get("/api/0/hosts/{hostname}/plugins/{plugin_name}", api_host_plugin_detail),
@@ -859,6 +993,7 @@ async def start(
            web.get("/live", live),
            web.get("/plugins", plugins_page),
            web.get("/alerts", alerts_page),
+            web.get("/about", about_page),
            web.get("/profile", profile_page),
            web.get("/settings", settings_page),
            web.get("/static/{path:.*}", static),
@@ -101,9 +101,10 @@ async def reload_configuration(config_obj, config_path, components):
            access = config_mod.get_host_access(new_config, hostname)
            host.apply_access(access["owner"], access["managers"], access["monitors"])

-        # Reload threshold checker
+        # Reload threshold checker and prune alerts orphaned by the new config
        if 'threshold_checker' in components:
            components['threshold_checker'].reload(new_config)
+            components['threshold_checker'].purge_stale_alerts(hbdclass)
        
        # Note: Changes to the following require restart:
        # - hb_port, hbd_port, ws_port (already bound)
@@ -241,6 +242,10 @@ async def _run_async(config, config_path=None):
    )
    udp.restore_connection_timers(hbdclass, restore_ctx)

+    # Drop alert states that no longer have a matching threshold (stale after
+    # upgrade or config change between runs).
+    threshold_checker.purge_stale_alerts(hbdclass)
+
    # HTTP server (asyncio-based via aiohttp)
    try:
        http_task = asyncio.create_task(
@@ -250,6 +255,7 @@ async def _run_async(config, config_path=None):
                config=config,
                hbdclass=hbdclass,
                tcss=None,
+                threshold_checker=threshold_checker,
                verbose=config.get("verbose", False),
                get_now=lambda: time.time(),
                VER="",
@@ -469,6 +475,8 @@ def run(config, config_path=None):
    if config.get("debug", 0) > 0:
        log_level = logging.DEBUG
    logging.basicConfig(level=log_level)
+    if not config.get("debug", 0):
+        logging.getLogger("aiohttp.access").propagate = False
    load_pickled_hosts(config, hbdclass)

    notify_mod.initlog(logfile=config.get("logfile", "messages.log"))
@@ -15,7 +15,6 @@ their own ``notification_channels`` list.  When no users are configured the
 server runs silently (no notifications sent).
 """

-import asyncio
 import asyncio
 import logging
 import smtplib
@@ -30,13 +29,10 @@ from . import ws as ws_mod

 logger = logging.getLogger(__name__)

-logger = logging.getLogger(__name__)
-
 msg_to_websockets = ws_mod.broadcast

 # Module-level state set via setup()
 _config: dict = {}
-_loop: Optional[asyncio.AbstractEventLoop] = None

 # Tracks which channels fired a WARNING/CRITICAL per host.
 # {host_name: set of channel_names}  — used to route RECOVER to the same channels.
@@ -73,11 +69,9 @@ class Notification:
 # ---------------------------------------------------------------------------

 def setup(cfg: dict, loop: Optional[asyncio.AbstractEventLoop] = None):
-    """Initialize notifier from configuration dict and event loop."""
-    global _config, _loop
+    """Initialize notifier from configuration dict."""
+    global _config
    _config = dict(cfg)
-    if loop is not None:
-        _loop = loop


 def reload_config(cfg: dict):
@@ -112,11 +106,18 @@ def closelog():

 def eventlog(host, lvl, m, service=None):
    ts = time.time()
+    msg = {
+        "ts": ts,
+        "host": host or None,
+        "level": lvl,
+        "service": service,
+        "message": m,
+    }
+    data.msgs.append(msg)
    s = f"{time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(ts))} {lvl} "
    if host:
        s += f"{host} "
    s += m
-    data.msgs.append(s)
    logger.info(s)
    if logf:
        try:
@@ -124,7 +125,7 @@ def eventlog(host, lvl, m, service=None):
            logf.flush()
        except Exception as e:
            logger.warning("failed to write to logfile: %s", e)
-    msg_to_websockets("message", s)
+    msg_to_websockets("message", msg)


 # ---------------------------------------------------------------------------
@@ -140,9 +141,11 @@ def _send_pushover(channel_cfg: dict, notif: Notification) -> bool:
        logger.warning("pushover: missing token or user")
        return False
    params: dict = {"token": token, "user": user, "title": notif.title, "message": notif.body}
+    if channel_cfg.get("sound"):
+        params["sound"] = channel_cfg["sound"]
    if notif.url:
        params["url"] = notif.url
-        params["url_title"] = "Plugin metrics"
+        params["url_title"] = "Heartbeat"
    conn = http.client.HTTPSConnection("api.pushover.net:443")
    try:
        conn.request(
@@ -215,7 +218,7 @@ def _send_mattermost(channel_cfg: dict, notif: Notification) -> bool:
        return False
    text = f"**{notif.title}**\n{notif.body}"
    if notif.url:
-        text += f"\n[Plugin metrics]({notif.url})"
+        text += f"\n[Plugin metrics] {notif.url}"
    ses = {"url": host, "scheme": "http", "basepath": "/api/v4", "port": 8065}
    mm = Driver(ses)
    payload: dict = {"text": text, "channel": channel, "username": channel_cfg.get("username", "hbd")}
@@ -299,17 +302,6 @@ async def _send_sms_voipms_async(channel_cfg: dict, notif: Notification) -> bool
        return False


-def _send_sms_voipms(channel_cfg: dict, notif: Notification) -> bool:
-    """Dispatch voip.ms SMS send onto the shared event loop."""
-    if _loop is None:
-        logger.warning("sms_voipms: event loop not available")
-        return False
-    future = asyncio.run_coroutine_threadsafe(_send_sms_voipms_async(channel_cfg, notif), _loop)
-    try:
-        return future.result(timeout=15)
-    except Exception as e:
-        logger.error("sms_voipms send timed out or failed: %s", e)
-        return False


 async def _send_matrix_async(channel_cfg: dict, notif: Notification) -> bool:
@@ -357,40 +349,23 @@ async def _send_matrix_async(channel_cfg: dict, notif: Notification) -> bool:
        await client.close()


-def _send_matrix(channel_cfg: dict, notif: Notification) -> bool:
-    """Dispatch matrix send onto the shared event loop."""
-    if _loop is None:
-        logger.warning("matrix: event loop not available")
-        return False
-    future = asyncio.run_coroutine_threadsafe(_send_matrix_async(channel_cfg, notif), _loop)
-    try:
-        return future.result(timeout=15)
-    except Exception as e:
-        logger.error("matrix send timed out or failed: %s", e)
-        return False
-
-
 # ---------------------------------------------------------------------------
-# Channel dispatcher
+# Channel dispatcher  (all async — sync drivers run in a thread executor)
 # ---------------------------------------------------------------------------

+# Sync drivers kept for `hbd notify` CLI usage (asyncio.run wraps them there).
 _DRIVERS = {
    "pushover": _send_pushover,
    "email": _send_email,
    "mattermost": _send_mattermost,
    "signal": _send_signal,
-    "sms_voipms": _send_sms_voipms,
-    "matrix": _send_matrix,
 }

+_TIMEOUT = 15  # seconds per channel send

-def _dispatch_to_channel(channel_name: str, channel_cfg: dict, notif: Notification) -> bool:
-    """Send *notif* to a single named channel, honouring min_level.

-    RECOVER always bypasses min_level — a recovery is always relevant if the
-    channel was configured for any alerting (handles the restart-then-recover case
-    where _alerted_channels is empty and we fall through to the normal loop).
-    """
+async def _dispatch_to_channel(channel_name: str, channel_cfg: dict, notif: Notification) -> bool:
+    """Send *notif* to a single named channel, honouring min_level."""
    level = notif.level.upper()
    if level != "RECOVER":
        min_level = channel_cfg.get("min_level", "WARNING").upper()
@@ -398,14 +373,24 @@ def _dispatch_to_channel(channel_name: str, channel_cfg: dict, notif: Notificati
            logger.debug(
                "channel '%s': skipping level %s (min_level=%s)", channel_name, level, min_level
            )
-            return True  # not an error — filtered intentionally
+            return True  # filtered intentionally

    ch_type = channel_cfg.get("type", "")
-    driver = _DRIVERS.get(ch_type)
-    if driver is None:
+    try:
+        if ch_type == "matrix":
+            return await asyncio.wait_for(_send_matrix_async(channel_cfg, notif), timeout=_TIMEOUT)
+        if ch_type == "sms_voipms":
+            return await asyncio.wait_for(_send_sms_voipms_async(channel_cfg, notif), timeout=_TIMEOUT)
+        sync_driver = _DRIVERS.get(ch_type)
+        if sync_driver is None:
            logger.warning("unknown channel type '%s' for channel '%s'", ch_type, channel_name)
            return False
-    return driver(channel_cfg, notif)
+        return await asyncio.wait_for(
+            asyncio.to_thread(sync_driver, channel_cfg, notif), timeout=_TIMEOUT
+        )
+    except asyncio.TimeoutError:
+        logger.error("channel '%s' timed out after %ds", channel_name, _TIMEOUT)
+        return False


 # ---------------------------------------------------------------------------
@@ -419,7 +404,7 @@ def _build_url(host_name: str) -> str:
    return f"{base_url}/plugins#{host_name}"


-def send_notification(host_name: str, notif: Notification) -> dict:
+async def send_notification(host_name: str, notif: Notification) -> dict:
    """Dispatch *notif* to all managers/owner of *host_name*.

    Looks up the host's owner + managers, resolves each user's
@@ -469,16 +454,12 @@ def send_notification(host_name: str, notif: Notification) -> dict:
            if not channel_cfg:
                continue
            try:
-                ch_type = channel_cfg.get("type", "")
-                driver = _DRIVERS.get(ch_type)
-                if driver:
-                    ok = driver(channel_cfg, notif)
+                ok = await _dispatch_to_channel(channel_name, channel_cfg, notif)
                results[channel_name] = ok
                if ok:
                    logger.info("recover sent to channel '%s': %s", channel_name, notif.title)
            except Exception as e:
                logger.error("error sending recover to channel '%s': %s", channel_name, e)
-        # Clear the alerted set once recovery is delivered
        del _alerted_channels[host_name]
        return results

@@ -489,14 +470,14 @@ def send_notification(host_name: str, notif: Notification) -> dict:
            continue
        for channel_name in user.notification_channels:
            if channel_name in results:
-                continue  # already dispatched to this channel this notification
+                continue
            channel_cfg = global_channels.get(channel_name)
            if not channel_cfg:
                logger.warning("channel '%s' not defined in notification_channels", channel_name)
                results[channel_name] = False
                continue
            try:
-                ok = _dispatch_to_channel(channel_name, channel_cfg, notif)
+                ok = await _dispatch_to_channel(channel_name, channel_cfg, notif)
                results[channel_name] = ok
                if ok:
                    logger.info("notification sent to channel '%s': %s", channel_name, notif.title)
@@ -0,0 +1,142 @@
+"""Gitea OAuth2 support.
+
+Config shape (in ~/.hb.yaml):
+
+    oauth:
+      gitea:
+        url: https://git.example.com
+        client_id: <client-id>
+        client_secret: <client-secret>
+
+Register a Gitea OAuth2 application at:
+  Gitea → Settings → Applications → OAuth2
+Set the redirect URI to:
+  https://<hbd-host>/login/oauth/gitea/callback
+"""
+
+import logging
+import secrets
+import time
+import urllib.parse
+
+import aiohttp
+
+logger = logging.getLogger(__name__)
+
+STATE_TTL = 600  # 10 minutes
+
+# state_token -> expiry timestamp
+_states: dict[str, float] = {}
+
+
+def make_state() -> str:
+    """Generate a CSRF state token, store it with TTL, and return it."""
+    _purge_states()
+    token = secrets.token_hex(32)
+    _states[token] = time.time() + STATE_TTL
+    return token
+
+
+def validate_state(state: str) -> bool:
+    """Return True if *state* is known and unexpired; always removes it."""
+    expiry = _states.pop(state, None)
+    if expiry is None:
+        return False
+    return time.time() < expiry
+
+
+def _purge_states() -> None:
+    """Remove all expired CSRF state tokens from the in-memory store."""
+    now = time.time()
+    expired = [k for k, exp in list(_states.items()) if exp < now]
+    for k in expired:
+        del _states[k]
+
+
+class OAuthError(Exception):
+    """Raised when the OAuth2 flow fails for any reason."""
+
+
+def _gitea_cfg(config: dict) -> dict:
+    """Return the gitea sub-dict or {} if absent/incomplete."""
+    return config.get("oauth", {}).get("gitea", {})
+
+
+def is_enabled(config: dict) -> bool:
+    """Return True when all three required Gitea OAuth keys are present."""
+    g = _gitea_cfg(config)
+    return bool(g.get("url") and g.get("client_id") and g.get("client_secret"))
+
+
+def authorization_url(config: dict, state: str, redirect_uri: str) -> str:
+    """Return the Gitea OAuth2 authorization URL to redirect the browser to."""
+    g = _gitea_cfg(config)
+    if not (g.get("url") and g.get("client_id") and g.get("client_secret")):
+        raise OAuthError("Gitea OAuth2 is not configured")
+    params = urllib.parse.urlencode({
+        "client_id": g["client_id"],
+        "redirect_uri": redirect_uri,
+        "response_type": "code",
+        "scope": "user:email",
+        "state": state,
+    })
+    return f"{g['url'].rstrip('/')}/login/oauth/authorize?{params}"
+
+
+async def exchange_code(config: dict, code: str, redirect_uri: str) -> str:
+    """Exchange an authorization *code* for a Gitea access token.
+
+    Returns the access token string.  Raises OAuthError on any failure.
+    """
+    g = _gitea_cfg(config)
+    if not (g.get("url") and g.get("client_id") and g.get("client_secret")):
+        raise OAuthError("Gitea OAuth2 is not configured")
+    url = f"{g['url'].rstrip('/')}/login/oauth/access_token"
+    payload = {
+        "client_id": g["client_id"],
+        "client_secret": g["client_secret"],
+        "code": code,
+        "grant_type": "authorization_code",
+        "redirect_uri": redirect_uri,
+    }
+    timeout = aiohttp.ClientTimeout(total=10)
+    try:
+        async with aiohttp.ClientSession(timeout=timeout) as session:
+            async with session.post(url, json=payload, headers={"Accept": "application/json"}) as resp:
+                if resp.status != 200:
+                    text = await resp.text()
+                    raise OAuthError(f"Token exchange failed ({resp.status}): {text}")
+                data = await resp.json()
+                token = data.get("access_token")
+                if not token:
+                    raise OAuthError(f"No access_token in response: {data}")
+    except aiohttp.ClientError as exc:
+        raise OAuthError(f"Token exchange network error: {exc}") from exc
+    return token
+
+
+async def fetch_user(config: dict, token: str) -> dict:
+    """Fetch the authenticated user's profile from Gitea.
+
+    Returns a dict with keys: login, full_name, avatar_url.
+    Raises OAuthError on any failure.
+    """
+    g = _gitea_cfg(config)
+    if not (g.get("url") and g.get("client_id") and g.get("client_secret")):
+        raise OAuthError("Gitea OAuth2 is not configured")
+    url = f"{g['url'].rstrip('/')}/api/v1/user"
+    timeout = aiohttp.ClientTimeout(total=10)
+    try:
+        async with aiohttp.ClientSession(timeout=timeout) as session:
+            async with session.get(url, headers={"Authorization": f"token {token}"}) as resp:
+                if resp.status != 200:
+                    text = await resp.text()
+                    raise OAuthError(f"User fetch failed ({resp.status}): {text}")
+                data = await resp.json()
+    except aiohttp.ClientError as exc:
+        raise OAuthError(f"User fetch network error: {exc}") from exc
+    return {
+        "login": data.get("login", ""),
+        "full_name": data.get("full_name", ""),
+        "avatar_url": data.get("avatar_url", ""),
+    }
@@ -24,7 +24,7 @@ sensitive   bool  True when the raw value must never be shown
 # Credential field names that should always be masked.
 _SECRET_KEYS = frozenset({
    "password", "token", "user_key", "api_key", "secret",
-    "smtp_password", "smtp_user",
+    "smtp_password", "smtp_user", "api_password", "access_token",
 })

 _CHANNEL_TYPE_LABELS = {
@@ -88,7 +88,7 @@ def _sanitize_channel(name, cfg):
 # Public API
 # ---------------------------------------------------------------------------

-def get_settings_sections(config: dict) -> list:
+def get_settings_sections(config: dict, threshold_checker=None) -> list:
    """Return ordered list of setting sections for the settings page.

    Each section:
@@ -181,6 +181,41 @@ def get_settings_sections(config: dict) -> list:
            "notification_channels": attrs.get("notification_channels", []),
        })

+    # ---- Threshold configurations -----------------------------------------
+    def _tc_to_row(tc):
+        return {
+            "metric": tc.metric_path,
+            "operator": tc.operator.value,
+            "warning": tc.warning,
+            "critical": tc.critical,
+            "hysteresis": tc.hysteresis,
+            "count": tc.count,
+            "enabled": tc.enabled,
+        }
+
+    threshold_config_list = []
+    if threshold_checker is not None:
+        if threshold_checker.threshold_configs:
+            for cfg_name, cfg_metrics in sorted(threshold_checker.threshold_configs.items()):
+                # For the default config use the merged effective set;
+                # for named overrides use only the explicitly defined metrics
+                # (threshold_raw_configs) so inherited defaults are not repeated.
+                if cfg_name == "default":
+                    display_metrics = cfg_metrics
+                else:
+                    display_metrics = threshold_checker.threshold_raw_configs.get(cfg_name, cfg_metrics)
+                metrics = sorted(
+                    [_tc_to_row(tc) for tc in display_metrics.values()],
+                    key=lambda m: m["metric"],
+                )
+                threshold_config_list.append({"name": cfg_name, "metrics": metrics})
+        elif threshold_checker.thresholds:
+            metrics = sorted(
+                [_tc_to_row(tc) for tc in threshold_checker.thresholds.values()],
+                key=lambda m: m["metric"],
+            )
+            threshold_config_list.append({"name": "default", "metrics": metrics})
+
    # ---- Hosts summary ----------------------------------------------------
    hosts_list = []
    for hname, hcfg in (config.get("hosts") or {}).items():
@@ -188,7 +223,7 @@ def get_settings_sections(config: dict) -> list:
            continue
        hosts_list.append({
            "name": hname,
-            "watch": bool(hcfg.get("watch", False)),
+            "watch": bool(hcfg.get("watch", True)),
            "dyndns": bool(hcfg.get("dyndns", False)),
            "owner": hcfg.get("owner", ""),
            "managers": hcfg.get("managers", []),
@@ -312,6 +347,16 @@ def get_settings_sections(config: dict) -> list:
            "hosts": hosts_list,
            "fields": [],
        },
+        {
+            "id": "thresholds",
+            "title": "Threshold Configurations",
+            "description": "Named alert threshold sets. Each defines warning/critical levels per metric.",
+            "threshold_configs": threshold_config_list,
+            "fields": [
+                field("default_threshold_config", "Default config", "text",
+                      "Threshold config used for hosts with no explicit mapping."),
+            ],
+        },
        {
            "id": "runtime",
            "title": "Runtime",
@@ -0,0 +1,199 @@
+<!DOCTYPE html>
+<html>
+  {% include 'head.html' %}
+
+  <style>
+    html, body { overflow: visible; }
+
+    .container {
+      max-width: 700px;
+      margin: 0 auto;
+    }
+
+    h1 {
+      color: #333;
+      margin-bottom: 4px;
+      font-size: 1.5em;
+    }
+
+    .subtitle {
+      color: #666;
+      margin-bottom: 24px;
+      font-size: 0.9em;
+    }
+
+    .section {
+      background: #fff;
+      border-radius: 8px;
+      box-shadow: 0 1px 6px rgba(0,0,0,0.1);
+      padding: 20px 24px;
+      margin-bottom: 20px;
+    }
+
+    .section h2 {
+      font-size: 1em;
+      font-weight: 700;
+      color: #333;
+      margin: 0 0 16px;
+      padding-bottom: 10px;
+      border-bottom: 1px solid #eee;
+      text-transform: uppercase;
+      letter-spacing: 0.5px;
+    }
+
+    .info-row {
+      display: flex;
+      align-items: baseline;
+      padding: 8px 0;
+      border-bottom: 1px solid #f5f5f5;
+      font-size: 0.9em;
+    }
+    .info-row:last-child { border-bottom: none; }
+
+    .info-label {
+      width: 160px;
+      flex-shrink: 0;
+      color: #666;
+      font-size: 0.88em;
+    }
+
+    .info-value {
+      color: #222;
+      word-break: break-all;
+    }
+
+    .info-value a {
+      color: #0066cc;
+      text-decoration: none;
+    }
+    .info-value a:hover { text-decoration: underline; }
+
+    .version-badge {
+      display: inline-block;
+      padding: 3px 12px;
+      background: #e8f0fe;
+      color: #1a73e8;
+      border-radius: 12px;
+      font-size: 0.85em;
+      font-weight: 600;
+      font-family: monospace;
+    }
+
+    .hb-logo {
+      font-size: 2.5em;
+      font-weight: 700;
+      color: #0066cc;
+      letter-spacing: -1px;
+      margin-bottom: 6px;
+    }
+
+    .hb-tagline {
+      color: #555;
+      font-size: 0.95em;
+    }
+
+    .logo-section {
+      display: flex;
+      align-items: center;
+      gap: 20px;
+      padding: 8px 0 4px;
+    }
+
+    .logo-text { flex: 1; }
+  </style>
+
+  <body>
+    {% include 'nav.html' %}
+
+    <div class="container">
+      <h1>{{ header }}</h1>
+      <p class="subtitle">Heartbeat monitoring system</p>
+
+      <div class="section">
+        <div class="logo-section">
+          <div class="logo-text">
+            <div class="hb-logo">Heartbeat</div>
+            <div class="hb-tagline">Lightweight host monitoring over UDP</div>
+          </div>
+          <span class="version-badge">v{{ hbd_version }}</span>
+        </div>
+      </div>
+
+      <div class="section">
+        <h2>Version</h2>
+        <div class="info-row">
+          <span class="info-label">Server version</span>
+          <span class="info-value">{{ hbd_version }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Python</span>
+          <span class="info-value">{{ python_version }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">License</span>
+          <span class="info-value">MIT</span>
+        </div>
+      </div>
+
+      <div class="section">
+        <h2>Runtime</h2>
+        <div class="info-row">
+          <span class="info-label">Host</span>
+          <span class="info-value">{{ server_hostname }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Started</span>
+          <span class="info-value">{{ start_time_str }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Uptime</span>
+          <span class="info-value" id="uptime-value">{{ uptime_str }}</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Hosts monitored</span>
+          <span class="info-value">{{ host_count }}</span>
+        </div>
+      </div>
+
+      <div class="section">
+        <h2>Contact &amp; Source</h2>
+        <div class="info-row">
+          <span class="info-label">Author</span>
+          <span class="info-value">Andreas Wrede</span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Email</span>
+          <span class="info-value"><a href="mailto:aew@wrede.ca">aew@wrede.ca</a></span>
+        </div>
+        <div class="info-row">
+          <span class="info-label">Repository</span>
+          <span class="info-value"><a href="https://git.wrede.ca/andreas/heartbeat" target="_blank" rel="noopener">git.wrede.ca/andreas/heartbeat</a></span>
+        </div>
+      </div>
+
+    </div>
+
+    <script>
+      (function() {
+        var startEpoch = {{ start_epoch }};
+        var el = document.getElementById('uptime-value');
+        if (!el) return;
+        function fmt(s) {
+          var d = Math.floor(s / 86400);
+          var h = Math.floor((s % 86400) / 3600);
+          var m = Math.floor((s % 3600) / 60);
+          var sec = s % 60;
+          if (d > 0) return d + 'd ' + h + 'h ' + m + 'm';
+          if (h > 0) return h + 'h ' + m + 'm ' + sec + 's';
+          return m + 'm ' + sec + 's';
+        }
+        function tick() {
+          var up = Math.floor(Date.now() / 1000 - startEpoch);
+          el.textContent = fmt(up);
+        }
+        tick();
+        setInterval(tick, 1000);
+      })();
+    </script>
+  </body>
+</html>
@@ -4,12 +4,17 @@

  <style>

+    html, body {
+      height: auto;
+      overflow-y: auto;
+    }
+
    .container {
      max-width: 1400px;
      margin: 0 auto;
    }

-    h1 { color: #333; margin-bottom: 10px; font-size: 1.5em; }
+    h1 { color: #333; margin-bottom: 5px; margin-top: 15px; font-size: 1.5em; }

    .subtitle {
      color: #666;
@@ -170,14 +175,18 @@

    .alert-hostname {
      font-weight: bold;
-      color: #333;
+      color: #0066cc;
      font-size: 1.1em;
+      text-decoration: none;
+    }
+    .alert-hostname:hover {
+      text-decoration: underline;
    }

    .alert-metric {
-      color: #666;
-      font-family: 'Courier New', monospace;
-      font-size: 0.9em;
+      color: #0066cc;
+      font-size: 1.1em;
+      font-weight: normal;
    }

    .alert-details {
@@ -400,6 +409,10 @@
        } else if (alert.threshold_value !== undefined && alert.threshold_value !== null && alert.operator) {
          valueText += ` <span class="threshold-info">(threshold: ${alert.operator} ${formatValue(alert.threshold_value)})</span>`;
        }
+        if (alert.recovery_threshold !== undefined && alert.recovery_threshold !== null) {
+          const recOp = (alert.operator === '>' || alert.operator === '>=') ? '<' : '>';
+          valueText += ` <span class="threshold-info" style="color:#888">(recovers ${recOp} ${formatValue(alert.recovery_threshold)})</span>`;
+        }
        
        // Build actions section
        let actionsHtml = '';
@@ -424,9 +437,9 @@
            <div class="alert-main">
              <div class="alert-header">
                <span class="alert-level ${level}">${alert.level}</span>
-                <span class="alert-hostname">${alert.hostname}</span>
+                <a class="alert-hostname" href="/plugins#${alert.hostname}">${alert.hostname}</a>
+                <span class="alert-metric">${(alert.metric_path.includes('.') ? alert.metric_path.slice(alert.metric_path.indexOf('.') + 1) : alert.metric_path).replace(/_status_code$/, '')}</span>
              </div>
-              <div class="alert-metric">${alert.metric_path}</div>
              <div class="alert-details">
                <span>${valueText}</span>
                <span class="alert-duration">Active for ${duration}</span>
@@ -15,6 +15,7 @@
      body {
        margin: 0;
        padding: 10px;
+        padding-top: 60px;
        background: #f5f5f5;
      }
      h1 { font-size: 1.5em; color: #333; margin: 0 0 5px; }
@@ -23,11 +24,14 @@

      /* Navigation bar — shared across all pages */
      .nav {
+        position: fixed;
+        top: 0;
+        left: 0;
+        right: 0;
+        z-index: 200;
        background: #fff;
        padding: 6px 12px;
-        margin-bottom: 10px;
        box-shadow: 0 2px 4px rgba(0,0,0,.1);
-        border-radius: 4px;
        display: flex;
        align-items: center;
        justify-content: space-between;
@@ -122,11 +126,17 @@
      }

      /* Swiss railway clock — nav */
-      .nav-clock {
+      .nav-pie {
        flex-shrink: 0;
        line-height: 0;
        margin-left: auto;
        padding: 4px 4px 4px 0;
+      }
+      #alert-pie { display: block; cursor: default; }
+      .nav-clock {
+        flex-shrink: 0;
+        line-height: 0;
+        padding: 4px 4px 4px 0;
        cursor: pointer;
      }
      #swiss-clock { display: block; }
@@ -204,7 +214,7 @@
        ctx.restore();
      }

-      hand((m + s / 60) / 60 * Math.PI * 2 - Math.PI / 2,
+      hand((sFrac >= 58.5 ? m + 1 : m) / 60 * Math.PI * 2 - Math.PI / 2,
           R * 0.88, -R * 0.12, SIZE * 0.027, '#222');           /* minute */
      hand((h + m / 60) / 12 * Math.PI * 2 - Math.PI / 2,
           R * 0.58, -R * 0.12, SIZE * 0.039, '#222');           /* hour   */
@@ -45,6 +45,7 @@
    h1 {
      color: #333;
      margin-bottom: 5px;
+      margin-top: 15px; 
      font-size: 1.5em;
    }

@@ -182,11 +183,24 @@
      line-height: 1.0;
    }

-    #messages div {
+    #messages .log-entry {
      padding: 5px 0;
      border-bottom: 1px solid #f0f0f0;
+      display: flex;
+      gap: 0.5em;
+      align-items: baseline;
    }

+    .log-ts { color: #888; white-space: nowrap; }
+    .log-level { font-weight: bold; min-width: 6em; }
+    .log-host { font-weight: 600; }
+    .log-service { color: #888; }
+
+    .log-warning .log-level  { color: #b8860b; }
+    .log-critical .log-level { color: #c00; }
+    .log-recover .log-level  { color: #2a7a2a; }
+    .log-info .log-level     { color: #555; }
+
    /* Modal for connection status messages */
    .connection-modal {
      display: none;
@@ -235,6 +249,8 @@
      color: #ff9800;
      font-weight: 700;
    }
+    #ntable a.host-link { color: inherit; text-decoration: none; }
+    #ntable a.host-link:hover { text-decoration: underline; }
  </style>
  <script type="text/javascript">
    var cnt = 0;
@@ -244,11 +260,13 @@
    var HBD_VERSION = "{{ hbd_version }}";

    function hostNameHtml(data) {
+      var rawName = data.raw_name || data.name.replace(/<[^>]+>/g, '').replace('*', '').trim();
      var nameHtml = data.name;
      if (!data.hbc_version || data.hbc_version !== HBD_VERSION) {
        nameHtml += ' 🥀';
      }
-      return data.dyn ? '<b>' + nameHtml + '</b>' : nameHtml;
+      var display = data.dyn ? '<b>' + nameHtml + '</b>' : nameHtml;
+      return '<a class="host-link" href="/plugins#' + encodeURIComponent(rawName) + '">' + display + '</a>';
    }

    function setup() {
@@ -403,7 +421,7 @@
        );
        if (data.connections[i].state == "up") {
          state = '<span class="state-up">up</span>';
-          latency = Number.parseFloat(data.connections[i].rtts[0]).toFixed(2);
+          latency = String(Math.round(Number.parseFloat(data.connections[i].rtts[0])));
        } else {
          if (data.connections[i].state == "unknown") {
            state = "";
@@ -455,7 +473,20 @@
            update_table(state.data);
          } else if (state.type == "message") {
            var msgs = document.getElementById("messages");
-            msgs.insertAdjacentHTML("afterbegin", "<div>" + state.data + "</div>");
+            var msg = state.data;
+            var _d = new Date(msg.ts * 1000);
+            function _p(n) { return n < 10 ? '0' + n : '' + n; }
+            var ts_str = _d.getFullYear() + '-' + _p(_d.getMonth()+1) + '-' + _p(_d.getDate())
+                       + ' ' + _p(_d.getHours()) + ':' + _p(_d.getMinutes()) + ':' + _p(_d.getSeconds());
+            var lvl = (msg.level || "INFO").toLowerCase();
+            var html = '<div class="log-entry log-' + lvl + '">';
+            html += '<span class="log-ts">' + ts_str + '</span>';
+            html += '<span class="log-level">' + (msg.level || "") + '</span>';
+            if (msg.host) html += '<span class="log-host">' + msg.host + '</span>';
+            if (msg.service) html += '<span class="log-service">' + msg.service + '</span>';
+            html += '<span class="log-msg">' + msg.message + '</span>';
+            html += '</div>';
+            msgs.insertAdjacentHTML("afterbegin", html);
          }
          cnt++;
        };
@@ -510,7 +541,7 @@
          <tbody id="ntablebody">
            {% for host in hosts %}
            <tr class="{% if host.alert_critical_unacked > 0 or host.alert_critical_acked > 0 %}row-critical{% elif host.alert_warning_unacked > 0 or host.alert_warning_acked > 0 %}row-warning{% endif %}">
-              <td data-name="{{ host.name }}">{{ host.name }}{% if not host.hbc_version or host.hbc_version != hbd_version %} 🥀{% endif %}</td>
+              <td data-name="{{ host.name }}"><a class="host-link" href="/plugins#{{ host.raw_name | urlencode }}">{{ host.name }}{% if not host.hbc_version or host.hbc_version != hbd_version %} 🥀{% endif %}</a></td>
              <td style="text-align: center; color: #ff9800; font-weight: bold;">
                {%- set warning_unacked = host.alert_warning_unacked -%}
                {%- set warning_acked = host.alert_warning_acked -%}
@@ -9,6 +9,10 @@
    {% if current_user and current_user.admin %}
    <a href="/settings"{% if active_page == "settings" %} class="active"{% endif %}>Settings</a>
    {% endif %}
+    <a href="/about"{% if active_page == "about" %} class="active"{% endif %}>About</a>
+  </div>
+  <div class="nav-pie" title="Host alert status">
+    <canvas id="alert-pie" width="44" height="44"></canvas>
  </div>
  <div class="nav-clock" title="Click for full-screen clock">
    <canvas id="swiss-clock" width="44" height="44"></canvas>
@@ -41,4 +45,52 @@
      });
    }
  })();
+
+  function drawAlertPie(critical, warning, ok) {
+    var canvas = document.getElementById('alert-pie');
+    if (!canvas) return;
+    var ctx = canvas.getContext('2d');
+    var SIZE = canvas.width;
+    var R = SIZE / 2;
+    ctx.clearRect(0, 0, SIZE, SIZE);
+    var total = critical + warning + ok;
+    if (total === 0) {
+      ctx.beginPath();
+      ctx.arc(R, R, R - 1, 0, Math.PI * 2);
+      ctx.fillStyle = '#ccc';
+      ctx.fill();
+      return;
+    }
+    var slices = [
+      { value: critical, color: '#e53935' },
+      { value: warning,  color: '#ffb300' },
+      { value: ok,       color: '#43a047' }
+    ];
+    var start = -Math.PI / 2;
+    slices.forEach(function(s) {
+      if (s.value === 0) return;
+      var sweep = (s.value / total) * Math.PI * 2;
+      ctx.beginPath();
+      ctx.moveTo(R, R);
+      ctx.arc(R, R, R - 1, start, start + sweep);
+      ctx.closePath();
+      ctx.fillStyle = s.color;
+      ctx.fill();
+      start += sweep;
+    });
+  }
+
+  function updateAlertPie() {
+    fetch('/api/0/alert_summary').then(function(r) {
+      if (!r.ok) return;
+      return r.json();
+    }).then(function(d) {
+      if (d) drawAlertPie(d.critical || 0, d.warning || 0, d.ok || 0);
+    }).catch(function() {});
+  }
+
+  document.addEventListener('DOMContentLoaded', function() {
+    updateAlertPie();
+    setInterval(updateAlertPie, 30000);
+  });
 </script>
@@ -16,6 +16,7 @@
    h1 {
      color: #333;
      margin-bottom: 5px;
+      margin-top: 15px; 
      font-size: 1.5em;
    }

@@ -130,6 +131,52 @@
      text-overflow: ellipsis;
    }

+    .host-action-btn {
+      font-size: 0.75em;
+      font-weight: bold;
+      padding: 3px 10px;
+      border-radius: 4px;
+      border: none;
+      cursor: pointer;
+      text-decoration: none;
+      white-space: nowrap;
+    }
+    .host-action-btn.update-btn {
+      background: #e3f2fd;
+      color: #1565c0;
+    }
+    .host-action-btn.update-btn:hover { background: #bbdefb; }
+    .host-action-btn.delete-btn {
+      background: #ffebee;
+      color: #c62828;
+    }
+    .host-action-btn.delete-btn:hover { background: #ffcdd2; }
+
+    /* ── Action result toast ───────────────────────────────────── */
+    #action-toast {
+      position: fixed;
+      bottom: 24px;
+      left: 50%;
+      transform: translateX(-50%) translateY(20px);
+      background: #323232;
+      color: #fff;
+      padding: 12px 22px;
+      border-radius: 6px;
+      font-size: 0.9em;
+      max-width: 480px;
+      text-align: center;
+      opacity: 0;
+      pointer-events: none;
+      transition: opacity 0.25s, transform 0.25s;
+      z-index: 9000;
+      white-space: pre-wrap;
+    }
+    #action-toast.show {
+      opacity: 1;
+      transform: translateX(-50%) translateY(0);
+    }
+    #action-toast.error { background: #c62828; }
+
    /* ── Host body ──────────────────────────────────────────────── */

    .host-body {
@@ -369,7 +416,8 @@
              <span class="host-name">{{ host.name }}</span>
            </div>

-            <div class="glance-strip" id="glance-{{ host.name }}">
+            <div class="glance-strip" id="glance-{{ host.name }}" data-owner="{{ host.owner or '' }}">
+              {% if current_user and current_user.admin and host.owner %}<span class="glance-chip neutral">{{ host.owner }}</span>{% endif %}
              <span class="glance-loading">—</span>
            </div>

@@ -378,11 +426,17 @@
              <span class="nagios-badge" id="nagios-badge-{{ host.name }}">—</span>
              {% endif %}
              <span class="os-label" id="os-label-{{ host.name }}"></span>
+              {% if host.is_owner %}
+              <button class="host-action-btn update-btn"
+                      onclick="event.stopPropagation(); hostAction(this, '/u?h={{ host.name }}')">Update</button>
+              <button class="host-action-btn delete-btn"
+                      onclick="event.stopPropagation(); hostDelete(this, '{{ host.name }}')">Delete</button>
+              {% endif %}
            </div>
          </div>

          <div class="host-body">
-            {% set plugin_order = ['os_info','cpu_monitor','memory_monitor','disk_monitor','network_monitor','nagios_runner','filesystem_info'] %}
+            {% set plugin_order = ['os_info','cpu_monitor','memory_monitor','disk_monitor','network_monitor','zfs_monitor','nagios_runner','filesystem_info'] %}
            {% for plugin in plugin_order if plugin in host.plugins %}
            <div class="plugin-accordion collapsed"
                 data-hostname="{{ host.name }}"
@@ -427,6 +481,7 @@
      const GLANCE_PLUGINS = ['cpu_monitor','memory_monitor','disk_monitor',
                              'network_monitor','nagios_runner','os_info'];
      const SKIP_FIELDS = new Set(['id','name']);
+      const CURRENT_USER_ADMIN = {{ 'true' if current_user and current_user.admin else 'false' }};

      // ── Cache ───────────────────────────────────────────────────────────────

@@ -446,6 +501,17 @@
        return pluginCache[hostname]?.[pluginName] ?? null;
      }

+      // Return worst nagios exit code (0-3) found in a nagios_runner data object.
+      function nagiosWorstStatus(data) {
+        let worst = 0;
+        for (const [k, v] of Object.entries(data || {})) {
+          if (k.endsWith('_status_code') && typeof v === 'number' && v > worst) {
+            worst = v;
+          }
+        }
+        return worst;
+      }
+
      // ── Fetch helpers ───────────────────────────────────────────────────────

      async function fetchPlugin(hostname, pluginName) {
@@ -494,6 +560,12 @@

        const chips = [];

+        // Owner (admin only, static from server)
+        const owner = strip.dataset.owner;
+        if (CURRENT_USER_ADMIN && owner) {
+          chips.push(`<span class="glance-chip neutral">${owner}</span>`);
+        }
+
        // CPU
        const cpu = getCache(hostname, 'cpu_monitor');
        if (cpu) {
@@ -547,13 +619,13 @@
          ? chips.join('')
          : '<span class="glance-loading">—</span>';

-        // Nagios badge
+        // Nagios badge — derive worst status from individual check codes
        const nagios = getCache(hostname, 'nagios_runner');
        if (nagosBadge && nagios) {
-          const status = (nagios.data.overall_status || '—').toUpperCase();
-          const cls = status === 'OK' ? 'ok'
-            : status === 'WARNING' ? 'warning'
-            : status === 'CRITICAL' ? 'critical' : '';
+          const worst = nagiosWorstStatus(nagios.data);
+          const names = {0:'OK', 1:'WARNING', 2:'CRITICAL', 3:'UNKNOWN'};
+          const status = names[worst] || '—';
+          const cls = worst === 0 ? 'ok' : worst === 1 ? 'warning' : worst >= 2 ? 'critical' : '';
          nagosBadge.className = `nagios-badge ${cls}`;
          nagosBadge.textContent = status;
        }
@@ -662,9 +734,10 @@
            break;
          }
          case 'nagios_runner': {
-            const status = (d.overall_status || '?').toUpperCase();
-            const count = d.plugin_count;
-            text = status + (count != null ? ` — ${count} checks` : '');
+            const worst = nagiosWorstStatus(d);
+            const names = {0:'OK', 1:'WARNING', 2:'CRITICAL', 3:'UNKNOWN'};
+            const codes = Object.keys(d).filter(k => k.endsWith('_status_code'));
+            text = (names[worst] || '?') + (codes.length ? ` — ${codes.length} checks` : '');
            break;
          }
          case 'filesystem_info': {
@@ -672,6 +745,19 @@
            text = `${count} filesystem${count !== 1 ? 's' : ''}`;
            break;
          }
+          case 'zfs_monitor': {
+            const pools = d.pools || {};
+            const names = Object.keys(pools);
+            if (names.length === 0) { text = 'No pools'; break; }
+            const degraded = names.filter(n => pools[n].health && pools[n].health !== 'ONLINE');
+            text = names.map(n => {
+              const p = pools[n];
+              const cap = p.capacity != null ? ` ${p.capacity.toFixed(0)}%` : '';
+              return `${n}${cap}`;
+            }).join(' · ');
+            if (degraded.length) text += ` ⚠ ${degraded.map(n => pools[n].health).join(',')}`;
+            break;
+          }
          default:
            text = 'Loaded';
        }
@@ -693,6 +779,7 @@
          case 'memory_monitor': html = renderMemoryTable(cached.data); break;
          case 'disk_monitor':   html = renderDiskTables(cached.data); break;
          case 'network_monitor':html = renderNetworkTables(cached.data); break;
+          case 'zfs_monitor':    html = renderZfsTables(cached.data); break;
          case 'nagios_runner':  html = renderNagiosTable(cached.data); break;
          case 'filesystem_info':html = renderFilesystemTable(cached.data); break;
          default:               html = renderGenericTable(cached.data); break;
@@ -1023,6 +1110,66 @@
        return html;
      }

+      function renderZfsTables(d) {
+        const pools = d.pools || {};
+        const names = Object.keys(pools);
+        if (names.length === 0) return '<div class="no-data">No ZFS pools found</div>';
+
+        const healthCls = h => {
+          if (!h || h === 'ONLINE') return 'pct-ok';
+          if (h === 'DEGRADED') return 'pct-warn';
+          return 'pct-crit';
+        };
+
+        let pt = '<table class="data-table"><thead><tr>'
+          + '<th>Pool</th><th>Health</th>'
+          + '<th class="num">Size</th><th class="num">Used</th>'
+          + '<th class="num">Free</th><th class="num">Cap %</th>'
+          + '<th class="num">Frag %</th><th class="num">Dedup</th>'
+          + '</tr></thead><tbody>';
+        for (const name of names) {
+          const p = pools[name];
+          const cap = p.capacity != null ? p.capacity : 0;
+          const capCls = cap > 90 ? 'pct-crit' : cap > 75 ? 'pct-warn' : 'pct-ok';
+          pt += `<tr>
+            <td class="iface-name">${escHtml(name)}</td>
+            <td class="${healthCls(p.health)}">${escHtml(p.health || '—')}</td>
+            <td class="num">${formatBytes(p.size || 0)}</td>
+            <td class="num">${formatBytes(p.alloc || 0)}</td>
+            <td class="num">${formatBytes(p.free || 0)}</td>
+            <td class="num ${capCls}">${cap.toFixed(1)}%</td>
+            <td class="num">${p.frag != null ? p.frag.toFixed(1) + '%' : '—'}</td>
+            <td class="num">${p.dedup != null ? p.dedup.toFixed(2) + 'x' : '—'}</td>
+          </tr>`;
+        }
+        pt += '</tbody></table>';
+
+        const hasIo = names.some(n => pools[n].read_ops != null);
+        if (!hasIo) return pt;
+
+        let iot = '<table class="data-table"><thead><tr>'
+          + '<th>Pool</th>'
+          + '<th class="num">Read ops</th><th class="num">Write ops</th>'
+          + '<th class="num">Read BW</th><th class="num">Write BW</th>'
+          + '</tr></thead><tbody>';
+        for (const name of names) {
+          const p = pools[name];
+          iot += `<tr>
+            <td class="iface-name">${escHtml(name)}</td>
+            <td class="num">${p.read_ops != null ? p.read_ops.toLocaleString() : '—'}</td>
+            <td class="num">${p.write_ops != null ? p.write_ops.toLocaleString() : '—'}</td>
+            <td class="num">${p.read_bw != null ? formatBytes(p.read_bw) : '—'}</td>
+            <td class="num">${p.write_bw != null ? formatBytes(p.write_bw) : '—'}</td>
+          </tr>`;
+        }
+        iot += '</tbody></table>';
+
+        return `<div class="flex-tables">
+          <div><div class="table-section-label">Pools</div>${pt}</div>
+          <div><div class="table-section-label">I/O (cumulative)</div>${iot}</div>
+        </div>`;
+      }
+
      function renderGenericTable(d) {
        let html = '<table class="data-table"><thead><tr><th>Field</th><th>Value</th></tr></thead><tbody>';
        for (const [k, v] of Object.entries(d)) {
@@ -1081,12 +1228,68 @@
      // ── Init ────────────────────────────────────────────────────────────────

      document.addEventListener('DOMContentLoaded', () => {
+        // If a host fragment is in the URL, expand and scroll to that host;
+        // otherwise expand the first host as before.
+        const hash = window.location.hash;
+        if (hash) {
+          const hostname = decodeURIComponent(hash.slice(1));
+          const card = document.querySelector(`.host-card[data-hostname="${hostname}"]`);
+          if (card) {
+            card.classList.remove('collapsed');
+            fetchHostGlance(hostname);
+            setTimeout(() => card.scrollIntoView({ behavior: 'smooth', block: 'start' }), 150);
+            return;
+          }
+        }
        const first = document.querySelector('.host-card');
        if (first) {
          first.classList.remove('collapsed');
          fetchHostGlance(first.dataset.hostname);
        }
      });
+      // ── Host action helpers ──────────────────────────────────────
+
+      let _toastTimer = null;
+      function showToast(msg, isError) {
+        const t = document.getElementById('action-toast');
+        t.textContent = msg;
+        t.classList.toggle('error', !!isError);
+        t.classList.add('show');
+        clearTimeout(_toastTimer);
+        _toastTimer = setTimeout(() => t.classList.remove('show'), 4000);
+      }
+
+      async function hostAction(btn, url) {
+        btn.disabled = true;
+        try {
+          const res = await fetch(url);
+          const text = await res.text();
+          showToast(text, !res.ok);
+        } catch (e) {
+          showToast('Request failed: ' + e.message, true);
+        } finally {
+          btn.disabled = false;
+        }
+      }
+
+      async function hostDelete(btn, hostname) {
+        if (!confirm('Delete host ' + hostname + '?')) return;
+        btn.disabled = true;
+        try {
+          const res = await fetch('/d?h=' + encodeURIComponent(hostname));
+          const text = await res.text();
+          showToast(text, !res.ok);
+          if (res.ok) {
+            const card = document.querySelector(`.host-card[data-hostname="${hostname}"]`);
+            if (card) card.remove();
+          }
+        } catch (e) {
+          showToast('Request failed: ' + e.message, true);
+          btn.disabled = false;
+        }
+      }
    </script>
+
+    <div id="action-toast"></div>
  </body>
 </html>
@@ -9,7 +9,7 @@
      max-width: 960px;
    }

-    h1 { color: #333; margin-bottom: 4px; font-size: 1.5em; }
+    h1 { color: #333; margin-bottom: 5px; margin-top: 15px; font-size: 1.5em; }
    .subtitle { color: #666; margin-bottom: 24px; font-size: 0.9em; }

    /* ---- Sidebar + content layout ---- */
@@ -23,7 +23,7 @@
      width: 180px;
      flex-shrink: 0;
      position: sticky;
-      top: 20px;
+      top: 60px;
    }

    .sidebar-nav a {
@@ -254,6 +254,17 @@
    .host-bool { text-align: center; }
    .dot-yes { color: #2e7d32; font-size: 1.1em; }
    .dot-no  { color: #ddd;    font-size: 1.1em; }
+
+    /* ---- Threshold configurations ---- */
+    .thresh-config { margin: 12px 20px 20px; }
+    .thresh-config-name {
+      font-weight: 600; font-size: 0.9em; color: #1a237e;
+      margin-bottom: 6px;
+    }
+    .mini-table .warn  { color: #e65100; font-weight: 600; }
+    .mini-table .crit  { color: #b71c1c; font-weight: 600; }
+    .mini-table .dim   { color: #aaa; }
+    .mini-table .metric-path { font-family: monospace; font-size: 0.88em; }
  </style>

  <body>
@@ -394,6 +405,49 @@
            {% endif %}
            {% endif %}

+            {# ---- Threshold configurations section ---- #}
+            {% if section.id == "thresholds" %}
+            {% if section.threshold_configs %}
+            {% for tc in section.threshold_configs %}
+            <div class="thresh-config">
+              <div class="thresh-config-name">{{ tc.name }}</div>
+              {% if tc.metrics %}
+              <div style="overflow-x: auto;">
+                <table class="mini-table">
+                  <thead>
+                    <tr>
+                      <th>Metric</th>
+                      <th>Op</th>
+                      <th>Warning</th>
+                      <th>Critical</th>
+                      <th>Hysteresis</th>
+                      <th>Count</th>
+                    </tr>
+                  </thead>
+                  <tbody>
+                    {% for m in tc.metrics %}
+                    <tr {% if not m.enabled %} style="opacity:0.45"{% endif %}>
+                      <td class="metric-path">{{ m.metric }}</td>
+                      <td>{{ m.operator or '>' }}</td>
+                      <td class="warn">{{ m.warning if m.warning is not none else '—' }}</td>
+                      <td class="crit">{{ m.critical if m.critical is not none else '—' }}</td>
+                      <td class="dim">{{ '%.0f%%' % (m.hysteresis * 100) if m.hysteresis else '—' }}</td>
+                      <td class="dim">{{ m.count }}</td>
+                    </tr>
+                    {% endfor %}
+                  </tbody>
+                </table>
+              </div>
+              {% else %}
+              <span class="val-empty">No thresholds defined.</span>
+              {% endif %}
+            </div>
+            {% endfor %}
+            {% else %}
+            <div class="field-row"><span class="val-empty">No threshold configurations defined.</span></div>
+            {% endif %}
+            {% endif %}
+
            {# ---- Hosts section ---- #}
            {% if section.id == "hosts" %}
            {% if section.hosts %}
@@ -9,10 +9,11 @@ This module provides a flexible threshold checking system that:
 - Supports multiple comparison operators
 """

+import asyncio
 import logging
 import time
 from enum import Enum
-from typing import Dict, Any, Optional, Tuple, Callable
+from typing import Dict, List, Any, Optional, Tuple, Callable
 from . import notify as notify_mod
 from .config import THRESHOLD_DEFAULTS

@@ -35,6 +36,7 @@ class ComparisonOperator(Enum):
    LTE = "<="      # Less than or equal
    EQ = "=="       # Equal to
    NEQ = "!="      # Not equal to
+    NAGIOS = "nagios"  # Nagios exit-code semantics: 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN


 class AlertState:
@@ -56,6 +58,7 @@ class AlertState:
        self.last_notification = None
        self.threshold_value = None  # The threshold value that triggered alert
        self.operator = None  # The comparison operator (>, <, >=, etc.)
+        self.hysteresis: Optional[float] = None  # Hysteresis fraction used for recovery
        self.formatted_message = None  # Formatted display message for UI
        self.acknowledged = False  # Whether alert has been acknowledged
        self.acknowledged_at = None  # Timestamp when acknowledged
@@ -151,6 +154,15 @@ class AlertState:
        if self.formatted_message is not None:
            result["formatted_message"] = self.formatted_message

+        # Compute and expose the recovery threshold so the UI can display it
+        if (self.hysteresis and self.threshold_value is not None
+                and self.operator is not None):
+            ha = abs(self.threshold_value * self.hysteresis)
+            if self.operator in ('>', '>='):
+                result["recovery_threshold"] = round(self.threshold_value - ha, 4)
+            elif self.operator in ('<', '<='):
+                result["recovery_threshold"] = round(self.threshold_value + ha, 4)
+
        return result
    
    def __setstate__(self, state):
@@ -158,6 +170,8 @@ class AlertState:
        self.__dict__.update(state)
        if not hasattr(self, 'consecutive_count'):
            self.consecutive_count = 0
+        if not hasattr(self, 'hysteresis'):
+            self.hysteresis = None

    def acknowledge(self):
        """Acknowledge this alert to stop reminder notifications."""
@@ -226,6 +240,16 @@ class ThresholdConfig:
        if not self.enabled:
            return AlertLevel.OK

+        # Nagios exit-code semantics: value IS the severity
+        if self.operator == ComparisonOperator.NAGIOS:
+            try:
+                code = int(value)
+            except (TypeError, ValueError):
+                return AlertLevel.UNKNOWN
+            return {0: AlertLevel.OK, 1: AlertLevel.WARNING, 2: AlertLevel.CRITICAL}.get(
+                code, AlertLevel.UNKNOWN
+            )
+
        try:
            # Convert value to float for comparison
            value = float(value)
@@ -262,6 +286,10 @@ class ThresholdConfig:
        """
        new_level = self.evaluate(value)

+        # Nagios exit codes are discrete integers — hysteresis doesn't apply
+        if self.operator == ComparisonOperator.NAGIOS:
+            return new_level
+
        # If no hysteresis, return new level
        if self.hysteresis == 0.0:
            return new_level
@@ -328,14 +356,17 @@ class ThresholdChecker:
            renotify_interval: Seconds between repeat notifications (default: 1 hour)
            journal: Optional MessageJournal instance for logging threshold events
        """
-        # Named threshold configurations: {config_name: {metric_path: ThresholdConfig}}
+        # Named threshold configurations (pre-merged: defaults + overrides): {config_name: {metric_path: ThresholdConfig}}
        self.threshold_configs = {}

+        # Raw overrides only for each named config (no defaults baked in): {config_name: {metric_path: ThresholdConfig}}
+        self.threshold_raw_configs: Dict[str, Dict[str, ThresholdConfig]] = {}
+
        # Single threshold set for backward compatibility: {metric_path: ThresholdConfig}
        self.thresholds = {}

-        # Host to config name mapping: {host_name: config_name}
-        self.host_config_mapping = {}
+        # Host to ordered list of config names: {host_name: [config_name, ...]}
+        self.host_config_mapping: Dict[str, List[str]] = {}

        # Default config name to use when no mapping exists
        self.default_config = "default"
@@ -372,6 +403,7 @@ class ThresholdChecker:
        
        # Clear old configuration
        self.threshold_configs.clear()
+        self.threshold_raw_configs.clear()
        self.thresholds.clear()
        self.host_config_mapping.clear()
        self.grace_seconds = float(config.get("grace", 2))
@@ -391,10 +423,24 @@ class ThresholdChecker:
        Supports two formats:
        1. Legacy format with direct 'thresholds' section
        2. New format with 'threshold_configs' and 'host_threshold_mapping'
+
+        In all cases, THRESHOLD_DEFAULTS are seeded into threshold_configs["default"]
+        so the Settings page always shows the built-in defaults.
+        _parse_multi_config() overwrites this with the fully-merged effective defaults.
        """
+        # Always expose built-in defaults through threshold_configs["default"] so
+        # the Settings page has something to display even in legacy/no-config mode.
+        seed: Dict[str, ThresholdConfig] = {}
+        for plugin_name, plugin_thresholds in THRESHOLD_DEFAULTS.get("thresholds", {}).items():
+            if isinstance(plugin_thresholds, dict):
+                self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=seed)
+        if seed:
+            self.threshold_configs["default"] = seed
+            self.threshold_raw_configs["default"] = {}
+
        # Check for new multi-config format
        if "threshold_configs" in config:
-            self._parse_multi_config(config)
+            self._parse_multi_config(config)  # overwrites threshold_configs["default"]
        elif "thresholds" in config:
            # Legacy single threshold configuration
            self._parse_legacy_config(config)
@@ -424,9 +470,10 @@ class ThresholdChecker:
                        self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=effective_defaults)

        self.threshold_configs["default"] = dict(effective_defaults)
+        self.threshold_raw_configs["default"] = {}
        logger.info("Registered 'default' threshold config with %d metrics", len(effective_defaults))

-        # Parse each named configuration, seeding it with effective_defaults first
+        # Parse each named configuration
        for config_name, config_data in threshold_configs.items():
            if config_name == "default":
                continue  # already handled above
@@ -440,33 +487,41 @@ class ThresholdChecker:
                continue

            logger.info("Parsing threshold configuration: %s", config_name)
-            self.threshold_configs[config_name] = dict(effective_defaults)

+            # Raw overrides only (used for multi-config layering)
+            raw_overrides: Dict[str, ThresholdConfig] = {}
            thresholds_config = config_data["thresholds"]
            for plugin_name, plugin_thresholds in thresholds_config.items():
-                if not isinstance(plugin_thresholds, dict):
-                    continue
+                if isinstance(plugin_thresholds, dict):
+                    self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=raw_overrides)
+            self.threshold_raw_configs[config_name] = raw_overrides

-                self._parse_plugin_thresholds(
-                    plugin_name,
-                    plugin_thresholds,
-                    target_dict=self.threshold_configs[config_name]
-                )
+            # Pre-merged version (defaults + overrides) for single-config fast path
+            self.threshold_configs[config_name] = dict(effective_defaults)
+            self.threshold_configs[config_name].update(raw_overrides)

-        # Parse host to config mapping from two possible sources
-        # 1. New format: hosts section with threshold_config attribute
+        # Parse host → config list mapping from two possible sources
+
+        def _normalise(value) -> List[str]:
+            """Accept a string or list; always return a list."""
+            if isinstance(value, list):
+                return [str(v) for v in value]
+            return [str(value)]
+
+        # 1. hosts section with threshold_config attribute (string or list)
        if "hosts" in config:
            hosts_config = config["hosts"]
            if isinstance(hosts_config, dict):
                for host_name, host_attrs in hosts_config.items():
                    if isinstance(host_attrs, dict) and "threshold_config" in host_attrs:
-                        self.host_config_mapping[host_name] = host_attrs["threshold_config"]
+                        self.host_config_mapping[host_name] = _normalise(host_attrs["threshold_config"])

-        # 2. Legacy format: host_threshold_mapping section (for backward compatibility)
+        # 2. Legacy host_threshold_mapping section (string values only)
        if "host_threshold_mapping" in config:
            legacy_mapping = config.get("host_threshold_mapping", {})
            if isinstance(legacy_mapping, dict):
-                self.host_config_mapping.update(legacy_mapping)
+                for host_name, value in legacy_mapping.items():
+                    self.host_config_mapping[host_name] = _normalise(value)
        
        # Set default config (first one alphabetically or explicitly set)
        self.default_config = config.get("default_threshold_config", "default")
@@ -520,10 +575,13 @@ class ThresholdChecker:
            if not isinstance(threshold_config, dict):
                continue
            
-            # Handle nested metrics (e.g., partitions./.percent)
+            # Handle nested metrics (e.g., partitions./.percent or pools.*.status)
            if metric_name == "partitions":
                self._parse_partition_thresholds(plugin_name, threshold_config, target_dict)
                continue
+            if metric_name == "pools":
+                self._parse_pool_thresholds(plugin_name, threshold_config, target_dict)
+                continue
            
            metric_path = f"{plugin_name}.{metric_name}"
            
@@ -531,11 +589,14 @@ class ThresholdChecker:
            warning = threshold_config.get("warning")
            critical = threshold_config.get("critical")
            operator = threshold_config.get("operator", ">")
-            display = threshold_config.get("display", "(threshold: {op_symbol} {threshold_value})")
-            hysteresis = threshold_config.get("hysteresis", 0.1)  # 10% default
+            # Nagios operator maps exit codes directly; no numeric thresholds needed
+            is_nagios_op = (operator == "nagios")
+            default_display = "{check_name}: {output}" if is_nagios_op else "(threshold: {op_symbol} {threshold_value})"
+            display = threshold_config.get("display", default_display)
+            hysteresis = threshold_config.get("hysteresis", 0.0 if is_nagios_op else 0.02)
            enabled = threshold_config.get("enabled", True)

-            if warning is None and critical is None:
+            if warning is None and critical is None and not is_nagios_op:
                logger.warning("No thresholds defined for %s, skipping", metric_path)
                continue
            
@@ -606,6 +667,56 @@ class ThresholdChecker:
                
                target_dict[metric_path] = threshold

+    def _parse_pool_thresholds(
+        self,
+        plugin_name: str,
+        pools: Dict[str, Any],
+        target_dict: Optional[Dict[str, ThresholdConfig]] = None,
+    ):
+        """Parse ZFS pool thresholds.  Pool names may be literal or '*' (all pools).
+
+        Config shape::
+
+            zfs_monitor:
+              pools:
+                '*':
+                  status:
+                    warning: 1
+                    critical: 2
+                    operator: '>'
+                tank:
+                  capacity:
+                    warning: 80
+                    critical: 90
+        """
+        if target_dict is None:
+            target_dict = self.thresholds
+
+        for pool_name, metrics in pools.items():
+            if not isinstance(metrics, dict):
+                continue
+            for metric_name, threshold_config in metrics.items():
+                if not isinstance(threshold_config, dict):
+                    continue
+                metric_path = f"{plugin_name}.{pool_name}.{metric_name}"
+                warning = threshold_config.get("warning")
+                critical = threshold_config.get("critical")
+                operator = threshold_config.get("operator", ">")
+                hysteresis = threshold_config.get("hysteresis", 0.02)
+                enabled = threshold_config.get("enabled", True)
+                display = threshold_config.get("display")
+                if warning is None and critical is None:
+                    continue
+                target_dict[metric_path] = ThresholdConfig(
+                    metric_path=metric_path,
+                    warning=warning,
+                    critical=critical,
+                    operator=operator,
+                    hysteresis=hysteresis,
+                    enabled=enabled,
+                    display=display,
+                )
+
    def _parse_rtt_thresholds(
        self,
        rtt_thresholds: Dict[str, Any],
@@ -635,7 +746,7 @@ class ThresholdChecker:
        warning = rtt_thresholds.get("warning")
        critical = rtt_thresholds.get("critical")
        operator = rtt_thresholds.get("operator", ">")
-        hysteresis = rtt_thresholds.get("hysteresis", 0.1)  # 10% default
+        hysteresis = rtt_thresholds.get("hysteresis", 0.02)  # 2% default
        enabled = rtt_thresholds.get("enabled", True)
        display = rtt_thresholds.get("display")
        count = rtt_thresholds.get("count", 1)
@@ -664,7 +775,10 @@ class ThresholdChecker:
        )
    
    def get_thresholds_for_host(self, host_name: str) -> Dict[str, ThresholdConfig]:
-        """Get the appropriate threshold configuration for a host.
+        """Get the effective threshold configuration for a host.
+
+        When threshold_config is a list, configs are applied left-to-right on top
+        of the default thresholds so earlier entries can be overridden by later ones.

        Args:
            host_name: Name of the host
@@ -676,23 +790,40 @@ class ThresholdChecker:
        if self.thresholds and not self.threshold_configs:
            return self.thresholds

-        # Multi-config mode: look up host-specific configuration
-        if self.threshold_configs:
-            config_name = self.host_config_mapping.get(host_name, self.default_config)
+        if not self.threshold_configs:
+            return {}

-            if config_name in self.threshold_configs:
-                return self.threshold_configs[config_name]
-            else:
+        config_names = self.host_config_mapping.get(host_name)
+
+        # No host-specific mapping → return pre-merged default
+        if not config_names:
+            return self.threshold_configs.get(self.default_config, {})
+
+        # Single config → fast path using pre-merged copy
+        if len(config_names) == 1:
+            name = config_names[0]
+            if name in self.threshold_configs:
+                return self.threshold_configs[name]
            logger.warning(
                "Threshold config '%s' not found for host '%s', using default '%s'",
-                    config_name,
-                    host_name,
-                    self.default_config
+                name, host_name, self.default_config,
            )
            return self.threshold_configs.get(self.default_config, {})

-        # No thresholds configured
-        return {}
+        # Multiple configs → start from defaults, layer raw overrides in order
+        result = dict(self.threshold_configs.get(self.default_config, {}))
+        for name in config_names:
+            if name == self.default_config:
+                continue  # defaults already the base
+            raw = self.threshold_raw_configs.get(name)
+            if raw is None:
+                logger.warning(
+                    "Threshold config '%s' not found for host '%s', skipping",
+                    name, host_name,
+                )
+            else:
+                result.update(raw)
+        return result
    
    def check_value(
        self,
@@ -760,6 +891,12 @@ class ThresholdChecker:
        elif new_level == AlertLevel.WARNING and threshold.warning is not None:
            threshold_value = threshold.warning

+        # Keep hysteresis on the state so the UI can show the recovery threshold
+        if new_level != AlertLevel.OK:
+            alert_state.hysteresis = threshold.hysteresis
+        else:
+            alert_state.hysteresis = None
+
        # Update state and check for changes
        old_level = alert_state.level
        if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
@@ -769,6 +906,36 @@ class ThresholdChecker:
            self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, None)

        return None
+    def _find_threshold(
+        self, thresholds: Dict[str, "ThresholdConfig"], metric_path: str
+    ) -> Tuple[Optional["ThresholdConfig"], Optional[str]]:
+        """Return (threshold, check_name) for *metric_path*, falling back to suffix matches.
+
+        Allows generic thresholds like ``nagios_runner.status_code`` to match
+        fully-qualified paths like ``nagios_runner.check_disk_root_status_code``.
+        The exact match is always tried first; then successive leading
+        underscore-delimited segments are stripped from the field name until
+        a match is found or no segments remain.
+
+        Returns:
+            (ThresholdConfig, None) for an exact match.
+            (ThresholdConfig, "check_disk_root") for a suffix match — the second
+            element is the stripped prefix, available as ``{check_name}`` in
+            display format templates.
+            (None, None) when no threshold is found.
+        """
+        if metric_path in thresholds:
+            return thresholds[metric_path], None
+        plugin, sep, field = metric_path.partition(".")
+        if not sep:
+            return None, None
+        parts = field.split("_")
+        for i in range(1, len(parts)):
+            candidate = plugin + "." + "_".join(parts[i:])
+            if candidate in thresholds:
+                return thresholds[candidate], "_".join(parts[:i])
+        return None, None
+
    def check_plugin_data(
        self,
        host_name: str,
@@ -797,11 +964,10 @@ class ThresholdChecker:
        for metric_name, value in data.items():
            metric_path = f"{plugin_name}.{metric_name}"

-            if metric_path not in thresholds:
+            threshold, check_name = self._find_threshold(thresholds, metric_path)
+            if threshold is None:
                continue

-            threshold = thresholds[metric_path]
-            
            # Get or create alert state
            if metric_path not in alert_states:
                alert_states[metric_path] = AlertState(metric_path)
@@ -821,13 +987,15 @@ class ThresholdChecker:
            elif new_level == AlertLevel.WARNING and threshold.warning is not None:
                threshold_value = threshold.warning

+            alert_state.hysteresis = threshold.hysteresis if new_level != AlertLevel.OK else None
+
            # Update state and check for changes
            old_level = alert_state.level
            if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
                state_changes.append((metric_path, old_level, new_level, value))
-                self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data)
+                self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data, check_name=check_name, metric_name=metric_name)
            elif new_level != AlertLevel.OK:
-                self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data)
+                self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data, check_name=check_name, metric_name=metric_name)

        # Check nested metrics (e.g., partition data in disk_monitor)
        self._check_nested_metrics(
@@ -852,6 +1020,44 @@ class ThresholdChecker:
        # Get host-specific thresholds
        thresholds = self.get_thresholds_for_host(host_name)
        
+        # ZFS pool health checks
+        if plugin_name == "zfs_monitor" and "pools" in data:
+            pools = data["pools"]
+            if isinstance(pools, dict):
+                for pool_name, pool_metrics in pools.items():
+                    if not isinstance(pool_metrics, dict):
+                        continue
+                    # Synthesize status from health string for older clients
+                    # that predate the status field.
+                    pool_metrics_effective = dict(pool_metrics)
+                    if "health" in pool_metrics and "status" not in pool_metrics:
+                        pool_metrics_effective["status"] = 0 if pool_metrics["health"] == "ONLINE" else 1
+                    for metric_name, value in pool_metrics_effective.items():
+                        # Try specific pool name first, then wildcard '*'
+                        metric_path = f"{plugin_name}.{pool_name}.{metric_name}"
+                        wildcard_path = f"{plugin_name}.*.{metric_name}"
+                        threshold = thresholds.get(metric_path) or thresholds.get(wildcard_path)
+                        if threshold is None:
+                            continue
+                        if metric_path not in alert_states:
+                            alert_states[metric_path] = AlertState(metric_path)
+                        alert_state = alert_states[metric_path]
+                        new_level = threshold.evaluate_with_hysteresis(value, alert_state.level)
+                        threshold_value = None
+                        if new_level == AlertLevel.CRITICAL and threshold.critical is not None:
+                            threshold_value = threshold.critical
+                        elif new_level == AlertLevel.WARNING and threshold.warning is not None:
+                            threshold_value = threshold.warning
+                        alert_state.hysteresis = threshold.hysteresis if new_level != AlertLevel.OK else None
+                        pool_context = dict(pool_metrics_effective)
+                        pool_context["pool_name"] = pool_name
+                        old_level = alert_state.level
+                        if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
+                            state_changes.append((metric_path, old_level, new_level, value))
+                            self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, pool_context, metric_name=pool_name)
+                        elif new_level != AlertLevel.OK:
+                            self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, pool_context, metric_name=pool_name)
+
        # Look for partition data in disk_monitor
        if plugin_name == "disk_monitor" and "partitions" in data:
            partitions = data["partitions"]
@@ -887,6 +1093,8 @@ class ThresholdChecker:
                    elif new_level == AlertLevel.WARNING and threshold.warning is not None:
                        threshold_value = threshold.warning

+                    alert_state.hysteresis = threshold.hysteresis if new_level != AlertLevel.OK else None
+
                    old_level = alert_state.level
                    if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
                        state_changes.append((metric_path, old_level, new_level, value))
@@ -903,6 +1111,8 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]] = None,
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ):
        """Trigger a notification for an alert state change.
        
@@ -925,54 +1135,52 @@ class ThresholdChecker:
        # Format operator symbol
        op_symbol = threshold.operator.value

+        # Short metric label: strip the plugin-name prefix and _status_code suffix
+        short_path = (metric_path.partition(".")[2] or metric_path).removesuffix("_status_code")
+
        # Use a display-friendly value (inf is the sentinel for "overdue")
        import math
        display_value = "overdue" if isinstance(value, float) and math.isinf(value) else value

-        # Format message
+        # Format message — for the nagios operator there is no numeric threshold_value;
+        # render the display template whenever one is available.
+        has_display = threshold_value is not None or threshold.operator == ComparisonOperator.NAGIOS
+
+        def _fmt():
+            return self._format_display(
+                threshold.display,
+                value=display_value,
+                threshold_value=threshold_value,
+                op_symbol=op_symbol,
+                plugin_data=plugin_data,
+                check_name=check_name,
+                metric_name=metric_name,
+            )
+
        if new_level == AlertLevel.OK:
            lvl = "RECOVER"
-            message = f"{metric_path} = {display_value} ({old_level.name} -> OK)"
+            message = f"{short_path} = {display_value} ({old_level.name} -> OK)"
        elif new_level == AlertLevel.WARNING:
            lvl = "WARNING"
-            if threshold_value is not None:
-                threshold_info = self._format_display(
-                    threshold.display,
-                    value=display_value,
-                    threshold_value=threshold_value,
-                    op_symbol=op_symbol,
-                    plugin_data=plugin_data
-                )
-                message = f"{metric_path} = {display_value} {threshold_info}"
+            if has_display:
+                message = f"{short_path} = {display_value} {_fmt()}"
            else:
-                message = f"{metric_path} = {display_value}"
+                message = f"{short_path} = {display_value}"
        elif new_level == AlertLevel.CRITICAL:
            lvl = "CRITICAL"
-            if threshold_value is not None:
-                threshold_info = self._format_display(
-                    threshold.display,
-                    value=display_value,
-                    threshold_value=threshold_value,
-                    op_symbol=op_symbol,
-                    plugin_data=plugin_data
-                )
-                message = f"{metric_path} = {display_value} {threshold_info}"
+            if has_display:
+                message = f"{short_path} = {display_value} {_fmt()}"
            else:
-                message = f"{metric_path} = {display_value}"
+                message = f"{short_path} = {display_value}"
        else:
            lvl = "UNKNOWN"
-            message = f"{metric_path} = {display_value}"
+            if has_display:
+                message = f"{short_path} = {display_value} {_fmt()}"
+            else:
+                message = f"{short_path} = {display_value}"

-        # Return the formatted threshold info for storing in AlertState
-        formatted_threshold_msg = None
-        if threshold_value is not None and new_level != AlertLevel.OK:
-            formatted_threshold_msg = self._format_display(
-                threshold.display,
-                value=display_value,
-                threshold_value=threshold_value,
-                op_symbol=op_symbol,
-                plugin_data=plugin_data
-            )
+        # Formatted threshold info stored on AlertState for the UI
+        formatted_threshold_msg = _fmt() if has_display and new_level != AlertLevel.OK else None

        return lvl, message, formatted_threshold_msg
    
@@ -987,23 +1195,28 @@ class ThresholdChecker:
        value: Any,
    ):
        """Send notification and log to journal/eventlog."""
-        try:
-            notify_mod.send_notification(
+        from . import hbdclass
+        host = hbdclass.Host.hosts.get(host_name)
+        if host is not None and not host.watched:
+            eventlog(host_name, lvl, message, service="threshold")
+            return
+        short_path = (metric_path.partition(".")[2] or metric_path).removesuffix("_status_code")
+        title = f"[{lvl}] {host_name}  {short_path}"
+        # Strip the "metric = " prefix from message so body is just the value/detail
+        prefix = short_path + " = "
+        body = message[len(prefix):] if message.startswith(prefix) else message
+        asyncio.get_event_loop().create_task(notify_mod.send_notification(
            host_name,
            notify_mod.Notification(
-                    title=f"[{lvl}] {host_name}",
-                    body=message,
+                title=title,
+                body=body,
                level=lvl,
            ),
-            )
-            logger.info("Notification sent: %s", message)
-        except Exception as e:
-            logger.error("Failed to send notification: %s", e)
+        ))
        
        # Log to journal
        if self.journal is not None:
            try:
-                import asyncio
                loop = asyncio.get_event_loop()
                loop.create_task(self.journal.log_threshold_event(
                    host_name=host_name,
@@ -1021,33 +1234,62 @@ class ThresholdChecker:
        self,
        display_format: str,
        value: Any,
-        threshold_value: float,
+        threshold_value: Optional[float],
        op_symbol: str,
        plugin_data: Optional[Dict[str, Any]] = None,
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ) -> str:
        """Format the display string using available data.

-        Args:
-            display_format: Format string from threshold config
-            value: Current metric value
-            threshold_value: Threshold value that was exceeded
-            op_symbol: Comparison operator symbol
-            plugin_data: Optional dictionary of plugin data fields
+        Available template variables:
+            {value}           - current metric value
+            {threshold_value} - threshold that was exceeded
+            {op_symbol}       - comparison operator (>, <, >=, <=, ==, !=)
+            {check_name}      - prefix stripped for generic threshold match
+                                (e.g. "check_disk_root" when metric
+                                "check_disk_root_status_code" matched generic
+                                threshold "status_code")
+            {metric_name}     - field name within the plugin data dict
+            Any key from plugin_data is also available.

        Returns:
            Formatted display string
        """
+        if not display_format:
+            display_format = "(threshold: {op_symbol} {threshold_value})" if threshold_value is not None else ""
+
        # Build format context with standard variables
        format_context = {
            'value': value,
-            'threshold_value': threshold_value,
            'op_symbol': op_symbol,
        }
+        if threshold_value is not None:
+            format_context['threshold_value'] = threshold_value
+
+        # Add generic-match context variables when available
+        if check_name is not None:
+            format_context['check_name'] = check_name
+        if metric_name is not None:
+            format_context['metric_name'] = metric_name

        # Add all plugin data fields if available
        if plugin_data:
            format_context.update(plugin_data)

+        # For nagios_runner generic matches, expose the matched check's output
+        # and status as short aliases {output} and {status} so display templates
+        # don't need to use the full {check_disk_root_output} form.
+        if check_name and plugin_data:
+            if 'output' not in format_context:
+                output = plugin_data.get(f"{check_name}_output")
+                if output is not None:
+                    format_context['output'] = output
+            if 'status' not in format_context:
+                status = plugin_data.get(f"{check_name}_status")
+                if status is not None:
+                    format_context['status'] = status
+        
        try:
            # Format the display string
            return display_format.format(**format_context)
@@ -1077,17 +1319,22 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]],
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ) -> None:
        """Handle a state-change transition with grace-period logic.

-        Transitioning INTO alert: defers the notification for grace_seconds.
+        Transitioning INTO alert (worsening): defers the notification for grace_seconds.
+        De-escalation within alert states (e.g. CRITICAL→WARNING): no new notification;
+          the metric is still alerting so no RECOVER was sent.
        Transitioning TO OK:
          - Still in grace window (pending_since set): suppresses both the alert
            and the recovery — the spike never warranted a page.
          - Past grace: fires the RECOVER notification normally.
        """
        lvl, message, formatted_msg = self._trigger_notification(
-            host_name, metric_path, old_level, new_level, value, threshold, plugin_data
+            host_name, metric_path, old_level, new_level, value, threshold, plugin_data,
+            check_name=check_name, metric_name=metric_name,
        )
        alert_state.formatted_message = formatted_msg

@@ -1100,12 +1347,20 @@ class ThresholdChecker:
                alert_state.pending_since = None
            else:
                self._send_notification(host_name, lvl, message, metric_path, old_level, new_level, value)
-        else:
+        elif new_level.value > old_level.value:
+            # Worsening (OK→WARNING, OK→CRITICAL, WARNING→CRITICAL): schedule notification.
            alert_state.pending_since = time.time()
            logger.debug(
                "Alert deferred (%.0fs grace): %s on %s = %s",
                self.grace_seconds, metric_path, host_name, value,
            )
+        else:
+            # De-escalation within alert states (e.g. CRITICAL→WARNING): metric is still
+            # alerting but did not recover, so no new notification.
+            logger.debug(
+                "De-escalation %s→%s for %s on %s, no notification",
+                old_level.name, new_level.name, metric_path, host_name,
+            )

    def _check_pending_or_renotify(
        self,
@@ -1115,6 +1370,8 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]],
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ) -> None:
        """Called when alert level is unchanged and non-OK.

@@ -1124,7 +1381,8 @@ class ThresholdChecker:
        if alert_state.pending_since is not None:
            if time.time() - alert_state.pending_since >= self.grace_seconds:
                lvl, message, formatted_msg = self._trigger_notification(
-                    host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data
+                    host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data,
+                    check_name=check_name, metric_name=metric_name,
                )
                alert_state.formatted_message = formatted_msg
                self._send_notification(
@@ -1133,7 +1391,7 @@ class ThresholdChecker:
                alert_state.pending_since = None
            # else: still within grace window, do nothing
        else:
-            self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data)
+            self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data, check_name=check_name, metric_name=metric_name)

    def _check_renotify(
        self,
@@ -1143,6 +1401,8 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]] = None,
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ):
        """Check if we should send a repeat notification.
        
@@ -1180,6 +1440,7 @@ class ThresholdChecker:
            
            # Format operator symbol
            op_symbol = threshold.operator.value
+            short_path = (metric_path.partition(".")[2] or metric_path).removesuffix("_status_code")

            # Time to re-notify
            if threshold_value is not None:
@@ -1189,26 +1450,49 @@ class ThresholdChecker:
                    value=value,
                    threshold_value=threshold_value,
                    op_symbol=op_symbol,
-                    plugin_data=plugin_data
+                    plugin_data=plugin_data,
+                    check_name=check_name,
+                    metric_name=metric_name,
                )
-                message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} {threshold_info}, ongoing for {int(now - alert_state.since)}s"
+                body = f"{value} {threshold_info}, ongoing for {int(now - alert_state.since)}s"
            else:
-                message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} (ongoing for {int(now - alert_state.since)}s)"
+                body = f"{value} (ongoing for {int(now - alert_state.since)}s)"
+            message = f"REMINDER ({alert_state.level.name}): {host_name} - {short_path} = {body}"

-            try:
-                notify_mod.send_notification(
+            from . import hbdclass
+            host = hbdclass.Host.hosts.get(host_name)
+            if host is None or host.watched:
+                asyncio.get_event_loop().create_task(notify_mod.send_notification(
                    host_name,
                    notify_mod.Notification(
-                        title=f"[REMINDER/{alert_state.level.name}] {host_name}",
-                        body=message,
+                        title=f"[REMINDER/{alert_state.level.name}] {host_name}  {short_path}",
+                        body=body,
                        level=alert_state.level.name,
                    ),
-                )
+                ))
+                logger.info("Re-notification sent: %s", message)
            alert_state.last_notification = now
            alert_state.notification_count += 1
-                logger.info("Re-notification sent: %s", message)
-            except Exception as e:
-                logger.error("Failed to send re-notification: %s", e)
+    
+    def purge_stale_alerts(self, hbdclass) -> None:
+        """Remove alert states that have no matching threshold configuration.
+
+        Called after startup (pickle restore) and after each config reload so
+        that alerts orphaned by configuration changes do not linger forever.
+        Alerts whose metric_path is not present in the current threshold config
+        for that host are silently dropped.
+        """
+        for hostname, host in hbdclass.Host.hosts.items():
+            if not host.alert_states:
+                continue
+            configured = self.get_thresholds_for_host(hostname)
+            stale = [mp for mp in host.alert_states if self._find_threshold(configured, mp)[0] is None]
+            for mp in stale:
+                logger.info(
+                    "Purging stale alert state for %s / %s (no threshold configured)",
+                    hostname, mp,
+                )
+                del host.alert_states[mp]

    def get_active_alerts(self, alert_states: Dict[str, AlertState]) -> list:
        """
@@ -211,10 +211,11 @@ def _make_timer_callbacks(uname, host, ctx):
        connection.newstate(connection.__class__.OVERDUE, now, cfg.get("grace", 2))
        msg = f"{connection.afam} overdue"
        eventlog(uname, "CRITICAL", msg)
-        notify_mod.send_notification(
+        if host.watched:
+            asyncio.create_task(notify_mod.send_notification(
                uname,
                notify_mod.Notification(title=f"[CRITICAL] {uname}", body=msg, level="CRITICAL"),
-        )
+            ))
        # Track in alert_states so the Alerts Dashboard shows this
        _set_connectivity_alert(host, connection.afam, "CRITICAL")
        if threshold_checker:
@@ -335,8 +336,7 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
        # Apply user-access settings from config
        access = config_mod.get_host_access(cfg, uname)
        host.apply_access(access["owner"], access["managers"], access["monitors"])
-        if verbose:
-            print(("XX: New host, num now %s" % (len(hbdcls.Host.hosts))))
+        logger.info("New host signed on: %s (dyn=%s, access=%s)", uname, host.dyn, access)
        newh = True
    else:
        host = hbdcls.Host.hosts[uname]
@@ -350,8 +350,10 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):

    if msg.get("ID") == "HTB":
        host.doesack = msg.get("acks", -1)
-        # send ACK back
+        # send ACK back; ask client to resend plugin info when we have none yet
        rmsg = {"time": time.time()}
+        if not host.plugin_data:
+            rmsg["request_update"] = 1
        opkt = dicttos("ACK", rmsg)
        try:
            transport.sendto(opkt, addr)
@@ -368,6 +370,14 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
                           if k not in ("ID", "plugin", "id", "name")}
            # Store plugin data with timestamp
            host.add_plugin_data(plugin_name, plugin_data, timestamp=now)
+
+            # If os_info reports an owner and none is configured server-side, apply it
+            if plugin_name == "os_info":
+                config_owner =  config_mod.get_host_access(cfg, uname).get("owner")
+                default_owner = config_mod.get_default_owner(cfg)
+                inferred_owner = plugin_data.get("owner", config_owner or default_owner)    
+                host.owner = inferred_owner
+                logger.info(f"owner for {uname} is '{host.owner}")
            if DEBUG > 1:
                print(f"Stored plugin data for {uname}: {plugin_name}")
            
@@ -407,10 +417,11 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):

    if res:
        eventlog(uname, "WARNING", res)
-        notify_mod.send_notification(
+        if host.watched:
+            asyncio.create_task(notify_mod.send_notification(
                uname,
                notify_mod.Notification(title=f"[WARNING] {uname}", body=res, level="WARNING"),
-        )
+            ))

    interval = int(msg.get("interval", 0) or 0)
    shutdown = msg.get("shutdown", 0)
@@ -420,10 +431,11 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):

    if boot:
        eventlog(uname, "INFO", "booted")
-        notify_mod.send_notification(
+        if host.watched:
+            asyncio.create_task(notify_mod.send_notification(
                uname,
                notify_mod.Notification(title=f"[INFO] {uname}", body=f"{host.name} booted", level="INFO"),
-        )
+            ))
    if message:
        eventlog(uname, "INFO", "msg: %s" % message, service=service)

@@ -437,13 +449,18 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
        if not newh:
            if d == 0 or lasts == "unknown":
                m = "%s is up" % (conn.afam)
+            elif d < 4:
+                # Transient blip (likely client restart) — skip log and notification
+                m = None
            else:
                m = "%s back after being %s for %s" % (conn.afam, lasts, dur(d))
+            if m:
                eventlog(uname, "RECOVER", m)
-            notify_mod.send_notification(
+                if host.watched:
+                    asyncio.create_task(notify_mod.send_notification(
                        uname,
                        notify_mod.Notification(title=f"[RECOVER] {uname}", body=m, level="RECOVER"),
-            )
+                    ))

    if boot or newh:
        host.upcount = host.doesack
@@ -453,10 +470,11 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
    if shutdown:
        m = "%s shutdown" % conn.afam
        eventlog(uname, "INFO", m)
-        notify_mod.send_notification(
+        if host.watched:
+            asyncio.create_task(notify_mod.send_notification(
                uname,
                notify_mod.Notification(title=f"[INFO] {uname}", body=m, level="INFO"),
-        )
+            ))
        conn.newstate(hbdcls.Connection.DOWN, now)
        _set_connectivity_alert(host, conn.afam, "CRITICAL")

@@ -146,9 +146,14 @@ def load_users(config: dict) -> dict:
    Returns the new ``users`` dict.
    """
    global users
+    old_users = dict(users)  # snapshot before rebuild
    users_cfg = config.get("users", {})
    if not isinstance(users_cfg, dict):
        users = {}
+        # Preserve OAuth-provisioned users (password_hash == "") that aren't in config.
+        for username, existing_user in old_users.items():
+            if not existing_user.password_hash and username not in users:
+                users[username] = existing_user
        return users

    result: dict = {}
@@ -166,6 +171,10 @@ def load_users(config: dict) -> dict:
        )

    users = result
+    # Preserve OAuth-provisioned users (password_hash == "") that aren't in config.
+    for username, existing_user in old_users.items():
+        if not existing_user.password_hash and username not in users:
+            users[username] = existing_user
    logger.info("Loaded %d user(s) from config", len(users))
    return users

@@ -187,6 +196,26 @@ def authenticate(username: str, password: str) -> "User | None":
    return None


+def provision_oauth_user(username: str, full_name: str, avatar: str) -> "User":
+    """Create or update a user sourced from an OAuth2 provider.
+
+    New users are inserted with no password_hash — they can only authenticate
+    via OAuth.  Existing users (e.g. defined in config with a password) have
+    their display name and avatar refreshed; all other attributes are preserved.
+    """
+    user = users.get(username)
+    if user is None:
+        user = User(username=username, full_name=full_name, avatar=avatar)
+        users[username] = user
+        logger.info("Provisioned OAuth user %r", username)
+    else:
+        if full_name:
+            user.full_name = full_name
+        if avatar:
+            user.avatar = avatar
+    return user
+
+
 # ---------------------------------------------------------------------------
 # Session management
 # ---------------------------------------------------------------------------
@@ -13,7 +13,8 @@ from . import data

 logger = logging.getLogger(__name__)

-_connections: set = set()
+# Map of WebSocket → User object (or None when auth is disabled)
+_connections: dict = {}
 _loop: Optional[asyncio.AbstractEventLoop] = None
 _get_hosts: Optional[Callable[[], Iterable]] = None
 _verbose: bool = False
@@ -34,30 +35,62 @@ def setup(
    _verbose = verbose


+def _user_can_see_host(user, host_name: str) -> bool:
+    """Return True if *user* may see updates for *host_name* (manager or higher)."""
+    from . import hbdclass, users as users_mod
+    if user is None or not users_mod.users_enabled():
+        return True
+    if user.admin:
+        return True
+    host = hbdclass.Host.hosts.get(host_name)
+    if host is None:
+        return False
+    return host.is_manager(user.username)
+
+
+def _get_token(request) -> str:
+    """Extract session token from request (mirrors logic in http.py)."""
+    auth = request.headers.get("Authorization", "")
+    if auth.startswith("Bearer "):
+        return auth[7:].strip()
+    token = request.headers.get("X-Auth-Token", "")
+    if token:
+        return token
+    return request.cookies.get("hbd_session", "")
+
+
 async def handler(request):
    """aiohttp WebSocket upgrade handler — register as GET /ws."""
    from aiohttp import web
+    from . import users as users_mod

    ws = web.WebSocketResponse()
    await ws.prepare(request)

-    _connections.add(ws)
+    token = _get_token(request)
+    user = users_mod.get_session_user(token) if token else None
+
+    _connections[ws] = user
    remote = request.remote
    logger.info("WebSocket connected from %s", remote)

    try:
-        # Send current host state to the new client
+        # Send current host state, filtered to hosts this user may see
        if _get_hosts:
            try:
                for h in list(_get_hosts()):
+                    host_name = h.get("raw_name") or h.get("name", "")
+                    if _user_can_see_host(user, host_name):
                        await ws.send_str(json.dumps({"type": "host", "data": h}))
            except Exception as e:
                logger.error("Error sending initial hosts: %s", e)

-        # Send recent messages
+        # Send recent messages, filtered to hosts this user may see
        if data.msgs:
            try:
                for m in data.msgs:
+                    host_name = m.get("host") if isinstance(m, dict) else None
+                    if not host_name or _user_can_see_host(user, host_name):
                        await ws.send_str(json.dumps({"type": "message", "data": m}))
            except Exception as e:
                logger.error("Error sending initial messages: %s", e)
@@ -74,7 +107,7 @@ async def handler(request):
    except Exception as e:
        logger.exception("WebSocket handler error from %s: %s", remote, e)
    finally:
-        _connections.discard(ws)
+        _connections.pop(ws, None)
        logger.info("WebSocket disconnected from %s", remote)

    return ws
@@ -83,25 +116,39 @@ async def handler(request):
 def broadcast(typ: str, payload) -> bool:
    """Thread-safe broadcast to all connected WebSocket clients.

+    For host and plugin updates, only sends to clients whose user has
+    manager-or-higher access to that host.  Other message types are
+    broadcast to all clients.
+
    Can be called from any thread; schedules sends on the event loop.
    Returns False if the loop is not running yet.
    """
    if not _loop:
        return False
+
+    # Determine the host name for access-filtered message types
+    host_name: Optional[str] = None
+    if typ in ("host", "plugin"):
+        host_name = payload.get("raw_name") or payload.get("host") or payload.get("name")
+    elif typ == "message" and isinstance(payload, dict):
+        host_name = payload.get("host")
+
    jmsg = json.dumps({"type": typ, "data": payload})

    async def _send_all():
        dead = set()
-        for ws in list(_connections):
+        for ws, user in list(_connections.items()):
            try:
-                if not ws.closed:
-                    await ws.send_str(jmsg)
-                else:
+                if ws.closed:
                    dead.add(ws)
+                    continue
+                if host_name is not None and not _user_can_see_host(user, host_name):
+                    continue
+                await ws.send_str(jmsg)
            except Exception:
                dead.add(ws)
        for ws in dead:
-            _connections.discard(ws)
+            _connections.pop(ws, None)

    asyncio.run_coroutine_threadsafe(_send_all(), _loop)
    return True
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "hbd"
-version = "5.1.7"
+version = "5.2.5"
 description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
 readme = "README.md"
 requires-python = ">=3.11"
@@ -34,6 +34,9 @@ server = [
  "matrix-nio>=0.24",
 ]

+# Minimal client — hbc_mini only, no external dependencies
+mini = []
+
 # Install both client and server
 all = [
  "hbd[client,server]",
@@ -4,12 +4,14 @@ set -e
 uv version --bump patch 
 VER=$(uv  version  --short)
 sed -i".bak"  "s/__version__ = \"[0-9.]*\"\(.*\)$/__version__ = \"$VER\"\1/" hbd/__init__.py
+sed -i".bak"  "s/__version__ = \"[0-9.]*\"\(.*\)$/__version__ = \"$VER\"\1/" scripts/hbc_mini.py

 # commit pyproject.toml
-git commit -m "version $VER" pyproject.toml hbd/__init__.py
+git commit -m "version $VER" pyproject.toml hbd/__init__.py scripts/hbc_mini.py
 git push 
 # tag version
 git tag -a v$VER -m "Version $VER"
 git push --tags

 rm hbd/__init__.py.bak
+rm scripts/hbc_mini.py.bak
@@ -0,0 +1,2 @@
+hbc_mini
+hbc_mini_dbg
@@ -0,0 +1,21 @@
+CC      ?= cc
+CFLAGS  = -O2 -Wall -Wextra -std=c11
+LDFLAGS = -lz -lpthread -lm
+TARGET  = hbc_mini
+SRC     = hbc_mini.c
+
+# FreeBSD/NetBSD keep zlib in base; no extra flags needed.
+# On some NetBSD installs pthreads may need -lpthread from pkgsrc.
+
+.PHONY: all clean debug
+
+all: $(TARGET)
+
+$(TARGET): $(SRC)
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+debug: $(SRC)
+	$(CC) -g -fsanitize=address,undefined -o $(TARGET)_dbg $< $(LDFLAGS)
+
+clean:
+	rm -f $(TARGET) $(TARGET)_dbg
@@ -12,11 +12,14 @@
 set -e
 what=$1
 on_ha=0
+where=""
+venv=""
+[ "$2" = "HA" ] && on_ha=1
 [ -z "$what" ] && what="client"

-if [ -d /homeassistant ]; then
-    echo "cannot install in HA, running \"docker exec homeassistant $0 $@\""
-    docker exec homeassistant $0 $@
+if [ -d /homeassistant ]; then  # if running from HA command line
+    echo "HA, running \"docker exec homeassistant /config/bin/hb_install.sh $@\""
+    docker exec homeassistant /config/bin/hb_install.sh $@ HA
    rc=$?
    if [ $rc -ne 0 ]; then
        echo "Failed to install heartbeat in HA, please check the logs for more details"
@@ -24,11 +27,12 @@ if [ -d /homeassistant ]; then
    fi
    exit 0
 fi
-if [ -d /config ]; then
-    echo "Installing on HA"
+
+if [ $on_ha -eq 1 ] || [ -r /.dockerenv ] && [ -d /config/bin ]; then
+    # Installing under docker on Home Assistant OS, using /config/bin for executables and /config/venvs for virtual environments 
+    echo "Home Assistant OS detected, installing under docker"
    where="/config/bin"
    venv="/config/venvs"
-    on_ha=1
 else
    if [ ! -d $HOME/.local/bin ] && [ ! -d $HOME/bin ]; then
        echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
@@ -43,24 +47,32 @@ else
        echo "No suitable bin directory found in PATH, please add either $HOME/.local/bin or $HOME/bin to your PATH"
        exit 1
    fi
+    if [ "$what" = "mini" ]; then
+        venv=""
+    else
        venv="$HOME/venvs"
+    fi
+fi
+echo "Installing $what to $where"
+if [ ! -z "$venv" ]; then
+    echo "Using virtual environment at $venv/hbd"
 fi

-echo "Installing heartbeat $what"
-
-if [ ! -d  $venv/hbd ]; then
-    set +e
-    python3 -m pip --version > /dev/null 2>&1 
-    rc=$?
-    set -e
-    if [ $rc -ne 0 ]; then
-        # truenas does not have pip installed by default, so we need to fetch get-pip.py and install pip
+if [ "$venv" != "" ] && [ ! -d  $venv/hbd ]; then
+    arg=""
+    have_pip=$(python3 -c "import pip" 2>/dev/null &> /dev/null && echo "Installed" || echo "Not Installed")
+    if [ "$have_pip" = "Not Installed" ]; then
+        # some systems do not have pip installed by default, so we need to fetch get-pip.py and install pip
        echo "pip is not installed, fetching get-pip.py and installing pip"
        arg="--without-pip"
    fi
    mkdir -p $venv
-    have_venv=$(python3 -c "import venv" &> /dev/null && echo "Installed" || echo "Not Installed")
+    have_venv=$(python3 -c "import venv" 2>/dev/null &> /dev/null && echo "Installed" || echo "Not Installed")
    if [ "$have_venv" = "Not Installed" ]; then
+        if [ "$have_pip" = "Not Installed" ]; then
+            echo "python has no venv, and no pip to install virtualenv, cannot continue"
+            exit 1
+        fi
        echo "python venv module not found, installing virtualenv"
        python3 -m pip install --user virtualenv
        python3 -m virtualenv $venv/hbd --system-site-packages $arg
@@ -74,24 +86,30 @@ if [ ! -d  $venv/hbd ]; then
    deactivate
 fi

-. $venv/hbd/bin/activate
-python3 -mpip install --upgrade --index-url https://git.wrede.ca/api/packages/andreas/pypi/simple/ --extra-index-url https://pypi.org/simple hbd[$what]
+if [ ! -z "$venv" ]; then
+    . $venv/hbd/bin/activate
+fi
+if [ "$what" = "mini" ]; then
+    curl -s -o $where/hbc_mini https://git.wrede.ca/andreas/heartbeat/raw/branch/master/scripts/hbc_mini.py
+    chmod +x $where/hbc_mini
+else
+    python3 -mpip install --upgrade --index-url https://git.wrede.ca/api/packages/andreas/pypi/simple/ --extra-index-url https://pypi.org/simple hbd[$what]
+fi

-if [ "$what" = "server" ]; then
+if [ ! -z "$venv" ]; then
+    echo "linking executables to $where"
+    if [ "$what" = "server" ]; then
        rm -f $where/hbd
        ln -sf $(which hbd) $where/hbd
-    echo "hbd installed, you can run it with \"$where/hbd\" or \"hbd\" if $where is in your PATH"
-else
+    elif [ "$what" = "client" ]; then
        rm -f $where/hbc
        ln -sf $(which hbc) $where/hbc
-    # rm -f $where/hb_install.sh
-    cp "$0" $where/hb_install.sh
-    chmod +x $where/hb_install.sh
-    if [ $on_ha -eq 1 ]; then
-        echo "restarting hbc "
-        job=$(grep run_hbc configuration.yaml | sed 's/run_hbc://')
-        $job
-    else
-        echo "hbc installed, you can run it with \"$where/hbc\" or \"hbc\" if $where is in your PATH"
    fi
+    rm -f $where/hb_install.sh
+    ln -sf $(which hb_install.sh) $where/hb_install.sh
 fi
+echo "Installation complete. To upgrade, run the following:"
+echo "    $where/hb_install.sh $what"
+echo "To install on another machine, run the following obtain the install script and run it:"
+echo "from https://git.wrede.ca/andreas/heartbeat/raw/branch/master/scripts/hb_install.sh"
+echo "and then run sh hb_install.sh [mini|client]"
@@ -40,6 +40,9 @@ from logging.handlers import SysLogHandler
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple

+# updated by scripts/bumpminor.sh
+__version__ = "5.2.5"
+
 # ---------------------------------------------------------------------------
 # Protocol  (mirrors hbd/common/proto.py)
 # ---------------------------------------------------------------------------
@@ -111,6 +114,7 @@ def _stodict(data: bytes) -> Dict[str, Any]:
 _DEFAULTS: Dict[str, Any] = {
    "hb_port": 50003,
    "interval": 10,
+    "owner": None,
    "plugins": {},
 }

@@ -233,7 +237,11 @@ class OSInfoPlugin(InfoPlugin):
            "machine": platform.machine(),
            "architecture": platform.architecture()[0],
            "python_version": platform.python_version(),
+            "hbc_version": __version__,
+            "hbc_type": "mini",
        }
+        if self.config.get("owner"):
+            data["owner"] = self.config["owner"]
        if platform.system() == "Linux":
            data.update(_linux_distro())
        elif platform.system() == "Darwin":
@@ -383,7 +391,6 @@ class NagiosRunnerPlugin(MonitorPlugin):

    async def _collect_metrics(self) -> Dict[str, Any]:
        results: Dict[str, Any] = {}
-        worst = 0
        for cmd_cfg in self.commands:
            name = cmd_cfg.get("name")
            command = cmd_cfg.get("command")
@@ -394,10 +401,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
            results[f"{name}_status_code"] = rc
            results[f"{name}_output"] = msg
            results.update({f"{name}_{k}": v for k, v in perf.items()})
-            worst = max(worst, rc)
-        results["overall_status"] = _NAGIOS_STATUS.get(worst, "UNKNOWN")
-        results["overall_status_code"] = worst
-        results["plugin_count"] = len(self.commands)
        return results


@@ -482,6 +485,12 @@ class CPUMonitorPlugin(MonitorPlugin):
        except Exception:
            pass

+        try:
+            with open("/proc/uptime") as fh:
+                data["uptime_seconds"] = int(float(fh.read().split()[0]))
+        except Exception:
+            pass
+
        return data


@@ -529,19 +538,41 @@ class MemoryMonitorPlugin(MonitorPlugin):
            return {}
        total = mi.get("MemTotal", 0)
        avail = mi.get("MemAvailable", mi.get("MemFree", 0))
+        free = mi.get("MemFree", 0)
+
+        # ZFS ARC is reclaimable but not included in MemAvailable; add it.
+        arc_kb = 0
+        try:
+            with open("/proc/spl/kstat/zfs/arcstats") as _f:
+                for _line in _f:
+                    _p = _line.split()
+                    if len(_p) >= 3 and _p[0] == "size":
+                        arc_kb = int(_p[2]) // 1024
+                        break
+        except (OSError, ValueError):
+            pass
+
+        avail = min(avail + arc_kb, total)
        used = total - avail
        data: Dict[str, Any] = {
-            "mem_total_kb": total,
-            "mem_used_kb": used,
-            "mem_available_kb": avail,
-            "mem_percent": round(100.0 * used / total, 1) if total else 0.0,
+            "memory_total": total * 1024,
+            "memory_used": used * 1024,
+            "memory_available": avail * 1024,
+            "memory_free": free * 1024,
+            "memory_percent": round(100.0 * used / total, 1) if total else 0.0,
        }
+        for field, key in (("Buffers", "memory_buffers"), ("Cached", "memory_cached"),
+                           ("Active", "memory_active"), ("Inactive", "memory_inactive")):
+            if field in mi:
+                data[key] = mi[field] * 1024
        stotal = mi.get("SwapTotal", 0)
        if stotal:
            sfree = mi.get("SwapFree", 0)
-            data["swap_total_kb"] = stotal
-            data["swap_used_kb"] = stotal - sfree
-            data["swap_percent"] = round(100.0 * (stotal - sfree) / stotal, 1)
+            sused = stotal - sfree
+            data["swap_total"] = stotal * 1024
+            data["swap_used"] = sused * 1024
+            data["swap_free"] = sfree * 1024
+            data["swap_percent"] = round(100.0 * sused / stotal, 1)
        return data


@@ -577,7 +608,7 @@ class DiskMonitorPlugin(MonitorPlugin):
        except Exception as e:
            self.logger.warning("df failed: %s", e)
            return {}
-        data: Dict[str, Any] = {}
+        partitions: Dict[str, Any] = {}
        for line in out.decode(errors="replace").splitlines()[1:]:
            parts = line.split()
            if len(parts) < 6:
@@ -586,14 +617,19 @@ class DiskMonitorPlugin(MonitorPlugin):
            if self.mounts and mount not in self.mounts:
                continue
            try:
-                key = re.sub(r"[^a-zA-Z0-9_]", "_", mount).strip("_") or "root"
-                data[f"{key}_total_kb"] = int(parts[1])
-                data[f"{key}_used_kb"] = int(parts[2])
-                data[f"{key}_avail_kb"] = int(parts[3])
-                data[f"{key}_percent"] = int(parts[4].rstrip("%"))
+                total_kb = int(parts[1])
+                used_kb = int(parts[2])
+                avail_kb = int(parts[3])
+                pct = int(parts[4].rstrip("%"))
+                partitions[mount] = {
+                    "total": total_kb * 1024,
+                    "used": used_kb * 1024,
+                    "free": avail_kb * 1024,
+                    "percent": pct,
+                }
            except (ValueError, IndexError):
                continue
-        return data
+        return {"partitions": partitions} if partitions else {}


 # ---------------------------------------------------------------------------
@@ -649,17 +685,18 @@ class NetworkMonitorPlugin(MonitorPlugin):
        self._prev = (now, curr)
        if dt <= 0:
            return {}
-        data: Dict[str, Any] = {}
+        interfaces: Dict[str, Any] = {}
        for iface, (rx, tx) in curr.items():
            if iface in self.skip_ifaces or iface not in prev:
                continue
            prx, ptx = prev[iface]
-            key = re.sub(r"[^a-zA-Z0-9_]", "_", iface)
-            data[f"{key}_rx_bps"] = round((rx - prx) / dt)
-            data[f"{key}_tx_bps"] = round((tx - ptx) / dt)
-            data[f"{key}_rx_bytes"] = rx
-            data[f"{key}_tx_bytes"] = tx
-        return data
+            interfaces[iface] = {
+                "bytes_recv": rx,
+                "bytes_sent": tx,
+                "bytes_recv_delta": rx - prx,
+                "bytes_sent_delta": tx - ptx,
+            }
+        return {"interfaces": interfaces} if interfaces else {}


 # ---------------------------------------------------------------------------
@@ -682,7 +719,9 @@ async def _load_plugins(cfg: Dict[str, Any]) -> List[Plugin]:
    plugins_cfg: Dict[str, Any] = cfg.get("plugins", {})
    loaded: List[Plugin] = []
    for cls in _ALL_PLUGIN_CLASSES:
-        plugin_cfg = plugins_cfg.get(cls.name) or cfg.get(cls.name, {})
+        plugin_cfg = dict(plugins_cfg.get(cls.name) or cfg.get(cls.name) or {})
+        if "owner" in cfg and "owner" not in plugin_cfg:
+            plugin_cfg["owner"] = cfg["owner"]
        plugin: Plugin = cls(config=plugin_cfg)
        try:
            ok = await plugin.initialize()
@@ -752,7 +791,7 @@ class _HeartbeatProtocol(asyncio.DatagramProtocol):
            msg_id = msg.get("ID")
            now = time.time()
            if msg_id == "ACK":
-                self._conn._handle_ack(now)
+                self._conn._handle_ack(msg, now)
            elif msg_id == "CMD":
                asyncio.create_task(_handle_command(self._conn, msg))
            elif msg_id == "UPD":
@@ -763,8 +802,7 @@ class _HeartbeatProtocol(asyncio.DatagramProtocol):
            self._log.error("datagram error: %s", e)

    def error_received(self, exc):
-        self._log.warning("protocol error on %s: %s — dropping connection", self._conn.addr, exc)
-        self._conn._dead = True
+        self._log.warning("protocol error on %s: %s — will retry", self._conn.addr, exc)
        self._conn.close()


@@ -780,6 +818,7 @@ class AsyncConnection:
        self.rtts: List[float] = [0.0]
        self._transport: Optional[asyncio.DatagramTransport] = None
        self._dead = False
+        self._request_info: asyncio.Event = asyncio.Event()
        self._log = logging.getLogger(f"hbc.conn.{addr}")

    async def open(self) -> bool:
@@ -798,12 +837,14 @@ class AsyncConnection:
            self._transport.close()
            self._transport = None

-    def _handle_ack(self, now: float):
+    def _handle_ack(self, msg: Dict[str, Any], now: float):
        rtt = (now - self.lastsend) * 1000.0
        self.rtts.append(rtt)
        if len(self.rtts) > 10:
            self.rtts.pop(0)
        self.ackcount += 1
+        if msg.get("request_update"):
+            self._request_info.set()

    async def sendto(self, msg: Dict[str, Any], msg_id: str = "HTB"):
        if self._dead:
@@ -859,7 +900,7 @@ async def _handle_update(conn: AsyncConnection):
    log.info("running installer: %s", installer)
    try:
        proc = await asyncio.create_subprocess_exec(
-            installer, "client",
+            installer, "mini",
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.STDOUT,
        )
@@ -936,6 +977,19 @@ async def _run_monitor_group(conn: AsyncConnection, plugins: List[Plugin], inter
        await _sleep(interval)


+async def _info_refresh_loop(conn: AsyncConnection, info: List[Plugin]):
+    log = logging.getLogger("hbc.plugins")
+    while _running:
+        await conn._request_info.wait()
+        if not _running:
+            break
+        conn._request_info.clear()
+        log.info("refreshing InfoPlugins on server request")
+        for plugin in info:
+            plugin._cache = None
+        await _run_info_plugins(conn, info)
+
+
 async def _plugin_collector(conn: AsyncConnection, plugins: List[Plugin]):
    info = [p for p in plugins if isinstance(p, InfoPlugin)]
    monitor = [p for p in plugins if isinstance(p, MonitorPlugin)]
@@ -946,12 +1000,10 @@ async def _plugin_collector(conn: AsyncConnection, plugins: List[Plugin]):
    for p in monitor:
        by_interval[p.interval].append(p)

-    if by_interval:
-        await asyncio.gather(
-            *[asyncio.create_task(_run_monitor_group(conn, grp, iv))
-              for iv, grp in by_interval.items()],
-            return_exceptions=True,
-        )
+    tasks = [asyncio.create_task(_info_refresh_loop(conn, info))]
+    tasks += [asyncio.create_task(_run_monitor_group(conn, grp, iv))
+              for iv, grp in by_interval.items()]
+    await asyncio.gather(*tasks, return_exceptions=True)


 # ---------------------------------------------------------------------------
@@ -995,7 +1047,7 @@ def _reconfigure_syslog(level: int):
 # ---------------------------------------------------------------------------

 async def _async_main(args, cfg: Dict[str, Any]) -> int:
-    global _running, _shutdown_event, _active_tasks
+    global _running, _shutdown_event, _active_tasks, send_shutdown
    _running = True
    _shutdown_event = asyncio.Event()
    _active_tasks = []
@@ -1005,7 +1057,7 @@ async def _async_main(args, cfg: Dict[str, Any]) -> int:
    port = cfg.get("hb_port", PORT)
    interval = cfg.get("interval", INTERVAL)

-    log.info("starting: %s -> %s port=%d interval=%ds", iam, args.hosts, port, interval)
+    log.info("hbc_mini %s on %s -> %s port=%d interval=%ds",__version__, iam, args.hosts, port, interval)

    connections: List[AsyncConnection] = []
    conn_id = 1
@@ -1026,15 +1078,18 @@ async def _async_main(args, cfg: Dict[str, Any]) -> int:
        return 1

    # Boot / one-shot message
+    send_shutdown = False
    if args.boot or args.message:
        bmsg: Dict[str, Any] = {"acks": 0}
        if args.boot:
            bmsg["boot"] = 1
+            args.boot = False  # don't repeat on restart
+            send_shutdown = True
        if args.message:
            bmsg["service"] = "service"
            bmsg["msg"] = args.message
-        for c in connections:
-            await c.sendto(bmsg)
+        target = next((c for c in connections if c._transport), connections[0])
+        await target.sendto(bmsg)
        if args.message and not args.daemon:
            await asyncio.sleep(0.3)
            for c in connections:
@@ -1047,6 +1102,13 @@ async def _async_main(args, cfg: Dict[str, Any]) -> int:
    for sig in (signal.SIGTERM, signal.SIGINT):
        loop.add_signal_handler(sig, _stop)

+    def _sighup():
+        global _dorestart
+        _dorestart = True
+        _stop()
+
+    loop.add_signal_handler(signal.SIGHUP, _sighup)
+
    for conn in connections:
        _active_tasks.append(asyncio.create_task(_heartbeat_sender(conn, interval)))

@@ -1059,11 +1121,13 @@ async def _async_main(args, cfg: Dict[str, Any]) -> int:
        pass

    log.info("shutting down")
-    for conn in connections:
+    target = next((c for c in connections if c._transport), connections[0] if connections else None)
+    if target and send_shutdown:
        try:
-            await conn.sendto({"shutdown": 1, "acks": conn.ackcount})
+            await target.sendto({"shutdown": 1, "acks": target.ackcount})
        except Exception:
            pass
+    for conn in connections:
        conn.close()
    await asyncio.sleep(0.3)
    for plugin in plugins:
@@ -68,8 +68,7 @@ async def test_nagios_runner():
    print(f"   ✓ Collected {len(data)} data points")
    
    print(f"\n4. Results:")
-    print(f"   Overall Status: {data.get('overall_status')} (code: {data.get('overall_status_code')})")
-    print(f"   Plugins Executed: {data.get('plugin_count')}")
+    print(f"   Data points collected: {len(data)}")
    
    # Show individual plugin results
    print(f"\n5. Individual Plugin Results:")
@@ -0,0 +1,324 @@
+import time as time_mod
+from unittest.mock import AsyncMock, MagicMock, patch
+from urllib.parse import urlparse, parse_qs
+
+import pytest
+
+from hbd.server import oauth
+from hbd.server import users as users_mod
+from hbd.server.users import User
+
+
+CFG_OFF = {}
+CFG_ON = {
+    "oauth": {
+        "gitea": {
+            "url": "https://git.example.com",
+            "client_id": "cid",
+            "client_secret": "csec",
+        }
+    }
+}
+CFG_PARTIAL = {"oauth": {"gitea": {"url": "https://git.example.com"}}}
+
+
+@pytest.fixture(autouse=True)
+def clear_oauth_states():
+    oauth._states.clear()
+    yield
+    oauth._states.clear()
+
+
+@pytest.fixture(autouse=True)
+def reset_users_dict():
+    original = dict(users_mod.users)
+    yield
+    users_mod.users = original
+
+
+def test_is_enabled_when_all_keys_present():
+    assert oauth.is_enabled(CFG_ON) is True
+
+
+def test_is_enabled_false_when_no_oauth_key():
+    assert oauth.is_enabled(CFG_OFF) is False
+
+
+def test_is_enabled_false_when_partial_config():
+    assert oauth.is_enabled(CFG_PARTIAL) is False
+
+
+def test_make_state_returns_unique_tokens():
+    s1 = oauth.make_state()
+    s2 = oauth.make_state()
+    assert s1 != s2
+    assert len(s1) == 64  # 32 bytes hex
+
+
+def test_validate_state_valid():
+    state = oauth.make_state()
+    assert oauth.validate_state(state) is True
+
+
+def test_validate_state_consumed_on_use():
+    state = oauth.make_state()
+    oauth.validate_state(state)
+    assert oauth.validate_state(state) is False  # replay rejected
+
+
+def test_validate_state_unknown():
+    assert oauth.validate_state("notastate") is False
+
+
+def test_validate_state_expired(monkeypatch):
+    state = oauth.make_state()
+    # Wind expiry into the past
+    monkeypatch.setitem(oauth._states, state, time_mod.time() - 1000)
+    assert oauth.validate_state(state) is False
+
+
+def _reset_users(entries=None):
+    users_mod.users = entries or {}
+
+
+def test_provision_oauth_user_new():
+    _reset_users()
+    user = users_mod.provision_oauth_user("gituser", "Git User", "https://example.com/avatar.png")
+    assert user.username == "gituser"
+    assert user.full_name == "Git User"
+    assert user.avatar == "https://example.com/avatar.png"
+    assert user.admin is False
+    assert user.password_hash == ""
+    assert "gituser" in users_mod.users
+
+
+def test_provision_oauth_user_no_password_login():
+    _reset_users()
+    user = users_mod.provision_oauth_user("gituser", "Git User", "")
+    assert user.check_password("anything") is False
+
+
+def test_provision_oauth_user_existing_updates_profile():
+    existing = User(
+        username="alice",
+        full_name="Old Name",
+        avatar="old.png",
+        password_hash="pbkdf2:sha256:1:salt:abc",
+        admin=True,
+        notification_channels=["chan1"],
+    )
+    _reset_users({"alice": existing})
+    user = users_mod.provision_oauth_user("alice", "New Name", "new.png")
+    assert user.full_name == "New Name"
+    assert user.avatar == "new.png"
+    # Preserved
+    assert user.admin is True
+    assert user.password_hash == "pbkdf2:sha256:1:salt:abc"
+    assert user.notification_channels == ["chan1"]
+
+
+def test_provision_oauth_user_does_not_overwrite_with_empty():
+    existing = User(username="bob", full_name="Bob", avatar="bob.png")
+    _reset_users({"bob": existing})
+    user = users_mod.provision_oauth_user("bob", "", "")
+    assert user.full_name == "Bob"
+    assert user.avatar == "bob.png"
+
+
+def test_provision_oauth_user_survives_config_reload():
+    _reset_users()
+    users_mod.provision_oauth_user("oauthonly", "OAuth Only", "https://example.com/a.png")
+    assert "oauthonly" in users_mod.users
+    # Reload with empty config — OAuth user should survive
+    users_mod.load_users({})
+    assert "oauthonly" in users_mod.users
+
+
+def test_authorization_url_shape():
+    state = "teststate"
+    redirect_uri = "https://hbd.example.com/login/oauth/gitea/callback"
+    url = oauth.authorization_url(CFG_ON, state, redirect_uri)
+    parsed = urlparse(url)
+    qs = parse_qs(parsed.query)
+    assert parsed.scheme == "https"
+    assert parsed.netloc == "git.example.com"
+    assert parsed.path == "/login/oauth/authorize"
+    assert qs["client_id"] == ["cid"]
+    assert qs["state"] == ["teststate"]
+    assert qs["redirect_uri"] == [redirect_uri]
+    assert qs["scope"] == ["user:email"]
+    assert qs["response_type"] == ["code"]
+
+
+@pytest.mark.asyncio
+async def test_exchange_code_returns_token():
+    redirect_uri = "https://hbd.example.com/login/oauth/gitea/callback"
+    mock_response = AsyncMock()
+    mock_response.status = 200
+    mock_response.json = AsyncMock(return_value={"access_token": "tok123"})
+
+    mock_session = MagicMock()
+    mock_session.post = MagicMock(return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_response),
+        __aexit__=AsyncMock(return_value=False),
+    ))
+
+    with patch("hbd.server.oauth.aiohttp.ClientSession", return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_session),
+        __aexit__=AsyncMock(return_value=False),
+    )):
+        token = await oauth.exchange_code(CFG_ON, "mycode", redirect_uri)
+    assert token == "tok123"
+
+
+@pytest.mark.asyncio
+async def test_exchange_code_raises_on_error_status():
+    redirect_uri = "https://hbd.example.com/login/oauth/gitea/callback"
+    mock_response = AsyncMock()
+    mock_response.status = 401
+    mock_response.text = AsyncMock(return_value="unauthorized")
+
+    mock_session = MagicMock()
+    mock_session.post = MagicMock(return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_response),
+        __aexit__=AsyncMock(return_value=False),
+    ))
+
+    with patch("hbd.server.oauth.aiohttp.ClientSession", return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_session),
+        __aexit__=AsyncMock(return_value=False),
+    )):
+        with pytest.raises(oauth.OAuthError):
+            await oauth.exchange_code(CFG_ON, "badcode", redirect_uri)
+
+
+@pytest.mark.asyncio
+async def test_fetch_user_returns_profile():
+    mock_response = AsyncMock()
+    mock_response.status = 200
+    mock_response.json = AsyncMock(return_value={
+        "login": "alice",
+        "full_name": "Alice Smith",
+        "avatar_url": "https://git.example.com/avatars/alice.png",
+    })
+
+    mock_session = MagicMock()
+    mock_session.get = MagicMock(return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_response),
+        __aexit__=AsyncMock(return_value=False),
+    ))
+
+    with patch("hbd.server.oauth.aiohttp.ClientSession", return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_session),
+        __aexit__=AsyncMock(return_value=False),
+    )):
+        profile = await oauth.fetch_user(CFG_ON, "tok123")
+    assert profile == {
+        "login": "alice",
+        "full_name": "Alice Smith",
+        "avatar_url": "https://git.example.com/avatars/alice.png",
+    }
+
+
+@pytest.mark.asyncio
+async def test_exchange_code_raises_when_no_access_token():
+    redirect_uri = "https://hbd.example.com/login/oauth/gitea/callback"
+    mock_response = AsyncMock()
+    mock_response.status = 200
+    mock_response.json = AsyncMock(return_value={"error": "bad_request"})
+
+    mock_session = MagicMock()
+    mock_session.post = MagicMock(return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_response),
+        __aexit__=AsyncMock(return_value=False),
+    ))
+
+    with patch("hbd.server.oauth.aiohttp.ClientSession", return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_session),
+        __aexit__=AsyncMock(return_value=False),
+    )):
+        with pytest.raises(oauth.OAuthError):
+            await oauth.exchange_code(CFG_ON, "mycode", redirect_uri)
+
+
+@pytest.mark.asyncio
+async def test_fetch_user_raises_on_error_status():
+    mock_response = AsyncMock()
+    mock_response.status = 401
+    mock_response.text = AsyncMock(return_value="unauthorized")
+
+    mock_session = MagicMock()
+    mock_session.get = MagicMock(return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_response),
+        __aexit__=AsyncMock(return_value=False),
+    ))
+
+    with patch("hbd.server.oauth.aiohttp.ClientSession", return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_session),
+        __aexit__=AsyncMock(return_value=False),
+    )):
+        with pytest.raises(oauth.OAuthError):
+            await oauth.fetch_user(CFG_ON, "tok123")
+
+
+# ---------------------------------------------------------------------------
+# Integration-style tests: callback logic chain
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_callback_invalid_state_rejects():
+    """Verify validate_state returns False for unknown state tokens."""
+    fake_state = "this-is-not-a-real-state"
+    assert oauth.validate_state(fake_state) is False
+
+
+@pytest.mark.asyncio
+async def test_full_oauth_flow_chain():
+    """Integration-style test: state → exchange → fetch → provision chain."""
+    redirect_uri = "https://hbd.example.com/login/oauth/gitea/callback"
+
+    # Step 1: create a state token
+    state = oauth.make_state()
+    assert oauth.validate_state(state) is True  # consumed; replay would return False
+
+    # Step 2: exchange code → token (mocked)
+    mock_token_response = AsyncMock()
+    mock_token_response.status = 200
+    mock_token_response.json = AsyncMock(return_value={"access_token": "flow_token"})
+
+    mock_user_response = AsyncMock()
+    mock_user_response.status = 200
+    mock_user_response.json = AsyncMock(return_value={
+        "login": "flowuser",
+        "full_name": "Flow User",
+        "avatar_url": "https://git.example.com/avatars/flow.png",
+    })
+
+    mock_session = MagicMock()
+    mock_session.post = MagicMock(return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_token_response),
+        __aexit__=AsyncMock(return_value=False),
+    ))
+    mock_session.get = MagicMock(return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_user_response),
+        __aexit__=AsyncMock(return_value=False),
+    ))
+
+    with patch("hbd.server.oauth.aiohttp.ClientSession", return_value=AsyncMock(
+        __aenter__=AsyncMock(return_value=mock_session),
+        __aexit__=AsyncMock(return_value=False),
+    )):
+        token = await oauth.exchange_code(CFG_ON, "authcode", redirect_uri)
+        profile = await oauth.fetch_user(CFG_ON, token)
+
+    assert token == "flow_token"
+    assert profile["login"] == "flowuser"
+
+    # Step 3: provision user
+    _reset_users()
+    user = users_mod.provision_oauth_user(
+        profile["login"], profile["full_name"], profile["avatar_url"]
+    )
+    assert user.username == "flowuser"
+    assert user.check_password("anything") is False