version 5.2.1

fix: threshold and logging improvements
- threshold: fix crash when display is None (_format_display now falls back to default format string instead of calling None.format()) - threshold: shorten notification messages by stripping plugin-name prefix from metric_path (cpu_percent instead of cpu_monitor.cpu_percent) - main: demote aiohttp.access log records from INFO to DEBUG - udp: replace debug print with proper logger.info for new host sign-on Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 07:07:01 -04:00 · 2026-05-06 07:06:56 -04:00 · 2026-05-05 13:47:46 -04:00 · 2026-05-05 13:45:43 -04:00 · 2026-05-05 13:02:28 -04:00 · 2026-05-05 12:26:56 -04:00
14 changed files with 324 additions and 194 deletions
@@ -58,10 +58,11 @@ Heartbeat includes a comprehensive plugin architecture that extends monitoring b
 ### Built-in Plugins

 - `os_info`: Collects OS, kernel, distribution, and architecture information
- `cpu_monitor`: Monitors CPU usage, load average, frequency, and process counts
- `memory_monitor`: Monitors RAM and swap usage, available memory
+- `cpu_monitor`: Monitors CPU usage, load average, frequency, process counts, and uptime
+- `memory_monitor`: Monitors RAM and swap usage, available memory (ZFS ARC-aware)
 - `disk_monitor`: Monitors disk usage, I/O statistics, and filesystem metrics
 - `network_monitor`: Monitors network interface statistics, bandwidth, and connections
+- `ping_monitor`: Measures round-trip latency to configured hosts
 - `filesystem_info`: Collects mounted filesystem information (physical filesystems only by default)
 - `nagios_runner`: Executes Nagios monitoring plugins (check_disk, check_load, check_http, etc.)
 - `zfs_monitor`: Monitors ZFS pool health, capacity, fragmentation, dedup ratio, and cumulative I/O via `zpool(8)`
@@ -76,7 +77,7 @@ The `nagios_runner` plugin provides seamless integration with the vast Nagios pl
 - Validates absolute command paths at startup and warns on missing or non-executable files
 - Parses exit codes (OK/WARNING/CRITICAL/UNKNOWN)
 - Extracts performance data with thresholds
- Reports aggregated status across all configured checks
+- Reports per-check status, exit code, and output; no aggregate rollup field

 See [docs/NAGIOS_INTEGRATION.md](docs/NAGIOS_INTEGRATION.md) for complete integration guide including configuration examples and custom plugin development.

@@ -181,7 +182,8 @@ thresholds:
      warning: 80.0      # Warn when CPU > 80%
      critical: 90.0     # Critical when CPU > 90%
      operator: ">"
-      hysteresis: 0.1    # 10% hysteresis to prevent flapping
+      hysteresis: 0.02   # 2% hysteresis to prevent flapping
+      display: "(threshold: {op_symbol} {threshold_value}%)"  # optional
  
  memory_monitor:
    percent:
@@ -223,7 +225,7 @@ thresholds:
    <hostname>:
      warning: <milliseconds>   # Warn when RTT > this value
      critical: <milliseconds>  # Critical when RTT > this value
-      hysteresis: 0.1           # Optional: 10% hysteresis (default)
+      hysteresis: 0.02          # Optional: 2% hysteresis (default)
 ```

 **Example alerts:**
@@ -274,7 +276,59 @@ All plugin metrics can be thresholded:
 - **Memory**: percent, available_mb, swap_percent
 - **Disk**: Per-partition percent, free_gb, free_mb
 - **Network**: errors_total, dropped packets, connection counts
- **Nagios**: exit_code mapping (0=OK, 1=WARNING, 2=CRITICAL)
+- **Nagios**: Any field emitted by `nagios_runner` (`<name>_status_code`, `<name>_status`, `<name>_output`, performance data fields)
+
+### Display Format Templates
+
+Each threshold entry accepts an optional `display` field — a Python format string shown in notifications and on the Alerts dashboard:
+
+```yaml
+nagios_runner:
+  status_code:
+    warning: 1
+    critical: 2
+    operator: ">="
+    display: "{check_name}: exit {value} (expected < {threshold_value})"
+```
+
+Available variables:
+
+| Variable | Description |
+|---|---|
+| `{value}` | Current metric value |
+| `{threshold_value}` | Threshold that was crossed |
+| `{op_symbol}` | Comparison operator (`>`, `<`, `>=`, …); `"nagios"` for the nagios operator |
+| `{check_name}` | Prefix stripped by generic matching (see below) |
+| `{metric_name}` | Full field name within the plugin data |
+| `{output}` | For `nagios_runner` generic matches: the matched check's status text (alias for `{check_name}_output`) |
+| `{status}` | For `nagios_runner` generic matches: the matched check's status name — OK/WARNING/CRITICAL/UNKNOWN (alias for `{check_name}_status`) |
+| any plugin field | Any other field present in the plugin's data |
+
+### Generic Threshold Matching
+
+When a metric name has no exact threshold entry, the server progressively strips leading underscore-separated segments and re-tries the lookup. This lets a single generic entry cover an entire family of metrics.
+
+The classic use case is `nagios_runner`, which names each metric after the command that produced it:
+
+```
+nagios_runner.check_disk_root_status_code    → no exact match
+nagios_runner.disk_root_status_code          → no match
+nagios_runner.root_status_code               → no match
+nagios_runner.status_code                    → matched ✓
+```
+
+Configure the generic threshold once using the `nagios` operator, which maps exit codes directly to alert severity without requiring numeric warning/critical values:
+
+```yaml
+nagios_runner:
+  status_code:
+    operator: "nagios"   # 0=OK  1=WARNING  2=CRITICAL  3=UNKNOWN
+    display: "{check_name}: {output}"
+```
+
+The stripped prefix (`check_disk_root` in the example above) is available as `{check_name}` in the display template, so you can identify which check triggered the alert without writing a separate threshold entry per command.
+
+Exact matches always take priority. A generic entry only applies when no specific one is defined.

 ### Per-Host Threshold Profiles

@@ -461,12 +515,11 @@ You can also run it via the module entrypoint:
 python -m hbd.client.main your-server.example.com
 ```

-Client configuration can also be specified in YAML:
+Client configuration can also be specified in YAML (`~/.hbc.yaml`):

 ```yaml
-server: hbd.example.com
-port: 50003
-interval: 30
+hb_port: 50003        # Server port (default: 50003)
+interval: 30          # Heartbeat interval in seconds
 plugins:
  cpu_monitor:
    interval: 300      # Check every 5 minutes (default)
@@ -480,10 +533,14 @@ plugins:
  nagios_runner:
    interval: 300      # Check every 5 minutes (default)
    commands:
-      - /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
-      - /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
+      - name: check_load
+        command: /usr/lib/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
+      - name: check_disk
+        command: /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
 ```

+The server hostname is always passed as a positional command-line argument; there is no `server:` config key.
+
 All monitoring plugins default to 5-minute (300 second) intervals, but can be customized as needed.

 **Connection retry:** If a server is temporarily unreachable, `hbc` retries `open()` indefinitely on every heartbeat interval. IPv6 connections that never succeeded during early startup are dropped after 3 consecutive failures (to handle hosts without IPv6 routing), while IPv4 connections always retry.
@@ -104,11 +104,6 @@ The `nagios_runner` plugin collects:
 - `{name}_{metric}_min` - Minimum value (if present)
 - `{name}_{metric}_max` - Maximum value (if present)

-**Overall:**
- `overall_status` - Worst status from all commands
- `overall_status_code` - Worst status code
- `plugin_count` - Number of Nagios plugins executed
-
 ## Configuration Options

 ```yaml
@@ -1110,33 +1110,6 @@ hosts:
  db-02:
    threshold_config: [tight_memory, db_disk]
 ```
-
-### Backward Compatibility
-
-The legacy single threshold configuration is fully supported:
-
-```yaml
-# Old format - still works
-thresholds:
-  cpu_monitor:
-    cpu_percent:
-      warning: 80.0
-      critical: 90.0
-```
-
-This is equivalent to:
-
-```yaml
-# New format
-threshold_configs:
-  default:
-    thresholds:
-      cpu_monitor:
-        cpu_percent:
-          warning: 80.0
-          critical: 90.0
-```
-
 ### Configuration Priority

 1. **Host `threshold_config` (list)**: Layer each named config's overrides left-to-right on top of the defaults
@@ -14,4 +14,4 @@ Install options:
 """

 __all__ = ["__version__"]
-__version__ = "5.1.19"
+__version__ = "5.2.1"
@@ -31,16 +31,13 @@ from hbd.client.plugin import MonitorPlugin


 # Nagios exit codes
-NAGIOS_OK = 0
-NAGIOS_WARNING = 1
-NAGIOS_CRITICAL = 2
 NAGIOS_UNKNOWN = 3

 STATUS_NAMES = {
-    NAGIOS_OK: "OK",
-    NAGIOS_WARNING: "WARNING",
-    NAGIOS_CRITICAL: "CRITICAL",
-    NAGIOS_UNKNOWN: "UNKNOWN"
+    0: "OK",
+    1: "WARNING",
+    2: "CRITICAL",
+    3: "UNKNOWN",
 }


@@ -128,52 +125,39 @@ class NagiosRunnerPlugin(MonitorPlugin):
            Dictionary with results from all plugins
        """
        results = {}
-        
-        # Track overall status (worst status wins)
-        worst_status = NAGIOS_OK
-        
+
        for cmd_config in self.commands:
            name = cmd_config.get("name")
            command = cmd_config.get("command")
-            
+
            if not name or not command:
                self.logger.warning("Skipping command with missing name or command")
                continue
-            
+
            # Execute plugin
            try:
                status_code, output, perfdata = await self._run_nagios_plugin(command)
-                
+
                # Store results
                results[f"{name}_status"] = STATUS_NAMES.get(status_code, "UNKNOWN")
                results[f"{name}_status_code"] = status_code
                results[f"{name}_output"] = output
-                
-                # Track worst status
-                if status_code > worst_status:
-                    worst_status = status_code
-                
+
                # Parse and add performance data
                if perfdata:
                    for metric_name, metric_value in perfdata.items():
                        results[f"{name}_{metric_name}"] = metric_value
-                
+
                self.logger.info(
                    f"Executed {name}: {STATUS_NAMES.get(status_code, 'UNKNOWN')} - {output[:50]}"
                )
-                
+
            except Exception as e:
                self.logger.error(f"Error running {name}: {e}", exc_info=True)
                results[f"{name}_status"] = "ERROR"
                results[f"{name}_status_code"] = NAGIOS_UNKNOWN
                results[f"{name}_output"] = str(e)
-                worst_status = NAGIOS_UNKNOWN
-        
-        # Add overall status
-        results["overall_status"] = STATUS_NAMES.get(worst_status, "UNKNOWN")
-        results["overall_status_code"] = worst_status
-        results["plugin_count"] = len(self.commands)
-        
+
        return results
    
    async def _run_nagios_plugin(
@@ -95,6 +95,12 @@ THRESHOLD_DEFAULTS = {
                'warning': 200,
                'critical': 250.0,
                'count': 3  # Optional: number of consecutive breaches before alerting
+            },
+            'nagios_runner': {
+                'status_code': {
+                    'display': '{check_name} {output}',
+                    'operator': "nagios"
+                }   
            }
        }
    }
@@ -475,6 +475,7 @@ def run(config, config_path=None):
    if config.get("debug", 0) > 0:
        log_level = logging.DEBUG
    logging.basicConfig(level=log_level)
+    logging.getLogger("aiohttp.access").setLevel(logging.DEBUG)
    load_pickled_hosts(config, hbdclass)

    notify_mod.initlog(logfile=config.get("logfile", "messages.log"))
@@ -4,7 +4,7 @@

  <style>

-    body {
+    html, body {
      height: auto;
      overflow-y: auto;
    }
@@ -175,8 +175,12 @@

    .alert-hostname {
      font-weight: bold;
-      color: #333;
+      color: #0066cc;
      font-size: 1.1em;
+      text-decoration: none;
+    }
+    .alert-hostname:hover {
+      text-decoration: underline;
    }

    .alert-metric {
@@ -405,6 +409,10 @@
        } else if (alert.threshold_value !== undefined && alert.threshold_value !== null && alert.operator) {
          valueText += ` <span class="threshold-info">(threshold: ${alert.operator} ${formatValue(alert.threshold_value)})</span>`;
        }
+        if (alert.recovery_threshold !== undefined && alert.recovery_threshold !== null) {
+          const recOp = (alert.operator === '>' || alert.operator === '>=') ? '<' : '>';
+          valueText += ` <span class="threshold-info" style="color:#888">(recovers ${recOp} ${formatValue(alert.recovery_threshold)})</span>`;
+        }
        
        // Build actions section
        let actionsHtml = '';
@@ -429,7 +437,7 @@
            <div class="alert-main">
              <div class="alert-header">
                <span class="alert-level ${level}">${alert.level}</span>
-                <span class="alert-hostname">${alert.hostname}</span>
+                <a class="alert-hostname" href="/plugins#${alert.hostname}">${alert.hostname}</a>
              </div>
              <div class="alert-metric">${alert.metric_path}</div>
              <div class="alert-details">
@@ -499,6 +499,17 @@
        return pluginCache[hostname]?.[pluginName] ?? null;
      }

+      // Return worst nagios exit code (0-3) found in a nagios_runner data object.
+      function nagiosWorstStatus(data) {
+        let worst = 0;
+        for (const [k, v] of Object.entries(data || {})) {
+          if (k.endsWith('_status_code') && typeof v === 'number' && v > worst) {
+            worst = v;
+          }
+        }
+        return worst;
+      }
+
      // ── Fetch helpers ───────────────────────────────────────────────────────

      async function fetchPlugin(hostname, pluginName) {
@@ -600,13 +611,13 @@
          ? chips.join('')
          : '<span class="glance-loading">—</span>';

-        // Nagios badge
+        // Nagios badge — derive worst status from individual check codes
        const nagios = getCache(hostname, 'nagios_runner');
        if (nagosBadge && nagios) {
-          const status = (nagios.data.overall_status || '—').toUpperCase();
-          const cls = status === 'OK' ? 'ok'
-            : status === 'WARNING' ? 'warning'
-            : status === 'CRITICAL' ? 'critical' : '';
+          const worst = nagiosWorstStatus(nagios.data);
+          const names = {0:'OK', 1:'WARNING', 2:'CRITICAL', 3:'UNKNOWN'};
+          const status = names[worst] || '—';
+          const cls = worst === 0 ? 'ok' : worst === 1 ? 'warning' : worst >= 2 ? 'critical' : '';
          nagosBadge.className = `nagios-badge ${cls}`;
          nagosBadge.textContent = status;
        }
@@ -715,9 +726,10 @@
            break;
          }
          case 'nagios_runner': {
-            const status = (d.overall_status || '?').toUpperCase();
-            const count = d.plugin_count;
-            text = status + (count != null ? ` — ${count} checks` : '');
+            const worst = nagiosWorstStatus(d);
+            const names = {0:'OK', 1:'WARNING', 2:'CRITICAL', 3:'UNKNOWN'};
+            const codes = Object.keys(d).filter(k => k.endsWith('_status_code'));
+            text = (names[worst] || '?') + (codes.length ? ` — ${codes.length} checks` : '');
            break;
          }
          case 'filesystem_info': {
@@ -30,12 +30,13 @@ class AlertLevel(Enum):

 class ComparisonOperator(Enum):
    """Supported comparison operators for threshold checks."""
-    GT = ">"      # Greater than
-    GTE = ">="    # Greater than or equal
-    LT = "<"      # Less than
-    LTE = "<="    # Less than or equal
-    EQ = "=="     # Equal to
-    NEQ = "!="    # Not equal to
+    GT = ">"        # Greater than
+    GTE = ">="      # Greater than or equal
+    LT = "<"        # Less than
+    LTE = "<="      # Less than or equal
+    EQ = "=="       # Equal to
+    NEQ = "!="      # Not equal to
+    NAGIOS = "nagios"  # Nagios exit-code semantics: 0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN


 class AlertState:
@@ -57,6 +58,7 @@ class AlertState:
        self.last_notification = None
        self.threshold_value = None  # The threshold value that triggered alert
        self.operator = None  # The comparison operator (>, <, >=, etc.)
+        self.hysteresis: Optional[float] = None  # Hysteresis fraction used for recovery
        self.formatted_message = None  # Formatted display message for UI
        self.acknowledged = False  # Whether alert has been acknowledged
        self.acknowledged_at = None  # Timestamp when acknowledged
@@ -151,7 +153,16 @@ class AlertState:
            result["operator"] = self.operator
        if self.formatted_message is not None:
            result["formatted_message"] = self.formatted_message
-        
+
+        # Compute and expose the recovery threshold so the UI can display it
+        if (self.hysteresis and self.threshold_value is not None
+                and self.operator is not None):
+            ha = abs(self.threshold_value * self.hysteresis)
+            if self.operator in ('>', '>='):
+                result["recovery_threshold"] = round(self.threshold_value - ha, 4)
+            elif self.operator in ('<', '<='):
+                result["recovery_threshold"] = round(self.threshold_value + ha, 4)
+
        return result
    
    def __setstate__(self, state):
@@ -159,6 +170,8 @@ class AlertState:
        self.__dict__.update(state)
        if not hasattr(self, 'consecutive_count'):
            self.consecutive_count = 0
+        if not hasattr(self, 'hysteresis'):
+            self.hysteresis = None

    def acknowledge(self):
        """Acknowledge this alert to stop reminder notifications."""
@@ -217,33 +230,43 @@ class ThresholdConfig:
    def evaluate(self, value: float) -> AlertLevel:
        """
        Evaluate a value against this threshold.
-        
+
        Args:
            value: Metric value to check
-            
+
        Returns:
            AlertLevel indicating the severity
        """
        if not self.enabled:
            return AlertLevel.OK
-        
+
+        # Nagios exit-code semantics: value IS the severity
+        if self.operator == ComparisonOperator.NAGIOS:
+            try:
+                code = int(value)
+            except (TypeError, ValueError):
+                return AlertLevel.UNKNOWN
+            return {0: AlertLevel.OK, 1: AlertLevel.WARNING, 2: AlertLevel.CRITICAL}.get(
+                code, AlertLevel.UNKNOWN
+            )
+
        try:
            # Convert value to float for comparison
            value = float(value)
        except (TypeError, ValueError):
            logger.warning("Cannot convert value %s to float for %s", value, self.metric_path)
            return AlertLevel.UNKNOWN
-        
+
        # Check critical threshold first
        if self.critical is not None:
            if self._compare(value, self.critical):
                return AlertLevel.CRITICAL
-        
+
        # Then check warning threshold
        if self.warning is not None:
            if self._compare(value, self.warning):
                return AlertLevel.WARNING
-        
+
        return AlertLevel.OK
    
    def evaluate_with_hysteresis(
@@ -262,7 +285,11 @@ class ThresholdConfig:
            New alert level considering hysteresis
        """
        new_level = self.evaluate(value)
-        
+
+        # Nagios exit codes are discrete integers — hysteresis doesn't apply
+        if self.operator == ComparisonOperator.NAGIOS:
+            return new_level
+
        # If no hysteresis, return new level
        if self.hysteresis == 0.0:
            return new_level
@@ -392,14 +419,28 @@ class ThresholdChecker:
    
    def _parse_config(self, config: Dict[str, Any]):
        """Parse threshold configuration from YAML structure.
-        
+
        Supports two formats:
        1. Legacy format with direct 'thresholds' section
        2. New format with 'threshold_configs' and 'host_threshold_mapping'
+
+        In all cases, THRESHOLD_DEFAULTS are seeded into threshold_configs["default"]
+        so the Settings page always shows the built-in defaults.
+        _parse_multi_config() overwrites this with the fully-merged effective defaults.
        """
+        # Always expose built-in defaults through threshold_configs["default"] so
+        # the Settings page has something to display even in legacy/no-config mode.
+        seed: Dict[str, ThresholdConfig] = {}
+        for plugin_name, plugin_thresholds in THRESHOLD_DEFAULTS.get("thresholds", {}).items():
+            if isinstance(plugin_thresholds, dict):
+                self._parse_plugin_thresholds(plugin_name, plugin_thresholds, target_dict=seed)
+        if seed:
+            self.threshold_configs["default"] = seed
+            self.threshold_raw_configs["default"] = {}
+
        # Check for new multi-config format
        if "threshold_configs" in config:
-            self._parse_multi_config(config)
+            self._parse_multi_config(config)  # overwrites threshold_configs["default"]
        elif "thresholds" in config:
            # Legacy single threshold configuration
            self._parse_legacy_config(config)
@@ -545,11 +586,14 @@ class ThresholdChecker:
            warning = threshold_config.get("warning")
            critical = threshold_config.get("critical")
            operator = threshold_config.get("operator", ">")
-            display = threshold_config.get("display", "(threshold: {op_symbol} {threshold_value})")
-            hysteresis = threshold_config.get("hysteresis", 0.1)  # 10% default
+            # Nagios operator maps exit codes directly; no numeric thresholds needed
+            is_nagios_op = (operator == "nagios")
+            default_display = "{check_name}: {output}" if is_nagios_op else "(threshold: {op_symbol} {threshold_value})"
+            display = threshold_config.get("display", default_display)
+            hysteresis = threshold_config.get("hysteresis", 0.0 if is_nagios_op else 0.02)
            enabled = threshold_config.get("enabled", True)
-            
-            if warning is None and critical is None:
+
+            if warning is None and critical is None and not is_nagios_op:
                logger.warning("No thresholds defined for %s, skipping", metric_path)
                continue
            
@@ -649,7 +693,7 @@ class ThresholdChecker:
        warning = rtt_thresholds.get("warning")
        critical = rtt_thresholds.get("critical")
        operator = rtt_thresholds.get("operator", ">")
-        hysteresis = rtt_thresholds.get("hysteresis", 0.1)  # 10% default
+        hysteresis = rtt_thresholds.get("hysteresis", 0.02)  # 2% default
        enabled = rtt_thresholds.get("enabled", True)
        display = rtt_thresholds.get("display")
        count = rtt_thresholds.get("count", 1)
@@ -794,6 +838,12 @@ class ThresholdChecker:
        elif new_level == AlertLevel.WARNING and threshold.warning is not None:
            threshold_value = threshold.warning

+        # Keep hysteresis on the state so the UI can show the recovery threshold
+        if new_level != AlertLevel.OK:
+            alert_state.hysteresis = threshold.hysteresis
+        else:
+            alert_state.hysteresis = None
+
        # Update state and check for changes
        old_level = alert_state.level
        if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
@@ -805,26 +855,33 @@ class ThresholdChecker:
        return None
    def _find_threshold(
        self, thresholds: Dict[str, "ThresholdConfig"], metric_path: str
-    ) -> Optional["ThresholdConfig"]:
-        """Return the threshold for *metric_path*, falling back to suffix matches.
+    ) -> Tuple[Optional["ThresholdConfig"], Optional[str]]:
+        """Return (threshold, check_name) for *metric_path*, falling back to suffix matches.

-        Allows generic thresholds like ``ping_monitor.rtt_avg`` to match
-        fully-qualified paths like ``ping_monitor.8_8_8_8_rtt_avg``.
+        Allows generic thresholds like ``nagios_runner.status_code`` to match
+        fully-qualified paths like ``nagios_runner.check_disk_root_status_code``.
        The exact match is always tried first; then successive leading
        underscore-delimited segments are stripped from the field name until
        a match is found or no segments remain.
+
+        Returns:
+            (ThresholdConfig, None) for an exact match.
+            (ThresholdConfig, "check_disk_root") for a suffix match — the second
+            element is the stripped prefix, available as ``{check_name}`` in
+            display format templates.
+            (None, None) when no threshold is found.
        """
        if metric_path in thresholds:
-            return thresholds[metric_path]
+            return thresholds[metric_path], None
        plugin, sep, field = metric_path.partition(".")
        if not sep:
-            return None
+            return None, None
        parts = field.split("_")
        for i in range(1, len(parts)):
            candidate = plugin + "." + "_".join(parts[i:])
            if candidate in thresholds:
-                return thresholds[candidate]
-        return None
+                return thresholds[candidate], "_".join(parts[:i])
+        return None, None

    def check_plugin_data(
        self,
@@ -853,37 +910,39 @@ class ThresholdChecker:
        # Check flat metrics
        for metric_name, value in data.items():
            metric_path = f"{plugin_name}.{metric_name}"
-            
-            threshold = self._find_threshold(thresholds, metric_path)
+
+            threshold, check_name = self._find_threshold(thresholds, metric_path)
            if threshold is None:
                continue
-            
+
            # Get or create alert state
            if metric_path not in alert_states:
                alert_states[metric_path] = AlertState(metric_path)
-            
+
            alert_state = alert_states[metric_path]
-            
+
            # Evaluate threshold with hysteresis
            new_level = threshold.evaluate_with_hysteresis(
                value,
                alert_state.level
            )
-            
+
            # Determine which threshold was exceeded
            threshold_value = None
            if new_level == AlertLevel.CRITICAL and threshold.critical is not None:
                threshold_value = threshold.critical
            elif new_level == AlertLevel.WARNING and threshold.warning is not None:
                threshold_value = threshold.warning
-            
+
+            alert_state.hysteresis = threshold.hysteresis if new_level != AlertLevel.OK else None
+
            # Update state and check for changes
            old_level = alert_state.level
            if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
                state_changes.append((metric_path, old_level, new_level, value))
-                self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data)
+                self._apply_grace(host_name, alert_state, metric_path, old_level, new_level, value, threshold, data, check_name=check_name, metric_name=metric_name)
            elif new_level != AlertLevel.OK:
-                self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data)
+                self._check_pending_or_renotify(host_name, alert_state, metric_path, value, threshold, data, check_name=check_name, metric_name=metric_name)

        # Check nested metrics (e.g., partition data in disk_monitor)
        self._check_nested_metrics(
@@ -942,7 +1001,9 @@ class ThresholdChecker:
                        threshold_value = threshold.critical
                    elif new_level == AlertLevel.WARNING and threshold.warning is not None:
                        threshold_value = threshold.warning
-                    
+
+                    alert_state.hysteresis = threshold.hysteresis if new_level != AlertLevel.OK else None
+
                    old_level = alert_state.level
                    if alert_state.update(new_level, value, threshold_value, threshold.operator.value):
                        state_changes.append((metric_path, old_level, new_level, value))
@@ -959,6 +1020,8 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]] = None,
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ):
        """Trigger a notification for an alert state change.
        
@@ -980,56 +1043,54 @@ class ThresholdChecker:
        
        # Format operator symbol
        op_symbol = threshold.operator.value
-        
+
+        # Short metric label: strip the plugin-name prefix for readability
+        short_path = metric_path.partition(".")[2] or metric_path
+
        # Use a display-friendly value (inf is the sentinel for "overdue")
        import math
        display_value = "overdue" if isinstance(value, float) and math.isinf(value) else value

-        # Format message
-        if new_level == AlertLevel.OK:
-            lvl = "RECOVER"
-            message = f"{metric_path} = {display_value} ({old_level.name} -> OK)"
-        elif new_level == AlertLevel.WARNING:
-            lvl = "WARNING"
-            if threshold_value is not None:
-                threshold_info = self._format_display(
-                    threshold.display,
-                    value=display_value,
-                    threshold_value=threshold_value,
-                    op_symbol=op_symbol,
-                    plugin_data=plugin_data
-                )
-                message = f"{metric_path} = {display_value} {threshold_info}"
-            else:
-                message = f"{metric_path} = {display_value}"
-        elif new_level == AlertLevel.CRITICAL:
-            lvl = "CRITICAL"
-            if threshold_value is not None:
-                threshold_info = self._format_display(
-                    threshold.display,
-                    value=display_value,
-                    threshold_value=threshold_value,
-                    op_symbol=op_symbol,
-                    plugin_data=plugin_data
-                )
-                message = f"{metric_path} = {display_value} {threshold_info}"
-            else:
-                message = f"{metric_path} = {display_value}"
-        else:
-            lvl = "UNKNOWN"
-            message = f"{metric_path} = {display_value}"
-        
-        # Return the formatted threshold info for storing in AlertState
-        formatted_threshold_msg = None
-        if threshold_value is not None and new_level != AlertLevel.OK:
-            formatted_threshold_msg = self._format_display(
+        # Format message — for the nagios operator there is no numeric threshold_value;
+        # render the display template whenever one is available.
+        has_display = threshold_value is not None or threshold.operator == ComparisonOperator.NAGIOS
+
+        def _fmt():
+            return self._format_display(
                threshold.display,
                value=display_value,
                threshold_value=threshold_value,
                op_symbol=op_symbol,
-                plugin_data=plugin_data
+                plugin_data=plugin_data,
+                check_name=check_name,
+                metric_name=metric_name,
            )
-        
+
+        if new_level == AlertLevel.OK:
+            lvl = "RECOVER"
+            message = f"{short_path} = {display_value} ({old_level.name} -> OK)"
+        elif new_level == AlertLevel.WARNING:
+            lvl = "WARNING"
+            if has_display:
+                message = f"{short_path} = {display_value} {_fmt()}"
+            else:
+                message = f"{short_path} = {display_value}"
+        elif new_level == AlertLevel.CRITICAL:
+            lvl = "CRITICAL"
+            if has_display:
+                message = f"{short_path} = {display_value} {_fmt()}"
+            else:
+                message = f"{short_path} = {display_value}"
+        else:
+            lvl = "UNKNOWN"
+            if has_display:
+                message = f"{short_path} = {display_value} {_fmt()}"
+            else:
+                message = f"{short_path} = {display_value}"
+
+        # Formatted threshold info stored on AlertState for the UI
+        formatted_threshold_msg = _fmt() if has_display and new_level != AlertLevel.OK else None
+
        return lvl, message, formatted_threshold_msg
    
    def _send_notification(
@@ -1077,32 +1138,61 @@ class ThresholdChecker:
        self,
        display_format: str,
        value: Any,
-        threshold_value: float,
+        threshold_value: Optional[float],
        op_symbol: str,
        plugin_data: Optional[Dict[str, Any]] = None,
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ) -> str:
        """Format the display string using available data.
-        
-        Args:
-            display_format: Format string from threshold config
-            value: Current metric value
-            threshold_value: Threshold value that was exceeded
-            op_symbol: Comparison operator symbol
-            plugin_data: Optional dictionary of plugin data fields
-            
+
+        Available template variables:
+            {value}           - current metric value
+            {threshold_value} - threshold that was exceeded
+            {op_symbol}       - comparison operator (>, <, >=, <=, ==, !=)
+            {check_name}      - prefix stripped for generic threshold match
+                                (e.g. "check_disk_root" when metric
+                                "check_disk_root_status_code" matched generic
+                                threshold "status_code")
+            {metric_name}     - field name within the plugin data dict
+            Any key from plugin_data is also available.
+
        Returns:
            Formatted display string
        """
+        if not display_format:
+            display_format = "(threshold: {op_symbol} {threshold_value})" if threshold_value is not None else ""
+
        # Build format context with standard variables
        format_context = {
            'value': value,
-            'threshold_value': threshold_value,
            'op_symbol': op_symbol,
        }
-        
+        if threshold_value is not None:
+            format_context['threshold_value'] = threshold_value
+
+        # Add generic-match context variables when available
+        if check_name is not None:
+            format_context['check_name'] = check_name
+        if metric_name is not None:
+            format_context['metric_name'] = metric_name
+
        # Add all plugin data fields if available
        if plugin_data:
            format_context.update(plugin_data)
+
+        # For nagios_runner generic matches, expose the matched check's output
+        # and status as short aliases {output} and {status} so display templates
+        # don't need to use the full {check_disk_root_output} form.
+        if check_name and plugin_data:
+            if 'output' not in format_context:
+                output = plugin_data.get(f"{check_name}_output")
+                if output is not None:
+                    format_context['output'] = output
+            if 'status' not in format_context:
+                status = plugin_data.get(f"{check_name}_status")
+                if status is not None:
+                    format_context['status'] = status
        
        try:
            # Format the display string
@@ -1133,6 +1223,8 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]],
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ) -> None:
        """Handle a state-change transition with grace-period logic.

@@ -1145,7 +1237,8 @@ class ThresholdChecker:
          - Past grace: fires the RECOVER notification normally.
        """
        lvl, message, formatted_msg = self._trigger_notification(
-            host_name, metric_path, old_level, new_level, value, threshold, plugin_data
+            host_name, metric_path, old_level, new_level, value, threshold, plugin_data,
+            check_name=check_name, metric_name=metric_name,
        )
        alert_state.formatted_message = formatted_msg

@@ -1181,6 +1274,8 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]],
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ) -> None:
        """Called when alert level is unchanged and non-OK.

@@ -1190,7 +1285,8 @@ class ThresholdChecker:
        if alert_state.pending_since is not None:
            if time.time() - alert_state.pending_since >= self.grace_seconds:
                lvl, message, formatted_msg = self._trigger_notification(
-                    host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data
+                    host_name, metric_path, AlertLevel.OK, alert_state.level, value, threshold, plugin_data,
+                    check_name=check_name, metric_name=metric_name,
                )
                alert_state.formatted_message = formatted_msg
                self._send_notification(
@@ -1199,7 +1295,7 @@ class ThresholdChecker:
                alert_state.pending_since = None
            # else: still within grace window, do nothing
        else:
-            self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data)
+            self._check_renotify(host_name, alert_state, metric_path, value, threshold, plugin_data, check_name=check_name, metric_name=metric_name)

    def _check_renotify(
        self,
@@ -1209,6 +1305,8 @@ class ThresholdChecker:
        value: Any,
        threshold: ThresholdConfig,
        plugin_data: Optional[Dict[str, Any]] = None,
+        check_name: Optional[str] = None,
+        metric_name: Optional[str] = None,
    ):
        """Check if we should send a repeat notification.
        
@@ -1246,7 +1344,8 @@ class ThresholdChecker:
            
            # Format operator symbol
            op_symbol = threshold.operator.value
-            
+            short_path = metric_path.partition(".")[2] or metric_path
+
            # Time to re-notify
            if threshold_value is not None:
                # Use display format string
@@ -1255,11 +1354,13 @@ class ThresholdChecker:
                    value=value,
                    threshold_value=threshold_value,
                    op_symbol=op_symbol,
-                    plugin_data=plugin_data
+                    plugin_data=plugin_data,
+                    check_name=check_name,
+                    metric_name=metric_name,
                )
-                message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} {threshold_info}, ongoing for {int(now - alert_state.since)}s"
+                message = f"REMINDER ({alert_state.level.name}): {host_name} - {short_path} = {value} {threshold_info}, ongoing for {int(now - alert_state.since)}s"
            else:
-                message = f"REMINDER ({alert_state.level.name}): {host_name} - {metric_path} = {value} (ongoing for {int(now - alert_state.since)}s)"
+                message = f"REMINDER ({alert_state.level.name}): {host_name} - {short_path} = {value} (ongoing for {int(now - alert_state.since)}s)"
            
            from . import hbdclass
            host = hbdclass.Host.hosts.get(host_name)
@@ -1288,7 +1389,7 @@ class ThresholdChecker:
            if not host.alert_states:
                continue
            configured = self.get_thresholds_for_host(hostname)
-            stale = [mp for mp in host.alert_states if mp not in configured]
+            stale = [mp for mp in host.alert_states if self._find_threshold(configured, mp)[0] is None]
            for mp in stale:
                logger.info(
                    "Purging stale alert state for %s / %s (no threshold configured)",
@@ -336,8 +336,7 @@ def handle_datagram(msg: dict, addr, transport, ctx: dict):
        # Apply user-access settings from config
        access = config_mod.get_host_access(cfg, uname)
        host.apply_access(access["owner"], access["managers"], access["monitors"])
-        if verbose:
-            print(("XX: New host, num now %s" % (len(hbdcls.Host.hosts))))
+        logger.info("New host signed on: %s (dyn=%s, access=%s)", uname, host.dyn, access)
        newh = True
    else:
        host = hbdcls.Host.hosts[uname]
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "hbd"
-version = "5.1.19"
+version = "5.2.1"
 description = "Heartbeat monitoring system — client (hbc) and server (hbd)"
 readme = "README.md"
 requires-python = ">=3.11"
@@ -41,7 +41,7 @@ from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple

 # updated by scripts/bumpminor.sh
-__version__ = "5.1.19"
+__version__ = "5.2.1"

 # ---------------------------------------------------------------------------
 # Protocol  (mirrors hbd/common/proto.py)
@@ -388,7 +388,6 @@ class NagiosRunnerPlugin(MonitorPlugin):

    async def _collect_metrics(self) -> Dict[str, Any]:
        results: Dict[str, Any] = {}
-        worst = 0
        for cmd_cfg in self.commands:
            name = cmd_cfg.get("name")
            command = cmd_cfg.get("command")
@@ -399,10 +398,6 @@ class NagiosRunnerPlugin(MonitorPlugin):
            results[f"{name}_status_code"] = rc
            results[f"{name}_output"] = msg
            results.update({f"{name}_{k}": v for k, v in perf.items()})
-            worst = max(worst, rc)
-        results["overall_status"] = _NAGIOS_STATUS.get(worst, "UNKNOWN")
-        results["overall_status_code"] = worst
-        results["plugin_count"] = len(self.commands)
        return results


@@ -68,8 +68,7 @@ async def test_nagios_runner():
    print(f"   ✓ Collected {len(data)} data points")
    
    print(f"\n4. Results:")
-    print(f"   Overall Status: {data.get('overall_status')} (code: {data.get('overall_status_code')})")
-    print(f"   Plugins Executed: {data.get('plugin_count')}")
+    print(f"   Data points collected: {len(data)}")
    
    # Show individual plugin results
    print(f"\n5. Individual Plugin Results:")
Author	SHA1	Message	Date
andreas	f3d08d1c9e	version 5.2.1 Release / release (push) Successful in 5s Details	2026-05-06 07:07:01 -04:00
andreas	1e4263b793	fix: threshold and logging improvements - threshold: fix crash when display is None (_format_display now falls back to default format string instead of calling None.format()) - threshold: shorten notification messages by stripping plugin-name prefix from metric_path (cpu_percent instead of cpu_monitor.cpu_percent) - main: demote aiohttp.access log records from INFO to DEBUG - udp: replace debug print with proper logger.info for new host sign-on Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 07:06:56 -04:00
andreas	e931acb9f5	version 5.2.0 Release / release (push) Successful in 5s Details	2026-05-05 13:47:46 -04:00
andreas	018409e71d	docs: correct README inaccuracies found during code audit - Add ping_monitor to built-in plugins list - Update cpu_monitor (uptime) and memory_monitor (ZFS ARC) descriptions - Replace "aggregated status" bullet with accurate per-check reporting note - Fix RTT hysteresis default: 0.1 → 0.02 - Fix client YAML config: remove non-existent server:/port: keys, use hb_port: - Fix nagios_runner commands format: plain strings → {name:, command:} dicts - Fix Supported Metrics: exit_code → actual <name>_status_code/<name>_status/<name>_output fields Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 13:45:43 -04:00
andreas	1824f637b4	fix: always show THRESHOLD_DEFAULTS in Settings threshold config Seed threshold_configs["default"] from THRESHOLD_DEFAULTS at the start of _parse_config() so the Settings page displays built-in defaults regardless of whether the server config uses the multi-config format, the legacy thresholds: format, or has no threshold config at all. _parse_multi_config() overwrites the seed with the fully-merged effective defaults when a threshold_configs section is present. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 13:02:28 -04:00
andreas	a534c06b26	feat: nagios operator for direct exit-code severity mapping Add ComparisonOperator.NAGIOS ("nagios") that maps Nagios exit codes directly to alert levels (0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN) without requiring numeric warning/critical thresholds. Hysteresis is bypassed for discrete codes. Display template defaults to "{check_name}: {output}". _format_display() handles None threshold_value gracefully. Add nagios_runner.status_code as a built-in default threshold config so nagios checks alert out of the box. Also: fix alerts.html scrolling (override html,body), make hostname a link to /plugins#<hostname>, remove overall_status/overall_status_code/plugin_count from nagios_runner and hbc_mini, replace with computed worst-status in plugins.html via nagiosWorstStatus() helper. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 12:26:56 -04:00
andreas	d7b5c97a4e	version 5.1.21 Release / release (push) Successful in 6s Details	2026-05-05 11:05:48 -04:00
andreas	ae447ac4a6	feat: nagios_runner improvements and alerts page fixes - nagios_runner: remove overall_status/overall_status_code/plugin_count fields; each command still reports its own <name>_status and <name>_status_code - threshold: expose {output} and {status} aliases in display templates for nagios_runner generic matches (mapped from <check_name>_output/status) - alerts.html: fix scrolling by overriding html,body height/overflow (style.css sets both); make hostname a link to /plugins/<hostname> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 11:05:45 -04:00
andreas	d44ce3d124	version 5.1.20 Release / release (push) Successful in 6s Details	2026-05-05 10:48:24 -04:00
andreas	b1985d0eb2	feat: generic threshold matching for nagios_runner with {check_name} display support _find_threshold() now returns the stripped prefix ("check_name") alongside the ThresholdConfig, enabling a single generic entry (e.g. nagios_runner.status_code) to cover all per-command metrics (check_disk_root_status_code, check_load_status_code, …). The prefix is threaded through to _format_display() as {check_name}, with {metric_name} also available in display templates. purge_stale_alerts() updated to use generic matching so it does not incorrectly drop alerts on generic-matched metrics. README updated with Display Format Templates and Generic Threshold Matching sections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 10:48:17 -04:00
andreas	de778f680f	fix: reduce default hysteresis 10%→2%; show recovery threshold in alerts UI The 10% default hysteresis created an unreasonably wide recovery band: a 95% threshold would only clear once the value dropped below 85.5%, causing alerts to linger long after the metric was well below the trigger level. Change default hysteresis to 2% across all threshold parsers (plugin metrics, partitions, RTT). For a 95% threshold, recovery is now at 93.1% instead of 85.5%. Add AlertState.hysteresis field (set on every check, cleared on OK) and expose recovery_threshold in to_dict() so the Alerts dashboard can display "recovers < 93.1" alongside the trigger threshold, making the hysteresis band visible to the user. Pickle backward-compatible via __setstate__. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-04 14:47:50 -04:00