Public Access

Files

T

andreas a534c06b26 feat: nagios operator for direct exit-code severity mapping

Add ComparisonOperator.NAGIOS ("nagios") that maps Nagios exit codes
directly to alert levels (0=OK 1=WARNING 2=CRITICAL 3=UNKNOWN) without
requiring numeric warning/critical thresholds. Hysteresis is bypassed for
discrete codes. Display template defaults to "{check_name}: {output}".
_format_display() handles None threshold_value gracefully.

Add nagios_runner.status_code as a built-in default threshold config so
nagios checks alert out of the box.

Also: fix alerts.html scrolling (override html,body), make hostname a link
to /plugins#<hostname>, remove overall_status/overall_status_code/plugin_count
from nagios_runner and hbc_mini, replace with computed worst-status in
plugins.html via nagiosWorstStatus() helper.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-05 12:26:56 -04:00

26 KiB

Raw Blame History

Threshold Alerting System

Overview

The Heartbeat Monitoring System includes a comprehensive threshold alerting system that monitors plugin metrics and triggers notifications when values exceed configured thresholds. This system is designed to:

Detect anomalies: Automatically identify when system metrics exceed safe operating ranges
Prevent alert fatigue: Use hysteresis to prevent notification flapping
Escalate appropriately: Support WARNING and CRITICAL severity levels
Track state: Maintain alert history and state transitions per host
Integrate seamlessly: Work with existing notification infrastructure (email, pushover, etc.)

Architecture

Components

ThresholdChecker (hbd/threshold.py)
- Main threshold checking engine
- Parses configuration
- Evaluates metrics against thresholds
- Triggers notifications on state changes
ThresholdConfig
- Individual threshold configuration
- Supports multiple comparison operators
- Implements hysteresis logic
AlertState
- Tracks current alert state per metric
- Records state transitions
- Manages notification timing
Integration Points
- UDP handler: Checks thresholds when plugin data arrives
- Host objects: Store alert states per host
- Notification system: Sends alerts via configured channels

Alert Levels

OK: Metric is within normal range
WARNING: Metric has exceeded warning threshold (first-level concern)
CRITICAL: Metric has exceeded critical threshold (requires immediate attention)
UNKNOWN: Metric value cannot be evaluated (e.g., non-numeric data)

Configuration

Basic Structure

Thresholds are configured in the YAML configuration file under the thresholds section:

thresholds:
  plugin_name:
    metric_name:
      warning: 80.0
      critical: 90.0
      operator: ">"
      hysteresis: 0.1
      display: "display format"
      enabled: true

Configuration Parameters

Required Parameters

warning: Warning threshold value (numeric)
critical: Critical threshold value (numeric)

Note: At least one of warning or critical must be specified.

Optional Parameters

operator: Comparison operator (default: ">")
- ">" - Greater than
- ">=" - Greater than or equal
- "<" - Less than
- "<=" - Less than or equal
- "==" - Equal to
- "!=" - Not equal to
hysteresis: Hysteresis percentage to prevent flapping (default: 0.1 = 10%)
- Range: 0.0 to 1.0
- Prevents rapid state transitions when value hovers near threshold
display: f-string to hold the display format for alert messages
- defaults to "(threshold: {op_symbol} {threshold_value})"
enabled: Whether this threshold is active (default: true)

Comparison Operators

Greater Than (`>`, `>=`)

Used for metrics where higher values are problematic:

cpu_monitor:
  cpu_percent:
    warning: 80.0      # Alert when CPU > 80%
    critical: 90.0     # Alert when CPU > 90%
    operator: ">"

Examples:

CPU usage percentage
Memory usage percentage
Disk usage percentage
Load average
Error counters

Less Than (`<`, `<=`)

Used for metrics where lower values are problematic:

memory_monitor:
  available_mb:
    warning: 1000      # Alert when available memory < 1GB
    critical: 500      # Alert when available memory < 500MB
    operator: "<"

Examples:

Available memory
Free disk space
Connection pool availability
Battery level

Hysteresis

Hysteresis prevents alert flapping by requiring values to improve by a certain amount before recovering from an alert state.

How It Works

When a metric crosses a threshold (e.g., CPU goes from 85% to 91%, triggering CRITICAL), hysteresis is applied when the value improves:

Threshold: 90
Hysteresis: 0.1 (10%)
Recovery threshold: 90 - (90 * 0.1) = 81

Value 91 -> CRITICAL (threshold crossed)
Value 89 -> CRITICAL (still above recovery threshold of 81)
Value 85 -> CRITICAL (still above recovery threshold)
Value 80 -> WARNING or OK (below recovery threshold, re-evaluated normally)

Configuration Recommendations

Stable metrics (CPU, memory): 10-15% hysteresis
```
hysteresis: 0.1
```
Very stable metrics (disk usage): 5% hysteresis
```
hysteresis: 0.05
```
Counter metrics (errors, packets): 20% hysteresis
```
hysteresis: 0.2
```
Binary states (exit codes): No hysteresis
```
hysteresis: 0.0
```

Plugin-Specific Configuration

CPU Monitor

cpu_monitor:
  cpu_percent:
    warning: 80.0
    critical: 90.0
    operator: ">"
    hysteresis: 0.1
  
  load_1min:
    warning: 4.0
    critical: 8.0
    operator: ">"
    hysteresis: 0.15
  
  load_5min:
    warning: 3.0
    critical: 6.0
    operator: ">"
  
  load_15min:
    warning: 2.0
    critical: 4.0
    operator: ">"

Memory Monitor

memory_monitor:
  # Percentage-based threshold
  percent:
    warning: 85.0
    critical: 95.0
    operator: ">"
  
  # Absolute value threshold (inverse - alert when LOW)
  available_mb:
    warning: 1000
    critical: 500
    operator: "<"
  
  # Swap usage
  swap_percent:
    warning: 50.0
    critical: 80.0
    operator: ">"

Disk Monitor

Disk thresholds support partition-specific configuration:

disk_monitor:
  partitions:
    /:
      percent:
        warning: 80.0
        critical: 90.0
        operator: ">"
        hysteresis: 0.05
      
      free_gb:
        warning: 10.0
        critical: 5.0
        operator: "<"
    
    /home:
      percent:
        warning: 85.0
        critical: 95.0
        operator: ">"
    
    /var:
      percent:
        warning: 80.0
        critical: 90.0
        operator: ">"
      
      free_gb:
        warning: 5.0
        critical: 2.0
        operator: "<"

Network Monitor

network_monitor:
  # Error counters
  errors_total:
    warning: 100
    critical: 1000
    operator: ">"
    hysteresis: 0.2
  
  # Dropped packets
  dropin_total:
    warning: 50
    critical: 200
    operator: ">"
  
  dropout_total:
    warning: 50
    critical: 200
    operator: ">"
  
  # Connection states
  connections_TIME_WAIT:
    warning: 1000
    critical: 5000
    operator: ">"
  
  connections_ESTABLISHED:
    warning: 500
    critical: 1000
    operator: ">"

Nagios Runner

The Nagios plugin runner reports exit codes that can be thresholded:

nagios_runner:
  exit_code:
    warning: 1       # Map Nagios WARNING to our WARNING
    critical: 2      # Map Nagios CRITICAL to our CRITICAL
    operator: ">="
    hysteresis: 0.0  # No hysteresis for exit codes

Notification Behavior

When Notifications Are Sent

Notifications are triggered on state changes:

Escalation: OK → WARNING, OK → CRITICAL, WARNING → CRITICAL
```
WARNING: webserver01 - cpu_monitor.cpu_percent = 85.0
```

Recovery: CRITICAL → WARNING, CRITICAL → OK, WARNING → OK

RECOVERED: webserver01 - cpu_monitor.cpu_percent = 70.0 (CRITICAL -> OK)

Re-notifications: Periodic reminders for ongoing alerts

REMINDER (CRITICAL): webserver01 - cpu_monitor.cpu_percent = 95.0 (ongoing for 3600s)

Notification Frequency

State changes: Immediate notification
Re-notifications: Controlled by threshold_renotify_interval (default: 3600 seconds = 1 hour)

threshold_renotify_interval: 3600  # Re-notify every hour for ongoing alerts

Notification Channels

The system supports centralized notification channel definitions, allowing different hosts to use different notification providers and credentials. This provides fine-grained control over who gets notified about what.

Supported Channel Types

Email (via SMTP)
Pushover (mobile notifications)
Signal (via signal-cli)
Mattermost (team chat webhooks)

Centralized Channel Configuration

Define notification channels once in the configuration file:

notification_channels:
  # Signal notifications
  signal_ops:
    type: signal
    cli_path: /usr/local/bin/signal-cli
    user: +1234567890
    recipient: +1234567890
  
  # Email notifications
  email_ops:
    type: email
    recipients: [ops@example.com, alerts@example.com]
    sender: heartbeat@example.com
    smtp_server: smtp.example.com
    smtp_port: 587
    smtp_user: heartbeat@example.com
    smtp_password: your-smtp-password
  
  # Pushover notifications
  pushover_urgent:
    type: pushover
    token: your-pushover-app-token
    user: your-pushover-user-key
  
  # Mattermost notifications
  mattermost_devops:
    type: mattermost
    host: mattermost.example.com
    token: your-webhook-token
    channel: devops-alerts
    username: heartbeat-bot
    icon: https://example.com/heartbeat-icon.png

# Default channels for hosts that don't specify channels
default_notification_channels: [email_ops]

Per-Host Channel Assignment

Assign notification channels to specific hosts in the hosts section:

hosts:
  # Critical server - multiple notification channels
  prod-web-01:
    threshold_config: high_sensitivity
    watch: true
    notification_channels: [signal_ops, pushover_urgent, email_ops]
    dyndns: false
  
  # Database server - ops team only
  prod-db-01:
    threshold_config: database
    watch: true
    notification_channels: [signal_ops, email_ops]
    dyndns: false
  
  # Development server - email only
  dev-server-01:
    threshold_config: low_sensitivity
    watch: false
    notification_channels: [email_ops]
    dyndns: false
  
  # Uses default_notification_channels if not specified
  test-server-01:
    threshold_config: default
    watch: false
    dyndns: false

Watched Hosts

Only hosts with watch: true in the hosts section will trigger notifications:

hosts:
  webserver01:
    watch: true
    notification_channels: [email_ops]
  
  database01:
    watch: true
    notification_channels: [signal_ops, email_ops]
  
  mailserver:
    watch: true
    notification_channels: [pushover_urgent]

Hosts not marked for watching will still have thresholds checked and alert states tracked, but won't send notifications.

Alert State Tracking

Each host maintains alert states for all monitored metrics:

host.alert_states = {
    "cpu_monitor.cpu_percent": AlertState(level=WARNING, since=1234567890),
    "memory_monitor.percent": AlertState(level=CRITICAL, since=1234567800),
    "disk_monitor./.percent": AlertState(level=OK, since=1234567700),
}

Alert states persist in memory and are saved with host data (pickle).

Alert State Information

Each AlertState tracks:

level: Current alert level (OK, WARNING, CRITICAL, UNKNOWN)
since: Timestamp when current state started
last_value: Most recent metric value
last_check: Timestamp of last threshold check
notification_count: Number of notifications sent for this alert
last_notification: Timestamp of last notification

Querying Alert States

Via HTTP API (future enhancement):

GET /api/hosts/webserver01/alerts

Response:

{
  "active_alerts": [
    {
      "metric": "cpu_monitor.cpu_percent",
      "level": "WARNING",
      "since": 1234567890,
      "value": 85.0,
      "duration": 300
    }
  ],
  "summary": {
    "ok": 15,
    "warning": 1,
    "critical": 0
  }
}

Testing

A comprehensive test suite is provided in test_threshold.py:

python test_threshold.py

Tests cover:

Threshold configuration and parsing
All comparison operators
Hysteresis functionality
Alert state tracking
State change detection
Notification triggering
Nested metrics (partitions)
Alert summaries

Best Practices

1. Start Conservative

Begin with higher thresholds to avoid alert fatigue:

cpu_monitor:
  cpu_percent:
    warning: 85.0    # Start higher
    critical: 95.0   # Very high for critical

Adjust downward based on observed behavior.

2. Consider Workload Patterns

Different systems have different normal ranges:

Web servers (bursty traffic):

cpu_percent:
  warning: 80.0
  critical: 90.0
  hysteresis: 0.15  # Higher hysteresis for burstiness

Database servers (steady load):

cpu_percent:
  warning: 70.0
  critical: 85.0
  hysteresis: 0.1   # Lower hysteresis for steady metrics

3. Use Appropriate Operators

Match the operator to the metric:

Metric Type	Example	Operator	Reason
Resource usage	CPU%, Memory%	`>`	Alert when high
Available resources	Free memory, Free disk	`<`	Alert when low
Error counters	Network errors	`>`	Alert when increasing
Health checks	Nagios exit code	`>=`	Map to standard codes

4. Align with Monitoring Intervals

Ensure threshold checks align with plugin collection intervals:

plugins:
  cpu_monitor:
    interval: 300    # Check every 5 minutes

thresholds:
  cpu_monitor:
    cpu_percent:
      warning: 80.0
      # Will be checked every 5 minutes

5. Test Before Production

Start with disabled thresholds:
```
enabled: false
```
Observe metric ranges over a week
Set thresholds based on observed data
Enable gradually:
```
enabled: true
```
Monitor for false positives

6. Document Baseline Values

Keep a record of normal operating ranges:

# Production web server baseline (observed over 30 days):
# CPU: 20-40% normal, 60% peak
# Memory: 60-70% normal, 80% peak
# Disk /: 40-50% usage, growing 2%/month

cpu_monitor:
  cpu_percent:
    warning: 75.0   # Above peak + margin
    critical: 90.0  # Danger zone

7. Layer Alerts

Use WARNING for early notification, CRITICAL for immediate action:

disk_monitor:
  partitions:
    /:
      percent:
        warning: 75.0    # Early warning: "check in next few days"
        critical: 90.0   # Critical: "act now before outage"

Troubleshooting

No Notifications Being Sent

Check if host is watched:
```
watchhosts:
  - your-host-name
```

Verify notification configuration:

toemail:
  - admin@example.com
smtpserver: smtp.example.com

Check threshold configuration:

# Look for parsing errors in server logs
grep "threshold" /var/log/heartbeat/hbd.log

Verify metric names:
- Metric names must match exactly (case-sensitive)
- Check journal or logs for actual metric names

Too Many Alerts (Flapping)

Increase hysteresis:

hysteresis: 0.2  # Increase from 0.1 to 0.2 (20%)

Adjust thresholds:
```
warning: 85.0  # Increase from 80.0
```

Increase renotification interval:

threshold_renotify_interval: 7200  # 2 hours instead of 1

Alerts Not Triggering

Check threshold operator:

# For available memory (alert when LOW):
operator: "<"   # NOT ">"

Verify numeric values:
- Ensure metric values are numeric
- Check for unit mismatches (MB vs GB)
Check if threshold is enabled:
```
enabled: true  # NOT false
```
Review hysteresis settings:
- Very high hysteresis may prevent state changes
- Try reducing or disabling temporarily

Alert State Not Recovering

Check recovery threshold calculation:

Threshold: 90
Hysteresis: 0.1
Recovery: 90 - (90 * 0.1) = 81

Value must drop below 81 to recover

Temporarily disable hysteresis:
```
hysteresis: 0.0
```

Monitor actual metric values:

# Check journal for actual values
grep "cpu_percent" /var/log/heartbeat/messages.journal | tail -20

Advanced Topics

Custom Notification Callbacks

The ThresholdChecker supports custom notification functions:

def custom_notifier(message):
    # Send to incident management system
    pagerduty.trigger(message)
    
    # Log to custom system
    logger.critical(message)
    
    # Update dashboard
    metrics.alert_count.inc()

checker = ThresholdChecker(
    config=config,
    notification_callback=custom_notifier
)

Programmatic Access

Query alert states programmatically:

# Get all active alerts for a host
active = threshold_checker.get_active_alerts(host.alert_states)

for alert in active:
    print(f"{alert.metric_path}: {alert.level.name} for {time.time() - alert.since}s")

# Get alert summary
summary = threshold_checker.get_alert_summary(host.alert_states)
print(f"WARNING: {summary['warning']}, CRITICAL: {summary['critical']}")

Integration with External Systems

Threshold violations can be integrated with:

PagerDuty: Incident creation and escalation
OpsGenie: On-call scheduling and routing
ServiceNow: Ticket creation
Grafana: Dashboard annotations
Elasticsearch: Alert indexing and analysis

Future Enhancements

Planned features:

Composite thresholds: Alert based on multiple metrics

composite:
  high_load_with_low_memory:
    conditions:
      - cpu_monitor.load_1min > 8.0
      - memory_monitor.available_mb < 500

Time-based thresholds: Different thresholds by time of day

schedule:
  business_hours:
    warning: 70.0
  off_hours:
    warning: 85.0

Rate-of-change thresholds: Alert on rapid changes

rate_of_change:
  metric: cpu_percent
  period: 300
  threshold: 30.0  # Alert if changes >30% in 5 minutes

Alert grouping: Combine related alerts

groups:
  disk_critical:
    metrics:
      - disk_monitor./.percent
      - disk_monitor./var.percent
    action: single_notification

Maintenance windows: Suppress alerts during planned maintenance

maintenance:
  - host: webserver01
    start: 2024-01-15T02:00:00Z
    end: 2024-01-15T04:00:00Z

Multi-Threshold Configuration

Support for multiple named threshold configurations with per-host mapping and composable layering.

Overview

The multi-threshold feature allows you to:

Define multiple named threshold configurations
Assign one or more configurations to each host
Compose configurations by layering — each named config's overrides are applied in order on top of the defaults
Use different sensitivity levels for different environments

Configuration Structure

Named configurations are defined under threshold_configs. Each host selects which ones to use via threshold_config in the hosts section (a string for a single config, or a list to layer multiple):

# Optional: set the default configuration name (defaults to "default")
default_threshold_config: "default"

threshold_configs:
  default:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 80.0
          critical: 90.0

  high_sensitivity:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 60.0
          critical: 75.0

  low_sensitivity:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 90.0
          critical: 95.0

hosts:
  prod-web-01:
    threshold_config: high_sensitivity   # single config

  dev-server-01:
    threshold_config: low_sensitivity

  # Hosts with no threshold_config use default_threshold_config

Composable Configurations (list form)

threshold_config can be a list. Configs are applied left to right: the defaults are the base, then each named config's overrides are layered on top. Later entries in the list win on any metric they define.

threshold_configs:
  default:
    thresholds:
      cpu_monitor:
        cpu_percent: {warning: 80, critical: 90}
      memory_monitor:
        memory_percent: {warning: 85, critical: 95}
      disk_monitor:
        partitions:
          /:
            percent: {warning: 80, critical: 90}

  # Tighter CPU limits for busy servers
  high_cpu_load:
    thresholds:
      cpu_monitor:
        cpu_percent: {warning: 60, critical: 75}

  # Tighter disk limits for data-heavy servers
  busy_disk:
    thresholds:
      disk_monitor:
        partitions:
          /:
            percent: {warning: 70, critical: 85}

hosts:
  # Gets default thresholds only
  web-01:
    threshold_config: default

  # Gets tighter CPU limits, default memory and disk
  build-server:
    threshold_config: high_cpu_load

  # Layers both: tighter CPU AND tighter disk, default memory
  db-01:
    threshold_config: [high_cpu_load, busy_disk]

  # Three layers: busy_disk overrides high_cpu_load if they conflict
  storage-01:
    threshold_config: [default, high_cpu_load, busy_disk]

How layering works:

Starting from the default thresholds:

Layer	Applied config	Effect
Base	`default`	all default thresholds
+1	`high_cpu_load`	cpu_percent overridden to 60/75
+2	`busy_disk`	disk percent overridden to 70/85; cpu_percent stays at 60/75

Each named config only overrides the metrics it explicitly defines. Metrics not mentioned in a config inherit from the layers beneath.

Use Cases

1. Environment-Based Thresholds

Different thresholds for production vs. development:

threshold_configs:
  production:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 70.0   # Alert earlier in production
          critical: 85.0

  development:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 90.0   # More relaxed for dev
          critical: 98.0

hosts:
  prod-web-01:
    threshold_config: production
  prod-web-02:
    threshold_config: production
  dev-web-01:
    threshold_config: development
  dev-web-02:
    threshold_config: development

2. Server Role-Based Thresholds

Different thresholds based on server function:

threshold_configs:
  webserver:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 80.0
          critical: 90.0

  database:
    thresholds:
      cpu_monitor:
        cpu_percent:
          warning: 70.0
          critical: 85.0
      memory_monitor:
        memory_percent:
          warning: 90.0   # Databases can use high memory
          critical: 97.0
      disk_monitor:
        partitions:
          /var/lib/mysql:
            percent:
              warning: 75.0
              critical: 85.0

  cache:
    thresholds:
      memory_monitor:
        memory_percent:
          warning: 95.0   # Redis/Memcached can use very high memory
          critical: 99.0

hosts:
  web-01:
    threshold_config: webserver
  web-02:
    threshold_config: webserver
  db-01:
    threshold_config: database
  db-02:
    threshold_config: database
  redis-01:
    threshold_config: cache
  memcached-01:
    threshold_config: cache

3. Sensitivity Levels

Different sensitivity for critical vs. non-critical systems:

threshold_configs:
  critical:
    thresholds:
      disk_monitor:
        partitions:
          /:
            percent:
              warning: 70.0
              critical: 80.0
              hysteresis: 0.15

  standard:
    thresholds:
      disk_monitor:
        partitions:
          /:
            percent:
              warning: 85.0
              critical: 95.0
              hysteresis: 0.1

  relaxed:
    thresholds:
      disk_monitor:
        partitions:
          /:
            percent:
              warning: 90.0
              critical: 98.0
              hysteresis: 0.05

hosts:
  payment-gateway:
    threshold_config: critical
  auth-server:
    threshold_config: critical
  web-01:
    threshold_config: standard
  web-02:
    threshold_config: standard
  test-server:
    threshold_config: relaxed

4. Composable Profiles

Build host-specific thresholds by combining small, focused configs:

threshold_configs:
  # Baseline — everything at default levels
  default:
    thresholds:
      cpu_monitor:
        cpu_percent: {warning: 80, critical: 90}
      memory_monitor:
        memory_percent: {warning: 85, critical: 95}

  # Overlay: tighter CPU only
  tight_cpu:
    thresholds:
      cpu_monitor:
        cpu_percent: {warning: 60, critical: 75}

  # Overlay: tighter memory only
  tight_memory:
    thresholds:
      memory_monitor:
        memory_percent: {warning: 70, critical: 85}

  # Overlay: extra disk partition for database servers
  db_disk:
    thresholds:
      disk_monitor:
        partitions:
          /var/lib/postgresql:
            percent: {warning: 75, critical: 88}

hosts:
  # Plain web server
  web-01:
    threshold_config: default

  # Build server: tight CPU, default memory and disk
  build-01:
    threshold_config: tight_cpu

  # Database: tight CPU + tight memory + extra disk partition
  db-01:
    threshold_config: [tight_cpu, tight_memory, db_disk]

  # Replica database: tight memory + extra disk, normal CPU
  db-02:
    threshold_config: [tight_memory, db_disk]

Configuration Priority

Host threshold_config (list): Layer each named config's overrides left-to-right on top of the defaults
Host threshold_config (string): Use that single named config directly
host_threshold_mapping (legacy): Same as above, string only
default_threshold_config: Used for hosts with no mapping
First alphabetically: If the default config is not found, use the first config alphabetically
Legacy thresholds section: Used when threshold_configs is absent entirely

Backward Compatibility

The legacy host_threshold_mapping top-level key and the flat thresholds section are still fully supported:

# Still works — equivalent to hosts: {prod-web-01: {threshold_config: high_sensitivity}}
host_threshold_mapping:
  prod-web-01: high_sensitivity

# Still works — equivalent to threshold_configs: {default: {thresholds: ...}}
thresholds:
  cpu_monitor:
    cpu_percent: {warning: 80, critical: 90}

26 KiB Raw Blame History

Threshold Alerting System

Overview

Architecture

Components

Alert Levels

Configuration

Basic Structure

Configuration Parameters

Required Parameters

Optional Parameters

Comparison Operators

Greater Than (>, >=)

Less Than (<, <=)

Hysteresis

How It Works

Configuration Recommendations

Plugin-Specific Configuration

CPU Monitor

Memory Monitor

Disk Monitor

Network Monitor

Nagios Runner

Notification Behavior

When Notifications Are Sent

Notification Frequency

Notification Channels

Supported Channel Types

Centralized Channel Configuration

Per-Host Channel Assignment

Watched Hosts

Alert State Tracking

Alert State Information

Querying Alert States

Testing

Best Practices

1. Start Conservative

2. Consider Workload Patterns

3. Use Appropriate Operators

4. Align with Monitoring Intervals

5. Test Before Production

6. Document Baseline Values

7. Layer Alerts

Troubleshooting

No Notifications Being Sent

Too Many Alerts (Flapping)

Alerts Not Triggering

Alert State Not Recovering

Advanced Topics

Custom Notification Callbacks

Programmatic Access

Integration with External Systems

Future Enhancements

See Also

Multi-Threshold Configuration

Overview

Configuration Structure

Composable Configurations (list form)

Use Cases

1. Environment-Based Thresholds

2. Server Role-Based Thresholds

3. Sensitivity Levels

4. Composable Profiles

Configuration Priority

Backward Compatibility

26 KiB

Raw Blame History

Greater Than (`>`, `>=`)

Less Than (`<`, `<=`)