Dealing with False Ping Alerts in LogicMonitor: Building a Fallback Ping Monitoring Script for Production VMs

Problem Statement

On Friday 6th September 2024 at 21:31, we received an alert from LogicMonitor indicating one of our production web app servers (Tomcat#3) was down, with the message: "The host Tomcat#3 (i-xxxxxxx) is down". However, shortly after receiving the alert, we attempted to SSH into the VM and confirmed that the server was fully operational. But what did it go wrong?

Root Cause Investigation and Resolution

A ticket was promptly raised with our IT service provider, responsible for managing our Cloud Production workloads, immediately following the alert. Their investigation provided the following findings:

  • 1. Ping Data: Ping data from the server stopped reporting at 16:09 on the same day.
  • 2. Other Metrics: All other performance metrics from the server were reported normally, indicating that the server was functioning properly aside from the ping data issue.
  • 3. Environment Consistency: Other servers in the same environment (e.g., Tomcat#1) continued to report ping data normally, indicating that the issue was isolated to Tomcat#3.

Upon further analysis by our service provider, it was determined that the ping data had stopped reporting at 16:09, resulting in a delayed false alert at 21:31. The root cause of the issue was traced to a malfunction in the service provider's monitoring tool, which failed to capture ping data properly. Once the issue with the monitoring tool's collector was resolved on Monday, September 9th, ping monitoring resumed, and the false alert was cleared.

Graph Explanation

The graph below shows that the ping metrics, including round-trip time and packet counts, were being reported normally up until 16:09 on September 6th. After that, no new ping data was collected, confirming the issue was related to the monitoring service rather than the server itself.

ping-data-loss-webserver#03

In-House Alternative Solution: Fallback Ping Monitoring Script

To reduce reliance on third-party monitoring services, I created a fallback ping monitoring script. Running on a separate VM, it pings target servers and sends email alerts if they don’t respond after several attempts. It’s a simple fallback solution for basic connectivity checks.

Environment setup:

This fallback ping monitoring solution was implemented and tested in an AWS environment. The setup included 3 EC2 instances:

  • 1 EC2 instance acting as the ping-monitoring server (our VM, not a service provider’s monitoring system).
  • 2 EC2 instance as the target hosts.

Implementation steps

Steps 1 through 3 are the prerequisites before running the ping-failure-alert.sh script on our dedicated VM, which is acting as the monitoring server (not the service provider's monitoring system).

Step 1 - Allow ICMP Through the Firewall

  • Monitoring Server (our VM): Allow outbound ICMP traffic (ping) for all destinations.
  • Target Servers: Allow inbound ICMP from the monitoring server’s IP.
  • Test Connectivity: Run ping <target_host_public_IP> from the monitoring server to ensure reachability.

Step 2 - Install mailx package to enable email notification

mailx is a command-line email client used in Unix-like systems to send and receive emails directly from the terminal or within scripts.

  • For Debian/Ubuntu: sudo apt-get install mailx
  • For RHEL/CentOS: sudo yum install mailx
  • For Fedora: sudo dnf install mailx

Step 3 - Gmail Configuration for SMTP Authentication

To ensure that the script can send email notifications using Gmail’s SMTP service, you need to configure Gmail accordingly. Follow these steps:

  1. Enable App Passwords in Gmail if Two-Step Verification is on.
  2. Create an App Password under Google Account > Security > App Passwords.
  3. Update /etc/mail.rc on the Monitoring server with the following configuration:
1set smtp=smtps://smtp.gmail.com:465
2set smtp-auth=login
3set smtp-auth-user=<your_email_id>@gmail.com            # provide the main Gmail address
4set smtp-auth-password=<your_generated_app_password>    # do not leave any spaces between characters
5set ssl-verify=ignore

Step 4 - Ping Monitoring Script

The bash script below pings the servers and sends alerts via email if they are unreachable.

  • Create the script file and open it for editing (sudo nano ping-failure-alert.sh).
  • Ensure the script is executable (sudo chmod +x ping-failure-alert.sh).
  • Restrict access to root only for security (sudo chown root:root ping-failure-alert.sh and sudo chmod 700 ping-failure-alert.sh).
 1#################################################################################
 2# Script Name: ping-failure-alert.sh
 3# Description: This script pings a predefined list of server IP addresses to check
 4#              their network connectivity. If any servers fail to respond after
 5#              a specified number of attempts and interval, an email notification
 6#              is sent.
 7# Author: Mickael Asghar
 8# Created on: 07/06/2024
 9# Updated on: 07/06/2024
10#################################################################################
11
12#!/bin/bash
13
14# Email Configuration - Define email addresses here
15recipient_email="ping-monitoring@abc.com"  # Define the primary recipient
16cc_recipients=("contact1@abc.com" "contact2@abc.com)  # Define CC recipients
17
18# Convert CC recipients array to a comma-separated string
19cc_list=$(IFS=','; echo "${cc_recipients[*]}")
20
21# Associative array of server IP addresses and their hostnames
22declare -A ping_targets=(
23    ["10.15.30.40"]="target_host_1"    # adjust IP address and hostname
24    ["50.60.70.80"]="target_host_2"    # adjust IP address and hostname
25    ["90.95.100.110"]="target_host_3"  # adjust IP address and hostname
26)
27
28# Retry settings
29retry_count=3       # Number of retry attempts
30retry_interval=30   # Interval in seconds between retries
31
32# Initialize a variable to store failed hosts
33failed_hosts=""
34
35# Function to ping a host
36ping_host() {
37    local ip=$1
38    ping -c 1 $ip > /dev/null 2>&1
39    return $?
40}
41
42# Loop through each target and attempt to ping
43for ip in "${!ping_targets[@]}"  # Loop over keys of the associative array
44do
45    hostname=${ping_targets[$ip]}  # Assign hostname from the associative array
46    success=false
47
48    for attempt in $(seq 1 $retry_count)
49    do
50        echo "Pinging $hostname ($ip) (Attempt $attempt of $retry_count)..."
51        if ping_host $ip; then
52            echo "$hostname ($ip) is reachable."
53            success=true
54            break
55        else
56            echo "$hostname ($ip) is not reachable. Waiting $retry_interval seconds before retrying..."
57            sleep $retry_interval
58        fi
59    done
60
61    if ! $success; then
62        current_datetime=$(date "+%d/%m/%Y %H:%M:%S")  # Get the current date and time
63        echo "$hostname ($ip) failed to respond after $retry_count attempts."
64        failed_hosts+="Date: $current_datetime - $hostname ($ip).\n"  # Append formatted string
65    fi
66done
67
68# Check if any host failed to respond and send an email if so
69if [ ! -z "$failed_hosts" ]; then
70    message="The following hosts failed to respond after $retry_count attempts with a $retry_interval second interval between attempts:\n$failed_hosts"
71    echo -e "$message" | mailx -s "[ALERT] - Ping Failure Notification" -c "$cc_list" "$recipient_email"
72else
73    echo "All hosts responded successfully after $retry_count attempts."
74fi

Key Settings:

  • Retry Attempts (retry_count): The script will try to ping each server 3 times before declaring it unreachable.
  • Interval Between Retries (retry_interval): There is a 30-second interval between retries. This ensures that short downtimes (e.g., server reboot) will not trigger immediate false alerts.
  • Adjust values accordingly to meet your requirements: recipient_email, cc_recipients, target_host_IP, target_hostname

Step 5 - Set up the Script as a Cron Job

To automate the script, run it as a cron job every 5 minutes:

  1. Open Crontab: sudo crontab -e
  2. Add the Cron Job:
1*/5 * * * * /path/to/ping-failure-alert.sh
  1. Replace /path/to/ping-failure-alert.sh with the actual path to your script.

Avoiding Unnecessary Alerts

With 3 retry attempts and a 30-second interval, the script waits ~90 seconds before declaring a server unreachable. This reduces unnecessary alerts for brief downtimes, like reboots.

Conclusion

The false alert caused by the monitoring collector led us to realize that the server was never down, despite the loss of ping data. Having a fallback ping monitoring script offers a reliable alternative for connectivity checks, ensuring you’re not misled by false positives from external services. This backup system is lightweight, customisable, and independent of the main monitoring service.