Dealing with False Ping Alerts in LogicMonitor: Building a Fallback Ping Monitoring Script for Production VMs
Problem Statement
On Friday 6th September 2024 at 21:31, we received an alert from LogicMonitor indicating one of our production web app servers (Tomcat#3) was down, with the message: "The host Tomcat#3 (i-xxxxxxx) is down"
. However, shortly after receiving the alert, we attempted to SSH into the VM and confirmed that the server was fully operational. But what did it go wrong?
Root Cause Investigation and Resolution
A ticket was promptly raised with our IT service provider, responsible for managing our Cloud Production workloads, immediately following the alert. Their investigation provided the following findings:
- 1. Ping Data: Ping data from the server stopped reporting at 16:09 on the same day.
- 2. Other Metrics: All other performance metrics from the server were reported normally, indicating that the server was functioning properly aside from the ping data issue.
- 3. Environment Consistency: Other servers in the same environment (e.g., Tomcat#1) continued to report ping data normally, indicating that the issue was isolated to Tomcat#3.
Upon further analysis by our service provider, it was determined that the ping data had stopped reporting at 16:09, resulting in a delayed false alert at 21:31. The root cause of the issue was traced to a malfunction in the service provider's monitoring tool, which failed to capture ping data properly. Once the issue with the monitoring tool's collector was resolved on Monday, September 9th, ping monitoring resumed, and the false alert was cleared.
Graph Explanation
The graph below shows that the ping metrics, including round-trip time and packet counts, were being reported normally up until 16:09 on September 6th. After that, no new ping data was collected, confirming the issue was related to the monitoring service rather than the server itself.
In-House Alternative Solution: Fallback Ping Monitoring Script
To reduce reliance on third-party monitoring services, I created a fallback ping monitoring script. Running on a separate VM, it pings target servers and sends email alerts if they don’t respond after several attempts. It’s a simple fallback solution for basic connectivity checks.
Environment setup:
This fallback ping monitoring solution was implemented and tested in an AWS environment. The setup included 3 EC2 instances:
- 1 EC2 instance acting as the
ping-monitoring server
(our VM, not a service provider’s monitoring system). - 2 EC2 instance as the
target hosts
.
Implementation steps
Steps 1 through 3 are the prerequisites before running the ping-failure-alert.sh script on our dedicated VM, which is acting as the monitoring server (not the service provider's monitoring system).
Step 1 - Allow ICMP Through the Firewall
- Monitoring Server (our VM): Allow outbound ICMP traffic (ping) for all destinations.
- Target Servers: Allow inbound ICMP from the monitoring server’s IP.
- Test Connectivity: Run ping
<target_host_public_IP>
from the monitoring server to ensure reachability.
Step 2 - Install mailx
package to enable email notification
mailx
is a command-line email client used in Unix-like systems to send and receive emails directly from the terminal or within scripts.
- For Debian/Ubuntu:
sudo apt-get install mailx
- For RHEL/CentOS:
sudo yum install mailx
- For Fedora:
sudo dnf install mailx
Step 3 - Gmail Configuration for SMTP Authentication
To ensure that the script can send email notifications using Gmail’s SMTP service, you need to configure Gmail accordingly. Follow these steps:
- Enable App Passwords in Gmail if Two-Step Verification is on.
- Create an
App Password
underGoogle Account > Security > App Passwords
. - Update
/etc/mail.rc
on the Monitoring server with the following configuration:
1set smtp=smtps://smtp.gmail.com:465
2set smtp-auth=login
3set smtp-auth-user=<your_email_id>@gmail.com # provide the main Gmail address
4set smtp-auth-password=<your_generated_app_password> # do not leave any spaces between characters
5set ssl-verify=ignore
Step 4 - Ping Monitoring Script
The bash script below pings the servers and sends alerts via email if they are unreachable.
- Create the script file and open it for editing (
sudo nano ping-failure-alert.sh
). - Ensure the script is executable (
sudo chmod +x ping-failure-alert.sh
). - Restrict access to root only for security (
sudo chown root:root ping-failure-alert.sh
andsudo chmod 700 ping-failure-alert.sh
).
1#################################################################################
2# Script Name: ping-failure-alert.sh
3# Description: This script pings a predefined list of server IP addresses to check
4# their network connectivity. If any servers fail to respond after
5# a specified number of attempts and interval, an email notification
6# is sent.
7# Author: Mickael Asghar
8# Created on: 07/06/2024
9# Updated on: 07/06/2024
10#################################################################################
11
12#!/bin/bash
13
14# Email Configuration - Define email addresses here
15recipient_email="ping-monitoring@abc.com" # Define the primary recipient
16cc_recipients=("contact1@abc.com" "contact2@abc.com) # Define CC recipients
17
18# Convert CC recipients array to a comma-separated string
19cc_list=$(IFS=','; echo "${cc_recipients[*]}")
20
21# Associative array of server IP addresses and their hostnames
22declare -A ping_targets=(
23 ["10.15.30.40"]="target_host_1" # adjust IP address and hostname
24 ["50.60.70.80"]="target_host_2" # adjust IP address and hostname
25 ["90.95.100.110"]="target_host_3" # adjust IP address and hostname
26)
27
28# Retry settings
29retry_count=3 # Number of retry attempts
30retry_interval=30 # Interval in seconds between retries
31
32# Initialize a variable to store failed hosts
33failed_hosts=""
34
35# Function to ping a host
36ping_host() {
37 local ip=$1
38 ping -c 1 $ip > /dev/null 2>&1
39 return $?
40}
41
42# Loop through each target and attempt to ping
43for ip in "${!ping_targets[@]}" # Loop over keys of the associative array
44do
45 hostname=${ping_targets[$ip]} # Assign hostname from the associative array
46 success=false
47
48 for attempt in $(seq 1 $retry_count)
49 do
50 echo "Pinging $hostname ($ip) (Attempt $attempt of $retry_count)..."
51 if ping_host $ip; then
52 echo "$hostname ($ip) is reachable."
53 success=true
54 break
55 else
56 echo "$hostname ($ip) is not reachable. Waiting $retry_interval seconds before retrying..."
57 sleep $retry_interval
58 fi
59 done
60
61 if ! $success; then
62 current_datetime=$(date "+%d/%m/%Y %H:%M:%S") # Get the current date and time
63 echo "$hostname ($ip) failed to respond after $retry_count attempts."
64 failed_hosts+="Date: $current_datetime - $hostname ($ip).\n" # Append formatted string
65 fi
66done
67
68# Check if any host failed to respond and send an email if so
69if [ ! -z "$failed_hosts" ]; then
70 message="The following hosts failed to respond after $retry_count attempts with a $retry_interval second interval between attempts:\n$failed_hosts"
71 echo -e "$message" | mailx -s "[ALERT] - Ping Failure Notification" -c "$cc_list" "$recipient_email"
72else
73 echo "All hosts responded successfully after $retry_count attempts."
74fi
Key Settings:
- Retry Attempts (retry_count): The script will try to ping each server 3 times before declaring it unreachable.
- Interval Between Retries (retry_interval): There is a 30-second interval between retries. This ensures that short downtimes (e.g., server reboot) will not trigger immediate false alerts.
- Adjust values accordingly to meet your requirements:
recipient_email
,cc_recipients
,target_host_IP
,target_hostname
Step 5 - Set up the Script as a Cron Job
To automate the script, run it as a cron job every 5 minutes:
- Open Crontab:
sudo crontab -e
- Add the Cron Job:
1*/5 * * * * /path/to/ping-failure-alert.sh
- Replace
/path/to/ping-failure-alert.sh
with the actual path to your script.
Avoiding Unnecessary Alerts
With 3 retry attempts and a 30-second interval, the script waits ~90 seconds before declaring a server unreachable. This reduces unnecessary alerts for brief downtimes, like reboots.
Conclusion
The false alert caused by the monitoring collector led us to realize that the server was never down, despite the loss of ping data. Having a fallback ping monitoring script offers a reliable alternative for connectivity checks, ensuring you’re not misled by false positives from external services. This backup system is lightweight, customisable, and independent of the main monitoring service.