Monday, March 18, 2013

Log Monitoring Script

Scenario:
In our production application we faced a critical slowness and hanging.While we did a deep analysis to find out what is an issue ? We finally came to know that it is a “timeout” issue. So i’m requested by my teammates to write a simple shell script to monitor the last 300 lines of application servers every 15 seconds. And the additional requirement is along with monitoring, we also need to take thread dump of application servers, terracotta servers which located across distributed machines. Finally we need mail that thread dump to specific persons or groups.

We developed this based on this existing script, which we already Developed, and available under the following link!!
http://www.industryvertical.co.in/2013/01/script-thread-dump-of-multiple-servers.html

image
Cron Job:
To  monitor the weblogic servers logs every 15 seconds assigned a new cronjob in Production Unix Servers.
image
Mailing List:
This would send the attached thread dump details to the following people in the mailing list.
person1@kshop.com;
Person2@kshop.com;
Person3@nielsen.com;.
Script:
This Script would Monitor for the Error Message "put timed out" in last 300 lines of weblogic server logs every 15 seconds.
Steps involved:
1. Get Latest Last Log of all the servers and grep for the pattern "put timed out"
2. If the above condition satisfied then it would trigger to take thread dump of that Weblogic server along with that it also triggers to take the thread dump of active and passive terracotta Servers.
3. Terracotta Servers thread dump scripts are placed in respective terracotta boxes and it would get called through ssh command. (Used RSA Algorithm based ssh-rsa technique for this)
Implementation Part:
#!/bin/bash 
#Purpose: Script to check timeout in server logs 
#Program Name: getThreadDump.sh

clear 
set_environment_variables() 
{ 
DATE=`date +"%Y%m%d_%H%M"`
SERVER_HOME=/opt/thiru/Oracle/Middleware/user_projects/domains/prod/servers 
JAVA_HOME=/opt/thiru/Oracle/Middleware/jdk1.6.0_26/ 
}

create_servers_list() 
{ 
        ls  ${SERVER_HOME}|grep -v Admin|grep -v logtail|grep -v domain_bak > /opt/thiru/Monitoring/list_of_servers.txt 
}

get_server_logs() 
{ 
  NO_OF_SERVERS=$(cat /opt/thiru/Monitoring/list_of_servers.txt|wc -l)
        i=1
        while [ $i -le $NO_OF_SERVERS ]
        do
        SERVER_NAME=$(cat /opt/thiru/Monitoring/list_of_servers.txt|tr '\n' ':'|cut -d":" -f$i)
         echo "Getting Server Logs for ${SERVER_NAME}"
         ls -ltr ${SERVER_HOME}/${SERVER_NAME}/logs/${SERVER_NAME}.* | awk '{print $9}'|tail -1f > /opt/thiru/Monitoring/last_one_server_log_files.txt
         FILE_NAME_1=$(cat /opt/thiru/Monitoring/last_one_server_log_files.txt|tr '\n' ':'|cut -d":" -f1)
         tail -n 300 ${FILE_NAME_1}| grep -i "put timed out"
         if [ $? -eq 0 ];
         then
         echo "Match!!"
         get_thread_dump_box1 ${SERVER_NAME}
         get_thread_dump_tc
         else
         echo "Not Matched!!"
         fi
         i=`expr $i + 1`
        done
}

get_thread_dump_box1() 
{ 
         echo "Getting Process ID of ${SERVER_NAME}"
         ps -ef | grep -i Dweblogic.Name=${SERVER_NAME} | grep -v 00:00:00 | awk '{print $2}' > /opt/thiru/Monitoring/process_id.txt

         PID=$(cat /opt/thiru/Monitoring/process_id.txt|tr '\n' ':'|cut -d":" -f1)
         echo "Taking ThreadDump for ${SERVER_NAME} with Process ID ${PID}"
         ${JAVA_HOME}/bin/jstack $PID > /opt/thiru/Monitoring/${SERVER_NAME}.out
         FILE_NAME=${SERVER_NAME}.out
         echo ${FILE_NAME}
         send_mail ${FILE_NAME} 
}

get_thread_dump_box2() 
{ 
        echo "Checking in 192.168.56.2 Box!!"
        ssh 192.168.56.2 /bin/bash /opt/thiru/Monitoring/getThreadDump.sh 
}

get_thread_dump_box3() 
{ 
        echo "Checking in 192.168.56.3 Box!!"
        ssh 192.168.56.3 /bin/bash /opt/thiru/Monitoring/getThreadDump.sh 
}

get_thread_dump_tc() 
{ 
        echo "Getting Thread dump of Active Terracotta Server!!"
        ssh 10.7.21.21 /bin/bash /opt/thiru/Monitoring/getThreadDumpTC1.sh 
         echo "Getting Thread dump of Passive Terracotta Server!!"
        ssh 10.7.21.22 /bin/bash /opt/thiru/Monitoring/getThreadDumpTC2.sh 
}

send_mail() 
{ 
        Mail_Subject="Error_Message:put_timed_out"
        to_email="person1@kshop.com; Person2@kshop.com; Person3@nielsen.com;"
        mutt -s $Mail_Subject -a /opt/thiru/Monitoring/${FILE_NAME} $to_email < /opt/thiru/Monitoring/message_body 
}

clear_event() 
{ 
        rm /opt/thiru/Monitoring/last_one_server_log_files.txt 

       rm /opt/thiru/Monitoring/list_of_servers.txt 

       rm /opt/thiru/Monitoring/process_id.txt

       rm  /opt/thiru/Monitoring/*.out 
}

set_environment_variables 
create_servers_list 
get_server_logs 
get_thread_dump_box2 
get_thread_dump_box3 
clear_event

Mail with attached Thread Dump:

image

I hope that this script would help someone to resolve an issue. Please post your comments if you have any doubts !!

No comments:

Post a Comment