Monday, March 18, 2013
Log Monitoring Script
Posted by
Thirunavukkarasu Muthuswamy
at
Monday, March 18, 2013
Labels:
Oracle Weblogic Application Server,
Script,
Unix
Scenario:
In our production application we faced a critical slowness and hanging.While we did a deep analysis to find out what is an issue ? We finally came to know that it is a “timeout” issue. So i’m requested by my teammates to write a simple shell script to monitor the last 300 lines of application servers every 15 seconds. And the additional requirement is along with monitoring, we also need to take thread dump of application servers, terracotta servers which located across distributed machines. Finally we need mail that thread dump to specific persons or groups.
We developed this based on this existing script, which we already Developed, and available under the following link!!
http://www.industryvertical.co.in/2013/01/script-thread-dump-of-multiple-servers.html
Cron Job:
To monitor the weblogic servers logs every 15 seconds assigned a new cronjob in Production Unix Servers.
Mailing List:
This would send the attached thread dump details to the following people in the mailing list.
person1@kshop.com;
Person2@kshop.com;
Person3@nielsen.com;.
Script:
This Script would Monitor for the Error Message "put timed out" in last 300 lines of weblogic server logs every 15 seconds.
Steps involved:
1. Get Latest Last Log of all the servers and grep for the pattern "put timed out"
2. If the above condition satisfied then it would trigger to take thread dump of that Weblogic server along with that it also triggers to take the thread dump of active and passive terracotta Servers.
3. Terracotta Servers thread dump scripts are placed in respective terracotta boxes and it would get called through ssh command. (Used RSA Algorithm based ssh-rsa technique for this)
Implementation Part:
Mail with attached Thread Dump:
I hope that this script would help someone to resolve an issue. Please post your comments if you have any doubts !!
In our production application we faced a critical slowness and hanging.While we did a deep analysis to find out what is an issue ? We finally came to know that it is a “timeout” issue. So i’m requested by my teammates to write a simple shell script to monitor the last 300 lines of application servers every 15 seconds. And the additional requirement is along with monitoring, we also need to take thread dump of application servers, terracotta servers which located across distributed machines. Finally we need mail that thread dump to specific persons or groups.
We developed this based on this existing script, which we already Developed, and available under the following link!!
http://www.industryvertical.co.in/2013/01/script-thread-dump-of-multiple-servers.html
Cron Job:
To monitor the weblogic servers logs every 15 seconds assigned a new cronjob in Production Unix Servers.
Mailing List:
This would send the attached thread dump details to the following people in the mailing list.
person1@kshop.com;
Person2@kshop.com;
Person3@nielsen.com;.
Script:
This Script would Monitor for the Error Message "put timed out" in last 300 lines of weblogic server logs every 15 seconds.
Steps involved:
1. Get Latest Last Log of all the servers and grep for the pattern "put timed out"
2. If the above condition satisfied then it would trigger to take thread dump of that Weblogic server along with that it also triggers to take the thread dump of active and passive terracotta Servers.
3. Terracotta Servers thread dump scripts are placed in respective terracotta boxes and it would get called through ssh command. (Used RSA Algorithm based ssh-rsa technique for this)
Implementation Part:
#!/bin/bash #Purpose: Script to check timeout in server logs #Program Name: getThreadDump.sh clear set_environment_variables() { DATE=`date +"%Y%m%d_%H%M"` SERVER_HOME=/opt/thiru/Oracle/Middleware/user_projects/domains/prod/servers JAVA_HOME=/opt/thiru/Oracle/Middleware/jdk1.6.0_26/ } create_servers_list() { ls ${SERVER_HOME}|grep -v Admin|grep -v logtail|grep -v domain_bak > /opt/thiru/Monitoring/list_of_servers.txt } get_server_logs() { NO_OF_SERVERS=$(cat /opt/thiru/Monitoring/list_of_servers.txt|wc -l) i=1 while [ $i -le $NO_OF_SERVERS ] do SERVER_NAME=$(cat /opt/thiru/Monitoring/list_of_servers.txt|tr '\n' ':'|cut -d":" -f$i) echo "Getting Server Logs for ${SERVER_NAME}" ls -ltr ${SERVER_HOME}/${SERVER_NAME}/logs/${SERVER_NAME}.* | awk '{print $9}'|tail -1f > /opt/thiru/Monitoring/last_one_server_log_files.txt FILE_NAME_1=$(cat /opt/thiru/Monitoring/last_one_server_log_files.txt|tr '\n' ':'|cut -d":" -f1) tail -n 300 ${FILE_NAME_1}| grep -i "put timed out" if [ $? -eq 0 ]; then echo "Match!!" get_thread_dump_box1 ${SERVER_NAME} get_thread_dump_tc else echo "Not Matched!!" fi i=`expr $i + 1` done } get_thread_dump_box1() { echo "Getting Process ID of ${SERVER_NAME}" ps -ef | grep -i Dweblogic.Name=${SERVER_NAME} | grep -v 00:00:00 | awk '{print $2}' > /opt/thiru/Monitoring/process_id.txt PID=$(cat /opt/thiru/Monitoring/process_id.txt|tr '\n' ':'|cut -d":" -f1) echo "Taking ThreadDump for ${SERVER_NAME} with Process ID ${PID}" ${JAVA_HOME}/bin/jstack $PID > /opt/thiru/Monitoring/${SERVER_NAME}.out FILE_NAME=${SERVER_NAME}.out echo ${FILE_NAME} send_mail ${FILE_NAME} } get_thread_dump_box2() { echo "Checking in 192.168.56.2 Box!!" ssh 192.168.56.2 /bin/bash /opt/thiru/Monitoring/getThreadDump.sh } get_thread_dump_box3() { echo "Checking in 192.168.56.3 Box!!" ssh 192.168.56.3 /bin/bash /opt/thiru/Monitoring/getThreadDump.sh } get_thread_dump_tc() { echo "Getting Thread dump of Active Terracotta Server!!" ssh 10.7.21.21 /bin/bash /opt/thiru/Monitoring/getThreadDumpTC1.sh echo "Getting Thread dump of Passive Terracotta Server!!" ssh 10.7.21.22 /bin/bash /opt/thiru/Monitoring/getThreadDumpTC2.sh } send_mail() { Mail_Subject="Error_Message:put_timed_out" to_email="person1@kshop.com; Person2@kshop.com; Person3@nielsen.com;" mutt -s $Mail_Subject -a /opt/thiru/Monitoring/${FILE_NAME} $to_email < /opt/thiru/Monitoring/message_body } clear_event() { rm /opt/thiru/Monitoring/last_one_server_log_files.txt rm /opt/thiru/Monitoring/list_of_servers.txt rm /opt/thiru/Monitoring/process_id.txt rm /opt/thiru/Monitoring/*.out } set_environment_variables create_servers_list get_server_logs get_thread_dump_box2 get_thread_dump_box3 clear_event
Mail with attached Thread Dump:
I hope that this script would help someone to resolve an issue. Please post your comments if you have any doubts !!
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment