WMSMonitor DB-Analyzer (Available from release 2.0)
The DB analyzer is a daemon that periodically checks the WMSMonitor database looking for new data and keeps track of the status of any monitored instances and notifies their status to a NAGIOS server, that should be configured in order to accept WMSMonitor notifications.
The main purpose of the DB analyzer is to send notification to NAGIOS that successively is able to send email and to interact with the SMS gateway keeping a database for the alarms history.
This is a cheap way to implement a robust notification service for WMSMonitor. We are however working to implement a stand alone notification service for the db analyzer.
It is also possible to specify groups of instances so that special notifications to nagios are sent about the whole group and not only the single instances.
This is particularly convenient when a site has multiple instances dedicated to one VO.
All the executable needed to start the DB analyzer are already present on the data_collector, under the usual directory /root/wmsmon
It uses the same wmsmon_site-info.def file to obtain the db connection parameters.
Configuration and start of the DB Analyzer
DB analyzer is implemented in python and many parameters are still hardcoded in the python executable, so they must be modified editing the executable itself.
We will provide an installation script in the next WMSMonitor releases.
The analyzer sends to NAGIOS notifications for a service MON-WMS or MON-LB, that should be configured in NAGIOS as a service of the WMS(LB) host.
A typical notification is the following (where gstore.cnaf.infn.it is the NAGIOS server):
In case of an LB:
echo "lb010;MON-LB;0;lb010.cnaf.infn.it STATUS is OK - " | send_nsca -H gstore.cnaf.infn.it -d ';' -c /etc/nagios/send_nsca.cfg
In case of a WMS:
echo "devel14;MON-WMS;2;devel14.cnaf.infn.it STATUS is CRITICAL - At least daemon LM is dead!" | send_nsca -H gstore.cnaf.infn.it -d ';' -c /etc/nagios/send_nsca.cfg
There are four kinds of notification that can be sent for any single instance:
OK, WARNING, CRITICAL, UNKNOWN defined as follow:
OK: no problem found in the DB for that specific instance
WARNING: problems are found but they are not critical, i.e. internal WMS/LB components queues are increasing but are not too high or a file system occupancy is between 80% and 90%
CRITICAL: something bad was found in the db about the instance: i.e. internal WMS/LB components queues are greater than 3000 entries or a file system occupancy greater than 90%
NOTE that the analyzer is able to associate an LB to a WMS from the information stored into the DB. The status of the LB affects the status of the WMS, but not vice versa. If the LB is in WARNING and the WMS itself is OK the notification for the WMS will be WARNING. The worst status between the WMS and LB are notified for the WMS.
UNKNOWN: the latest data about an instance are too old to have a reliable status
NAGIOS should be configured to handle all these notification. In example the CNAF NAGIOS is configured to notify via mail every status change on any instance.
The DB analyzer send notifications also about
groups of instances.
Groups are discovered from the WMSMonitor DB, they reflect the group reported in the third coloumn of the wmslist.conf file.
Notifications are sent to NAGIOS for each group following these rules:
OK: no problem found in the DB for that specific group
WARNING: less than 50% of the group instances are in critical status.
CRITICAL: more than 50% of the group instances are in critical status.
UNKNOWN: the latest data about an instance are too old to have a reliable status.
It is possible to configure subgroups for any group editing the file /root/groupfile. I.e. to create the groups ANALYSIS and PROD for the CMS VO the groupfile looks like:
#cat /root/groupfile
wms001.yuor_domain cms PROD
wms002.your_domain cms ANALYSIS
wms003.your_domain cms ANALYSIS
wms004.your_domain cms PROD
wms005.your_domain cms ANALYSIS
In this way notifications are sent for each subgroups and not for the groups itself and by default notification are sent for NAGIOS-services called GROUP-SUBGROUP-WMS belonging to the WMSMonitor server host.
As for single instances NAGIOS should be configured to handle (sub)groups notifications.
Before starting the analyzer you should configure the hostname of the NAGIOS server. This must be done by hand editing the file /root/wmsmon/bin/analyzer-utils.py substituting the string "gstore.cnaf.infn.it" with your NAGIOS server hostname.
Now you are ready to start the analyzer as a normal Linux backgroud process:
#/root/wmsmon/bin/wmsmon-db-analyzer.py > /var/log/wmsmon-db-analyzer.log 2>&1 &
NOTE that the analyzer logs to stdout.
It is advisable to set a logrotate for the /var/log/wmsmon-db-analyzer.log. You just need to add the following lines to /etc/wmsmon_logrotate.conf:
/var/log/wmsmon-db-analyzer.log {
copytruncate
rotate 10
size = 100M
missingok
nomail
}
In case of problems running the analyzer please contact wmsmon<at>cnaf.infn.it.