Nagios WMS probe
Test WMS service with job submission to CEs.It is based on the
python-GridMon library developed by the
SAM team. Details about the command line parameters can be found
here.
Installation
This probe need to be installed on a WMS User Interface because it use wms-cli commands to monitor WMS.
Dependencies
python |
>= 2.4 |
python-ldap |
|
openssl |
>= 0.9.8e-12 |
nagios-submit-conf |
>= 0.2 |
python-GridMon |
>= 1.1.10 |
About the last two rpms they can be install using this repository:
[egi-sam]
name=EGI SAM repo
baseurl=http://repository.egi.eu/sw/production/sam/1/$basearch
enabled=1
gpgcheck=0
protect=1
priority=10
WMS Metrics
- emi.wms.WMS-JobState. Submits grid job to CE(s) via WMS under test. Accepts passive check updates from emi.wms.WMS-JobMonit.
- emi.wms.WMS-JobMonit. Monitors submitted grid jobs.
- emi.wms.WMS-JobCancel. Cancel grid job.
- emi.wms.WMS-JobSubmit. Passive check. Holds terminal status of job submission.
emi.wms.WMS-JobState
This metric is used to submit jobs through the WMS under test. It accepts these parameter:
--jdl-templ <file> JDL template file (full path).
Default: <emi.wms.ProbesLocation>/WMS-jdl.template
--jdl-retrycount <val> JDL RetryCount (Default: 0).
--jdl-shallowretrycount <val> JDL ShallowRetryCount (Default: 1).
--ces-file <file> File with list of CEs. Two schemes [file:] or http:
(Default: /var/lib/gridprobes/<vo>/GoodCEs)
--prev-status <0-3> Previous Nagios status of the metric.
GoodCEs
If exists the file "
GoodCEs" all CEs from the file are OR'ed in the resulting
Requirements
ClassAdd
. Eg.
Requirements = (other.GlueCEInfoHostName == "ce106.cern.ch") || (other.GlueCEInfoHostName == "creamce.gina.sara.nl")
This file can been automatically popolated using the script
gather_healthy_nodes
from the
hr.srce grid-monitoring probes
Otherwise a very general requirement:
Requirements = true
is used for match-making.
JDL template
This is the default jdl template used for submission:
Type="Job";
JobType="Normal";
Executable = "<jdlExecutable>";
StdError = "gridjob.out";
StdOutput = "gridjob.out";
OutputSandbox = {"gridjob.out"};
RetryCount = <jdlRetryCount>;
ShallowRetryCount = <jdlShallowRetryCount>;
Requirements = <jdlReqCEInfoHostName>;
At the moment the parameter
jdlExecutable is hardcoded, and the script
/bin/hostname
is used.
emi.wms.WMS-JobMonit
Monitors submitted grid jobs. Threaded implementation with one thread per monitored resource with max 10 threads. Passively updates
emi.wms.WMS-JobState
with the latest state of the job according to WMS when job is not in a terminal state. When job enters terminal state or was canceled the metric updates both
emi.wms.WMS-JobState
and
org.sam.WMS-JobSubmit
with the final job status. The latter metrics are updated (as passive checks) either via Nagios command file or NSCA.
emi.wms.WMS-JobSubmit
is the metric which goes to Metric Store Database. It accepts these parameters:
--timeout-job-global <sec> Global timeout for jobs. Job will be cancelled and dropped
if it is not in terminal state by that time. (Default: 3600)
--timeout-job-waiting <sec> Time allowed for a job to stay in Waiting with 'no compatible
resources'. (Default: 2700)
--hosts <h1,h2,..> Comma-separated list of CE hostnames to run monitor on.
Test
To test the probe you have to create a valid proxy.
State + Monit
First you have to "submit" a jdl:
/usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobState
[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobState
OK:
OK:
Testing from: cream-48.pd.infn.it
DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy
VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL
Connecting to the service https://cream-45.pd.infn.it:7443/glite_wms_wmproxy_server
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ
The job identifier has been saved in the following file:
/var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobID
==========================================================================
Then you can "monitor" the job:
/usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobMonit --pass-check-dest active
[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobMonit --pass-check-dest active
metric results >>> <cream-45.pd.infn.it,emi.wms.WMS-JobState-dteam>
OK: [Scheduled] https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ
OK: [Scheduled] https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ
glite-wms-job-status https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ
Current Status: Scheduled
Status Reason: unavailable
Destination: cccreamceli09.in2p3.fr:8443/cream-sge-long
Submitted: Mon Nov 7 16:38:01 2011 CET
==========================================================================
OK: Jobs processed - 1
OK: Jobs processed - 1
[Scheduled] : 1|jobs_processed=1;; DONE=0;; RUNNING=0;; SCHEDULED=1;; SUBMITTED=0;; READY=0;; WAITING=0;; ABORTED=0;; CANCELLED=0;; CLEARED=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2
If you execute again the
emi.wms.WMS-JobState metrics you can have as output the last status see by the metrics
emi.wms.WMS-JobMonit
[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobState
OK: Active job - Scheduled [2011-11-07T15:38:28Z]
OK: Active job - Scheduled [2011-11-07T15:38:28Z]
Testing from: cream-48.pd.infn.it
DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy
VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL
Active job - Scheduled [2011-11-07T15:38:28Z]
At the end when job finished the execution of the
emi.wms.WMS-JobMonit metrics should trigger also a
get_output
[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobMonit --pass-check-dest active
metric results >>> <cream-45.pd.infn.it,emi.wms.WMS-JobSubmit-dteam>
OK: success.
glite-wms-job-output --noint --nosubdir --dir /var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobOutput https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ 2>&1
Connecting to the service https://cream-45.pd.infn.it:7443/glite_wms_wmproxy_server
Warning - option --nosubdir specified:
output files with same name will be overridden
Warning - Directory already exists:
/var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobOutput
================================================================================
JOB GET OUTPUT OUTCOME
Output sandbox files for the job:
https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ
have been successfully retrieved and stored in the directory:
/var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobOutput
================================================================================
metric results >>> <cream-45.pd.infn.it,emi.wms.WMS-JobState-dteam>
OK: success.
OK: Jobs processed - 1
OK: Jobs processed - 1
Done : 1|jobs_processed=1;; DONE=1;; RUNNING=0;; SCHEDULED=0;; SUBMITTED=0;; READY=0;; WAITING=0;; ABORTED=0;; CANCELLED=0;; CLEARED=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2
You can verify that it works correctly checking the status of the job.
[ale@cream-48 ~]$ glite-wms-job-status https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ
Current Status: Cleared
Status Reason: user retrieved output sandbox
Destination: cccreamceli09.in2p3.fr:8443/cream-sge-long
Submitted: Mon Nov 7 16:38:01 2011 CET
==========================================================================
State + Monit + Cancel
First you have to "submit" a jdl:
/usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobState
[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobState OK:
OK:
Testing from: cream-48.pd.infn.it
DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy
VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL
Connecting to the service https://cream-45.pd.infn.it:7443/glite_wms_wmproxy_server
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A
The job identifier has been saved in the following file:
/var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobID
==========================================================================
As before you can "monitor" the job:
/usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobMonit --pass-check-dest active
[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobMonit --pass-check-dest active
metric results >>> <cream-45.pd.infn.it,emi.wms.WMS-JobState-dteam>
OK: [Scheduled] https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A
OK: [Scheduled] https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A
glite-wms-job-status https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A
Current Status: Scheduled
Status Reason: unavailable
Destination: t2-ce-01.lnl.infn.it:8443/cream-lsf-cert1
Submitted: Mon Nov 7 16:51:29 2011 CET
==========================================================================
OK: Jobs processed - 1
OK: Jobs processed - 1
[Scheduled] : 1|jobs_processed=1;; DONE=0;; RUNNING=0;; SCHEDULED=1;; SUBMITTED=0;; READY=0;; WAITING=0;; ABORTED=0;; CANCELLED=0;; CLEARED=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2
Before it finishes you can "cancel" it:
/usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobCancel
[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobCancel
OK: job cancelled
OK: job cancelled
Testing from: cream-48.pd.infn.it
DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy
VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL
Job cancellation request sent:
glite-wms-job-cancel --noint -i /var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobID
Job bookkeeping files deleted.
[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobMonit --pass-check-dest active
OK: no active jobs [2011-11-07T16:51:53Z]
OK: no active jobs [2011-11-07T16:51:53Z]|jobs_processed=0;;
You can verify that it works correctly checking the status of the job.
[ale@cream-48 ~]$ glite-wms-job-status https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A
Current Status: Cancelled
Destination: t2-ce-01.lnl.infn.it:8443/cream-lsf-cert1
Submitted: Mon Nov 7 16:51:29 2011 CET
==========================================================================