Nagios WMS probe

Test WMS service with job submission to CEs.It is based on the python-GridMon library developed by the SAM team. Details about the command line parameters can be found here.

Installation

This probe need to be installed on a WMS User Interface because it use wms-cli commands to monitor WMS.

Dependencies

python >= 2.4
python-ldap  
openssl >= 0.9.8e-12
nagios-submit-conf >= 0.2
python-GridMon >= 1.1.10

About the last two rpms they can be install using this repository:

[egi-sam]
name=EGI SAM repo
baseurl=http://repository.egi.eu/sw/production/sam/1/$basearch
enabled=1
gpgcheck=0
protect=1
priority=10

WMS Metrics

  1. emi.wms.WMS-JobState. Submits grid job to CE(s) via WMS under test. Accepts passive check updates from emi.wms.WMS-JobMonit.
  2. emi.wms.WMS-JobMonit. Monitors submitted grid jobs.
  3. emi.wms.WMS-JobCancel. Cancel grid job.
  4. emi.wms.WMS-JobSubmit. Passive check. Holds terminal status of job submission.

emi.wms.WMS-JobState

This metric is used to submit jobs through the WMS under test. It accepts these parameter:

--jdl-templ <file>              JDL template file (full path). 
                                Default: <emi.wms.ProbesLocation>/WMS-jdl.template
--jdl-retrycount <val>          JDL RetryCount (Default: 0).
--jdl-shallowretrycount <val>   JDL ShallowRetryCount (Default: 1).
--ces-file <file>               File with list of CEs. Two schemes [file:] or http: 
                                (Default: /var/lib/gridprobes/<vo>/GoodCEs)
--prev-status <0-3>             Previous Nagios status of the metric.

GoodCEs

If exists the file "GoodCEs" all CEs from the file are OR'ed in the resulting Requirements ClassAdd. Eg.

Requirements = (other.GlueCEInfoHostName == "ce106.cern.ch") || (other.GlueCEInfoHostName == "creamce.gina.sara.nl")

This file can been automatically popolated using the script gather_healthy_nodes from the hr.srce grid-monitoring probes

Otherwise a very general requirement: Requirements = true is used for match-making.

JDL template

This is the default jdl template used for submission:

Type="Job";
JobType="Normal";
Executable = "<jdlExecutable>";
StdError = "gridjob.out";
StdOutput = "gridjob.out";
OutputSandbox = {"gridjob.out"};
RetryCount = <jdlRetryCount>;
ShallowRetryCount = <jdlShallowRetryCount>;
Requirements = <jdlReqCEInfoHostName>;

At the moment the parameter jdlExecutable is hardcoded, and the script /bin/hostname is used.

emi.wms.WMS-JobMonit

Monitors submitted grid jobs. Threaded implementation with one thread per monitored resource with max 10 threads. Passively updates emi.wms.WMS-JobState with the latest state of the job according to WMS when job is not in a terminal state. When job enters terminal state or was canceled the metric updates both emi.wms.WMS-JobState and org.sam.WMS-JobSubmit with the final job status. The latter metrics are updated (as passive checks) either via Nagios command file or NSCA. emi.wms.WMS-JobSubmit is the metric which goes to Metric Store Database. It accepts these parameters:

--timeout-job-global <sec>   Global timeout for jobs. Job will be cancelled and dropped 
                             if it is not in terminal state by that time. (Default: 3600)
--timeout-job-waiting <sec>  Time allowed for a job to stay in Waiting with 'no compatible 
                             resources'. (Default: 2700)
--hosts <h1,h2,..>           Comma-separated list of CE hostnames to run monitor on.

Test

To test the probe you have to create a valid proxy.

State + Monit

First you have to "submit" a jdl:

/usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobState

[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobState
OK:
OK:
Testing from: cream-48.pd.infn.it
DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy
VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL


Connecting to the service https://cream-45.pd.infn.it:7443/glite_wms_wmproxy_server


====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ

The job identifier has been saved in the following file:
/var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobID

==========================================================================

Then you can "monitor" the job:

/usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobMonit --pass-check-dest active

[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobMonit --pass-check-dest active
metric results >>> <cream-45.pd.infn.it,emi.wms.WMS-JobState-dteam>
OK: [Scheduled] https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ
OK: [Scheduled] https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ
glite-wms-job-status https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ


======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:

Status info for the Job : https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ
Current Status:     Scheduled
Status Reason:      unavailable
Destination:        cccreamceli09.in2p3.fr:8443/cream-sge-long
Submitted:          Mon Nov  7 16:38:01 2011 CET
==========================================================================
OK: Jobs processed - 1
OK: Jobs processed - 1
[Scheduled] : 1|jobs_processed=1;; DONE=0;; RUNNING=0;; SCHEDULED=1;; SUBMITTED=0;; READY=0;; WAITING=0;; ABORTED=0;; CANCELLED=0;; CLEARED=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2

If you execute again the emi.wms.WMS-JobState metrics you can have as output the last status see by the metrics emi.wms.WMS-JobMonit

[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobState 
OK: Active job - Scheduled [2011-11-07T15:38:28Z]
OK: Active job - Scheduled [2011-11-07T15:38:28Z]
Testing from: cream-48.pd.infn.it
DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy
VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL
Active job - Scheduled [2011-11-07T15:38:28Z]

At the end when job finished the execution of the emi.wms.WMS-JobMonit metrics should trigger also a get_output

[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobMonit --pass-check-dest active
metric results >>> <cream-45.pd.infn.it,emi.wms.WMS-JobSubmit-dteam>
OK: success.

glite-wms-job-output --noint --nosubdir --dir /var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobOutput https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ 2>&1

Connecting to the service https://cream-45.pd.infn.it:7443/glite_wms_wmproxy_server


Warning - option --nosubdir specified: 
output files with same name will be overridden


Warning - Directory already exists: 
/var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobOutput


================================================================================

         JOB GET OUTPUT OUTCOME

Output sandbox files for the job:
https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ
have been successfully retrieved and stored in the directory:
/var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobOutput

================================================================================
metric results >>> <cream-45.pd.infn.it,emi.wms.WMS-JobState-dteam>
OK: success.

OK: Jobs processed - 1
OK: Jobs processed - 1
Done : 1|jobs_processed=1;; DONE=1;; RUNNING=0;; SCHEDULED=0;; SUBMITTED=0;; READY=0;; WAITING=0;; ABORTED=0;; CANCELLED=0;; CLEARED=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2

You can verify that it works correctly checking the status of the job.

[ale@cream-48 ~]$ glite-wms-job-status https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ

======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:

Status info for the Job : https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ
Current Status:     Cleared 
Status Reason:      user retrieved output sandbox
Destination:        cccreamceli09.in2p3.fr:8443/cream-sge-long
Submitted:          Mon Nov  7 16:38:01 2011 CET
==========================================================================

State + Monit + Cancel

First you have to "submit" a jdl:

/usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobState

[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobState OK: 
OK: 
Testing from: cream-48.pd.infn.it
DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy
VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL


Connecting to the service https://cream-45.pd.infn.it:7443/glite_wms_wmproxy_server


====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A

The job identifier has been saved in the following file:
/var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobID

==========================================================================

As before you can "monitor" the job:

/usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobMonit --pass-check-dest active

[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobMonit --pass-check-dest active
metric results >>> <cream-45.pd.infn.it,emi.wms.WMS-JobState-dteam>
OK: [Scheduled] https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A
OK: [Scheduled] https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A
glite-wms-job-status https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A


======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:

Status info for the Job : https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A
Current Status:     Scheduled 
Status Reason:      unavailable
Destination:        t2-ce-01.lnl.infn.it:8443/cream-lsf-cert1
Submitted:          Mon Nov  7 16:51:29 2011 CET
==========================================================================
OK: Jobs processed - 1
OK: Jobs processed - 1
[Scheduled] : 1|jobs_processed=1;; DONE=0;; RUNNING=0;; SCHEDULED=1;; SUBMITTED=0;; READY=0;; WAITING=0;; ABORTED=0;; CANCELLED=0;; CLEARED=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2

Before it finishes you can "cancel" it:

/usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobCancel

[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobCancel
OK: job cancelled
OK: job cancelled
Testing from: cream-48.pd.infn.it
DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy
VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL
Job cancellation request sent:
glite-wms-job-cancel --noint  -i /var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobID
Job bookkeeping files deleted.
[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobMonit --pass-check-dest active
OK: no active jobs [2011-11-07T16:51:53Z]
OK: no active jobs [2011-11-07T16:51:53Z]|jobs_processed=0;;

You can verify that it works correctly checking the status of the job.

[ale@cream-48 ~]$ glite-wms-job-status https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A


======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:

Status info for the Job : https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A
Current Status:     Cancelled 
Destination:        t2-ce-01.lnl.infn.it:8443/cream-lsf-cert1
Submitted:          Mon Nov  7 16:51:29 2011 CET
==========================================================================
Edit | Attach | PDF | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | More topic actions
Topic revision: r6 - 2011-11-07 - AlessioGianelle
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback