--wms <wms> | WMS to be used for job submission. If not given, default WMProxy end-points defined on the UI will be used. |
--jdl-templ <file> | JDL template file (full path). Default: <emi.cream.ProbesLocation>/CREAM-jdl.template |
--jdl-retrycount <val> | JDL RetryCount (Default: 0). |
--jdl-shallowretrycount <val> | JDL ShallowRetryCount (Default: 1). |
--timeout-job-discard <sec> | Discard job after the timeout. (Default: 21600) |
--prev-status <0-3> | Previous Nagios status of the metric. |
Type="Job"; JobType="Normal"; Executable = "<jdlExecutable>"; StdError = "gridjob.out"; StdOutput = "gridjob.out"; Arguments = "<jdlArguments>"; InputSandbox = {"<jdlInputSandboxExecutable>", "<jdlInputSandboxTarball>"}; OutputSandbox = {"gridjob.out","wnlogs.tgz"}; RetryCount = <jdlRetryCount>; ShallowRetryCount = <jdlShallowRetryCount>; Requirements = other.GlueCEInfoHostName == "<jdlReqCEInfoHostName>";
nagrun.sh
(see below). The arguments are dinamically composed by the metrics translating the ones given to the probes. In particular results from the worker nodes are sent via a message transfer agent (MTA) to Message Brokers. The code for MTA mta-simple
is located under <WN_codebase>/bin/ and implementation in <WN_codebas>e/lib/python2.3/site-packages/mig.
The MTA:
handle_service_check
invoked by Nagios after execution of each check. The parameters to manage resource broker are:
--mb-destination <dest> | Mandatory parameter (if --no-mb is not specified). The destination queue/topic on Message Broker to publish to. |
--mb-uri <URI> | Message Broker URI. If not given, MB discovery will be performed on WN to find working MB. Format for <URI>: [failover://\(]<uri>,[...][\)] <uri> - stomp://FQDN:port/ or http://FQDN/message (Default: service discovery on WN.) |
--mb-network <net> | Brokers network for broker discovery on WN. (Default: PROD) |
--mb-no-discovery | Do not do broker discovery on WN. If a given <URI> is not accessible, WN part of the framework will exit with UKNOWN. (Default: if <URI> is not given or not accessible from WN perform broker discovery in <net>.) |
--mb-choice <best or random> | How to choose MB on WN. 'best' - min response time. (Defalult: best) |
--no-mb | Do not send results messages to Message Broker |
--no-mb
disabilitates messages transfer; in that case results e-mail messages can be found in the job's output file wnlogs.tgz
.
OutputSandbox = {"gridjob.out","wnlogs.tgz"};
gridjob.out
contains logging output of WN job as seen by WMS job wrapper. I.e., stdout and stderr from the testing framework launching script on WN (nagrun.sh).
wnlogs.tgz
contains the following directories from WN: /--add-wntar-nag <d1,d2,..> | Comma-separated list of top level directories with Nagios compliant directories structure to be added to tarball to be sent to WN. |
--add-wntar-nag-nosam | Instructs the metric not to include standard SAM WN probes and their Nagios config to WN tarball. (Default: WN probes are included) |
--add-wntar-nag-nosamcfg | Instructs the metric not to include Nagios configuration for SAM WN probes to WN tarball. The probes themselves and respective Python packages, however, will be included. |
--wnjob-location <dir> | Full path to directory contaning WN scheduler. (Default: <emi.cream.ProbesLocation>/wnjob) |
|-- etc | `-- wn.d | `-- org.my | |-- commands.cfg | `-- services.cfg `-- probes `-- org.my |-- check_A |-- check_B `-- checks_lib.sh
$USER3$
macro defining path to define command{ command_name check_A1 command_line $USER3$/org.my/check_A }
<wnjobWorkDir>
will be substituted with the job's working directory on WN. Handy if your check requires and creates a working directory. Possible usage (assumes -w instructs check_A to create <wnjobWorkDir>/.mygridprobes directory):
define command{ command_name check_A2 command_line $USER3$/org.my/check_A -w <wnjobWorkDir>/.mygridprobes }For this particular part of Nagios objects configuration and macros please see the Nagios documentation for resources configuration. With these last parameters you can manage some timeouts in WNs:
--timeout-wnjob-global <sec> | Global timeout for a job on WN. (Default: 600) |
--wn-verb <0-3> | Metrics verbosity level on WN. [-v <VERBOSITY>] (Default: 0) |
--wn-verb-fw <0-3> | Framework verbosity level on WN (Default: 1) |
--wn-verb
on WN the given value is substituted for <VERBOSITY> in Nagios metric definition *.cfg files.
--wn-verb-fw
is used by nagrun.sh for setting its debugging output as well as setting logging and debugging output of messages publishing client (Message Transfer Agent - MTA) and Nagios. usage: nagrun.sh -v <vo> -d <dest> [-b <broker_uri>] [-n <broker_network>] [-t <timeout>] [-w <fw_verb>] [-z <metric_verb>] [-f <fqan>] [-i <host:port,..>] -B -R -N -h -m -v and -d (if not -m) are mandatory paramters. Defaults: <broker_network> - PROD <timeout> - 600 sec <metrics_verb> - 0 <fw_verb> - 1 (2 - messages, 3 - Nagios config/stats/debug) -f <fqan> - VOMS FQAN -B - don't do broker discovery -R - take MB randomly; by default sort by min response time -N - don't run WN tests -m - don't use mta service to transfer messagesIn most cases the parameters is the translation of corresponding ones given to emi.cream.CREAMCE-JobState metric.
emi.cream.CREAMCE-JobState | nagrun.sh | |
---|---|---|
--mb-network <net> Brokers network for broker discovery on WN. | -n <broker_network> |
|
--mb-uri <URI> Message Broker URI. | -b <broker_uri> |
|
--mb-destination <dest> queue/topic on MB to publish to. | -d <dest> |
|
--mb-no-discovery Do not do broker discovery on WN. | -B | |
--mb-choice <best or random> How to choose MB on WN. | -R | |
--vo <name> Virtual Organization. | -v <vo> |
|
--vo-fqan <name> VOMS primary attribute as FQAN. | -f <fqan> |
|
--wn-verb <0-3> Metrics verbosity level on WN. | -z <metric_verb> |
|
--wn-verb-fw <0-3> Framework verbosity level on WN. | -w <fw_verb> |
|
--timeout-wnjob-global <sec> Global timeout for a job on WN. | -t <timeout> |
|
--no-mb | -m |
activejob.map
files) and updates states of emi.cream.CREAMCE-JobState and emi.wms.CREAMCE-JobMonit metrics. Acts as a babysitter for all grid jobs submitted by emi.cream.CREAMCE-JobState. emi.cream.CREAMCE-JobState and emi.cream.CREAMCE-JobMonit are updated (as passive checks) either via Nagios command file or NSCA. It accepts these parameters:
--timeout-job-global <sec> Global timeout for jobs. Job will be cancelled and dropped if it is not in terminal state by that time. (Default: 3300) --timeout-job-waiting <sec> Time allowed for a job to stay in Waiting with 'no compatible resources'. (Default: 2700) --timeout-job-discard <sec> Discard job after the timeout. (Default: 21600) --timeout-job-schedrun <sec> Scheduled/Running states timeout. (Default: 19800) --hosts <h1,h2,..> Comma-separated list of CE hostnames to run the monitor on.
emi.wn
these probes are performed on the worker node using the wrapper samtest-run:
/bin/csh -c "env|sort" > env-csh.txt
and then cheking if the variable PATH
is defined. Accept only the parameter: debug
.
Example of a message sent as output:
serviceURI: cream-30.pd.infn.it:8443/cream-pbs-creamtest2 hostName: localhost.localdomain serviceFlavour: CE siteName: FAKE-SITE metricStatus: OK metricName: emi.wn.WN-Csh summaryData: OK gatheredAt: cream-wn-030.pn.pd.infn.it timestamp: 2011-09-22T15:25:36Z nagiosName: emi.wn.WN-Csh-dteam role: site voName: dteam serviceType: emi.wn.WN detailsData: Checking if CSH works\nTest: OK.\n
lcg-version
, glite-version
commands and the cat of the /etc/emi-version
file are tried and if the commands are not available the script exits with an error.
Example of a message sent as output:
serviceURI: cream-30.pd.infn.it:8443/cream-pbs-creamtest2 hostName: localhost.localdomain serviceFlavour: CE siteName: FAKE-SITE metricStatus: OK metricName: emi.wn.WN-SoftVer summaryData: OK: EMI 1.2.0-1 gatheredAt: cream-wn-030.pn.pd.infn.it timestamp: 2011-09-22T15:25:37Z nagiosName: emi.wn.WN-SoftVer-dteam role: site voName: dteam serviceType: emi.wn.WN detailsData: Installed software version\n+ type=unknow\n+ mwver=error\n+ type -f glite-version\n/home/dteam009/home_cre30_262167654/CREAM262167654/nagios/probes/emi.wn/sam/WN-softver: line 31: type: glite-version: not found\n+ '[' -f /etc/emi-version ']'\n+ type=EMI\n++ cat /etc/emi-version\n+ mwver=1.2.0-1\n+ set +x\nVersion pattern: ^2\\.[456789]OR^3\\.OR^1\\.\nDeducted middleware version: EMI 1.2.0-1\n
$GLITE_WMS_RB_BROKERINFO
, $GLITE_WL_RB_BROKERINFO
or $EDG_WL_RB_BROKERINFO
variables
edg-brokerinfo getCE
or glite-brokerinfo getCE
command respectively. If previous command result value is different from 0 test is failed.
serviceURI: cream-30.pd.infn.it:8443/cream-pbs-creamtest2 hostName: localhost.localdomain serviceFlavour: CE siteName: FAKE-SITE metricStatus: OK metricName: emi.wn.WN-Bi summaryData: OK: getCE: cream-30.pd.infn.it:8443/cream-pbs-creamtest2 gatheredAt: cream-wn-030.pn.pd.infn.it timestamp: 2011-09-22T15:25:35Z nagiosName: emi.wn.WN-Bi-dteam role: site voName: dteam serviceType: emi.wn.WN detailsData: Checking if BrokerInfo works\nBrokerInfo file: /home/dteam009/home_cre30_262167654/CREAM262167654/.BrokerInfo\n+ ls -l /home/dteam009/home_cre30_262167654/CREAM262167654/.BrokerInfo\n-rw-r--r-- 1 dteam009 dteam 367 Sep 22 17:25 /home/dteam009/home_cre30_262167654/CREAM262167654/.BrokerInfo\n+ set +x\nCheck if we can get the name of CE using glite-brokerinfo command\n+ glite-brokerinfo -v getCE\nBrokerInfo::getBIFileName(): /home/dteam009/home_cre30_262167654/CREAM262167654/.BrokerInfo\nBrokerInfo::getCE(): \n -> cream-30.pd.infn.it:8443/cream-pbs-creamtest2\n -> BI_SUCCESS\n+ result=0\n+ set +x\n
[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.cream/CREAMCE-probe --vo dteam -x /tmp/x509up_u501 -H cream-19.pd.infn.it -m emi.cream.CREAMCE-JobState --wms cream-45.pd.infn.it --no-mb OK: [Submitted] OK: [Submitted] Connecting to the service https://cream-45.pd.infn.it:7443/glite_wms_wmproxy_server ====================== glite-wms-job-submit Success ====================== The job has been successfully submitted to the WMProxy Your job identifier is: https://cream-45.pd.infn.it:9000/fvNWcPJ6nVXAQJqhuU29-g The job identifier has been saved in the following file: /var/lib/gridprobes/dteam/emi.cream/CREAMCE/cream-19.pd.infn.it/jobID ==========================================================================You can "monitor" the job: /usr/libexec/grid-monitoring/probes/emi.cream/CREAMCE-probe --vo <vo> -x <path of the proxy> -H <CREAM-ce hostname> -m emi.cream.CREAMCE-JobMonit --pass-check-dest active
[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.cream/CREAMCE-probe --vo dteam -x /tmp/x509up_u501 -H cream-19.pd.infn.it -m emi.cream.CREAMCE-JobMonit --pass-check-dest active metric results >>> <cream-19.pd.infn.it,emi.cream.CREAMCE-JobState-dteam> OK: [Scheduled] https://cream-45.pd.infn.it:9000/fvNWcPJ6nVXAQJqhuU29-g OK: [Scheduled] https://cream-45.pd.infn.it:9000/fvNWcPJ6nVXAQJqhuU29-g Testing from: cream-48.pd.infn.it DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL glite-wms-job-status https://cream-45.pd.infn.it:9000/fvNWcPJ6nVXAQJqhuU29-g ======================= glite-wms-job-status Success ===================== BOOKKEEPING INFORMATION: Status info for the Job : https://cream-45.pd.infn.it:9000/fvNWcPJ6nVXAQJqhuU29-g Current Status: Scheduled Status Reason: unavailable Destination: cream-19.pd.infn.it:8443/cream-lsf-creamcert2 Submitted: Wed Nov 9 16:43:31 2011 CET ========================================================================== OK: Jobs processed - 1 OK: Jobs processed - 1 [Scheduled] : 1|jobs_processed=1;; DONE=0;; RUNNING=0;; SCHEDULED=1;; SUBMITTED=0;; READY=0;; WAITING=0;; WAITING-CANCELLED=0;; WAITING-CANCEL=0;; ABORTED=0;; CANCELLED=0;; CLEARED=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2Before it finishes you can "cancel" it: /usr/libexec/grid-monitoring/probes/emi.cream/CREAMCE-probe --vo <vo> -x <path of the proxy> -H <CREAM-ce hostname> -m emi.cream.CREAMCE-JobCancel
[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.cream/CREAMCE-probe --vo dteam -x /tmp/x509up_u501 -H cream-19.pd.infn.it -m emi.cream.CREAMCE-JobCancel OK: job cancelled OK: job cancelled Testing from: cream-48.pd.infn.it DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL Job cancellation request sent: glite-wms-job-cancel --noint -i /var/lib/gridprobes/dteam/emi.cream/CREAMCE/cream-19.pd.infn.it/jobID Job bookkeeping files deleted.You can verify that it works correctly checking the status of the job.
[ale@cream-48 ~]$ glite-wms-job-status https://cream-45.pd.infn.it:9000/fvNWcPJ6nVXAQJqhuU29-g ======================= glite-wms-job-status Success ===================== BOOKKEEPING INFORMATION: Status info for the Job : https://cream-45.pd.infn.it:9000/fvNWcPJ6nVXAQJqhuU29-g Current Status: Cancelled Destination: cream-19.pd.infn.it:8443/cream-lsf-creamcert2 Submitted: Wed Nov 9 16:43:31 2011 CET ==========================================================================
|
|