* EMI installation guide * CREAM install how to
The local repository is available here. Once finished make sure to also install and configure ntp. The sample ntp.conf can be copied from other working nodes.
rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-4.noarch.rpm wget http://emisoft.web.cern.ch/emisoft/dist/EMI/1/sl5/x86_64/base/emi-release-1.0.0-1.sl5.noarch.rpm yum install ./emi-release-1.0.0-1.sl5.noarch.rpm wget http://repository.egi.eu/sw/production/cas/1/current/repo-files/EGI-trustanchors.repo -O /etc/yum.repos.d/EGI-trustanchors.repo
Disable Yum automatic updates using this script: http://forge.cnaf.infn.it/frs/download.php/101/disable_yum.sh
The CREAM CE node requires the host certificate/key files to be installed. Make sure to place the two files (hostkey.pem and hostcert.pem) in the target node into the /etc/grid-security directory. Then set the proper mode and ownerships doing:
chown root.root /etc/grid-security/hostcert.pem chown root.root /etc/grid-security/hostkey.pem chmod 600 /etc/grid-security/hostcert.pem chmod 400 /etc/grid-security/hostkey.pem
Install the Yum protectbase, the CA meta package and the jdk dependence of some CREAM components:
yum install yum-protectbase.noarch yum install ca-policy-egi-core yum install xml-commons-apis
The custom install of our own Torque is on torque.hep.kbfi.ee, take from the /root/torque-2.5.5/ the file torque-package-clients-linux-x86_64.sh and execute it on the new CREAM CE with --install. It should install the relevant torque client libraries and commands to /usr/bin. Now instead of installing the emi-torque-util metapackage one has to install:
yum install lcg-info-dynamic-pbs glite-apel-pbs glite-apel-core glite-yaim-torque-utils lcg-info-dynamic-scheduler-generic lcg-info-dynamic-scheduler-pbs glite-yaim-core
Now to set up host based authentication you need to login to torque.hep.kbfi.ee, go to ssh-keys, add the new CREAM CE to the list of relevant hosts and rerun keyscan. Once done you have to distribute the ssh_known_hosts to all relevant nodes (all CREAM CE's, all WN's and torque itself). Then add the new CREAM to /etc/hosts.equiv as well. The /etc/ssh/shosts.equiv has to be distributed to the new CREAM CE and the sshd_config from some other CREAM CE be taken, the important things are:
UsePAM = no HostbasedAuthentication yes IgnoreUserKnownHosts yes IgnoreRhosts yes
You also need to add all WN's to the /etc/hosts file with FQDN and short names to make sure sshd picks the mappings up correctly. Once you restart sshd the host based auth should work (to test it however you need the user accounts, which are created during CREAM configuration so be patient). But at least qstat should give you the torque status information.
Install the meta package:
yum install emi-cream-ce
Now before configuration make sure the following group ID's are free: 155, 156. In SL56 they were used by some random avahi etc stuff.
And configure it:
/opt/glite/yaim/bin/yaim -c -s site-info.def -n creamCE -n TORQUE_utils
Any reconfigurations in the future should be enough to just use -n creamCE and not Torque utils. For the site-info.def take the file from mars.hep.kbfi.ee:siteinfo/. Change any reference to mars.hep.kbfi.ee to the new hostname as well as:
BLAH_JOBID_PREFIX=crmrs_
where the prefix has to be crxxx_ where the xxx can be altered. This allows to identify later which CREAM CE has submitted the jobs to the torque server.
Also, to fix GIP add the following to /etc/lcg-info-dynamic-scheduler.conf:
[LRMS] lrms_backend_cmd: /usr/libexec/lrmsinfo-pbs [Scheduler] cycle_time : 0
Here I'm leaving comments and common bugs that I've found
The reason is that CMS needs non-pool accounts, one has to modify the /etc/grid-security/grid-mapfile to remove the pool account description and replace with ordinary accounts (.sgmcms should be replaced with sgmcms000 etc). Only common cms accounts are pool accounts.
Add the following to /etc/fstab and make sure the endpoints exist and are empty (in the case of /opt/edg/var/info-torque you have to create it first and clean /opt/edg/var/info directory):
torque.hep.kbfi.ee:/var/spool/torque/server_logs /var/spool/torque/server_logs nfs users,rw 0 0 torque.hep.kbfi.ee:/var/spool/torque/server_priv/accounting /var/spool/torque/server_priv/accounting nfs users,rw 0 0 torque.hep.kbfi.ee:/opt/edg/var/info /opt/edg/var/info-torque nfs users,rw 0 0
then do a mount -a to attach them. The reason being that the software tags have to be shared amongst all the CREAM CE's as well as the Torque logs have to be available for job parsing.
Now to make sure all the tags are synchronized across the CE's you should copy from one of the working CREAM CE's the /usr/local/bin/sync-tags.sh script to your new CE /usr/local/bin and also the /etc/cron.d/sync-tags to your new CE cron.d directory. Afterwards it's good idea to change the time in the cron definition so that not all CE's do the sync at the same time.
Firstly create a test.sh (a simple id -a; sleep 300 will do) under one of the users created (i.e. cms011) and try as that user to qsub it to the queue. If the job runs through nicely and you get back the output, then PBS itself works. Next up try to see from a UI machine if the new CREAM CE has submissions enabled:
glite-ce-allowed-submission xxx.hep.kbfi.ee:8443
the answer should be that submissions are enabled.
Once done prepare a simple test.jdl:
[ executable="/bin/sleep"; arguments="300"; ]
and try submitting it to the cluster:
glite-ce-job-submit -a -r xxx.hep.kbfi.ee:8443/cream-pbs-short test.jdl glite-ce-job-status https://mars.hep.kbfi.ee:8443/CREAM084384886
it's also good to test as SGM user if you have the ability. More useful tests here
Also, make sure that BDII is ok: http://gstat.egi.eu/gstat/site/T2_Estonia/treeview/bdii_site/io.hep.kbfi.ee/