Hadoop installation outside OSG

Components that are not available in gLite

GUMS

GUMS is the user mapping system in OSG and both gLite and SRM doors developed for Hadoop use this service for user mapping. The easiest way to install GUMS is through the VDT package installation. Follow the instructions here:

http://vdt.cs.wisc.edu/vdt/documentation.html

to install the VDT version of GUMS. The basic method is the following:

# mkdir -p /opt/osg
# cd /opt/osg
# wget http://vdt.cs.wisc.edu/software/pacman/3.28/pacman-3.28.tar.gz
# tar zxf pacman*
# cd pacman-3.28
# source setup.sh
# cd /opt/osg
# mkdir vdt
# pacman -get http://vdt.cs.wisc.edu/vdt_200_cache:GUMS

You then need to set up that all the necessary services are up and running and always will be after reboot:

# cd $VDT_LOCATION
# source setup.sh
# vdt-control --enable fetch-crl
# vdt-control --enable vdt-rotate-logs
# vdt-control --enable mysql5
# vdt-control --enable apache
# vdt-control --enable tomcat-55

Now you need to get and install CA certificates and also to add into /etc/grid-security/http both httpcert.pem and httpkey.pem that are for the web service (I know I know ... a self signed will do here as this is for management only essentially). One option to do so is:

#source VDT_LOCATION/setup.sh
#$VDT_LOCATION/vdt/bin/vdt-ca-manage setupca --location root --url osg
# mkdir -p /etc/grid-security/http
# cd /etc/grid-security/http
# openssl req -new -x509 -nodes -out httpcert.pem -keyout httpkey.pem

Once done:

# vdt-control --on

Next you need to set up the Administrator user on the GUMS server. To do so:

# cd $VDT_LOCATION/tomcat/v55/webapps/gums/WEB-INF/scripts
# ./gums-add-mysql-admin "<Your_DN_Here>"

Once that is all done you can happily point your favorite browser with said certificate loaded to the following url: https://your.host.name:8443/gums/ and you should be able to login. Setting up the actual DN mappings is a separate topic.

The easiest way to go is to get the sample gums.config from this site, change your MySQL connection path and in the hostToGroupMappings change the respective domain and host names what you will use. The logic is that the queries will be answered according to from which host they come from, which DN they come for and what role. So there are plenty of different mappers that do this and some googling and common sense based upon the sample config file should get you started :) Once you have the config file replace it with your current one in

$VDT_LOCATION/tomcat/v55/webapps/gums/WEB-INF/config/gums.config

and you should be done. Try updating the VO members list and also the different mapping options (user to DN and vice versa).

One thing of note is that GUMS is running as the daemon user. So for it to be able to get the VO listings it needs access to the hostkey.pem file, which I have solved as setting the ownership to user daemon.

Hadoop itself

Hadoop, GridFTP and BestMan on Linux nodes

As this is documented well enough on the OSG pages I will not cover the installations here, just follow the instructions on https://twiki.grid.iu.edu/bin/view/Storage/Hadoop.

The only things worthy a mention are:

When installing a 32 bit version of gridftp-hdfs, then by default there is a bug in the gridftp-hdfs-0.1.0-14.el5 release where the package provides /usr/lib/libglobus_gridftp_server_hdfs_gcc64dbg.so while in reality /usr/lib/libglobus_gridftp_server_hdfs_gcc32dbg.so is sought. This is a simple renaming issue due to packaging, the file is right.
Users have to be mapped on the GFTP door the same way as everywhere else or there will be problems with authentication
/etc/gridftp-hdfs/gridftp-hdfs-local.conf -> one should set the local mountpoint to wherever you keep your hdfs mounted usually (even if it's not mounted on the GFTP door) or there will be problems with the TURL-s that need to be pure hadoop ones.
By default /etc/xinetd.d/gridftp-hdfs does not override the per_source configuration variable and by default /etc/xinetd.conf has it at 10 connections. This means that you may end up seeing a lot of GFTP errors due to "End of file reached, possibly disk full" or similar. Simplest is to change the /etc/xinetd.d/gridftp-hdfs to add the setting per_source = UNLIMITED
In /etc/gridftp-hdfs/gridftp-hdfs-local.conf one should uncomment the memory buffer count line as well as create a separate directory for the file cache area outside of /tmp as if it's running low on memory the file cache on /tmp isn't helping and transfers will fail.
Still in the current version (0.1.0-14.el5) there seems that on some nodes the GFTP doors hang after the transfer has actually completed for no apparent reason. As a temporary workaround we have developed a script: http://neptune.hep.kbfi.ee/mario/checkGFTP that checks the logs for completed transfers and compares that to running servers. If any PID is in the list of completed transfers and the finishing message was at elast 5 minutes ago, then the server is killed. This script can be run through cron at the interval that feels comfortable for you.

Hadoop datanodes on (Open)Solaris.

This is a bit more tricky as there are no Solaris packages and also most of the scripts have been written with Linux in mind so some of them don't work out of the box. Also I have discovered that Hadoop requires Java version 6 as Solaris nodes with version 5 didn't work due to java class version mismatch.

First off perform the installation on some Linux node so you can copy the files from this machine. Once this is done copy over to the Solaris nodes to the same locations the following folders/files:

/etc/hadoop
/etc/sysconfig/hadoop
/usr/bin/getPoolSize.sh
/usr/bin/hadoop*
/usr/share/java/hadoop

Once done copy from here the starting and stopping script that works under Solaris (node, this is a stripped down version that assumes only datanode). Create a user called hadoop (or what ever user you chose on your Linux node), create the data directories that you configured already in your /etc/sysconfig/hadoop file on Linux (note it can be a longer listing of directories, hadoop will ignore folders that don't exist on that particular node) and give all the relevant folders the right user rights (e.g. ownership to the hadoop user).

Also create the /var/log/hadoop and /var/run/hadoop and give the ownership to the hadoop user.

Also create /var/run/hadoop and /var/log/hadoop and give ownership to the same above mentioned user. I also seemed to need to set

   HADOOP_CONF_DIR=/etc/hadoop

in the hadoop users bash profile file as it wasn't set elsewhere.

Once this is all done you can as root run

   /etc/init.d/hadoop start

If all goes well you should have your hadoop data node running. If not, well, then back to debugging :) However I have used this procedure now already twice to install new Solaris nodes (both OpenSolaris and Solaris 10).

Group

Projects

HPC