The idea behind testing Hadoop (or more precisely HDFS component of it) for distributed filesystem is that it's designed from ground up to be used on commodity hardware having hence inherent implementations for replication, load balancing, node decommissioning etc etc.
Without repeating here all the different Hadoop components and why one should do something this or the other way I'll just link you to the OSG Hadoop planning guide
So to perform a reasonable test I created a test cluster of five nodes. To get Hadoop installed all I needed to do was add Caltech repository to my YUM configuration (easily doable with an rpm -ivh http://blaah/blah.rpm command) and install it via yum install hadoop.
To set up hadoop one needs to have at least two components: namenode and datanode however to have any reasonable benefit at least a few datanodes are needed. In my configuration I created one namenode and four datanodes. As I used test cluster components, then these systems featured just a single 160GB SATA disk and didn't perform faster than ca 25-35MB/s locally.
The instructions for setting up hadoop one can find here, which includes not only hadoop but also FUSE configuration. Through FUSE one can later on mount the HDFS filesystem directly at OS level and use it like almost any other posix filesystem. The setup of the data and namenodes outside OSG didn't require any modifications or alterations to the guide so I will not detail them. Effectively once you have configured Hadoop on one node (defining the node name for namenode and ports to use as well as replication factors) the other nodes are very simple to install:
# ssh test02 rpm -ivh http://newman.ultralight.org/repos/hadoop/5/i386/caltech-hadoop-5-1.noarch.rpm yum install -y hadoop scp test01:/etc/hadoop /etc/hadoop service hadoop-firstboot start service hadoop start
It really is as easy as that. However as we want to use Hadoop for not only local filesystem access, but to have it a fully functional storage element for our Tier 2 center, we also need two additional components:
* GridFTP door * SRMv2 doorway
This is also the place where instructions for OSG and those for Europe start to deviate. Namely every center in US has a central DN to user mapping service called GUMS, which everyone has pretty much set up beforehand and they can anyway use pre-created templates. In EU however we don't deploy such a server (although SCAS is supposed to start doing something like that, but the integration with GridFTP-hdfs isn't there yet).
So, to use any Grid related services one needs to first install the GUMS server. The easiest way to do this is to install it through the OSG deployment tool pacman ... (to be continued)