Hadoop testing

The idea behind testing Hadoop (or more precisely HDFS component of it) for distributed filesystem is that it's designed from ground up to be used on commodity hardware having hence inherent implementations for replication, load balancing, node decommissioning etc etc.

Without repeating here all the different Hadoop components and why one should do something this or the other way I'll just link you to the OSG Hadoop planning guide

So to perform a reasonable test I created a test cluster of five nodes. To get Hadoop installed all I needed to do was add Caltech repository to my YUM configuration (easily doable with an rpm -ivh http://blaah/blah.rpm command) and install it via yum install hadoop.

To set up hadoop one needs to have at least two components: namenode and datanode however to have any reasonable benefit at least a few datanodes are needed. In my configuration I created one namenode and four datanodes. As I used test cluster components, then these systems featured just a single 160GB SATA disk and didn't perform faster than ca 25-35MB/s locally.

The instructions for setting up hadoop one can find here, which includes not only hadoop but also FUSE configuration. Through FUSE one can later on mount the HDFS filesystem directly at OS level and use it like almost any other posix filesystem. The setup of the data and namenodes outside OSG didn't require any modifications or alterations to the guide so I will not detail them. Effectively once you have configured Hadoop on one node (defining the node name for namenode and ports to use as well as replication factors) the other nodes are very simple to install:

 # ssh test02
rpm -ivh http://newman.ultralight.org/repos/hadoop/5/i386/caltech-hadoop-5-1.noarch.rpm
yum install -y hadoop
scp test01:/etc/hadoop /etc/hadoop
service hadoop-firstboot start
service hadoop start

It really is as easy as that. However as we want to use Hadoop for not only local filesystem access, but to have it a fully functional storage element for our Tier 2 center, we also need two additional components:

   * GridFTP door
   * SRMv2 doorway

This is also the place where instructions for OSG and those for Europe start to deviate. Namely every center in US has a central DN to user mapping service called GUMS, which everyone has pretty much set up beforehand and they can anyway use pre-created templates. In EU however we don't deploy such a server (although SCAS is supposed to start doing something like that, but the integration with GridFTP-hdfs isn't there yet).

So, to use any Grid related services one needs to first install the GUMS server. The easiest way to do this is to install it through the OSG deployment tool pacman ... (to be continued)

Group

Projects

HPC

Stuff

Hadoop testing