Diff for "Hadoop Cluster Kickstart" -

Differences between revisions 1 and 12 (spanning 11 versions)

Contents

Preface
Configuration on all machines in cluster
Configuration of Hadoop framework
ssh settings
Ports to open for datanode communication
node commissioning
node decommissioning

Preface

We work with Apache Hadoop release 1.0.4 from http://hadoop.apache.org/, which is stable version in February 2013.
In our setup the secondarynamenode is running on other machine than namenode. Both namenode and secondarynamenode are also datanodes and tasktracker. Jobtracker is same machine than namenode. For data storage we use a RAID 6 disk array mounted under /dcache with 16 TB.
If all configuration is done, we start hadoop on the namenode

  start-all.sh

Next is formatting the namenode

  hadoop namenode -format

Now we can import some data to HDFS

  hadoop dfs -copyFromLocal /data/billing-2012 /billing-2012

and start a mapreduce job

  hadoop jar billing-job.jar -Din=/billing-2012 -Dout=/b12.out

Configuration on all machines in cluster

We have to add user hadoop in group hadoop on all machines in Cluster:

  groupadd -g 790 hadoop

  useradd --comment "Hadoop" --shell /bin/zsh -m -r -g 790 -G hadoop --home /usr/local/hadoop hadoop

in zshrc we have to add some variables:

  export HADOOP_INSTALL=/usr/local/hadoop/hadoop-1.0.4
  export HADOOP_CONF_DIR=$HADOOP_INSTALL/conf
  export PATH=$PATH:$HADOOP_INSTALL/bin

Configuration of Hadoop framework

conf/hadoop-env.xml

Following lines are to add in hadoop-env.xml

Setting JAVA_HOME

  export JAVA_HOME=/etc/alternatives/jre_oracle

Setting cluster members

  export HADOOP_SLAVES=$HADOOP_HOME/conf/slaves

Setting path where hadoop conf should be rsync'd

  export HADOOP_MASTER=ssu03:/usr/local/hadoop/hadoop-1.0.4

conf/core-site.xml

We manipulated following properties:

        <property>
                <name>fs.default.name</name>
                <value>hdfs://ssu03</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
        <value>/dcache/hadoop/tmp</value>
        </property>
        <property>
                <name>fs.inmemory.size.mb</name>
                <value>200</value>
        </property>
        <property>
                <name>io.sort.factor</name>
                <value>100</value>
        </property>
        <property>
                <name>io.sort.mb</name>
                <value>200</value>
        </property>

conf/hdfs-site.xml

We manipulate following properties:

        <property>
                <name>hadoop.tmp.dir</name>
                <value>/dcache/hadoop/tmp</value>
        </property>
        <property>
                <name>dfs.data.dir</name>
                <value>/dcache/hadoop/hdfs/data</value>
        </property>
        <property>
                <name>dfs.name.dir</name>
                <value>/dcache/hadoop/hdfs/name</value>
        </property>
        <property>
                <name>fs.default.name</name>
                <value>hdfs://ssu03</value>
        </property>
        <property>
                <name>dfs.hosts</name>
                <value>$(HADOOP_CONF_DIR)/slaves</value>
        </property>
        <property>
                <name>dfs.replication<name>
                <value>3<value>
                <description>Default block replication<description>
        <property>
        <property>
                <name>dfs.secondary.http.address<name>
                <value>ssu04:50090<value>
        <property>
        <property>
                <name>fs.checkpoint.dir<name>
                <value>ssu04:/dcache/hadoop/secondary<value>
        <property>
        <property>
                <name>dfs.http.address<name>
                <value>ssu03:50090<value>
        <property>

conf/mapred-site.xml

We manipulate following properties

we have to attach the mapred/system directory on all machines:

  mkdir -p /dcache/mapred/system

        <property>
                <name>mapred.system.dir</name>
                <value>/dcache/hadoop/mapred/system</value>
        </property>
        <property>
                <name>mapred.job.tracker</name>
                <value>ssu03:9001</value>
        </property>
        <property>
                <name>mapred.hosts</name>
                <value>${HADOOP_CONF_DIR}/slaves</value>
        </property>
        <property>
                <name>dfs.hosts</name>
                <value>${HADOOP_CONF_DIR}/slaves</value>
        </property>

conf/master

We have to add host name of our namenode/jobtracker

    ssu03

conf/slaves

We have to add host names of all datanodes/tasktracker

    ssu01
    ssu03
    ssu04
    ssu05

-  ⇤ ← Revision 1 as of 2013-02-12 10:18:57 → 
  Size: 41
  Editor: AndreasKnoepke
  Comment:
+   ← Revision 12 as of 2013-02-12 11:56:04 → ⇥
  Size: 4177
  Editor: AndreasKnoepke
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-* User:group on all machines in Cluster
+<<TableOfContents>>
== Preface ==
We work with Apache Hadoop release 1.0.4 from http://hadoop.apache.org/, which is stable version in February 2013.<<BR>>
In our setup the ''secondarynamenode'' is running on other machine than ''namenode''. Both namenode and secondarynamenode are also ''datanodes'' and ''tasktracker''. ''Jobtracker'' is same machine than namenode. For data storage we use a RAID 6 disk array mounted under /dcache with 16 TB.<<BR>>
If all configuration is done, we start hadoop on the namenode
{{{
  start-all.sh
}}}
Next is formatting the namenode
{{{
  hadoop namenode -format
}}}
Now we can import some data to ''HDFS''
{{{
  hadoop dfs -copyFromLocal /data/billing-2012 /billing-2012
}}}
and start a mapreduce job
{{{
  hadoop jar billing-job.jar -Din=/billing-2012 -Dout=/b12.out
}}}


== Configuration on all machines in cluster ==
 * We have to add user hadoop in group hadoop on all machines in Cluster:
{{{
  groupadd -g 790 hadoop
}}}
{{{
  useradd --comment "Hadoop" --shell /bin/zsh -m -r -g 790 -G hadoop --home /usr/local/hadoop hadoop
}}}

 * in zshrc we have to add some variables:
{{{
  export HADOOP_INSTALL=/usr/local/hadoop/hadoop-1.0.4
  export HADOOP_CONF_DIR=$HADOOP_INSTALL/conf
  export PATH=$PATH:$HADOOP_INSTALL/bin
}}}

== Configuration of Hadoop framework ==
=== conf/hadoop-env.xml ===
Following lines are to add in hadoop-env.xml
 * Setting JAVA_HOME
{{{
  export JAVA_HOME=/etc/alternatives/jre_oracle
}}}

 * Setting cluster members
{{{
  export HADOOP_SLAVES=$HADOOP_HOME/conf/slaves
}}}

 * Setting path where hadoop conf should be rsync'd
{{{
  export HADOOP_MASTER=ssu03:/usr/local/hadoop/hadoop-1.0.4
}}}

=== conf/core-site.xml ===
We manipulated following ''properties'':
{{{
 <property>
  <name>fs.default.name</name>
  <value>hdfs://ssu03</value>
 </property>
 <property>
  <name>hadoop.tmp.dir</name>
        <value>/dcache/hadoop/tmp</value>
 </property>
 <property>
  <name>fs.inmemory.size.mb</name>
  <value>200</value>
 </property>
 <property>
  <name>io.sort.factor</name>
  <value>100</value>
 </property>
 <property>
  <name>io.sort.mb</name>
  <value>200</value>
 </property>
}}}

=== conf/hdfs-site.xml ===
We manipulate following ''properties'':
{{{
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/dcache/hadoop/tmp</value>
 </property>
 <property>
  <name>dfs.data.dir</name>
  <value>/dcache/hadoop/hdfs/data</value>
 </property>
 <property>
  <name>dfs.name.dir</name>
  <value>/dcache/hadoop/hdfs/name</value>
 </property>
 <property>
  <name>fs.default.name</name>
  <value>hdfs://ssu03</value>
 </property>
 <property>
  <name>dfs.hosts</name>
  <value>$(HADOOP_CONF_DIR)/slaves</value>
 </property>
 <property>
  <name>dfs.replication<name>
  <value>3<value>
  <description>Default block replication<description>
  <property>
  <property>
  <name>dfs.secondary.http.address<name>
  <value>ssu04:50090<value>
  <property>
  <property>
  <name>fs.checkpoint.dir<name>
  <value>ssu04:/dcache/hadoop/secondary<value>
  <property>
 <property>
  <name>dfs.http.address<name>
  <value>ssu03:50090<value>
  <property>
}}}

=== conf/mapred-site.xml ===
We manipulate following ''properties''
 * we have to attach the ''mapred/system'' directory on all machines:
{{{
  mkdir -p /dcache/mapred/system
}}}
{{{
 <property>
  <name>mapred.system.dir</name>
  <value>/dcache/hadoop/mapred/system</value>
 </property>
 <property>
  <name>mapred.job.tracker</name>
  <value>ssu03:9001</value>
 </property>
 <property>
  <name>mapred.hosts</name>
  <value>${HADOOP_CONF_DIR}/slaves</value>
 </property>
 <property>
  <name>dfs.hosts</name>
  <value>${HADOOP_CONF_DIR}/slaves</value>
 </property>
}}}

=== conf/master ===
We have to add host name of our namenode/jobtracker
{{{
    ssu03
}}}

=== conf/slaves ===
We have to add host names of all datanodes/tasktracker
{{{
    ssu01
    ssu03
    ssu04
    ssu05
}}}
== ssh settings ==


== Ports to open for datanode communication ==


== node commissioning ==


== node decommissioning ==

Wiki

Page

Preface

Configuration on all machines in cluster

Configuration of Hadoop framework

conf/hadoop-env.xml

conf/core-site.xml

conf/hdfs-site.xml

conf/mapred-site.xml

conf/master

conf/slaves

ssh settings

Ports to open for datanode communication

node commissioning

node decommissioning