Differences between revisions 8 and 9
Revision 8 as of 2013-02-12 11:44:16
Size: 3855
Comment:
Revision 9 as of 2013-02-12 11:46:23
Size: 3799
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
== Hadoop Cluster Kickstart ==
=== Preface ===
== Preface ==
Line 7: Line 6:
=== Configuration on all machines in cluster === == Configuration on all machines in cluster ==
Line 23: Line 22:
=== Configuration of Hadoop framework ===
==== conf/hadoop-env.xml ====
== Configuration of Hadoop framework ==
=== conf/hadoop-env.xml ===
Line 66: Line 65:
==== conf/hdfs-site.xml ==== === conf/hdfs-site.xml ===
Line 108: Line 107:
==== conf/mapred-site.xml ==== === conf/mapred-site.xml ===
Line 133: Line 132:
==== conf/master ==== === conf/master ===
Line 139: Line 138:
==== conf/slaves ==== === conf/slaves ===
Line 147: Line 146:
=== ssh settings === == ssh settings ==
Line 150: Line 149:
=== Ports to open for datanode communication === == Ports to open for datanode communication ==
Line 153: Line 152:
=== node commissioning === == node commissioning ==
Line 156: Line 155:
=== node decommissioning === == node decommissioning ==

Preface

We work with Apache Hadoop release 1.0.4 from http://hadoop.apache.org/, which is stable version in February 2013.
In our setup the secondarynamenode is running on other machine than namenode. Both namenode and secondarynamenode are also datanodes and tasktracker. Jobtracker is same machine than namenode. For data storage we used a RAID 6 disk array mounted under /dcache with 16 TB.

Configuration on all machines in cluster

  • We have to add user hadoop in group hadoop on all machines in Cluster:

  groupadd -g 790 hadoop

  useradd --comment "Hadoop" --shell /bin/zsh -m -r -g 790 -G hadoop --home /usr/local/hadoop hadoop
  • in zshrc we have to add some variables:

  export HADOOP_INSTALL=/usr/local/hadoop/hadoop-1.0.4
  export HADOOP_CONF_DIR=$HADOOP_INSTALL/conf
  export PATH=$PATH:$HADOOP_INSTALL/bin

Configuration of Hadoop framework

conf/hadoop-env.xml

Following lines are to add in hadoop-env.xml

  • Setting JAVA_HOME

  export JAVA_HOME=/etc/alternatives/jre_oracle
  • Setting cluster members

  export HADOOP_SLAVES=$HADOOP_HOME/conf/slaves
  • Setting path where hadoop conf should be rsync'd

  export HADOOP_MASTER=ssu03:/usr/local/hadoop/hadoop-1.0.4

conf/core-site.xml

We manipulated following properties:

        <property>
                <name>fs.default.name</name>
                <value>hdfs://ssu03</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
        <value>/dcache/hadoop/tmp</value>
        </property>
        <property>
                <name>fs.inmemory.size.mb</name>
                <value>200</value>
        </property>
        <property>
                <name>io.sort.factor</name>
                <value>100</value>
        </property>
        <property>
                <name>io.sort.mb</name>
                <value>200</value>
        </property>

conf/hdfs-site.xml

We manipulated following properties:

        <property>
                <name>hadoop.tmp.dir</name>
                <value>/dcache/hadoop/tmp</value>
        </property>
        <property>
                <name>dfs.data.dir</name>
                <value>/dcache/hadoop/hdfs/data</value>
        </property>
        <property>
                <name>dfs.name.dir</name>
                <value>/dcache/hadoop/hdfs/name</value>
        </property>
        <property>
                <name>fs.default.name</name>
                <value>hdfs://ssu03</value>
        </property>
        <property>
                <name>dfs.hosts</name>
                <value>$(HADOOP_CONF_DIR)/slaves</value>
        </property>
        <property>
                <name>dfs.replication<name>
                <value>3<value>
                <description>Default block replication<description>
        <property>
        <property>
                <name>dfs.secondary.http.address<name>
                <value>ssu04:50090<value>
        <property>
        <property>
                <name>fs.checkpoint.dir<name>
                <value>ssu04:/dcache/hadoop/secondary<value>
        <property>
        <property>
                <name>dfs.http.address<name>
                <value>ssu03:50090<value>
        <property>

conf/mapred-site.xml

We manipulated following properties

  • we have to attach the mapred/system directory on all machines:

  mkdir -p /dcache/mapred/system

        <property>
                <name>mapred.system.dir</name>
                <value>/dcache/hadoop/mapred/system</value>
        </property>
        <property>
                <name>mapred.job.tracker</name>
                <value>ssu03:9001</value>
        </property>
        <property>
                <name>mapred.hosts</name>
                <value>${HADOOP_CONF_DIR}/slaves</value>
        </property>
        <property>
                <name>dfs.hosts</name>
                <value>${HADOOP_CONF_DIR}/slaves</value>
        </property>

conf/master

We have to add host name of our namenode/jobtracker

    ssu03

conf/slaves

We have to add host names of all datanodes/tasktracker

    ssu03
    ssu01
    ssu04
    ssu05

ssh settings

Ports to open for datanode communication

node commissioning

node decommissioning

Hadoop Cluster Kickstart (last edited 2013-02-12 11:56:04 by AndreasKnoepke)