>
== Preface ==
We work with Apache Hadoop release 1.0.4 from http://hadoop.apache.org/, which is stable version in February 2013.<
>
In our setup the ''secondarynamenode'' is running on other machine than ''namenode''. Both namenode and secondarynamenode are also ''datanodes'' and ''tasktracker''. ''Jobtracker'' is same machine than namenode. For data storage we use a RAID 6 disk array mounted under /dcache with 16 TB.<
>
If all configuration is done, we start hadoop on the namenode
{{{
start-all.sh
}}}
Next is formatting the namenode
{{{
hadoop namenode -format
}}}
Now we can import some data to ''HDFS''
{{{
hadoop dfs -copyFromLocal /data/billing-2012 /billing-2012
}}}
and start a mapreduce job
{{{
hadoop jar billing-job.jar -Din=/billing-2012 -Dout=/b12.out
}}}
== Configuration on all machines in cluster ==
* We have to add user hadoop in group hadoop on all machines in Cluster:
{{{
groupadd -g 790 hadoop
}}}
{{{
useradd --comment "Hadoop" --shell /bin/zsh -m -r -g 790 -G hadoop --home /usr/local/hadoop hadoop
}}}
* in zshrc we have to add some variables:
{{{
export HADOOP_INSTALL=/usr/local/hadoop/hadoop-1.0.4
export HADOOP_CONF_DIR=$HADOOP_INSTALL/conf
export PATH=$PATH:$HADOOP_INSTALL/bin
}}}
== Configuration of Hadoop framework ==
=== conf/hadoop-env.xml ===
Following lines are to add in hadoop-env.xml
* Setting JAVA_HOME
{{{
export JAVA_HOME=/etc/alternatives/jre_oracle
}}}
* Setting cluster members
{{{
export HADOOP_SLAVES=$HADOOP_HOME/conf/slaves
}}}
* Setting path where hadoop conf should be rsync'd
{{{
export HADOOP_MASTER=ssu03:/usr/local/hadoop/hadoop-1.0.4
}}}
=== conf/core-site.xml ===
We manipulated following ''properties'':
{{{
fs.default.name
hdfs://ssu03
hadoop.tmp.dir
/dcache/hadoop/tmp
fs.inmemory.size.mb
200
io.sort.factor
100
io.sort.mb
200
}}}
=== conf/hdfs-site.xml ===
We manipulate following ''properties'':
{{{
hadoop.tmp.dir
/dcache/hadoop/tmp
dfs.data.dir
/dcache/hadoop/hdfs/data
dfs.name.dir
/dcache/hadoop/hdfs/name
fs.default.name
hdfs://ssu03
dfs.hosts
$(HADOOP_CONF_DIR)/slaves
dfs.replication
3
Default block replication
dfs.secondary.http.address
ssu04:50090
fs.checkpoint.dir
ssu04:/dcache/hadoop/secondary
dfs.http.address
ssu03:50090
}}}
=== conf/mapred-site.xml ===
We manipulate following ''properties''
* we have to attach the ''mapred/system'' directory on all machines:
{{{
mkdir -p /dcache/mapred/system
}}}
{{{
mapred.system.dir
/dcache/hadoop/mapred/system
mapred.job.tracker
ssu03:9001
mapred.hosts
${HADOOP_CONF_DIR}/slaves
dfs.hosts
${HADOOP_CONF_DIR}/slaves
}}}
=== conf/master ===
We have to add host name of our namenode/jobtracker
{{{
ssu03
}}}
=== conf/slaves ===
We have to add host names of all datanodes/tasktracker
{{{
ssu01
ssu03
ssu04
ssu05
}}}
== ssh settings ==
== Ports to open for datanode communication ==
== node commissioning ==
== node decommissioning ==