01) Install OpenBSD 6.2, syspatch.
02) Install required packages:
pkg_add -v bash jdk lynx rsync cssh
( cssh and rsync and lynx you can install only on the client you use to manage the cluster, it's not needed on the nodes, but I prefer to have everything handy during experiments )
03) Add an user for hadoop (if you haven't already done during the install):
useradd -m hadoop ; passwd hadoop
04) Configure the user to use the bash shell:
usermod -s /usr/local/bin/bash hadoop
( bash is listed as a requirement on apache official docs, I've tryed to use only openbsd default pdksh and everything seems to work the same, anyway... ymmv )
05) Configure the file /etc/hosts with all the names of the machines of the cluster:
cat /etc/hosts
----------------------------------------------
127.0.0.1 localhost
::1 localhost
192.168.0.160 node00.sick-net node00
192.168.0.161 node01.sick-net node01
192.168.0.162 node02.sick-net node02
192.168.0.163 node03.sick-net node03
192.168.0.164 node04.sick-net node04
192.168.0.165 node05.sick-net node05
192.168.0.166 node06.sick-net node06
----------------------------------------------
06) From now on you can do everything on all hosts at the same time with this command:
cssh hadoop@node01 hadoop@node02 hadoop@node03 hadoop@node04 hadoop@node05 hadoop@node06
( to use terminus font size 14 do: cssh -f -*-terminus-*-*-*-*-14-*-*-*-*-*-*-* hadoop@node01 hadoop@node02 hadoop@node03 hadoop@node04 hadoop@node05 hadoop@node06 )
( of course the best way to manage multiple machines, up to hundreds and tousands, would be to use openbsd autoinstall facility and then some program like ansible, puppet, chef and the like... but it's out of the scope of this howto )
07) As user hadoop create private keys and upload public keys to other nodes:
ssh-keygen -t ed25519
(when it asks for file and passphrase just press enter)
mv .ssh/id_ed25519.pub .ssh/`hostname -s`.pub
scp .ssh/*pub hadoop@node01:.ssh/
scp .ssh/*pub hadoop@node02:.ssh/
scp .ssh/*pub hadoop@node03:.ssh/
scp .ssh/*pub hadoop@node04:.ssh/
scp .ssh/*pub hadoop@node05:.ssh/
scp .ssh/*pub hadoop@node06:.ssh/
cat .ssh/*pub >> .ssh/authorized_keys
( or just add the namenode public key: cat .ssh/node01.pub >> .ssh/authorized_keys )
( add any other key you may find useful for managing them, I usually add also the key of the client I run cssh from )
08) Return to the home of the user:
cd /home/hadoop
09) Configure all the environment variables in the files .profile .bashrc .bash_profile to look like this:
egrep -i '(java|jdk|hadoop|zoo|PS)' .profile
--------------------------------------------------------
PATH=$HOME/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11R6/bin:/usr/local/bin:/usr/local/sbin:/usr/games:/usr/local/jdk-1.8.0/bin:$HOME/hadoop-3.0.0-beta1/bin:$HOME/hadoop-3.0.0-beta1/sbin:$HOME/zookeeper-3.5.3-beta/bin:.
export JAVA_HOME=/usr/local/jdk-1.8.0
export HADOOP_HOME=/home/hadoop/hadoop-3.0.0-beta1
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HDFS_NAMENODE_OPTS="-XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=8 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -Xms1G -Xmx1G -XX:NewSize=128M -XX:MaxNewSize=128M $HDFS_NAMENODE_OPTS"
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export ZOOKEEPER_HOME=/home/hadoop/zookeeper-3.5.3-beta
PS1="[`hostname`]$ "
--------------------------------------------------------
cp .profile .ssh/environment ; cp .profile .bashrc ; cp .bashrc .bash_profile
( then remove "export PATH HOME TERM" from .ssh/environment or it will complain )
10) Become root, then disable ddb panic, configure the sshd daemon to allow the user environment and rise some limits in /etc/login.conf (then of course relogin):
su -l
echo "ddb.panic=0" > /etc/sysctl.conf
sysctl -w ddb.panic=0
echo "PermitUserEnvironment yes" >> /etc/ssh/sshd_config ; kill -HUP `cat /var/run/sshd.pid`
cp /etc/login.conf /etc/login.conf.orig
edit /etc/login.conf and in staff modify :datasize-cur=1536M:\ to :datasize-cur=4096M:\
exit
11) Download the hadop distribution, unpack it, and configure the file hadoop-env.sh like this:
cd /home/hadoop; ftp http://mirror.nohup.it/apache/hadoop/common/hadoop-3.0.0-beta1/hadoop-3.0.0-beta1.tar.gz
tar xvfz hadoop-3.0.0-beta1.tar.gz
cd /home/hadoop/hadoop-3.0.0-beta1/etc/hadoop ; cp hadoop-env.sh hadoop-env.sh.orig
tail -n5 hadoop-env.sh
--------------------------------------------
HADOOP_HOME=/home/hadoop/hadoop-3.0.0-beta1
export HADOOP_HOME
JAVA_HOME=/usr/local/jdk-1.8.0
export JAVA_HOME
--------------------------------------------
12) Now let's test if hadoop works locally:
cd /home/hadoop/hadoop-3.0.0-beta1
mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-beta1.jar grep input output 'dfs[a-z.]+'
cat output/*
----------------
1 dfsadmin
----------------
13) Download and unpack zookeeper:
cd /home/hadoop
ftp https://archive.apache.org/dist/zookeeper/zookeeper-3.5.3-beta/zookeeper-3.5.3-beta.tar.gz
tar xvfz zookeeper-3.5.3-beta.tar.gz
(even if this file is named .tar.gz, I had to do tar xvf to unpack, maybe they made a mistake in the package creation?)
14) Now we will configure the machines to act as a cluster, roughly following this layout:
node01: ActiveNameNode ZooKeeper QuorumJournalManager
node02: ActiveNameNode ZooKeeper QuorumJournalManager
node03: ActiveNamenode ZooKeeper QuorumJournalManager
node04: DataNode
node05: DataNode
node06: DataNode
( in this way we will achieve a redundancy that will let us to lose any 2 datanodes at any time and still continue work without losing data, more datanodes can easily be added at any time later )
15) On all nodes (or only on node01, the first, then scp the config files on the others):
cp hadoop-3.0.0-beta1/etc/hadoop/core-site.xml hadoop-3.0.0-beta1/etc/hadoop/core-site.xml.orig
vi hadoop-3.0.0-beta1/etc/hadoop/core-site.xml
--------------------------------------
fs.defaultFS
hdfs://hadoop
ha.zookeeper.quorum
node01.sick-net:2181,node02.sick-net:2181,node03.sick-net:2181
--------------------------------------
cp hadoop-3.0.0-beta1/etc/hadoop/hdfs-site.xml hadoop-3.0.0-beta1/etc/hadoop/hdfs-site.xml.orig
vi hadoop-3.0.0-beta1/etc/hadoop/hdfs-site.xml
-------------------------------------------------------------------------------------------------------------
dfs.namenode.name.dir
file:///home/hadoop/data/namenode
dfs.journalnode.edits.dir
/home/hadoop/data/jn
dfs.replication
3
dfs.permissions
false
dfs.nameservices
hadoop
dfs.ha.namenodes.hadoop
node01,node02,node03
dfs.namenode.rpc-address.hadoop.node01
node01.sick-net:9000
dfs.namenode.rpc-address.hadoop.node02
node02.sick-net:9000
dfs.namenode.rpc-address.hadoop.node03
node03.sick-net:9000
dfs.namenode.http-address.hadoop.node01
node01.sick-net:50070
dfs.namenode.http-address.hadoop.node02
node02.sick-net:50070
dfs.namenode.http-address.hadoop.node03
node03.sick-net:50070
dfs.namenode.shared.edits.dir
qjournal://node01.sick-net:8485;node02.sick-net:8485;node03.sick-net:8485/hadoop
dfs.client.failover.proxy.provider.hadoop
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
dfs.ha.automatic-failover.enabled
true
ha.zookeeper.quorum
node01.sick-net:2181,node02.sick-net:2181,node03.sick-net:2181
dfs.ha.fencing.methods
sshfence(hadoop)
dfs.ha.fencing.ssh.private-key-files
/home/hadoop/.ssh/id_ed25519
-------------------------------------------------------------------------------------------------------------
16) Go to zookeeper conf directory (on all nodes or at least on all the nodes that will run it):
cd /home/hadoop ; cd zookeeper-3.5.3-beta/conf/
cp zoo_sample.cfg zoo.cfg
cd /home/hadoop; mkdir -p data/zookeeper
vi zookeeper-3.5.3-beta/conf/zoo.cfg
----------------------------------------------------------------------------
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/home/hadoop/data/zookeeper
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
server.1=node01.sick-net:2888:3888
server.2=node02.sick-net:2888:3888
server.3=node03.sick-net:2888:3888
----------------------------------------------------------------------------
cp zookeeper-3.5.3-beta/bin/zkEnv.sh zookeeper-3.5.3-beta/bin/zkEnv.sh.orig
edit zookeeper-3.5.3-beta/bin/zkEnv.sh and change ZK_SERVER_HEAP="${ZK_SERVER_HEAP:-1000}" to ZK_SERVER_HEAP="${ZK_SERVER_HEAP:-256}"
echo 'export JVMFLAGS="-Xms128m -Xmx256m"' > zookeeper-3.5.3-beta/conf/java.env
17) On all the nodes create the directory /home/hadoop/data:
mkdir /home/hadoop/data
18) On name nodes create the directory /home/hadoop/data/namenode /home/hadoop/data/jn:
mkdir /home/hadoop/data/namenode ; mkdir /home/hadoop/data/jn
19) On storage nodes create the directory /home/hadoop/data/datanode:
mkdir /home/hadoop/data/datanode
20) On storage nodes modify hdfs-site.xml adding only this property:
------------------------------------------------
dfs.datanode.data.dir
file:///home/hadoop/data/datanode
------------------------------------------------
21) On namenodes go into zookeeper config dir and create the myid file with different increasing id:
echo "1" > data/zookeeper/myid
echo "2" > data/zookeeper/myid
echo "3" > data/zookeeper/myid
22) Start journalnode and zookeeper server on node01 node02 node03 and use jps to check if they started:
hdfs --daemon start journalnode
zkServer.sh start
jps
23) Format and start namenode and zookeeper failover controller on node01:
hdfs namenode -format
hdfs zkfc -formatZK
hdfs --daemon start namenode
hdfs --daemon start zkfc
jps
24) Execute this command on node02 to copy over the hdfs metadata and then to start the namenode and the zookeeper failover controller:
hdfs namenode -bootstrapStandby
hdfs --daemon start namenode
hdfs --daemon start zkfc
jps
25) Execute this command on node03 to copy over the hdfs metadata and then to start the namenode and the zookeeper failover controller:
hdfs namenode -bootstrapStandby
hdfs --daemon start namenode
hdfs --daemon start zkfc
jps
26) Start the data node on node04 node05 node06:
hdfs --daemon start datanode
jps
27) To stop the cluster:
ssh hadoop@node06 "bash -ls -c 'hdfs --daemon stop datanode'"
ssh hadoop@node05 "bash -ls -c 'hdfs --daemon stop datanode'"
ssh hadoop@node04 "bash -ls -c 'hdfs --daemon stop datanode'"
ssh hadoop@node03 "bash -ls -c 'hdfs --daemon stop zkfc'"
ssh hadoop@node03 "bash -ls -c 'hdfs --daemon stop namenode'"
ssh hadoop@node02 "bash -ls -c 'hdfs --daemon stop zkfc'"
ssh hadoop@node02 "bash -ls -c 'hdfs --daemon stop namenode'"
ssh hadoop@node01 "bash -ls -c 'hdfs --daemon stop zkfc'"
ssh hadoop@node01 "bash -ls -c 'hdfs --daemon stop namenode'"
ssh hadoop@node03 "bash -ls -c 'hdfs --daemon stop journalnode'"
ssh hadoop@node02 "bash -ls -c 'hdfs --daemon stop journalnode'"
ssh hadoop@node01 "bash -ls -c 'hdfs --daemon stop journalnode'"
ssh hadoop@node03 "bash -ls -c 'zkServer.sh stop'"
ssh hadoop@node02 "bash -ls -c 'zkServer.sh stop'"
ssh hadoop@node01 "bash -ls -c 'zkServer.sh stop'"
28) To start the cluster (for example after a reboot):
ssh hadoop@node01 "bash -ls -c 'hdfs --daemon start journalnode'"
ssh hadoop@node02 "bash -ls -c 'hdfs --daemon start journalnode'"
ssh hadoop@node03 "bash -ls -c 'hdfs --daemon start journalnode'"
ssh hadoop@node01 "bash -ls -c 'zkServer.sh start'"
ssh hadoop@node02 "bash -ls -c 'zkServer.sh start'"
ssh hadoop@node03 "bash -ls -c 'zkServer.sh start'"
ssh hadoop@node01 "bash -ls -c 'hdfs --daemon start namenode'"
ssh hadoop@node01 "bash -ls -c 'hdfs --daemon start zkfc'"
ssh hadoop@node02 "bash -ls -c 'hdfs --daemon start namenode'"
ssh hadoop@node02 "bash -ls -c 'hdfs --daemon start zkfc'"
ssh hadoop@node03 "bash -ls -c 'hdfs --daemon start namenode'"
ssh hadoop@node03 "bash -ls -c 'hdfs --daemon start zkfc'"
ssh hadoop@node04 "bash -ls -c 'hdfs --daemon start datanode'"
ssh hadoop@node05 "bash -ls -c 'hdfs --daemon start datanode'"
ssh hadoop@node06 "bash -ls -c 'hdfs --daemon start datanode'"
29) To check cluster status connect with a browser to: http://node01.sick-net:50070
and give this commands from console:
hdfs haadmin -getServiceState node01
hdfs haadmin -getServiceState node02
hdfs haadmin -getServiceState node03
30) To use the hdfs filesystem:
hadoop fs -ls /
hadoop fs -mkdir /test/
touch file.txt
hadoop fs -put file.txt /test/
hadoop fs -ls /test/
hadoop fs -df -h
hdfs fsck / -files -showprogress
31) To enable erasure codes on a directory (for the default policy more than 3 datanodes are required, but we are just experimenting):
hadoop fs -mkdir /erasure/
hdfs ec -getPolicy -path /erasure
(should reply: The erasure coding policy of /erasure is unspecified )
hdfs ec -setPolicy -path /erasure -policy RS-6-3-1024k
(should reply: Set erasure coding policy RS-6-3-1024k on /erasure )
hdfs ec -getPolicy -path /erasure
(should reply: RS-6-3-1024k )
32) To install a client outside of the cluster:
egrep -i '(java|jdk|hadoop|zoo|PS)' .profile
--------------------------------------------------------
PATH=$HOME/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11R6/bin:/usr/local/bin:/usr/local/sbin:/usr/games:/usr/local/jdk-1.8.0/bin:$HOME/hadoop-3.0.0-beta1/bin:$HOME/hadoop-3.0.0-beta1/sbin:$HOME/zookeeper-3.5.3-beta/bin:.
export JAVA_HOME=/usr/local/jdk-1.8.0
export HADOOP_HOME=/home/user/hadoop-3.0.0-beta1
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
--------------------------------------------------------
cd /home/user; ftp http://mirror.nohup.it/apache/hadoop/common/hadoop-3.0.0-beta1/hadoop-3.0.0-beta1.tar.gz
tar xvfz hadoop-3.0.0-beta1.tar.gz
cd /home/user/hadoop-3.0.0-beta1/etc/hadoop ; cp hadoop-env.sh hadoop-env.sh.orig
tail -n5 hadoop-env.sh
--------------------------------------------
HADOOP_HOME=/home/user/hadoop-3.0.0-beta1
export HADOOP_HOME
JAVA_HOME=/usr/local/jdk-1.8.0
export JAVA_HOME
--------------------------------------------
( and repeat step 15 for the client if you didn't do before! )
33) To test it we can try to syncronize a directory full of iso images:
hadoop fs -mkdir /user/
hadoop distcp file:///home/user/iso/linux hdfs://hadoop/user/
34) Obligatory screenshot pr0n:
http://www.sickness.it/openbsd_hadoop01.png
http://www.sickness.it/openbsd_hadoop02.png
http://www.sickness.it/openbsd_hadoop03.png
!!!KEEP IN MIND!!!
With this HA setup you can lose any 2 nodes without losing data and the cluster will continue to work, but the cluster is running in *insecure* mode so any user can connect and impersonate other users and cancel or modify data, so it is strongly advised to configure it in secure mode with kerberos!