01) Install OpenBSD 6.2, syspatch. 02) Install required packages: pkg_add -v bash jdk lynx rsync cssh ( cssh and rsync and lynx you can install only on the client you use to manage the cluster, it's not needed on the nodes, but I prefer to have everything handy during experiments ) 03) Add an user for hadoop (if you haven't already done during the install): useradd -m hadoop ; passwd hadoop 04) Configure the user to use the bash shell: usermod -s /usr/local/bin/bash hadoop ( bash is listed as a requirement on apache official docs, I've tryed to use only openbsd default pdksh and everything seems to work the same, anyway... ymmv ) 05) Configure the file /etc/hosts with all the names of the machines of the cluster: cat /etc/hosts ---------------------------------------------- 127.0.0.1 localhost ::1 localhost 192.168.0.160 node00.sick-net node00 192.168.0.161 node01.sick-net node01 192.168.0.162 node02.sick-net node02 192.168.0.163 node03.sick-net node03 192.168.0.164 node04.sick-net node04 192.168.0.165 node05.sick-net node05 192.168.0.166 node06.sick-net node06 ---------------------------------------------- 06) From now on you can do everything on all hosts at the same time with this command: cssh hadoop@node01 hadoop@node02 hadoop@node03 hadoop@node04 hadoop@node05 hadoop@node06 ( to use terminus font size 14 do: cssh -f -*-terminus-*-*-*-*-14-*-*-*-*-*-*-* hadoop@node01 hadoop@node02 hadoop@node03 hadoop@node04 hadoop@node05 hadoop@node06 ) ( of course the best way to manage multiple machines, up to hundreds and tousands, would be to use openbsd autoinstall facility and then some program like ansible, puppet, chef and the like... but it's out of the scope of this howto ) 07) As user hadoop create private keys and upload public keys to other nodes: ssh-keygen -t ed25519 (when it asks for file and passphrase just press enter) mv .ssh/id_ed25519.pub .ssh/`hostname -s`.pub scp .ssh/*pub hadoop@node01:.ssh/ scp .ssh/*pub hadoop@node02:.ssh/ scp .ssh/*pub hadoop@node03:.ssh/ scp .ssh/*pub hadoop@node04:.ssh/ scp .ssh/*pub hadoop@node05:.ssh/ scp .ssh/*pub hadoop@node06:.ssh/ cat .ssh/*pub >> .ssh/authorized_keys ( or just add the namenode public key: cat .ssh/node01.pub >> .ssh/authorized_keys ) ( add any other key you may find useful for managing them, I usually add also the key of the client I run cssh from ) 08) Return to the home of the user: cd /home/hadoop 09) Configure all the environment variables in the files .profile .bashrc .bash_profile to look like this: egrep -i '(java|jdk|hadoop|zoo|PS)' .profile -------------------------------------------------------- PATH=$HOME/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11R6/bin:/usr/local/bin:/usr/local/sbin:/usr/games:/usr/local/jdk-1.8.0/bin:$HOME/hadoop-3.0.0-beta1/bin:$HOME/hadoop-3.0.0-beta1/sbin:$HOME/zookeeper-3.5.3-beta/bin:. export JAVA_HOME=/usr/local/jdk-1.8.0 export HADOOP_HOME=/home/hadoop/hadoop-3.0.0-beta1 export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HDFS_NAMENODE_OPTS="-XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=8 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -Xms1G -Xmx1G -XX:NewSize=128M -XX:MaxNewSize=128M $HDFS_NAMENODE_OPTS" export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop export ZOOKEEPER_HOME=/home/hadoop/zookeeper-3.5.3-beta PS1="[`hostname`]$ " -------------------------------------------------------- cp .profile .ssh/environment ; cp .profile .bashrc ; cp .bashrc .bash_profile ( then remove "export PATH HOME TERM" from .ssh/environment or it will complain ) 10) Become root, then disable ddb panic, configure the sshd daemon to allow the user environment and rise some limits in /etc/login.conf (then of course relogin): su -l echo "ddb.panic=0" > /etc/sysctl.conf sysctl -w ddb.panic=0 echo "PermitUserEnvironment yes" >> /etc/ssh/sshd_config ; kill -HUP `cat /var/run/sshd.pid` cp /etc/login.conf /etc/login.conf.orig edit /etc/login.conf and in staff modify :datasize-cur=1536M:\ to :datasize-cur=4096M:\ exit 11) Download the hadop distribution, unpack it, and configure the file hadoop-env.sh like this: cd /home/hadoop; ftp http://mirror.nohup.it/apache/hadoop/common/hadoop-3.0.0-beta1/hadoop-3.0.0-beta1.tar.gz tar xvfz hadoop-3.0.0-beta1.tar.gz cd /home/hadoop/hadoop-3.0.0-beta1/etc/hadoop ; cp hadoop-env.sh hadoop-env.sh.orig tail -n5 hadoop-env.sh -------------------------------------------- HADOOP_HOME=/home/hadoop/hadoop-3.0.0-beta1 export HADOOP_HOME JAVA_HOME=/usr/local/jdk-1.8.0 export JAVA_HOME -------------------------------------------- 12) Now let's test if hadoop works locally: cd /home/hadoop/hadoop-3.0.0-beta1 mkdir input cp etc/hadoop/*.xml input bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-beta1.jar grep input output 'dfs[a-z.]+' cat output/* ---------------- 1 dfsadmin ---------------- 13) Download and unpack zookeeper: cd /home/hadoop ftp https://archive.apache.org/dist/zookeeper/zookeeper-3.5.3-beta/zookeeper-3.5.3-beta.tar.gz tar xvfz zookeeper-3.5.3-beta.tar.gz (even if this file is named .tar.gz, I had to do tar xvf to unpack, maybe they made a mistake in the package creation?) 14) Now we will configure the machines to act as a cluster, roughly following this layout: node01: ActiveNameNode ZooKeeper QuorumJournalManager node02: ActiveNameNode ZooKeeper QuorumJournalManager node03: ActiveNamenode ZooKeeper QuorumJournalManager node04: DataNode node05: DataNode node06: DataNode ( in this way we will achieve a redundancy that will let us to lose any 2 datanodes at any time and still continue work without losing data, more datanodes can easily be added at any time later ) 15) On all nodes (or only on node01, the first, then scp the config files on the others): cp hadoop-3.0.0-beta1/etc/hadoop/core-site.xml hadoop-3.0.0-beta1/etc/hadoop/core-site.xml.orig vi hadoop-3.0.0-beta1/etc/hadoop/core-site.xml -------------------------------------- fs.defaultFS hdfs://hadoop ha.zookeeper.quorum node01.sick-net:2181,node02.sick-net:2181,node03.sick-net:2181 -------------------------------------- cp hadoop-3.0.0-beta1/etc/hadoop/hdfs-site.xml hadoop-3.0.0-beta1/etc/hadoop/hdfs-site.xml.orig vi hadoop-3.0.0-beta1/etc/hadoop/hdfs-site.xml ------------------------------------------------------------------------------------------------------------- dfs.namenode.name.dir file:///home/hadoop/data/namenode dfs.journalnode.edits.dir /home/hadoop/data/jn dfs.replication 3 dfs.permissions false dfs.nameservices hadoop dfs.ha.namenodes.hadoop node01,node02,node03 dfs.namenode.rpc-address.hadoop.node01 node01.sick-net:9000 dfs.namenode.rpc-address.hadoop.node02 node02.sick-net:9000 dfs.namenode.rpc-address.hadoop.node03 node03.sick-net:9000 dfs.namenode.http-address.hadoop.node01 node01.sick-net:50070 dfs.namenode.http-address.hadoop.node02 node02.sick-net:50070 dfs.namenode.http-address.hadoop.node03 node03.sick-net:50070 dfs.namenode.shared.edits.dir qjournal://node01.sick-net:8485;node02.sick-net:8485;node03.sick-net:8485/hadoop dfs.client.failover.proxy.provider.hadoop org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider dfs.ha.automatic-failover.enabled true ha.zookeeper.quorum node01.sick-net:2181,node02.sick-net:2181,node03.sick-net:2181 dfs.ha.fencing.methods sshfence(hadoop) dfs.ha.fencing.ssh.private-key-files /home/hadoop/.ssh/id_ed25519 ------------------------------------------------------------------------------------------------------------- 16) Go to zookeeper conf directory (on all nodes or at least on all the nodes that will run it): cd /home/hadoop ; cd zookeeper-3.5.3-beta/conf/ cp zoo_sample.cfg zoo.cfg cd /home/hadoop; mkdir -p data/zookeeper vi zookeeper-3.5.3-beta/conf/zoo.cfg ---------------------------------------------------------------------------- # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. # do not use /tmp for storage, /tmp here is just # example sakes. dataDir=/home/hadoop/data/zookeeper # the port at which the clients will connect clientPort=2181 # the maximum number of client connections. # increase this if you need to handle more clients #maxClientCnxns=60 # # Be sure to read the maintenance section of the # administrator guide before turning on autopurge. # # http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance # # The number of snapshots to retain in dataDir #autopurge.snapRetainCount=3 # Purge task interval in hours # Set to "0" to disable auto purge feature #autopurge.purgeInterval=1 server.1=node01.sick-net:2888:3888 server.2=node02.sick-net:2888:3888 server.3=node03.sick-net:2888:3888 ---------------------------------------------------------------------------- cp zookeeper-3.5.3-beta/bin/zkEnv.sh zookeeper-3.5.3-beta/bin/zkEnv.sh.orig edit zookeeper-3.5.3-beta/bin/zkEnv.sh and change ZK_SERVER_HEAP="${ZK_SERVER_HEAP:-1000}" to ZK_SERVER_HEAP="${ZK_SERVER_HEAP:-256}" echo 'export JVMFLAGS="-Xms128m -Xmx256m"' > zookeeper-3.5.3-beta/conf/java.env 17) On all the nodes create the directory /home/hadoop/data: mkdir /home/hadoop/data 18) On name nodes create the directory /home/hadoop/data/namenode /home/hadoop/data/jn: mkdir /home/hadoop/data/namenode ; mkdir /home/hadoop/data/jn 19) On storage nodes create the directory /home/hadoop/data/datanode: mkdir /home/hadoop/data/datanode 20) On storage nodes modify hdfs-site.xml adding only this property: ------------------------------------------------ dfs.datanode.data.dir file:///home/hadoop/data/datanode ------------------------------------------------ 21) On namenodes go into zookeeper config dir and create the myid file with different increasing id: echo "1" > data/zookeeper/myid echo "2" > data/zookeeper/myid echo "3" > data/zookeeper/myid 22) Start journalnode and zookeeper server on node01 node02 node03 and use jps to check if they started: hdfs --daemon start journalnode zkServer.sh start jps 23) Format and start namenode and zookeeper failover controller on node01: hdfs namenode -format hdfs zkfc -formatZK hdfs --daemon start namenode hdfs --daemon start zkfc jps 24) Execute this command on node02 to copy over the hdfs metadata and then to start the namenode and the zookeeper failover controller: hdfs namenode -bootstrapStandby hdfs --daemon start namenode hdfs --daemon start zkfc jps 25) Execute this command on node03 to copy over the hdfs metadata and then to start the namenode and the zookeeper failover controller: hdfs namenode -bootstrapStandby hdfs --daemon start namenode hdfs --daemon start zkfc jps 26) Start the data node on node04 node05 node06: hdfs --daemon start datanode jps 27) To stop the cluster: ssh hadoop@node06 "bash -ls -c 'hdfs --daemon stop datanode'" ssh hadoop@node05 "bash -ls -c 'hdfs --daemon stop datanode'" ssh hadoop@node04 "bash -ls -c 'hdfs --daemon stop datanode'" ssh hadoop@node03 "bash -ls -c 'hdfs --daemon stop zkfc'" ssh hadoop@node03 "bash -ls -c 'hdfs --daemon stop namenode'" ssh hadoop@node02 "bash -ls -c 'hdfs --daemon stop zkfc'" ssh hadoop@node02 "bash -ls -c 'hdfs --daemon stop namenode'" ssh hadoop@node01 "bash -ls -c 'hdfs --daemon stop zkfc'" ssh hadoop@node01 "bash -ls -c 'hdfs --daemon stop namenode'" ssh hadoop@node03 "bash -ls -c 'hdfs --daemon stop journalnode'" ssh hadoop@node02 "bash -ls -c 'hdfs --daemon stop journalnode'" ssh hadoop@node01 "bash -ls -c 'hdfs --daemon stop journalnode'" ssh hadoop@node03 "bash -ls -c 'zkServer.sh stop'" ssh hadoop@node02 "bash -ls -c 'zkServer.sh stop'" ssh hadoop@node01 "bash -ls -c 'zkServer.sh stop'" 28) To start the cluster (for example after a reboot): ssh hadoop@node01 "bash -ls -c 'hdfs --daemon start journalnode'" ssh hadoop@node02 "bash -ls -c 'hdfs --daemon start journalnode'" ssh hadoop@node03 "bash -ls -c 'hdfs --daemon start journalnode'" ssh hadoop@node01 "bash -ls -c 'zkServer.sh start'" ssh hadoop@node02 "bash -ls -c 'zkServer.sh start'" ssh hadoop@node03 "bash -ls -c 'zkServer.sh start'" ssh hadoop@node01 "bash -ls -c 'hdfs --daemon start namenode'" ssh hadoop@node01 "bash -ls -c 'hdfs --daemon start zkfc'" ssh hadoop@node02 "bash -ls -c 'hdfs --daemon start namenode'" ssh hadoop@node02 "bash -ls -c 'hdfs --daemon start zkfc'" ssh hadoop@node03 "bash -ls -c 'hdfs --daemon start namenode'" ssh hadoop@node03 "bash -ls -c 'hdfs --daemon start zkfc'" ssh hadoop@node04 "bash -ls -c 'hdfs --daemon start datanode'" ssh hadoop@node05 "bash -ls -c 'hdfs --daemon start datanode'" ssh hadoop@node06 "bash -ls -c 'hdfs --daemon start datanode'" 29) To check cluster status connect with a browser to: http://node01.sick-net:50070 and give this commands from console: hdfs haadmin -getServiceState node01 hdfs haadmin -getServiceState node02 hdfs haadmin -getServiceState node03 30) To use the hdfs filesystem: hadoop fs -ls / hadoop fs -mkdir /test/ touch file.txt hadoop fs -put file.txt /test/ hadoop fs -ls /test/ hadoop fs -df -h hdfs fsck / -files -showprogress 31) To enable erasure codes on a directory (for the default policy more than 3 datanodes are required, but we are just experimenting): hadoop fs -mkdir /erasure/ hdfs ec -getPolicy -path /erasure (should reply: The erasure coding policy of /erasure is unspecified ) hdfs ec -setPolicy -path /erasure -policy RS-6-3-1024k (should reply: Set erasure coding policy RS-6-3-1024k on /erasure ) hdfs ec -getPolicy -path /erasure (should reply: RS-6-3-1024k ) 32) To install a client outside of the cluster: egrep -i '(java|jdk|hadoop|zoo|PS)' .profile -------------------------------------------------------- PATH=$HOME/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11R6/bin:/usr/local/bin:/usr/local/sbin:/usr/games:/usr/local/jdk-1.8.0/bin:$HOME/hadoop-3.0.0-beta1/bin:$HOME/hadoop-3.0.0-beta1/sbin:$HOME/zookeeper-3.5.3-beta/bin:. export JAVA_HOME=/usr/local/jdk-1.8.0 export HADOOP_HOME=/home/user/hadoop-3.0.0-beta1 export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop -------------------------------------------------------- cd /home/user; ftp http://mirror.nohup.it/apache/hadoop/common/hadoop-3.0.0-beta1/hadoop-3.0.0-beta1.tar.gz tar xvfz hadoop-3.0.0-beta1.tar.gz cd /home/user/hadoop-3.0.0-beta1/etc/hadoop ; cp hadoop-env.sh hadoop-env.sh.orig tail -n5 hadoop-env.sh -------------------------------------------- HADOOP_HOME=/home/user/hadoop-3.0.0-beta1 export HADOOP_HOME JAVA_HOME=/usr/local/jdk-1.8.0 export JAVA_HOME -------------------------------------------- ( and repeat step 15 for the client if you didn't do before! ) 33) To test it we can try to syncronize a directory full of iso images: hadoop fs -mkdir /user/ hadoop distcp file:///home/user/iso/linux hdfs://hadoop/user/ 34) Obligatory screenshot pr0n: http://www.sickness.it/openbsd_hadoop01.png http://www.sickness.it/openbsd_hadoop02.png http://www.sickness.it/openbsd_hadoop03.png !!!KEEP IN MIND!!! With this HA setup you can lose any 2 nodes without losing data and the cluster will continue to work, but the cluster is running in *insecure* mode so any user can connect and impersonate other users and cancel or modify data, so it is strongly advised to configure it in secure mode with kerberos!