Hadoop 3.2.0: Installation

I would like to share what I have learned and applied in the hopes that it will help someone else configure their system. The deployment I have done is to have a Name Node and 1-* DataNodes on Ubuntu 16.04 assuming 5 cpu and 13GB RAM. I will put all commands used in this tutorial right down to the very basics for those that are new to Ubuntu.

NOTE: Sometimes you may have to use “sudo” in front of the command. I also use nano for this article for beginners but you can use any editor you prefer (ie: vi). Also this article does not take into consideration any SSL, kerberos, etc. For all purposes here Hadoop will be open without having to login, etc.

Additional Setup/Configurations to Consider:

Zookeeper: It is also a good idea to use ZooKeeper to synchronize your configuration

Secondary NameNode: This should be done on a seperate server and it’s function is to take checkpoints of the namenodes file system.

Rack AwarenessFault tolerance to ensure blocks are placed as evenly as possible on different racks if they are available.

Apply the following to all NameNode and DataNodes unless otherwise directed:

Hadoop User:
For this example we will just use hduser as our group and user for simplicity sake.
The “-a” on usermod is for appending to a group used with –G for which groups

addgroup hduser
sudo gpasswd -a $USER sudo
usermod –a –G sudo hduser

Install JDK:

apt-get update
apt-get upgrade
apt-get install default-jdk

Install SSH:

apt-get install ssh
which ssh
which sshd

These two commands will check that ssh installed correctly and will return “/usr/bin/ssh” and “/usr/bin/sshd”

java -version

You use this to verify that java installed correctly and will return something like the following.

openjdk version “1.8.0_171”
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

System Configuration

nano ~/.bashrc

The .bashrc is a script that is executed when a terminal session is started.
Add the following line to the end and save because Hadoop uses IPv4.

export _JAVA_OPTIONS=’-XX:+UseCompressedOops -Djava.net.preferIPv4Stack=true’

source ~/.bashrc

sysctl.conf

Disable ipv6 as it causes issues in getting your server up and running.

nano /etc/sysctl.conf

Add the following to the end and save

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
#Change eth0 to what ifconfig has
net.ipv6.conf.eth0.disable_ipv6 = 1

Close sysctl

sysctl -p
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
reboot

If all the above disabling IPv6 configuration was successful you should get “1” returned.
Sometimes you can reach open file descriptor limit and open file limit. If you do encounter this issue you might have to set the ulimit and descriptor limit. For this example I have set some values but you will have to figure out the best numbers for your specific case.

If you get “cannot stat /proc/sys/-p: No such file or directory”. Then you need to add /sbin/ to PATH.

sudo nano ~/.bashrc
export PATH=$PATH:/sbin/
nano /etc/sysctl.conf

fs.file-max = 500000

sysctl –p

limits.conf

nano /etc/security/limits.conf

* soft nofile 60000
* hard nofile 60000

 reboot

Test Limits

You can now test the limits you applied to make sure they took.

ulimit -a
more /proc/sys/fs/file-max
more /proc/sys/fs/file-nr
lsof | wc -l

file-max: Current open file descriptor limit
file-nr: How many file descriptors are currently being used
lsof wc: How many files are currently open

You might be wondering why we installed ssh at the beginning. That is because Hadoop uses ssh to access its nodes. We need to eliminate the password requirement by setting up ssh certificates. If asked for a filename just leave it blank and confirm with enter.

su hduser

If not already logged in as the user we created in the Hadoop user section.

ssh-keygen –t rsa –P ""

You will get the below example as well as the fingerprint and randomart image.

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory ‘/home/hduser/.ssh’.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.

cat $HOME/.ssh/id-rsa.pub >> $HOME/.ssh/authorized_keys

You may get “No such file or directory”. It is most likely just the id-rsa.pub filename. Look in the .ssh directory for the name it most likely will be “id_rsa.pub”.

This will add the newly created key to the list of authorized keys so that Hadoop can use SSH without prompting for a password.
Now we check that it worked by running “ssh localhost”. When prompted with if you should continue connecting type “yes” and enter. You will be permanently added to localhost
Once we have done this on all Name Node and Data Node you should run the following command from the Name Node to each Data Node.

ssh-copy-id –i ~/.ssh/id_rsa.pub hduser@DATANODEHOSTNAME
ssh DATANODEHOSTNAME

/etc/hosts Update

We need to update the hosts file.

sudo nano /etc/hosts

#Comment out line "127.0.0.1 localhost"

127.0.0.1 HOSTNAME localhost

Now we are getting to the part we have been waiting for.

Hadoop Installation:

NAMENODE: You will see this in the config files below and it can be the hostname, the static ip or it could be 0.0.0.0 so that all TCP ports will be bound to all IP’s of the server. You should also note that the masters and slaves file later on in this tutorial can still be the hostname.

Note: You could run rsync after setting up the Name Node Initial configuration to each Data Node if you want. This would save initial hadoop setup time. You do that by running the following command:

rsync –a /usr/local/hadoop/ hduser@DATANODEHOSTNAME:/usr/local/hadoop/

Download & Extract:

wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz 
tar xvzf hadoop-3.2.0.tar.gz
sudo mv hadoop-3.2.0/ /usr/local/hadoop
chown –R hduser:hduser /usr/local/hadoop
update-alternatives --config java

Basically the above downloads, extracts, moves the extracted hadoop directory to the /usr/local directory, if the hduser doesn’t own the newly created directory then switch ownership
and tells us the path where java was been installed to to set the JAVA_HOME environment variable. It should return something like the following:

There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

nano ~/.bashrc

Add the following to the end of the file. Make sure to do this on Name Node and all Data Nodes:

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_INSTALL/lib”
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_HOME=$HADOOP_INSTALL

export HDFS_NAMENODE_USER=hduser
export HDFS_DATANODE_USER=hduser
export HDFS_SECONDARYNAMENODE_USER=hduser

#HADOOP VARIABLES END

source ~/.bashrc
javac –version
which javac
readlink –f /usr/bin/javac

This basically validates that bashrc update worked!
javac should return “javac 1.8.0_171” or something similar
which javac should return “/usr/bin/javac”
readlink should return “/usr/lib/jvm/java-8-openjdk-amd64/bin/javac”

Memory Tools

There is an application from HortonWorks you can download which can help get you started on how you should setup memory utilization for yarn. I found it’s a great starting point but you need to tweak it to work for what you need on your specific case.

wget http://public-repo-1.hortonworks.com/HDP/tools/2.6.0.3/hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
cd hdp_manual_install_rpm_helper_files-2.6.0.3.8/
sudo apt-get install python2.7
python2.7 scripts/yarn-utils.py -c 5 -m 13 -d 1 -k False

-c is for how many cores you have
-m is for how much memory you have
-d is for how many disks you have
False is if you are running HBASE. True if you are.

After the script is ran it will give you guidelines on yarn/mapreduce settings. See below for example. Remember they are guidelines. Tweak as needed.
Now the real fun begins!!! Remember that these settings are what worked for me and you may need to adjust them.

 

hadoop-env.sh

nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

You will see JAVA_HOME near the beginning of the file you will need to change that to where java is installed on your system.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HEAPSIZE=1000
export HADOOP_NAMENODE_OPTS=”-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,DRFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,RFAAUDIT} $HADOOP_NAMENODE_OPTS”
export HADOOP_SECONDARYNAMENODE_OPTS=$HADOOP_NAMENODE_OPTS
export HADOOP_CLIENT_OPTS=”-Xmx1024m $HADOOP_CLIENT_OPTS”

mkdir –p /app/hadoop/tmp

This is the temp directory hadoop uses

chown hduser:hduser /app/hadoop/tmp

core-site.xml

Click here to view the docs.

nano /usr/local/hadoop/etc/hadoop/core-site.xml

This file contains configuration properties that Hadoop uses when starting up. By default it will look like . This will need to be changed.

<configuration>
      <property>
            <name>fs.defaultFS</name>
            <value>hdfs://NAMENODE:54310</value>
            <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
      </property>
      <property>
            <name>hadoop.tmp.dir</name>
            <value>/app/hadoop/tmp</value>
      </property>
      <property>
            <name>hadoop.proxyuser.hduser.hosts</name>
            <value>*</value>
      </property>
      <property>
            <name>hadoop.proxyuser.hduser.groups</name>
            <value>*</value>
      </property>
</configuration>

yarn-site.xml

Click here to view the docs.

nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
<configuration>
      <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
      </property>
      <property>
            <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
      </property>
      <property>
            <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
            <value>org.apache.hadoop.mapred.ShuffleHandler</value>
      </property>
      <property>
            <name>yarn.nodemanager.resource.memory-mb</name>
            <value>12288</value>
            <final>true</final>
      </property>
      <property>
            <name>yarn.scheduler.minimum-allocation-mb</name>
            <value>4096</value>
            <final>true</final>
      </property>
      <property>
            <name>yarn.scheduler.maximum-allocation-mb</name>
            <value>12288</value>
            <final>true</final>
      </property>
      <property>
            <name>yarn.app.mapreduce.am.resource.mb</name>
            <value>4096</value>
      </property>
      <property>
            <name>yarn.app.mapreduce.am.command-opts</name>
            <value>-Xmx3276m</value>
      </property>
      <property>
            <name>yarn.nodemanager.local-dirs</name>
            <value>/app/hadoop/tmp/nm-local-dir</value>
      </property>
      <!--LOG-->
      <property>
            <name>yarn.log-aggregation-enable</name>
            <value>true</value>
      </property>
      <property>
            <description>Where to aggregate logs to.</description>
            <name>yarn.nodemanager.remote-app-log-dir</name>
            <value>/tmp/yarn/logs</value>
      </property>
      <property>
            <name>yarn.log-aggregation.retain-seconds</name>
            <value>604800</value>
      </property>
      <property>
            <name>yarn.log-aggregation.retain-check-interval-seconds</name>
            <value>86400</value>
      </property>
      <property>
            <name>yarn.log.server.url</name>
            <value>http://NAMENODE:19888/jobhistory/logs/</value>
      </property>
      
      <!--URLs-->
      <property>
            <name>yarn.resourcemanager.resource-tracker.address</name>
            <value>${yarn.resourcemanager.hostname}:8025</value>
      </property>
      <property>
            <name>yarn.resourcemanager.scheduler.address</name>
            <value>${yarn.resourcemanager.hostname}:8030</value>
      </property>
      <property>
            <name>yarn.resourcemanager.address</name>
            <value>${yarn.resourcemanager.hostname}:8050</value>
      </property>
      <property>
            <name>yarn.resourcemanager.admin.address</name>
            <value>${yarn.resourcemanager.hostname}:8033</value>
      </property>
      <property>
            <name>yarn.resourcemanager.webapp.address</name>
            <value>${yarn.nodemanager.hostname}:8088</value>
      </property>
      <property>
            <name>yarn.nodemanager.hostname</name>
            <value>0.0.0.0</value>
      </property>
      <property>
            <name>yarn.nodemanager.address</name>
            <value>${yarn.nodemanager.hostname}:0</value>
      </property>
      <property>
            <name>yarn.nodemanager.webapp.address</name>
            <value>${yarn.nodemanager.hostname}:8042</value>
      </property>
</configuration>

By default it will look like . This will need to be changed.

mapred-site.xml

Click here to view the docs. By default, the /usr/local/hadoop/etc/hadoop/ folder contains /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed/copied with the name mapred-site.xml By default it will look like . This will need to be changed.

cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
<configuration>
      <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
      </property>
      <property>
            <name>mapreduce.jobhistory.address</name>
            <value>0.0.0.0:10020</value>
      </property>
      <property>
            <name>mapreduce.jobhistory.webapp.address</name>
            <value>0.0.0.0:19888</value>
      </property>
      <property>
            <name>mapreduce.jobtracker.address</name>
            <value>0.0.0.0:54311</value>
      </property>
      <property>
            <name>mapreduce.jobhistory.admin.address</name>
            <value>0.0.0.0:10033</value>
      </property>
      <!-- Memory and concurrency tuning -->
      <property>
            <name>mapreduce.map.memory.mb</name>
            <value>4096</value>
      </property>
      <property>
            <name>mapreduce.map.java.opts</name>
            <value>-server -Xmx3276m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>
      </property>
      <property>
            <name>mapreduce.reduce.memory.mb</name>
            <value>4096</value>
      </property>
      <property>
            <name>mapreduce.reduce.java.opts</name>
            <value>-server -Xmx3276m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>
      </property>
      <property>
            <name>mapreduce.reduce.shuffle.input.buffer.percent</name>
            <value>0.5</value>
      </property>
      <property>
            <name>mapreduce.task.io.sort.mb</name>
            <value>600</value>
      </property>
      <property>
            <name>mapreduce.task.io.sort.factor</name>
            <value>1638</value>
      </property>
      <property>
            <name>mapreduce.map.sort.spill.percent</name>
            <value>0.50</value>
      </property>
      <property>
            <name>mapreduce.map.speculative</name>
            <value>false</value>
      </property>
      <property>
            <name>mapreduce.reduce.speculative</name>
            <value>false</value>
      </property>
      <property>
            <name>mapreduce.task.timeout</name>
            <value>1800000</value>
      </property>
</configuration>

yarn-env.sh

nano /usr/local/hadoop/etc/hadoop/yarn-env.sh

Change or uncomment or add the following:

JAVA_HEAP_MAX=Xmx2000m
HADOOP_OPTS=”$HADOOP_OPTS-server -Dhadoop.log.dir=$YARN_LOG_DIR”
HADOOP_OPTS=”$HADOOP_OPTS-Djava.net.preferIPv4Stack=true”

Master

Add the namenode hostname.

nano /usr/local/hadoop/etc/hadoop/masters

APPLY THE FOLLOWING TO THE NAMENODE ONLY

Slaves

Add namenode hostname and all datanodes hostname.

nano /usr/local/hadoop/etc/hadoop/slaves

hdfs-site.xml

Click here to view the docs. By default it will look like . This will need to be changed. The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. Before editing this file, we need to create the namenode directory.

mkdir -p /usr/local/hadoop_store/data/namenode
chown -R hduser:hduser /usr/local/hadoop_store
nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
      <property>
            <name>dfs.replication</name>
            <value>3</value>
            <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
      </property>
      <property>
            <name>dfs.permissions</name>
            <value>false</value>
      </property>
      <property>
            <name>dfs.namenode.name.dir</name>
            <value>file:/usr/local/hadoop_store/data/namenode</value>
      </property>
      <property>
            <name>dfs.datanode.use.datanode.hostname</name>
            <value>false</value>
      </property>
      <property>
            <name>dfs.blocksize</name>
            <value>128m</value>
      </property>
      <property>
            <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
            <value>false</value>
      </property>
      
      <!-- URL -->
      <property>
            <name>dfs.namenode.http-address</name>
            <value>${dfs.namenode.http-bind-host}:50070</value>
            <description>Your NameNode hostname for http access.</description>
      </property>
      <property>
            <name>dfs.namenode.secondary.http-address</name>
            <value>${dfs.namenode.http-bind-host}:50090</value>
            <description>Your Secondary NameNode hostname for http access.</description>
      </property>
      <property>
            <name>dfs.datanode.http.address</name>
            <value>${dfs.namenode.http-bind-host}:50075</value>
      </property>
      <property>
            <name>dfs.datanode.address</name>
            <value>${dfs.namenode.http-bind-host}:50076</value>
      </property>
      <property>
            <name>dfs.namenode.http-bind-host</name>
            <value>0.0.0.0</value>
      </property>
      <property>
            <name>dfs.namenode.rpc-bind-host</name>
            <value>0.0.0.0</value>
      </property>
      <property>
             <name>dfs.namenode.servicerpc-bind-host</name>
             <value>0.0.0.0</value>
      </property>
&lt;/configuration>

APPLY THE FOLLOWING TO THE DATANODE(s) ONLY

Slaves

Add only that datanodes hostname.

nano /usr/local/hadoop/etc/hadoop/slaves

hdfs-site.xml

The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. Before editing this file, we need to create the datanode directory.
By default it will look like . This will need to be changed.

mkdir -p /usr/local/hadoop_store/data/datanode
chown -R hduser:hduser /usr/local/hadoop_store
nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
      <property>
            <name>dfs.replication</name>
            <value>3</value>
            <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
      </property>
      <property>
            <name>dfs.permissions</name>
            <value>false</value>
      </property>
      <property>
            <name>dfs.blocksize</name>
            <value>128m</value>
      </property>
      <property>
            <name>dfs.datanode.data.dir</name>
            <value>file:/usr/local/hadoop_store/data/datanode</value>
      </property>
      <property>
            <name>dfs.datanode.use.datanode.hostname</name>
            <value>false</value>
      </property>
      <property>
            <name>dfs.namenode.http-address</name>
            <value>${dfs.namenode.http-bind-host}:50070</value>
            <description>Your NameNode hostname for http access.</description>
      </property>
      <property>
            <name>dfs.namenode.secondary.http-address</name>
            <value>${dfs.namenode.http-bind-host}:50090</value>
            <description>Your Secondary NameNode hostname for http access.</description>
      </property>
      <property>
            <name>dfs.datanode.http.address</name>
            <value>${dfs.namenode.http-bind-host}:50075</value>
      </property>
      <property>
            <name>dfs.datanode.address</name>
            <value>${dfs.namenode.http-bind-host}:50076</value>
      </property>
      <property>
            <name>dfs.namenode.http-bind-host</name>
            <value>0.0.0.0</value>
      </property>
      <property>
            <name>dfs.namenode.rpc-bind-host</name>
            <value>0.0.0.0</value>
      </property>
      <property>
             <name>dfs.namenode.servicerpc-bind-host</name>
             <value>0.0.0.0</value>
      </property>
</configuration>

You need to allow the pass-through for all ports necessary. If you have the Ubuntu firewall on.

sudo ufw allow 50070
sudo ufw allow 8088

Format Cluster:
Only do this if NO data is present. All data will be destroyed when the following is done.
This is to be done on NAMENODE ONLY!

hdfs namenode -format

Start The Cluster:
You can now start the cluster.
You do this from the NAMENODE ONLY.

start-dfs.sh
start-yarn.sh
mapred --config $HADOOP_CONF_DIR --daemon start historyserver

If the above three commands didn’t work something went wrong. As it should have found the scripts located /usr/local/hadoop/sbin/ directory.

Cron Job:
You should probably setup a cron job to start the cluster when you reboot.

crontab –e

@reboot /usr/local/hadoop/sbin/start-dfs.sh > /home/hduser/dfs-start.log 2>&1
@reboot /usr/local/hadoop/sbin/start-yarn.sh > /home/hduser/yarn-start.log 2>&1
@reboot /usr/local/hadoop/bin/mapred –config $HADOOP_CONF_DIR –daemon start historyserver > /home/hduser/history-stop.log 2>&1

Verification:
To check that everything is working as it should run “jps” on the NAMENODE. It should return something like the following where the pid will be different:

jps

You could also run “netstat -plten | grep java” or “lsof –i :50070” and “lsof –i :8088”.

Picked up _JAVA_OPTIONS: -Xms3g -Xmx10g -Djava.net.preferIPv4Stack=true
12007 SecondaryNameNode
13090 Jps
12796 JobHistoryServer
12261 ResourceManager
11653 NameNode
12397 NodeManager
11792 DataNode

You can check the DATA NODES by ssh into each one and running “jps”. It should return something like the following where the pid will be different:

Picked up _JAVA_OPTIONS: -Xms3g -Xmx10g -Djava.net.preferIPv4Stack=true
3218 Jps
2215 NodeManager
2411 DataNode

If for any reason only of the services is not running you need to review the logs. They can be found at /usr/local/hadoop/logs/. If it’s ResourceManager that isn’t running then look at file that has “yarn” and “resourcemanager” in it.

WARNING:
Never reboot the system without first stopping the cluster. When the cluster shuts down it is safe to reboot it. Also if you configured a cronjob @reboot you should make sure the DATANODES are up and running first before starting the NAMENODE that way it automatically starts the DATANODES for you

Web Ports:

NameNode

  • 50070: HDFS Namenode
  • 50075: HDFS Datanode
  • 50090: HDFS Secondary Namenode
  • 8088: Resource Manager
  • 19888: Job History

DataNode

  • 50075: HDFS Datanode

NetStat

To check that all the Hadoop ports are available on which IP run the following.

sudo netstat -ltnp

Port Check

If for some reason you are having issues connecting to a Hadoop port then run the following command as you try and connect via the port.

sudo tcpdump -n -tttt -i eth1 port 50070

References

I used a lot of different resources and reference material on this. However I did not save all the relevant links I used. Below are just a few I used. There was various blog posts about memory utilization, etc.

Kafka: Kerberize/SSL

In this tutorial I will show you how to use Kerberos/SSL with NiFi. I will use self signed certs for this example. Before you begin ensure you have installed Kerberos Server and Kafka.

If you don’t want to use the built in Zookeeper you can setup your own. To do that following this tutorial.

This assumes your hostname is “hadoop”

Create Kerberos Principals

cd /etc/security/keytabs/

sudo kadmin.local

#You can list princepals
listprincs

#Create the following principals
addprinc -randkey kafka/hadoop@REALM.CA
addprinc -randkey zookeeper/hadoop@REALM.CA

#Create the keytab files.
#You will need these for Hadoop to be able to login
xst -k kafka.service.keytab kafka/hadoop@REALM.CA
xst -k zookeeper.service.keytab zookeeper/hadoop@REALM.CA

Set Keytab Permissions/Ownership

sudo chown root:hadoopuser /etc/security/keytabs/*
sudo chmod 750 /etc/security/keytabs/*

Hosts Update

sudo nano /etc/hosts

#Remove 127.0.1.1 line

#Change 127.0.0.1 to the following
127.0.0.1 realm.ca hadoop localhost

Ubuntu Firewall

sudo ufw disable

SSL

Setup SSL Directories if you have not previously done so.

sudo mkdir -p /etc/security/serverKeys
sudo chown -R root:hadoopuser /etc/security/serverKeys/
sudo chmod 755 /etc/security/serverKeys/

cd /etc/security/serverKeys

Setup Keystore

sudo keytool -genkey -alias NAMENODE -keyalg RSA -keysize 1024 -dname "CN=NAMENODE,OU=ORGANIZATION_UNIT,C=canada" -keypass PASSWORD -keystore /etc/security/serverKeys/keystore.jks -storepass PASSWORD
sudo keytool -export -alias NAMENODE -keystore /etc/security/serverKeys/keystore.jks -rfc -file /etc/security/serverKeys/NAMENODE.csr -storepass PASSWORD

Setup Truststore

sudo keytool -import -noprompt -alias NAMENODE -file /etc/security/serverKeys/NAMENODE.csr -keystore /etc/security/serverKeys/truststore.jks -storepass PASSWORD

Generate Self Signed Certifcate

sudo openssl genrsa -out /etc/security/serverKeys/NAMENODE.key 2048

sudo openssl req -x509 -new -key /etc/security/serverKeys/NAMENODE.key -days 300 -out /etc/security/serverKeys/NAMENODE.pem

sudo keytool -keystore /etc/security/serverKeys/keystore.jks -alias NAMENODE -certreq -file /etc/security/serverKeys/NAMENODE.cert -storepass PASSWORD -keypass PASSWORD

sudo openssl x509 -req -CA /etc/security/serverKeys/NAMENODE.pem -CAkey /etc/security/serverKeys/NAMENODE.key -in /etc/security/serverKeys/NAMENODE.cert -out /etc/security/serverKeys/NAMENODE.signed -days 300 -CAcreateserial

Setup File Permissions

sudo chmod 440 /etc/security/serverKeys/*
sudo chown root:hadoopuser /etc/security/serverKeys/*

Edit server.properties Config

cd /usr/local/kafka/config

sudo nano server.properties

#Edit or Add the following properties.
ssl.endpoint.identification.algorithm=HTTPS
ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
ssl.key.password=PASSWORD
ssl.keystore.location=/etc/security/serverKeys/keystore.jks
ssl.keystore.password=PASSWORD
ssl.truststore.location=/etc/security/serverKeys/truststore.jks
ssl.truststore.password=PASSWORD
listeners=SASL_SSL://:9094
security.inter.broker.protocol=SASL_SSL
ssl.client.auth=required
authorizer.class.name=kafka.security.auth.SimpleAclAuthorizer
ssl.keystore.type=JKS
ssl.truststore.type=JKS
sasl.kerberos.service.name=kafka
zookeeper.connect=hadoop:2181
sasl.mechanism.inter.broker.protocol=GSSAPI
sasl.enabled.mechanisms=GSSAPI

Edit zookeeper.properties Config

sudo nano zookeeper.properties

#Edit or Add the following properties.

server.1=hadoop:2888:3888
clientPort=2181
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
requireClientAuthScheme=SASL
jaasLoginRenew=3600000

Edit producer.properties Config

sudo nano producer.properties

bootstrap.servers=hadoop:9094
security.protocol=SASL_SSL
sasl.kerberos.service.name=kafka
ssl.truststore.location=/etc/security/serverKeys/truststore.jks
ssl.truststore.password=PASSWORD
ssl.keystore.location=/etc/security/serverKeys/keystore.jks
ssl.keystore.password=PASSWORD
ssl.key.password=PASSWORD
sasl.mechanism=GSSAPI

Edit consumer.properties Config

sudo nano consumer.properties

zookeeper.connect=hadoop:2181
bootstrap.servers=hadoop:9094
group.id=securing-kafka-group
security.protocol=SASL_SSL
sasl.kerberos.service.name=kafka
ssl.truststore.location=/etc/security/serverKeys/truststore.jks
ssl.truststore.password=PASSWORD
sasl.mechanism=GSSAPI

Add zookeeper_jass.conf Config

sudo nano zookeeper_jass.conf

Server {
  com.sun.security.auth.module.Krb5LoginModule required
  debug=true
  useKeyTab=true
  keyTab="/etc/security/keytabs/zookeeper.service.keytab"
  storeKey=true
  useTicketCache=true
  refreshKrb5Config=true
  principal="zookeeper/hadoop@REALM.CA";
};

Add kafkaserver_jass.conf Config

sudo nano kafkaserver_jass.conf

KafkaServer {
    com.sun.security.auth.module.Krb5LoginModule required
    debug=true
    useKeyTab=true
    storeKey=true
    refreshKrb5Config=true
    keyTab="/etc/security/keytabs/kafka.service.keytab"
    principal="kafka/hadoop@REALM.CA";
};

kafkaClient {
    com.sun.security.auth.module.Krb5LoginModule required
    useTicketCache=true
    refreshKrb5Config=true
    debug=true
    useKeyTab=true
    storeKey=true
    keyTab="/etc/security/keytabs/kafka.service.keytab"
    principal="kafka/hadoop@REALM.CA";
};

Edit kafka-server-start.sh

cd /usr/local/kafka/bin/

sudo nano kafka-server-start.sh

jaas="$base_dir/../config/kafkaserver_jaas.conf"

export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=$jaas"

Edit zookeeper-server-start.sh

sudo nano zookeeper-server-start.sh

jaas="$base_dir/../config/zookeeper_jaas.conf"

export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=$jaas"

Kafka-ACL

cd /usr/local/kafka/bin/

#Grant topic access and cluster access
./kafka-acls.sh  --operation All --allow-principal User:kafka --authorizer-properties zookeeper.connect=hadoop:2181 --add --cluster
./kafka-acls.sh  --operation All --allow-principal User:kafka --authorizer-properties zookeeper.connect=hadoop:2181 --add --topic TOPIC

#Grant all groups for a specific topic
./kafka-acls.sh --operation All --allow-principal User:kafka --authorizer-properties zookeeper.connect=hadoop:2181 --add --topic TOPIC --group *

#If you want to remove cluster access
./kafka-acls.sh --authorizer-properties zookeeper.connect=hadoop:2181 --remove --cluster

#If you want to remove topic access
./kafka-acls.sh --authorizer-properties zookeeper.connect=hadoop:2181 --remove --topic TOPIC

#List access for cluster
./kafka-acls.sh --list --authorizer-properties zookeeper.connect=hadoop:2181 --cluster

#List access for topic
./kafka-acls.sh --list --authorizer-properties zookeeper.connect=hadoop:2181 --topic TOPIC

kafka-console-producer.sh

If you want to test using the console producer you need to make these changes.

cd /usr/local/kafka/bin/
nano kafka-console-producer.sh

#Add the below before the last line

base_dir=$(dirname $0)
jaas="$base_dir/../config/kafkaserver_jaas.conf"
export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=$jaas"


#Now you can run the console producer
./kafka-console-producer.sh --broker-list hadoop:9094 --topic TOPIC -producer.config ../config/producer.properties

kafka-console-consumer.sh

If you want to test using the console consumer you need to make these changes.

cd /usr/local/kafka/bin/
nano kafka-console-consumer.sh

#Add the below before the last line

base_dir=$(dirname $0)
jaas="$base_dir/../config/kafkaserver_jaas.conf"
export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=$jaas"


#Now you can run the console consumer
./kafka-console-consumer.sh --bootstrap-server hadoop:9094 --topic TOPIC --consumer.config ../config/consumer.properties --from-beginning

References

https://www.confluent.io/blog/apache-kafka-security-authorization-authentication-encryption/
https://github.com/confluentinc/securing-kafka-blog/blob/master/manifests/default.pp

Hive & Java: Connect to Remote Kerberos Hive using KeyTab

In this tutorial I will show you how to connect to remote Kerberos Hive cluster using Java. If you haven’t install Hive yet follow the tutorial.

Import SSL Cert to Java:

Follow this tutorial to “Installing unlimited strength encryption Java libraries

If on Windows do the following

#Import it
"C:\Program Files\Java\jdk1.8.0_171\bin\keytool" -import -file hadoop.csr -keystore "C:\Program Files\Java\jdk1.8.0_171\jre\lib\security\cacerts" -alias "hadoop"

#Check it
"C:\Program Files\Java\jdk1.8.0_171\bin\keytool" -list -v -keystore "C:\Program Files\Java\jdk1.8.0_171\jre\lib\security\cacerts"

#If you want to delete it
"C:\Program Files\Java\jdk1.8.0_171\bin\keytool" -delete -alias hadoop -keystore "C:\Program Files\Java\jdk1.8.0_171\jre\lib\security\cacerts"

POM.xml:

<dependency>
	<groupId>org.apache.hive</groupId>
	<artifactId>hive-jdbc</artifactId>
	<version>2.3.3</version>
	<exclusions>
		<exclusion>
			<groupId>jdk.tools</groupId>
			<artifactId>jdk.tools</artifactId>
		</exclusion>
	</exclusions>
</dependency>

Imports:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.security.UserGroupInformation;
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

Connect:

// Setup the configuration object.
final Configuration config = new Configuration();

config.set("fs.defaultFS", "swebhdfs://hadoop:50470");
config.set("hadoop.security.authentication", "kerberos");
config.set("hadoop.rpc.protection", "integrity");

System.setProperty("https.protocols", "TLSv1,TLSv1.1,TLSv1.2");
System.setProperty("java.security.krb5.conf", "C:\\Program Files\\Java\\jdk1.8.0_171\\jre\\lib\\security\\krb5.conf");
System.setProperty("java.security.krb5.realm", "REALM.CA");
System.setProperty("java.security.krb5.kdc", "REALM.CA");
System.setProperty("sun.security.krb5.debug", "true");
System.setProperty("javax.net.debug", "all");
System.setProperty("javax.net.ssl.keyStorePassword","changeit");
System.setProperty("javax.net.ssl.keyStore","C:\\Program Files\\Java\\jdk1.8.0_171\\jre\\lib\\security\\cacerts");
System.setProperty("javax.net.ssl.trustStore", "C:\\Program Files\\Java\\jdk1.8.0_171\\jre\\lib\\security\\cacerts");
System.setProperty("javax.net.ssl.trustStorePassword","changeit");
System.setProperty("javax.security.auth.useSubjectCredsOnly", "false");

UserGroupInformation.setConfiguration(config);
UserGroupInformation.setLoginUser(UserGroupInformation.loginUserFromKeytabAndReturnUGI("hive/hadoop@REALM.CA", "c:\\data\\hive.service.keytab"));

System.out.println(UserGroupInformation.getLoginUser());
System.out.println(UserGroupInformation.getCurrentUser());

//Add the hive driver
Class.forName("org.apache.hive.jdbc.HiveDriver");

//Connect to hive jdbc
Connection connection = DriverManager.getConnection("jdbc:hive2://hadoop:10000/default;principal=hive/hadoop@REALM.CA");
Statement statement = connection.createStatement();

//Create a table
String createTableSql = "CREATE TABLE IF NOT EXISTS "
		+" employee ( eid int, name String, "
		+" salary String, designation String)"
		+" COMMENT 'Employee details'"
		+" ROW FORMAT DELIMITED"
		+" FIELDS TERMINATED BY '\t'"
		+" LINES TERMINATED BY '\n'"
		+" STORED AS TEXTFILE";

System.out.println("Creating Table: " + createTableSql);
statement.executeUpdate(createTableSql);

//Show all the tables to ensure we successfully added the table
String showTablesSql = "show tables";
System.out.println("Show All Tables: " + showTablesSql);
ResultSet res = statement.executeQuery(showTablesSql);

while (res.next()) {
	System.out.println(res.getString(1));
}

//Drop the table
String dropTablesSql = "DROP TABLE IF EXISTS employee";

System.out.println("Dropping Table: " + dropTablesSql);
statement.executeUpdate(dropTablesSql);

System.out.println("Finish!");

ElasticSearch Installation

To install ElasticSearch is really straight forward. I will be using Ubuntu 16.04 for this installation.

Java 8

java -version
#if not installed run the following
sudo apt-get install openjdk-8-jdk

Download

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.3.rpm

Directories

It is recommended to change the log and data directory from default implementations.

#create log and data directory
sudo mkdir /my/dir/log/elasticsearch
sudo mkdir /my/dir/elasticsearch

# Change owner
sudo chown -R elasticsearch /my/dir/log/elasticsearch
sudo chown -R elasticsearch /my/dir/elasticsearch

Install

sudo rpm -ivh elasticsearch-6.2.3.rpm

Change Settings

sudo vi /etc/elasticsearch/elasticsearch.yml

#Change the following settings
#----------SETTINGS-----------------
cluster.name: logsearch
node.name: ##THE_HOST_NAME##
node.master: true #The node is master eligable
node.data: true #Hold data and perform data related operations
path.data: /my/dir/elasticsearch
path.logs: /my/dir/log/elasticsearch
network.host: ##THE_HOST_NAME##
http.port: 9200
discovery.zen.ping.unicast.hosts: ["##THE_HOST_NAME##"]
#----------SETTINGS-----------------

Start/Stop/Status ElasticSearch

sudo service elasticsearch start
sudo service elasticsearch stop
sudo service elasticsearch status

Rest API

http://localhost:9200

 

NiFi Installation (Basic)

In this tutorial I will guide you through installing NiFi on Ubuntu 16.04 and setting to run as a service. We will assume you have a user called “hduser”.

Install Java 8

sudo apt-get install openjdk-8-jdk

Install NiFi

wget http://mirror.dsrg.utoronto.ca/apache/nifi/1.8.0/nifi-1.8.0-bin.tar.gz
tar -xzf nifi-1.8.0-bin.tar.gz
sudo mv nifi-1.8.0/ /usr/local/nifi

Set Ownership:

 sudo chown -R hduser:hduser /usr/local/nifi

Setup .bashrc:

 sudo nano ~/.bashrc

Add the following to the end of the file.

#NIFI VARIABLES START
export NIFI_HOME=/usr/local/nifi
export NIFI_CONF_DIR=/usr/local/nifi/conf
export PATH=$PATH:$NIFI_HOME/bin
#NIFI VARIABLES STOP

 source ~/.bashrc

Install NiFi As Service

cd /usr/local/nifi/bin
sudo ./nifi.sh install
reboot

Start/Stop/Status Service

sudo service nifi start
sudo service nifi stop
sudo service nifi status

Your site is now available http://localhost:8080/nifi

Uninstall

sudo rm /etc/rc2.d/S65nifi
sudo rm /etc/init.d/nifi
sudo rm /etc/rc2.d/K65nifi

sudo rm -R /usr/local/nifi/

Avro & Python: How to Schema, Write, Read

I have been experimenting with Apache Avro and Python. Below is what I have learned thus far.

Pip Install

At the time of this writing I am using 1.8.2.

pip install avro-python3

Schema

There are so many different ways to work with the schema definition. There are primitive and complex types. You can find way more documentation on the schema definition here.

import json
import avro.schema

my_schema = avro.schema.Parse(json.dumps(
{
    'namespace': 'test.avro',
    'type': 'record',
    'name': 'MY_NAME',
    'fields': [
        {'name': 'name_1', 'type': 'int'},
        {'name': 'name_2', 'type': {'type': 'array', 'items': 'float'}},
        {'name': 'name_3', 'type': 'float'},
    ]
}))

Method 1

Write

from avro.datafile import DataFileWriter
from avro.io import DatumWriter
import io

#write binary
file = open(filename, 'wb')

datum_writer = DatumWriter()
fwriter = DataFileWriter(file, datum_writer, my_schema)
fwriter.append({'name_1': 645645, 'name_2': [5.6,34.7], 'name_3': 644.5645})
fwriter.close()

Write Deflate

from avro.datafile import DataFileWriter
from avro.io import DatumWriter

#write binary
file = open(filename, 'wb')

datum_writer = DatumWriter()
fwriter = DataFileWriter(file, datum_writer, my_schema, codec = 'deflate')
fwriter.append({'name_1': 645645, 'name_2': [5.6,34.7], 'name_3': 644.5645})
fwriter.close()

Append

from avro.datafile import DataFileWriter
from avro.io import DatumWriter
import io

#append binary
file = open(filename, 'a+b')

datum_writer = DatumWriter()
#Notice that the schema is not added the the datafilewriter. This is because you are appending to an existing avro file
fwriter = DataFileWriter(file, datum_writer)
fwriter.append({'name_1': 645675, 'name_2': [5.6,34.9], 'name_3': 649.5645})
fwriter.close()

Read Schema

from avro.datafile import DataFileReader
from avro.io import DatumReader

file = open(filename, 'rb')
datum_reader = DatumReader()
file_reader = DataFileReader(file, datum_reader)

print(file_reader .meta)

Read

from avro.datafile import DataFileReader
from avro.io import DatumReader

#read binary
fd = open(filename, 'rb')
datum_reader = DatumReader()
file_reader = DataFileReader(fd, datum_reader)

for datum in file_reader:
	print(datum['name_1'])
	print(datum['name_2'])
	print(datum['name_3'])
file_reader.close()

Method 2

Write/Append BinaryEncoder

import io
from avro.io import DatumWriter, BinaryEncoder

#write binary
file = open(filename, 'wb')
#append binary
file = open(filename, 'a+b')
bytes_writer = io.BytesIO()
encoder = BinaryEncoder(bytes_writer)
writer_binary = DatumWriter(my_schema)
writer_binary.write({'name_1': 645645, 'name_2': [5.6,34.7], 'name_3': 644.5645}, encoder)
file.write(bytes_writer.getvalue())

Read BinaryDecoder

import io
from avro.io import DatumReader, BinaryDecoder

file = open(filename, 'rb')
bytes_reader = io.BytesIO(file.read())
decoder = BinaryDecoder(bytes_reader)
reader = DatumReader(my_schema)

while True:
	try:
		rec = reader.read(decoder)
		print(rec['name_1'])
		print(rec['name_2'])
		print(rec['name_3'])
	except:
		break

 

 

 

PIG: Testing

Apache PIG analyzes large data sets. There are a variety of ways of processing data using it. As I learn more about it I will put use cases below.

JSON:

REGISTER 'hdfs:///elephant-bird-core-4.15.jar';
REGISTER 'hdfs:///elephant-bird-hadoop-compat-4.15.jar';
REGISTER 'hdfs:///elephant-bird-pig-4.15.jar';
REGISTER 'hdfs:///json-simple-1.1.1.jar';

loadedJson = LOAD '/hdfs_dir/MyFile.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
rec = FOREACH loadedJson GENERATE json#'my_key' as (m:chararray);
DESCRIBE rec;
DUMP rec;

--Store the results in a hdfs dir. You can have HIVE query that directory
STORE rec INTO '/hdfs_dir' USING PigStorage();

Python: Connect To Hadoop

We can connect to Hadoop from Python using PyWebhdfs package. For the purposes of this post we will use version 0.4.1. You can see all API’s from here.

To build a connection to Hadoop you first need to import it.

from pywebhdfs.webhdfs import PyWebHdfsClient

Then you build the connection like this.

HDFS_CONNECTION = PyWebHdfsClient(host=##HOST## port='50070', user_name=##USER##)

To list the contents of a directory you do this.

HDFS_CONNECTION.list_dir(##HADOOP_DIR##)

To pull a single file down from Hadoop is straight forward. Notice how we have the “FileNotFound” brought in. That is important when pulling a file in. You don’t actually need it but “read_file” will raise that exception if it is not found. By default we should always include this.

from pywebhdfs.errors import FileNotFound

try:
	file_data = HDFS_CONNECTION.read_file(##FILENAME##)
except FileNotFound as e:
	print(e)
except Exception as e:
	print(e)

 

 

Build a Java Map Reduce Application

I will attempt to explain how to setup a map, reduce, Combiner, Path Filter, Partitioner, Outputer using Java Eclipse with Maven. If you need to know how to install Eclipse go here. Remember that these are not complete code just snipets to get you going.

A starting point I used was this tutorial however it was built using older Hadoop code.

Mapper: Maps input key/value pairs to a set of intermediate key/value pairs.
Reducer: Reduces a set of intermediate values which share a key to a smaller set of values.
Partitioner: http://www.tutorialspoint.com/map_reduce/map_reduce_partitioner.htm
Combiner: http://www.tutorialspoint.com/map_reduce/map_reduce_combiners.htm

First you will need to create a maven project. You can follow any tutorial on how to do that if you don’t know how.

pom.xml:

<properties>
      <hadoop.version>2.7.2</hadoop.version>
</properties>
<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>${hadoop.version}</version>
</dependency>
<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>${hadoop.version}</version>
</dependency>
<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
</dependency>
<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-mapreduce-client-core</artifactId>
      <version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-yarn-api</artifactId>
      <version>${hadoop.version}</version>
</dependency>
<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-yarn-common</artifactId>
      <version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-auth</artifactId>
      <version>${hadoop.version}</version>
</dependency>
<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-yarn-server-nodemanager</artifactId>
      <version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-yarn-server-resourcemanager</artifactId>
      <version>${hadoop.version}</version>
</dependency>

Job Driver:

public class JobDriver extends Configured implements Tool {
      private Configuration conf;
      private static String hdfsURI = "hdfs://localhost:54310";

      public static void main(String[] args) throws Exception {
            int res = ToolRunner.run(new Configuration(), new JobDriver(), args);
            System.exit(res);
      }

      @Override
      public int run(String[] args) throws Exception {
            BasicConfigurator.configure();
            conf = this.getConf();

            //The paths for the configuration
            final String HADOOP_HOME = System.getenv("HADOOP_HOME");
            conf.addResource(new Path(HADOOP_HOME, "etc/hadoop/core-site.xml"));
            conf.addResource(new Path(HADOOP_HOME, "etc/hadoop/hdfs-site.xml"));
            conf.addResource(new Path(HADOOP_HOME, "etc/hadoop/yarn-site.xml"));
            hdfsURI = conf.get("fs.defaultFS");

            Job job = Job.getInstance(conf, YOURJOBNAME);
            //You can setup additional configuration information by doing the below.
            job.getConfiguration().set("NAME", "VALUE");

            job.setJarByClass(JobDriver.class);

            //If you are going to use a mapper class
            job.setMapperClass(MAPPERCLASS.class);

            //If you are going to use a combiner class
            job.setCombinerClass(COMBINERCLASS.class);

            //If you plan on splitting the output
            job.setPartitionerClass(PARTITIONERCLASS.class);
            job.setNumReduceTasks(NUMOFREDUCERS);

            //if you plan on use a reducer
            job.setReducerClass(REDUCERCLASS.class);

            //You need to set the output key and value types. We will just use Text for this example
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);

            //If you want to use an input filter class
            FileInputFormat.setInputPathFilter(job, INPUTPATHFILTER.class);

            //You must setup what the input path is for the files you want to parse. It takes either string or Path
            FileInputFormat.setInputPaths(job, inputPaths);

            //Once you parse the data you must put it somewhere.
            job.setOutputFormatClass(OUTPUTFORMATCLASS.class);
            FileOutputFormat.setOutputPath(job, new Path(OUTPUTPATH));

            return job.waitForCompletion(true) ? 0 : 1;
      }
}

INPUTPATHFILTER:

public class InputPathFilter extends Configured implements PathFilter {
      Configuration conf;
      FileSystem fs;
      Pattern includePattern = null;
      Pattern excludePattern = null;

      @Override
      public void setConf(Configuration conf) {
            this.conf = conf;

            if (conf != null) {
                  try {
                        fs = FileSystem.get(conf);

                        //If you want you can always pass in regex patterns from the job driver class and filter that way. Up to you!
                        if (conf.get("file.includePattern") != null)
                              includePattern = conf.getPattern("file.includePattern", null);

                        if (conf.get("file.excludePattern") != null)
                              excludePattern = conf.getPattern("file.excludePattern", null);
                  } catch (IOException e) {
                        e.printStackTrace();
                  }
            }
      }

      @Override
      public boolean accept(Path path) {
            //Here you could filter based on your include or exclude regex or file size.
            //Remember if you have sub directories you have to return true for that

            if (fs.isDirectory(path)) {
                  return true;
            }
            else {
                  //You can also do this to get file size in case you want to do anything when files are certains size, etc
                  FileStatus file = fs.getFileStatus(path);
                  String size = FileUtils.byteCountToDisplaySize(file.getLen());

                  //You can also move files in this section
                  boolean move_success = fs.rename(path, new Path(NEWPATH + path.getName()));
            }
      }
}

MAPPERCLASS:

//Remember at the beginning I said we will use key and value as Text. That is the second part of the extends mapper
public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
      //Do whatever setup you would like. Remember in the job drive you could set things to configuration well you can access them here now
      @Override
      protected void setup(Context context) throws IOException, InterruptedException {
            super.setup(context);
            Configuration conf = context.getConfiguration();
      }

      //This is the main map method.
      @Override
      public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            //This will get the file name you are currently processing if you want. However not necessary.
            String filename = ((FileSplit) context.getInputSplit()).getPath().toString();

            //Do whatever you want in the mapper. The context is what you print out to.

            //If you want to embed javascript go <a href="http://www.gaudreault.ca/java-embed-javascript/" target="_blank">here</a>.
            //If you want to embed Python go <a href="http://www.gaudreault.ca/java-embed-python/" target="_blank">here</a>.
      }
}

If you decided to embed Python or JavaScript you will need these scripts as an example. map_python and map

COMBINERCLASS:

public class MyCombiner extends Reducer<Text, Text, Text, Text> {
      //Do whatever setup you would like. Remember in the job drive you could set things to configuration well you can access them here now
      @Override
      protected void setup(Context context) throws IOException, InterruptedException {
            super.setup(context);
            Configuration conf = context.getConfiguration();
      }

      @Override
      protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
            //Do whatever you want in the mapper. The context is what you print out to.
            //If you want to embed javascript go <a href="http://www.gaudreault.ca/java-embed-javascript/" target="_blank">here</a>.
            //If you want to embed Python go <a href="http://www.gaudreault.ca/java-embed-python/" target="_blank">here</a>.
      }
}

If you decided to embed Python or JavaScript you will need these scripts as an example. combiner_python and combiner_js

REDUCERCLASS:

public class MyReducer extends Reducer<Text, Text, Text, Text> {
      //Do whatever setup you would like. Remember in the job drive you could set things to configuration well you can access them here now
      @Override
      protected void setup(Context context) throws IOException, InterruptedException {
            super.setup(context);
            Configuration conf = context.getConfiguration();
      }

      @Override
      protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
            //Do whatever you want in the mapper. The context is what you print out to.
            //If you want to embed javascript go <a href="http://www.gaudreault.ca/java-embed-javascript/" target="_blank">here</a>.
            //If you want to embed Python go <a href="http://www.gaudreault.ca/java-embed-python/" target="_blank">here</a>.
      }
}

If you decided to embed Python or JavaScript you will need these scripts as an example. reduce_python and reduce_js

PARTITIONERCLASS:

public class MyPartitioner extends Partitioner<Text, Text> implements Configurable
{
      private Configuration conf;

      @Override
      public Configuration getConf() {
            return conf;
      }

      //Do whatever setup you would like. Remember in the job drive you could set things to configuration well you can access them here now
      @Override
      public void setConf(Configuration conf) {
            this.conf = conf;
      }

      @Override
      public int getPartition(Text key, Text value, int numReduceTasks)
      {
            Integer partitionNum = 0;

            //Do whatever logic you would like to figure out the way you want to partition.
            //If you want to embed javascript go <a href="http://www.gaudreault.ca/java-embed-javascript/" target="_blank">here</a>.
            //If you want to embed Python go <a href="http://www.gaudreault.ca/java-embed-python/" target="_blank">here</a>.

            return partionNum;
      }
}

If you decided to embed Python or JavaScript you will need these scripts as an example. partitioner_python and partitioner_js

OUTPUTFORMATCLASS:

public class MyOutputFormat<K, V> extends FileOutputFormat<K, V> {
      protected static int outputCount = 0;

      protected static class JsonRecordWriter<K, V> extends RecordWriter<K, V> {
            protected DataOutputStream out;

            public JsonRecordWriter(DataOutputStream out) throws IOException {
                  this.out = out;
            }

            @Override
            public void close(TaskAttemptContext arg0) throws IOException, InterruptedException {
                  out.writeBytes(WRITE_WHATEVER_YOU_WANT);
                  out.close();
            }

            @Override
            public void write(K key, V value) throws IOException, InterruptedException {
                  //write the value
                  //You could also send to a database here if you wanted. Up to you how you want to deal with it.
            }
      }

      @Override
      public RecordWriter<K, V> getRecordWriter(TaskAttemptContext tac) throws IOException, InterruptedException {
            Configuration conf = tac.getConfiguration();
            Integer numReducers = conf.getInt("mapred.reduce.tasks", 0);
            //you can set output filename in the config from the job driver if you want
            String outputFileName = conf.get("outputFileName");
            outputCount++;

            //If you used a partitioner you need to split out the content so you should break the output filename into parts
            if (numReducers > 1) {
                  //Do whatever logic you need to in order to get unique filenames per split
            }

            Path file = FileOutputFormat.getOutputPath(tac);
            Path fullPath = new Path(file, outputFileName);
            FileSystem fs = file.getFileSystem(conf);
            FSDataOutputStream fileout = fs.create(fullPath);
            return new JsonRecordWriter<K, V>(fileout);
      }
}

Hadoop: Commands

Below is a list of all the commands I have had to use while working with Hadoop. If you have any other ones that are not listed here please feel free to add them in or if you have updates to ones below.

Move Files:

 hadoop fs -mv /OLD_DIR/* /NEW_DIR/

Sort Files By Size. Note this is for viewing information only on terminal. It has no affect on the files or the way they are displayed via web ui:

 hdfs fsck /logs/ -files | grep "/FILE_DIR/" | grep -v "<dir>" | gawk '{print $2, $1;}' | sort –n

Display system information:

 hdfs fsck /FILE_dir/ -files

Remove folder with all files in it:

 hadoop fs -rm -R hdfs:///DIR_TO_REMOVE

Make folder:

 hadoop fs -mkdir hdfs:///NEW_DIR

Remove one file:

 hadoop fs -rm hdfs:///DIR/FILENAME.EXTENSION

Copy all file from directory outside of HDFS to HDFS:

 hadoop fs -copyFromLocal LOCAL_DIR hdfs:///DIR

Copy files from HDFS to local directory:

 hadoop dfs -copyToLocal hdfs:///DIR/REGPATTERN LOCAL_DIR

Kill a running MR job:

 hadoop job -kill job_1461090210469_0003

You could also do that via the 8088 web ui interface

Kill yarn application:

 yarn application -kill application_1461778722971_0001

Check status of DATANODES. Check “Under Replicated blocks” field. If you have any you should probably rebalance:

 hadoop dfsadmin –report

Number of files in HDFS directory:

 hadoop fs -count -q hdfs:///DIR

-q is optional – Gives columns QUOTA, REMAINING_QUATA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, FILE_NAME

Rename directory:

 hadoop fs -mv hdfs:///OLD_NAME hdfs:///NEW_NAME

Change replication factor on files:

 hadoop fs -setrep -R 3 hdfs:///DIR

3 is the replication number.
You can choose a file if you want

Get yarn log. You can also view via web ui 8088:

 yarn logs -applicationId application_1462141864581_0016

Refresh Nodes:

 hadoop dfsadmin –refreshNodes

Report of blocks and their locations:

 hadoop fsck / -files -blocks –locations

Find out where a particular file is located with blocks:

 hadoop fsck /DIR/FILENAME -files -locations –blocks

Fix under replicated blocks. First command gets the blocks that are under replicated. The second sets replication to 2 for those files. You might have to restart the dfs to see a change from dfsadmin –report:

 hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}' >> /tmp/under_replicated_files

for hdfsfile in `cat /tmp/under_replicated_files`; do echo "Fixing $hdfsfile :" ; hadoop fs -setrep 2 $hdfsfile; done

Show all the classpaths associated to hadoop:

 hadoop classpath

Hadoop 2.9.1: Installation

I have been working with Hadoop 2.9.1 for over a year and have learned much on the installation of Hadoop in a multi node cluster environment. I would like to share what I have learned and applied in the hopes that it will help someone else configure their system. The deployment I have done is to have a Name Node and 1-* DataNodes on Ubuntu 16.04 assuming 5 cpu and 13GB RAM. I will put all commands used in this tutorial right down to the very basics for those that are new to Ubuntu.

NOTE: Sometimes you may have to use “sudo” in front of the command. I also use nano for this article for beginners but you can use any editor you prefer (ie: vi). Also this article does not take into consideration any SSL, kerberos, etc. For all purposes here Hadoop will be open without having to login, etc.

Additional Setup/Configurations to Consider:

Zookeeper: It is also a good idea to use ZooKeeper to synchronize your configuration

Secondary NameNode: This should be done on a seperate server and it’s function is to take checkpoints of the namenodes file system.

Rack AwarenessFault tolerance to ensure blocks are placed as evenly as possible on different racks if they are available.

Apply the following to all NameNode and DataNodes unless otherwise directed:

Hadoop User:
For this example we will just use hduser as our group and user for simplicity sake.
The “-a” on usermod is for appending to a group used with –G for which groups

addgroup hduser
sudo gpasswd -a $USER sudo
usermod –a –G sudo hduser

Install JDK:

apt-get update
apt-get upgrade
apt-get install default-jdk

Install SSH:

apt-get install ssh
which ssh
which sshd

These two commands will check that ssh installed correctly and will return “/usr/bin/ssh” and “/usr/bin/sshd”

java -version

You use this to verify that java installed correctly and will return something like the following.

openjdk version “1.8.0_171”
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

System Configuration

nano ~/.bashrc

The .bashrc is a script that is executed when a terminal session is started.
Add the following line to the end and save because Hadoop uses IPv4.

export _JAVA_OPTIONS=’-XX:+UseCompressedOops -Djava.net.preferIPv4Stack=true’

source ~/.bashrc

sysctl.conf

Disable ipv6 as it causes issues in getting your server up and running.

nano /etc/sysctl.conf

Add the following to the end and save

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
#Change eth0 to what ifconfig has
net.ipv6.conf.eth0.disable_ipv6 = 1

Close sysctl

sysctl -p
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
reboot

If all the above disabling IPv6 configuration was successful you should get “1” returned.
Sometimes you can reach open file descriptor limit and open file limit. If you do encounter this issue you might have to set the ulimit and descriptor limit. For this example I have set some values but you will have to figure out the best numbers for your specific case.

If you get “cannot stat /proc/sys/-p: No such file or directory”. Then you need to add /sbin/ to PATH.

sudo nano ~/.bashrc
export PATH=$PATH:/sbin/
nano /etc/sysctl.conf

fs.file-max = 500000

sysctl –p

limits.conf

nano /etc/security/limits.conf

* soft nofile 60000
* hard nofile 60000

 reboot

Test Limits

You can now test the limits you applied to make sure they took.

ulimit -a
more /proc/sys/fs/file-max
more /proc/sys/fs/file-nr
lsof | wc -l

file-max: Current open file descriptor limit
file-nr: How many file descriptors are currently being used
lsof wc: How many files are currently open

You might be wondering why we installed ssh at the beginning. That is because Hadoop uses ssh to access its nodes. We need to eliminate the password requirement by setting up ssh certificates. If asked for a filename just leave it blank and confirm with enter.

su hduser

If not already logged in as the user we created in the Hadoop user section.

ssh-keygen –t rsa –P ""

You will get the below example as well as the fingerprint and randomart image.

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory ‘/home/hduser/.ssh’.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.

cat $HOME/.ssh/id-rsa.pub >> $HOME/.ssh/authorized_keys

You may get “No such file or directory”. It is most likely just the id-rsa.pub filename. Look in the .ssh directory for the name it most likely will be “id_rsa.pub”.

This will add the newly created key to the list of authorized keys so that Hadoop can use SSH without prompting for a password.
Now we check that it worked by running “ssh localhost”. When prompted with if you should continue connecting type “yes” and enter. You will be permanently added to localhost
Once we have done this on all Name Node and Data Node you should run the following command from the Name Node to each Data Node.

ssh-copy-id –i ~/.ssh/id_rsa.pub hduser@DATANODEHOSTNAME
ssh DATANODEHOSTNAME

/etc/hosts Update

We need to update the hosts file.

sudo nano /etc/hosts

#Comment out line "127.0.0.1 localhost"

127.0.0.1 HOSTNAME localhost

Now we are getting to the part we have been waiting for.

Hadoop Installation:

NAMENODE: You will see this in the config files below and it can be the hostname, the static ip or it could be 0.0.0.0 so that all TCP ports will be bound to all IP’s of the server. You should also note that the masters and slaves file later on in this tutorial can still be the hostname.

Note: You could run rsync after setting up the Name Node Initial configuration to each Data Node if you want. This would save initial hadoop setup time. You do that by running the following command:

rsync –a /usr/local/hadoop/ hduser@DATANODEHOSTNAME:/usr/local/hadoop/

Download & Extract:

wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.9.1/hadoop-2.9.1.tar.gz
tar xvzf hadoop-2.9.1.tar.gz
sudo mv hadoop-2.9.1/ /usr/local/hadoop
chown –R hduser:hduser /usr/local/hadoop
update-alternatives --config java

Basically the above downloads, extracts, moves the extracted hadoop directory to the /usr/local directory, if the hduser doesn’t own the newly created directory then switch ownership
and tells us the path where java was been installed to to set the JAVA_HOME environment variable. It should return something like the following:

There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

nano ~/.bashrc

Add the following to the end of the file. Make sure to do this on Name Node and all Data Nodes:

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_INSTALL/lib”
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_HOME=$HADOOP_INSTALL
#HADOOP VARIABLES END

source ~/.bashrc
javac –version
which javac
readlink –f /usr/bin/javac

This basically validates that bashrc update worked!
javac should return “javac 1.8.0_171” or something similar
which javac should return “/usr/bin/javac”
readlink should return “/usr/lib/jvm/java-8-openjdk-amd64/bin/javac”

Memory Tools

There is an application from HortonWorks you can download which can help get you started on how you should setup memory utilization for yarn. I found it’s a great starting point but you need to tweak it to work for what you need on your specific case.

wget http://public-repo-1.hortonworks.com/HDP/tools/2.6.0.3/hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
cd hdp_manual_install_rpm_helper_files-2.6.0.3.8/
sudo apt-get install python2.7
python2.7 scripts/yarn-utils.py -c 5 -m 13 -d 1 -k False

-c is for how many cores you have
-m is for how much memory you have
-d is for how many disks you have
False is if you are running HBASE. True if you are.

After the script is ran it will give you guidelines on yarn/mapreduce settings. See below for example. Remember they are guidelines. Tweak as needed.
Now the real fun begins!!! Remember that these settings are what worked for me and you may need to adjust them.

 

hadoop-env.sh

nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

You will see JAVA_HOME near the beginning of the file you will need to change that to where java is installed on your system.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HEAPSIZE=1000
export HADOOP_NAMENODE_OPTS=”-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,DRFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,RFAAUDIT} $HADOOP_NAMENODE_OPTS”
export HADOOP_SECONDARYNAMENODE_OPTS=$HADOOP_NAMENODE_OPTS
export HADOOP_CLIENT_OPTS=”-Xmx1024m $HADOOP_CLIENT_OPTS”

mkdir –p /app/hadoop/tmp

This is the temp directory hadoop uses

chown hduser:hduser /app/hadoop/tmp

core-site.xml

Click here to view the docs.

nano /usr/local/hadoop/etc/hadoop/core-site.xml

This file contains configuration properties that Hadoop uses when starting up. By default it will look like . This will need to be changed.

<configuration>
      <property>
            <name>fs.defaultFS</name>
            <value>hdfs://NAMENODE:54310</value>
            <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
      </property>
      <property>
            <name>hadoop.tmp.dir</name>
            <value>/app/hadoop/tmp</value>
      </property>
      <property>
            <name>hadoop.proxyuser.hduser.hosts</name>
            <value>*</value>
      </property>
      <property>
            <name>hadoop.proxyuser.hduser.groups</name>
            <value>*</value>
      </property>
</configuration>

yarn-site.xml

Click here to view the docs.

nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
<configuration>
      <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
      </property>
      <property>
            <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
      </property>
      <property>
            <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
            <value>org.apache.hadoop.mapred.ShuffleHandler</value>
      </property>
      <property>
            <name>yarn.nodemanager.resource.memory-mb</name>
            <value>12288</value>
            <final>true</final>
      </property>
      <property>
            <name>yarn.scheduler.minimum-allocation-mb</name>
            <value>4096</value>
            <final>true</final>
      </property>
      <property>
            <name>yarn.scheduler.maximum-allocation-mb</name>
            <value>12288</value>
            <final>true</final>
      </property>
      <property>
            <name>yarn.app.mapreduce.am.resource.mb</name>
            <value>4096</value>
      </property>
      <property>
            <name>yarn.app.mapreduce.am.command-opts</name>
            <value>-Xmx3276m</value>
      </property>
      <property>
            <name>yarn.nodemanager.local-dirs</name>
            <value>/app/hadoop/tmp/nm-local-dir</value>
      </property>
      <!--LOG-->
      <property>
            <name>yarn.log-aggregation-enable</name>
            <value>true</value>
      </property>
      <property>
            <description>Where to aggregate logs to.</description>
            <name>yarn.nodemanager.remote-app-log-dir</name>
            <value>/tmp/yarn/logs</value>
      </property>
      <property>
            <name>yarn.log-aggregation.retain-seconds</name>
            <value>604800</value>
      </property>
      <property>
            <name>yarn.log-aggregation.retain-check-interval-seconds</name>
            <value>86400</value>
      </property>
      <property>
            <name>yarn.log.server.url</name>
            <value>http://NAMENODE:19888/jobhistory/logs/</value>
      </property>
      
      <!--URLs-->
      <property>
            <name>yarn.resourcemanager.resource-tracker.address</name>
            <value>NAMENODE:8025</value>
      </property>
      <property>
            <name>yarn.resourcemanager.scheduler.address</name>
            <value>NAMENODE:8030</value>
      </property>
      <property>
            <name>yarn.resourcemanager.address</name>
            <value>NAMENODE:8050</value>
      </property>
      <property>
            <name>yarn.resourcemanager.admin.address</name>
            <value>NAMENODE:8033</value>
      </property>
      <property>
            <name>yarn.resourcemanager.webapp.address</name>
            <value>NAMENODE:8088</value>
      </property>
</configuration>

By default it will look like . This will need to be changed.

mapred-site.xml

Click here to view the docs. By default, the /usr/local/hadoop/etc/hadoop/ folder contains /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed/copied with the name mapred-site.xml By default it will look like . This will need to be changed.

cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
<configuration>
      <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
      </property>
      <property>
            <name>mapreduce.jobhistory.address</name>
            <value>NAMENODE:10020</value>
      </property>
      <property>
            <name>mapreduce.jobhistory.webapp.address</name>
            <value>NAMENODE:19888</value>
      </property>
      <property>
            <name>mapreduce.jobtracker.address</name>
            <value>NAMENODE:54311</value>
      </property>
      <!-- Memory and concurrency tuning -->
      <property>
            <name>mapreduce.map.memory.mb</name>
            <value>4096</value>
      </property>
      <property>
            <name>mapreduce.map.java.opts</name>
            <value>-server -Xmx3276m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>
      </property>
      <property>
            <name>mapreduce.reduce.memory.mb</name>
            <value>4096</value>
      </property>
      <property>
            <name>mapreduce.reduce.java.opts</name>
            <value>-server -Xmx3276m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>
      </property>
      <property>
            <name>mapreduce.reduce.shuffle.input.buffer.percent</name>
            <value>0.5</value>
      </property>
      <property>
            <name>mapreduce.task.io.sort.mb</name>
            <value>600</value>
      </property>
      <property>
            <name>mapreduce.task.io.sort.factor</name>
            <value>1638</value>
      </property>
      <property>
            <name>mapreduce.map.sort.spill.percent</name>
            <value>0.50</value>
      </property>
      <property>
            <name>mapreduce.map.speculative</name>
            <value>false</value>
      </property>
      <property>
            <name>mapreduce.reduce.speculative</name>
            <value>false</value>
      </property>
      <property>
            <name>mapreduce.task.timeout</name>
            <value>1800000</value>
      </property>
</configuration>

yarn-env.sh

nano /usr/local/hadoop/etc/hadoop/yarn-env.sh

Change or uncomment or add the following:

JAVA_HEAP_MAX=Xmx2000m
YARN_OPTS=”$YARN_OPTS -server -Dhadoop.log.dir=$YARN_LOG_DIR”
YARN_OPTS=”$YARN_OPTS -Djava.net.preferIPv4Stack=true”

Master

Add the namenode hostname.

nano /usr/local/hadoop/etc/hadoop/masters

APPLY THE FOLLOWING TO THE NAMENODE ONLY

Slaves

Add namenode hostname and all datanodes hostname.

nano /usr/local/hadoop/etc/hadoop/slaves

hdfs-site.xml

Click here to view the docs. By default it will look like . This will need to be changed. The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. Before editing this file, we need to create the namenode directory.

mkdir -p /usr/local/hadoop_store/data/namenode
chown -R hduser:hduser /usr/local/hadoop_store
nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
      <property>
            <name>dfs.replication</name>
            <value>3</value>
            <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
      </property>
      <property>
            <name>dfs.permissions</name>
            <value>false</value>
      </property>
      <property>
            <name>dfs.namenode.name.dir</name>
            <value>file:/usr/local/hadoop_store/data/namenode</value>
      </property>
      <property>
            <name>dfs.datanode.use.datanode.hostname</name>
            <value>false</value>
      </property>
      <property>
            <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
            <value>false</value>
      </property>
      <property>
            <name>dfs.namenode.http-address</name>
            <value>NAMENODE:50070</value>
            <description>Your NameNode hostname for http access.</description>
      </property>
      <property>
            <name>dfs.namenode.secondary.http-address</name>
            <value>SECONDARYNAMENODE:50090</value>
            <description>Your Secondary NameNode hostname for http access.</description>
      </property>
      <property>
            <name>dfs.blocksize</name>
            <value>128m</value>
      </property>
      <property>
            <name>dfs.namenode.http-bind-host</name>
            <value>0.0.0.0</value>
      </property>
      <property>
            <name>dfs.namenode.rpc-bind-host</name>
            <value>0.0.0.0</value>
      </property>
      <property>
             <name>dfs.namenode.servicerpc-bind-host</name>
             <value>0.0.0.0</value>
      </property>
</configuration>

APPLY THE FOLLOWING TO THE DATANODE(s) ONLY

Slaves

Add only that datanodes hostname.

nano /usr/local/hadoop/etc/hadoop/slaves

hdfs-site.xml

The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. Before editing this file, we need to create the datanode directory.
By default it will look like . This will need to be changed.

mkdir -p /usr/local/hadoop_store/data/datanode
chown -R hduser:hduser /usr/local/hadoop_store
nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
      <property>
            <name>dfs.replication</name>
            <value>3</value>
            <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
      </property>
      <property>
            <name>dfs.permissions</name>
            <value>false</value>
      </property>
      <property>
            <name>dfs.datanode.data.dir</name>
            <value>file:/usr/local/hadoop_store/data/datanode</value>
      </property>
      <property>
            <name>dfs.datanode.use.datanode.hostname</name>
            <value>false</value>
      </property>
      <property>
            <name>dfs.namenode.http-address</name>
            <value>NAMENODE:50070</value>
            <description>Your NameNode hostname for http access.</description>
      </property>
      <property>
            <name>dfs.namenode.secondary.http-address</name>
            <value>SECONDARYNAMENODE:50090</value>
            <description>Your Secondary NameNode hostname for http access.</description>
      </property>
      <property>
            <name>dfs.datanode.http.address</name>
            <value>DATANODE:50075</value>
      </property>
      <property>
            <name>dfs.blocksize</name>
            <value>128m</value>
      </property>
</configuration>

You need to allow the pass-through for all ports necessary. If you have the Ubuntu firewall on.

sudo ufw allow 50070
sudo ufw allow 8088

Format Cluster:
Only do this if NO data is present. All data will be destroyed when the following is done.
This is to be done on NAMENODE ONLY!

hdfs namenode –format

Start The Cluster:
You can now start the cluster.
You do this from the NAMENODE ONLY.

start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver

If the above three commands didn’t work something went wrong. As it should have found the scripts located /usr/local/hadoop/sbin/ directory.

Cron Job:
You should probably setup a cron job to start the cluster when you reboot.

crontab –e

@reboot /usr/local/hadoop/sbin/start-dfs.sh > /home/hduser/dfs-start.log 2>&1
@reboot /usr/local/hadoop/sbin/start-yarn.sh > /home/hduser/yarn-start.log 2>&1
@reboot /usr/local/hadoop/sbin/mr-jobhistory-daemon.sh –config $HADOOP_CONF_DIR stop historyserver > /home/hduser/history-stop.log 2>&1

Verification:
To check that everything is working as it should run “jps” on the NAMENODE. It should return something like the following where the pid will be different:

jps

You could also run “netstat -plten | grep java” or “lsof –i :50070” and “lsof –i :8088”.

Picked up _JAVA_OPTIONS: -Xms3g -Xmx10g -Djava.net.preferIPv4Stack=true
2596 SecondaryNameNode
3693 Jps
1293 JobHistoryServer
1317 ResourceManager
1840 NameNode
1743 NodeManager
2351 DataNode

You can check the DATA NODES by ssh into each one and running “jps”. It should return something like the following where the pid will be different:

Picked up _JAVA_OPTIONS: -Xms3g -Xmx10g -Djava.net.preferIPv4Stack=true
3218 Jps
2215 NodeManager
2411 DataNode

If for any reason only of the services is not running you need to review the logs. They can be found at /usr/local/hadoop/logs/. If it’s ResourceManager that isn’t running then look at file that has “yarn” and “resourcemanager” in it.

WARNING:
Never reboot the system without first stopping the cluster. When the cluster shuts down it is safe to reboot it. Also if you configured a cronjob @reboot you should make sure the DATANODES are up and running first before starting the NAMENODE that way it automatically starts the DATANODES for you

Web Ports:

NameNode

  • 50070: HDFS Namenode
  • 50075: HDFS Datanode
  • 50090: HDFS Secondary Namenode
  • 8088: Resource Manager
  • 19888: Job History

DataNode

  • 50075: HDFS Datanode

NetStat

To check that all the Hadoop ports are available on which IP run the following.

sudo netstat -ltnp

Port Check

If for some reason you are having issues connecting to a Hadoop port then run the following command as you try and connect via the port.

sudo tcpdump -n -tttt -i eth1 port 50070

References

I used a lot of different resources and reference material on this. However I did not save all the relevant links I used. Below are just a few I used. There was various blog posts about memory utilization, etc.