I have been working with Hadoop 2.9.1 for over a year and have learned much on the installation of Hadoop in a multi node cluster environment. I would like to share what I have learned and applied in the hopes that it will help someone else configure their system. The deployment I have done is to have a Name Node and 1-* DataNodes on Ubuntu 16.04 assuming 5 cpu and 13GB RAM. I will put all commands used in this tutorial right down to the very basics for those that are new to Ubuntu.
NOTE: Sometimes you may have to use “sudo” in front of the command. I also use nano for this article for beginners but you can use any editor you prefer (ie: vi). Also this article does not take into consideration any SSL, kerberos, etc. For all purposes here Hadoop will be open without having to login, etc.
Additional Setup/Configurations to Consider:
Zookeeper: It is also a good idea to use ZooKeeper to synchronize your configuration
Secondary NameNode: This should be done on a seperate server and it’s function is to take checkpoints of the namenodes file system.
Rack Awareness: Fault tolerance to ensure blocks are placed as evenly as possible on different racks if they are available.
Apply the following to all NameNode and DataNodes unless otherwise directed:
Hadoop User:
For this example we will just use hduser as our group and user for simplicity sake.
The “-a” on usermod is for appending to a group used with –G for which groups
addgroup hduser sudo gpasswd -a $USER sudo usermod –a –G sudo hduser
Install JDK:
apt-get update apt-get upgrade apt-get install default-jdk
Install SSH:
apt-get install ssh which ssh which sshd
These two commands will check that ssh installed correctly and will return “/usr/bin/ssh” and “/usr/bin/sshd”
java -version
You use this to verify that java installed correctly and will return something like the following.
openjdk version “1.8.0_171”
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
System Configuration
nano ~/.bashrc
The .bashrc is a script that is executed when a terminal session is started.
Add the following line to the end and save because Hadoop uses IPv4.
export _JAVA_OPTIONS=’-XX:+UseCompressedOops -Djava.net.preferIPv4Stack=true’
source ~/.bashrc
sysctl.conf
Disable ipv6 as it causes issues in getting your server up and running.
nano /etc/sysctl.conf
Add the following to the end and save
net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 #Change eth0 to what ifconfig has net.ipv6.conf.eth0.disable_ipv6 = 1
Close sysctl
sysctl -p cat /proc/sys/net/ipv6/conf/all/disable_ipv6 reboot
If all the above disabling IPv6 configuration was successful you should get “1” returned.
Sometimes you can reach open file descriptor limit and open file limit. If you do encounter this issue you might have to set the ulimit and descriptor limit. For this example I have set some values but you will have to figure out the best numbers for your specific case.
If you get “cannot stat /proc/sys/-p: No such file or directory”. Then you need to add /sbin/ to PATH.
sudo nano ~/.bashrc export PATH=$PATH:/sbin/
nano /etc/sysctl.conf
fs.file-max = 500000
sysctl –p
limits.conf
nano /etc/security/limits.conf
* soft nofile 60000
* hard nofile 60000
reboot
Test Limits
You can now test the limits you applied to make sure they took.
ulimit -a more /proc/sys/fs/file-max more /proc/sys/fs/file-nr lsof | wc -l
file-max: Current open file descriptor limit
file-nr: How many file descriptors are currently being used
lsof wc: How many files are currently open
You might be wondering why we installed ssh at the beginning. That is because Hadoop uses ssh to access its nodes. We need to eliminate the password requirement by setting up ssh certificates. If asked for a filename just leave it blank and confirm with enter.
su hduser
If not already logged in as the user we created in the Hadoop user section.
ssh-keygen –t rsa –P ""
You will get the below example as well as the fingerprint and randomart image.
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory ‘/home/hduser/.ssh’.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
cat $HOME/.ssh/id-rsa.pub >> $HOME/.ssh/authorized_keys
You may get “No such file or directory”. It is most likely just the id-rsa.pub filename. Look in the .ssh directory for the name it most likely will be “id_rsa.pub”.
This will add the newly created key to the list of authorized keys so that Hadoop can use SSH without prompting for a password.
Now we check that it worked by running “ssh localhost”. When prompted with if you should continue connecting type “yes” and enter. You will be permanently added to localhost
Once we have done this on all Name Node and Data Node you should run the following command from the Name Node to each Data Node.
ssh-copy-id –i ~/.ssh/id_rsa.pub hduser@DATANODEHOSTNAME ssh DATANODEHOSTNAME
/etc/hosts Update
We need to update the hosts file.
sudo nano /etc/hosts #Comment out line "127.0.0.1 localhost" 127.0.0.1 HOSTNAME localhost
Now we are getting to the part we have been waiting for.
Hadoop Installation:
NAMENODE: You will see this in the config files below and it can be the hostname, the static ip or it could be 0.0.0.0 so that all TCP ports will be bound to all IP’s of the server. You should also note that the masters and slaves file later on in this tutorial can still be the hostname.
Note: You could run rsync after setting up the Name Node Initial configuration to each Data Node if you want. This would save initial hadoop setup time. You do that by running the following command:
rsync –a /usr/local/hadoop/ hduser@DATANODEHOSTNAME:/usr/local/hadoop/
Download & Extract:
wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.9.1/hadoop-2.9.1.tar.gz tar xvzf hadoop-2.9.1.tar.gz sudo mv hadoop-2.9.1/ /usr/local/hadoop chown –R hduser:hduser /usr/local/hadoop update-alternatives --config java
Basically the above downloads, extracts, moves the extracted hadoop directory to the /usr/local directory, if the hduser doesn’t own the newly created directory then switch ownership
and tells us the path where java was been installed to to set the JAVA_HOME environment variable. It should return something like the following:
There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
nano ~/.bashrc
Add the following to the end of the file. Make sure to do this on Name Node and all Data Nodes:
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_INSTALL/lib”
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_HOME=$HADOOP_INSTALL
#HADOOP VARIABLES END
source ~/.bashrc javac –version which javac readlink –f /usr/bin/javac
This basically validates that bashrc update worked!
javac should return “javac 1.8.0_171” or something similar
which javac should return “/usr/bin/javac”
readlink should return “/usr/lib/jvm/java-8-openjdk-amd64/bin/javac”
Memory Tools
There is an application from HortonWorks you can download which can help get you started on how you should setup memory utilization for yarn. I found it’s a great starting point but you need to tweak it to work for what you need on your specific case.
wget http://public-repo-1.hortonworks.com/HDP/tools/2.6.0.3/hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz cd hdp_manual_install_rpm_helper_files-2.6.0.3.8/ sudo apt-get install python2.7 python2.7 scripts/yarn-utils.py -c 5 -m 13 -d 1 -k False
-c is for how many cores you have
-m is for how much memory you have
-d is for how many disks you have
False is if you are running HBASE. True if you are.
After the script is ran it will give you guidelines on yarn/mapreduce settings. See below for example. Remember they are guidelines. Tweak as needed.
Now the real fun begins!!! Remember that these settings are what worked for me and you may need to adjust them.
hadoop-env.sh
nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
You will see JAVA_HOME near the beginning of the file you will need to change that to where java is installed on your system.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HEAPSIZE=1000
export HADOOP_NAMENODE_OPTS=”-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,DRFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,RFAAUDIT} $HADOOP_NAMENODE_OPTS”
export HADOOP_SECONDARYNAMENODE_OPTS=$HADOOP_NAMENODE_OPTS
export HADOOP_CLIENT_OPTS=”-Xmx1024m $HADOOP_CLIENT_OPTS”
mkdir –p /app/hadoop/tmp
This is the temp directory hadoop uses
chown hduser:hduser /app/hadoop/tmp
core-site.xml
Click here to view the docs.
nano /usr/local/hadoop/etc/hadoop/core-site.xml
This file contains configuration properties that Hadoop uses when starting up. By default it will look like . This will need to be changed.
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://NAMENODE:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> <property> <name>hadoop.proxyuser.hduser.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hduser.groups</name> <value>*</value> </property> </configuration>
yarn-site.xml
Click here to view the docs.
nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>12288</value> <final>true</final> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>4096</value> <final>true</final> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>12288</value> <final>true</final> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>4096</value> </property> <property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Xmx3276m</value> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/app/hadoop/tmp/nm-local-dir</value> </property> <!--LOG--> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <description>Where to aggregate logs to.</description> <name>yarn.nodemanager.remote-app-log-dir</name> <value>/tmp/yarn/logs</value> </property> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property> <property> <name>yarn.log-aggregation.retain-check-interval-seconds</name> <value>86400</value> </property> <property> <name>yarn.log.server.url</name> <value>http://NAMENODE:19888/jobhistory/logs/</value> </property> <!--URLs--> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>NAMENODE:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>NAMENODE:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>NAMENODE:8050</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>NAMENODE:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>NAMENODE:8088</value> </property> </configuration>
By default it will look like . This will need to be changed.
mapred-site.xml
Click here to view the docs. By default, the /usr/local/hadoop/etc/hadoop/ folder contains /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed/copied with the name mapred-site.xml By default it will look like . This will need to be changed.
cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>NAMENODE:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>NAMENODE:19888</value> </property> <property> <name>mapreduce.jobtracker.address</name> <value>NAMENODE:54311</value> </property> <!-- Memory and concurrency tuning --> <property> <name>mapreduce.map.memory.mb</name> <value>4096</value> </property> <property> <name>mapreduce.map.java.opts</name> <value>-server -Xmx3276m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>4096</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-server -Xmx3276m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value> </property> <property> <name>mapreduce.reduce.shuffle.input.buffer.percent</name> <value>0.5</value> </property> <property> <name>mapreduce.task.io.sort.mb</name> <value>600</value> </property> <property> <name>mapreduce.task.io.sort.factor</name> <value>1638</value> </property> <property> <name>mapreduce.map.sort.spill.percent</name> <value>0.50</value> </property> <property> <name>mapreduce.map.speculative</name> <value>false</value> </property> <property> <name>mapreduce.reduce.speculative</name> <value>false</value> </property> <property> <name>mapreduce.task.timeout</name> <value>1800000</value> </property> </configuration>
yarn-env.sh
nano /usr/local/hadoop/etc/hadoop/yarn-env.sh
Change or uncomment or add the following:
JAVA_HEAP_MAX=Xmx2000m
YARN_OPTS=”$YARN_OPTS -server -Dhadoop.log.dir=$YARN_LOG_DIR”
YARN_OPTS=”$YARN_OPTS -Djava.net.preferIPv4Stack=true”
Master
Add the namenode hostname.
nano /usr/local/hadoop/etc/hadoop/masters
APPLY THE FOLLOWING TO THE NAMENODE ONLY
Slaves
Add namenode hostname and all datanodes hostname.
nano /usr/local/hadoop/etc/hadoop/slaves
hdfs-site.xml
Click here to view the docs. By default it will look like . This will need to be changed. The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. Before editing this file, we need to create the namenode directory.
mkdir -p /usr/local/hadoop_store/data/namenode chown -R hduser:hduser /usr/local/hadoop_store nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>3</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/data/namenode</value> </property> <property> <name>dfs.datanode.use.datanode.hostname</name> <value>false</value> </property> <property> <name>dfs.namenode.datanode.registration.ip-hostname-check</name> <value>false</value> </property> <property> <name>dfs.namenode.http-address</name> <value>NAMENODE:50070</value> <description>Your NameNode hostname for http access.</description> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>SECONDARYNAMENODE:50090</value> <description>Your Secondary NameNode hostname for http access.</description> </property> <property> <name>dfs.blocksize</name> <value>128m</value> </property> <property> <name>dfs.namenode.http-bind-host</name> <value>0.0.0.0</value> </property> <property> <name>dfs.namenode.rpc-bind-host</name> <value>0.0.0.0</value> </property> <property> <name>dfs.namenode.servicerpc-bind-host</name> <value>0.0.0.0</value> </property> </configuration>
APPLY THE FOLLOWING TO THE DATANODE(s) ONLY
Slaves
Add only that datanodes hostname.
nano /usr/local/hadoop/etc/hadoop/slaves
hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. Before editing this file, we need to create the datanode directory.
By default it will look like . This will need to be changed.
mkdir -p /usr/local/hadoop_store/data/datanode chown -R hduser:hduser /usr/local/hadoop_store nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>3</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop_store/data/datanode</value> </property> <property> <name>dfs.datanode.use.datanode.hostname</name> <value>false</value> </property> <property> <name>dfs.namenode.http-address</name> <value>NAMENODE:50070</value> <description>Your NameNode hostname for http access.</description> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>SECONDARYNAMENODE:50090</value> <description>Your Secondary NameNode hostname for http access.</description> </property> <property> <name>dfs.datanode.http.address</name> <value>DATANODE:50075</value> </property> <property> <name>dfs.blocksize</name> <value>128m</value> </property> </configuration>
You need to allow the pass-through for all ports necessary. If you have the Ubuntu firewall on.
sudo ufw allow 50070 sudo ufw allow 8088
Format Cluster:
Only do this if NO data is present. All data will be destroyed when the following is done.
This is to be done on NAMENODE ONLY!
hdfs namenode –format
Start The Cluster:
You can now start the cluster.
You do this from the NAMENODE ONLY.
start-dfs.sh start-yarn.sh mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver
If the above three commands didn’t work something went wrong. As it should have found the scripts located /usr/local/hadoop/sbin/ directory.
Cron Job:
You should probably setup a cron job to start the cluster when you reboot.
crontab –e
@reboot /usr/local/hadoop/sbin/start-dfs.sh > /home/hduser/dfs-start.log 2>&1
@reboot /usr/local/hadoop/sbin/start-yarn.sh > /home/hduser/yarn-start.log 2>&1
@reboot /usr/local/hadoop/sbin/mr-jobhistory-daemon.sh –config $HADOOP_CONF_DIR stop historyserver > /home/hduser/history-stop.log 2>&1
Verification:
To check that everything is working as it should run “jps” on the NAMENODE. It should return something like the following where the pid will be different:
jps
You could also run “netstat -plten | grep java” or “lsof –i :50070” and “lsof –i :8088”.
Picked up _JAVA_OPTIONS: -Xms3g -Xmx10g -Djava.net.preferIPv4Stack=true
2596 SecondaryNameNode
3693 Jps
1293 JobHistoryServer
1317 ResourceManager
1840 NameNode
1743 NodeManager
2351 DataNode
You can check the DATA NODES by ssh into each one and running “jps”. It should return something like the following where the pid will be different:
Picked up _JAVA_OPTIONS: -Xms3g -Xmx10g -Djava.net.preferIPv4Stack=true
3218 Jps
2215 NodeManager
2411 DataNode
If for any reason only of the services is not running you need to review the logs. They can be found at /usr/local/hadoop/logs/. If it’s ResourceManager that isn’t running then look at file that has “yarn” and “resourcemanager” in it.
WARNING:
Never reboot the system without first stopping the cluster. When the cluster shuts down it is safe to reboot it. Also if you configured a cronjob @reboot you should make sure the DATANODES are up and running first before starting the NAMENODE that way it automatically starts the DATANODES for you
Web Ports:
NameNode
- 50070: HDFS Namenode
- 50075: HDFS Datanode
- 50090: HDFS Secondary Namenode
- 8088: Resource Manager
- 19888: Job History
DataNode
- 50075: HDFS Datanode
NetStat
To check that all the Hadoop ports are available on which IP run the following.
sudo netstat -ltnp
Port Check
If for some reason you are having issues connecting to a Hadoop port then run the following command as you try and connect via the port.
sudo tcpdump -n -tttt -i eth1 port 50070
References
I used a lot of different resources and reference material on this. However I did not save all the relevant links I used. Below are just a few I used. There was various blog posts about memory utilization, etc.
You must be logged in to post a comment.