Spark Installation on Hadoop

In this tutorial I will show you how to use Kerberos/SSL with Spark integrated with Yarn. I will use self signed certs for this example. Before you begin ensure you have installed Kerberos Server and Hadoop.

This assumes your hostname is “hadoop”

Create Kerberos Principals

cd /etc/security/keytabs/

sudo kadmin.local

#You can list princepals

#Create the following principals
addprinc -randkey spark/hadoop@REALM.CA

#Create the keytab files.
#You will need these for Hadoop to be able to login
xst -k spark.service.keytab spark/hadoop@REALM.CA

Set Keytab Permissions/Ownership

sudo chown root:hadoopuser /etc/security/keytabs/*
sudo chmod 750 /etc/security/keytabs/*


Go to Apache Spark Download and get the link for Spark.

tar -xvf spark-2.4.4-bin-hadoop2.7.tgz
mv spark-2.4.4-bin-hadoop2.7 /usr/local/spark/

Update .bashrc

sudo nano ~/.bashrc

#Ensure we have the following in the Hadoop section
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

#Add the following

export SPARK_HOME=/usr/local/spark

source ~/.bashrc

Setup Configuration

cd /usr/local/spark/conf
mv spark-defaults.conf.template spark-defaults.conf
nano spark-defaults.conf

#Add to the end
spark.master                            yarn
spark.yarn.historyServer.address        ${hadoopconf-yarn.resourcemanager.hostname}:18080
spark.yarn.keytab                       /etc/security/keytabs/spark.service.keytab
spark.yarn.principal                    spark/hadoop@REALM.CA
spark.yarn.access.hadoopFileSystems     hdfs://NAMENODE:54310
spark.authenticate                      true
spark.authenticate.enableSaslEncryption true
spark.eventLog.enabled                  true
spark.eventLog.dir                      hdfs://NAMENODE:54310/user/spark/applicationHistory
spark.history.fs.logDirectory           hdfs://NAMENODE:54310/user/spark/applicationHistory
spark.history.fs.update.interval        10s
spark.history.ui.port                   18080

spark.ssl.enabled                       true
spark.ssl.keyPassword                   PASSWORD
spark.ssl.keyStore                      /etc/security/serverKeys/keystore.jks
spark.ssl.keyStorePassword              PASSWORD
spark.ssl.keyStoreType                  JKS
spark.ssl.trustStore                    /etc/security/serverKeys/truststore.jks
spark.ssl.trustStorePassword            PASSWORD
spark.ssl.trustStoreType                JKS


kinit -kt /etc/security/keytabs/spark.service.keytab spark/hadoop@REALM.CA
hdfs dfs -mkdir /user/spark/
hdfs dfs -mkdir /user/spark/applicationHistory
hdfs dfs -ls /user/spark

Start The Service


Stop The Service


Spark History Server Web UI


Hadoop 3.2.0: Installation

I would like to share what I have learned and applied in the hopes that it will help someone else configure their system. The deployment I have done is to have a Name Node and 1-* DataNodes on Ubuntu 16.04 assuming 5 cpu and 13GB RAM. I will put all commands used in this tutorial right down to the very basics for those that are new to Ubuntu.

NOTE: Sometimes you may have to use “sudo” in front of the command. I also use nano for this article for beginners but you can use any editor you prefer (ie: vi). Also this article does not take into consideration any SSL, kerberos, etc. For all purposes here Hadoop will be open without having to login, etc.

Additional Setup/Configurations to Consider:

Zookeeper: It is also a good idea to use ZooKeeper to synchronize your configuration

Secondary NameNode: This should be done on a seperate server and it’s function is to take checkpoints of the namenodes file system.

Rack AwarenessFault tolerance to ensure blocks are placed as evenly as possible on different racks if they are available.

Apply the following to all NameNode and DataNodes unless otherwise directed:

Hadoop User:
For this example we will just use hduser as our group and user for simplicity sake.
The “-a” on usermod is for appending to a group used with –G for which groups

addgroup hduser
sudo gpasswd -a $USER sudo
usermod –a –G sudo hduser

Install JDK:

apt-get update
apt-get upgrade
apt-get install default-jdk

Install SSH:

apt-get install ssh
which ssh
which sshd

These two commands will check that ssh installed correctly and will return “/usr/bin/ssh” and “/usr/bin/sshd”

java -version

You use this to verify that java installed correctly and will return something like the following.

openjdk version “1.8.0_171”
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

System Configuration

nano ~/.bashrc

The .bashrc is a script that is executed when a terminal session is started.
Add the following line to the end and save because Hadoop uses IPv4.

export _JAVA_OPTIONS=’-XX:+UseCompressedOops’

source ~/.bashrc


Disable ipv6 as it causes issues in getting your server up and running.

nano /etc/sysctl.conf

Add the following to the end and save

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
#Change eth0 to what ifconfig has
net.ipv6.conf.eth0.disable_ipv6 = 1

Close sysctl

sysctl -p
cat /proc/sys/net/ipv6/conf/all/disable_ipv6

If all the above disabling IPv6 configuration was successful you should get “1” returned.
Sometimes you can reach open file descriptor limit and open file limit. If you do encounter this issue you might have to set the ulimit and descriptor limit. For this example I have set some values but you will have to figure out the best numbers for your specific case.

If you get “cannot stat /proc/sys/-p: No such file or directory”. Then you need to add /sbin/ to PATH.

sudo nano ~/.bashrc
export PATH=$PATH:/sbin/
nano /etc/sysctl.conf

fs.file-max = 500000

sysctl –p


nano /etc/security/limits.conf

* soft nofile 60000
* hard nofile 60000


Test Limits

You can now test the limits you applied to make sure they took.

ulimit -a
more /proc/sys/fs/file-max
more /proc/sys/fs/file-nr
lsof | wc -l

file-max: Current open file descriptor limit
file-nr: How many file descriptors are currently being used
lsof wc: How many files are currently open

You might be wondering why we installed ssh at the beginning. That is because Hadoop uses ssh to access its nodes. We need to eliminate the password requirement by setting up ssh certificates. If asked for a filename just leave it blank and confirm with enter.

su hduser

If not already logged in as the user we created in the Hadoop user section.

ssh-keygen –t rsa –P ""

You will get the below example as well as the fingerprint and randomart image.

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory ‘/home/hduser/.ssh’.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/

cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

You may get “No such file or directory”. It is most likely just the filename. Look in the .ssh directory for the name it most likely will be “”.

This will add the newly created key to the list of authorized keys so that Hadoop can use SSH without prompting for a password.
Now we check that it worked by running “ssh localhost”. When prompted with if you should continue connecting type “yes” and enter. You will be permanently added to localhost
Once we have done this on all Name Node and Data Node you should run the following command from the Name Node to each Data Node.

ssh-copy-id –i ~/.ssh/ hduser@DATANODEHOSTNAME

/etc/hosts Update

We need to update the hosts file.

sudo nano /etc/hosts

#Comment out line " localhost" HOSTNAME localhost

Now we are getting to the part we have been waiting for.

Hadoop Installation:

NAMENODE: You will see this in the config files below and it can be the hostname, the static ip or it could be so that all TCP ports will be bound to all IP’s of the server. You should also note that the masters and slaves file later on in this tutorial can still be the hostname.

Note: You could run rsync after setting up the Name Node Initial configuration to each Data Node if you want. This would save initial hadoop setup time. You do that by running the following command:

rsync –a /usr/local/hadoop/ hduser@DATANODEHOSTNAME:/usr/local/hadoop/

Download & Extract:

tar xvzf hadoop-3.2.0.tar.gz
sudo mv hadoop-3.2.0/ /usr/local/hadoop
chown –R hduser:hduser /usr/local/hadoop
update-alternatives --config java

Basically the above downloads, extracts, moves the extracted hadoop directory to the /usr/local directory, if the hduser doesn’t own the newly created directory then switch ownership
and tells us the path where java was been installed to to set the JAVA_HOME environment variable. It should return something like the following:

There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

nano ~/.bashrc

Add the following to the end of the file. Make sure to do this on Name Node and all Data Nodes:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_INSTALL/lib”
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

export HDFS_NAMENODE_USER=hduser
export HDFS_DATANODE_USER=hduser


source ~/.bashrc
javac –version
which javac
readlink –f /usr/bin/javac

This basically validates that bashrc update worked!
javac should return “javac 1.8.0_171” or something similar
which javac should return “/usr/bin/javac”
readlink should return “/usr/lib/jvm/java-8-openjdk-amd64/bin/javac”

Memory Tools

There is an application from HortonWorks you can download which can help get you started on how you should setup memory utilization for yarn. I found it’s a great starting point but you need to tweak it to work for what you need on your specific case.

tar zxvf hdp_manual_install_rpm_helper_files-
cd hdp_manual_install_rpm_helper_files-
sudo apt-get install python2.7
python2.7 scripts/ -c 5 -m 13 -d 1 -k False

-c is for how many cores you have
-m is for how much memory you have
-d is for how many disks you have
False is if you are running HBASE. True if you are.

After the script is ran it will give you guidelines on yarn/mapreduce settings. See below for example. Remember they are guidelines. Tweak as needed.
Now the real fun begins!!! Remember that these settings are what worked for me and you may need to adjust them.

nano /usr/local/hadoop/etc/hadoop/

You will see JAVA_HOME near the beginning of the file you will need to change that to where java is installed on your system.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

mkdir –p /app/hadoop/tmp

This is the temp directory hadoop uses

chown hduser:hduser /app/hadoop/tmp


Click here to view the docs.

nano /usr/local/hadoop/etc/hadoop/core-site.xml

This file contains configuration properties that Hadoop uses when starting up. By default it will look like . This will need to be changed.

            <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>


Click here to view the docs.

nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
            <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
            <description>Where to aggregate logs to.</description>

By default it will look like . This will need to be changed.


Click here to view the docs. By default, the /usr/local/hadoop/etc/hadoop/ folder contains /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed/copied with the name mapred-site.xml By default it will look like . This will need to be changed.

cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
      <!-- Memory and concurrency tuning -->
            <value>-server -Xmx3276m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>
            <value>-server -Xmx3276m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>

nano /usr/local/hadoop/etc/hadoop/

Change or uncomment or add the following:

HADOOP_OPTS=”$HADOOP_OPTS-server -Dhadoop.log.dir=$YARN_LOG_DIR”


Add the namenode hostname.

nano /usr/local/hadoop/etc/hadoop/masters



Add namenode hostname and all datanodes hostname.

nano /usr/local/hadoop/etc/hadoop/slaves


Click here to view the docs. By default it will look like . This will need to be changed. The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. Before editing this file, we need to create the namenode directory.

mkdir -p /usr/local/hadoop_store/data/namenode
chown -R hduser:hduser /usr/local/hadoop_store
nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
            <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
      <!-- URL -->
            <description>Your NameNode hostname for http access.</description>
            <description>Your Secondary NameNode hostname for http access.</description>



Add only that datanodes hostname.

nano /usr/local/hadoop/etc/hadoop/slaves


The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. Before editing this file, we need to create the datanode directory.
By default it will look like . This will need to be changed.

mkdir -p /usr/local/hadoop_store/data/datanode
chown -R hduser:hduser /usr/local/hadoop_store
nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
            <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
            <description>Your NameNode hostname for http access.</description>
            <description>Your Secondary NameNode hostname for http access.</description>

You need to allow the pass-through for all ports necessary. If you have the Ubuntu firewall on.

sudo ufw allow 50070
sudo ufw allow 8088

Format Cluster:
Only do this if NO data is present. All data will be destroyed when the following is done.
This is to be done on NAMENODE ONLY!

hdfs namenode -format

Start The Cluster:
You can now start the cluster.
You do this from the NAMENODE ONLY.
mapred --config $HADOOP_CONF_DIR --daemon start historyserver

If the above three commands didn’t work something went wrong. As it should have found the scripts located /usr/local/hadoop/sbin/ directory.

Cron Job:
You should probably setup a cron job to start the cluster when you reboot.

crontab –e

@reboot /usr/local/hadoop/sbin/ > /home/hduser/dfs-start.log 2>&1
@reboot /usr/local/hadoop/sbin/ > /home/hduser/yarn-start.log 2>&1
@reboot /usr/local/hadoop/bin/mapred –config $HADOOP_CONF_DIR –daemon start historyserver > /home/hduser/history-stop.log 2>&1

To check that everything is working as it should run “jps” on the NAMENODE. It should return something like the following where the pid will be different:


You could also run “netstat -plten | grep java” or “lsof –i :50070” and “lsof –i :8088”.

Picked up _JAVA_OPTIONS: -Xms3g -Xmx10g
12007 SecondaryNameNode
13090 Jps
12796 JobHistoryServer
12261 ResourceManager
11653 NameNode
12397 NodeManager
11792 DataNode

You can check the DATA NODES by ssh into each one and running “jps”. It should return something like the following where the pid will be different:

Picked up _JAVA_OPTIONS: -Xms3g -Xmx10g
3218 Jps
2215 NodeManager
2411 DataNode

If for any reason only of the services is not running you need to review the logs. They can be found at /usr/local/hadoop/logs/. If it’s ResourceManager that isn’t running then look at file that has “yarn” and “resourcemanager” in it.

Never reboot the system without first stopping the cluster. When the cluster shuts down it is safe to reboot it. Also if you configured a cronjob @reboot you should make sure the DATANODES are up and running first before starting the NAMENODE that way it automatically starts the DATANODES for you

Web Ports:


  • 50070: HDFS Namenode
  • 50075: HDFS Datanode
  • 50090: HDFS Secondary Namenode
  • 8088: Resource Manager
  • 19888: Job History


  • 50075: HDFS Datanode


To check that all the Hadoop ports are available on which IP run the following.

sudo netstat -ltnp

Port Check

If for some reason you are having issues connecting to a Hadoop port then run the following command as you try and connect via the port.

sudo tcpdump -n -tttt -i eth1 port 50070


HDFS/Yarn/MapRed: Kerberize/SSL

In this tutorial I will show you how to use Kerberos/SSL with HDFS/Yarn/MapRed. I will use self signed certs for this example. Before you begin ensure you have installed Kerberos Server and Hadoop.

This assumes your hostname is “hadoop”

Create Kerberos Principals

cd /etc/security/keytabs/

sudo kadmin.local

#You can list princepals

#Create the following principals
addprinc -randkey nn/hadoop@REALM.CA
addprinc -randkey jn/hadoop@REALM.CA
addprinc -randkey dn/hadoop@REALM.CA
addprinc -randkey sn/hadoop@REALM.CA
addprinc -randkey nm/hadoop@REALM.CA
addprinc -randkey rm/hadoop@REALM.CA
addprinc -randkey jhs/hadoop@REALM.CA
addprinc -randkey HTTP/hadoop@REALM.CA

#We are going to create a user to access with later
addprinc -pw hadoop myuser/hadoop@REALM.CA
xst -k myuser.keytab myuser/hadoop@REALM.CA

#Create the keytab files.
#You will need these for Hadoop to be able to login
xst -k nn.service.keytab nn/hadoop@REALM.CA
xst -k jn.service.keytab jn/hadoop@REALM.CA
xst -k dn.service.keytab dn/hadoop@REALM.CA
xst -k sn.service.keytab sn/hadoop@REALM.CA
xst -k nm.service.keytab nm/hadoop@REALM.CA
xst -k rm.service.keytab rm/hadoop@REALM.CA
xst -k jhs.service.keytab jhs/hadoop@REALM.CA
xst -k spnego.service.keytab HTTP/hadoop@REALM.CA

Set Keytab Permissions/Ownership

sudo chown root:hadoopuser /etc/security/keytabs/*
sudo chmod 750 /etc/security/keytabs/*

Stop the Cluster --config $HADOOP_CONF_DIR stop historyserver

Hosts Update

sudo nano /etc/hosts

#Remove line

#Change to the following
#Notice how is there its because we need to tell where that host resides hadoop localhost

We don’t set the HADOOP_SECURE_DN_USER because we are going to use Kerberos

sudo nano /usr/local/hadoop/etc/hadoop/

#and change to



nano /usr/local/hadoop/etc/hadoop/core-site.xml

		<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming
		the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
		<value>kerberos</value> <!-- A value of "simple" would disable security. -->


Change ssl-server.xml.example to ssl-server.xml

cp /usr/local/hadoop/etc/hadoop/ssl-server.xml.example /usr/local/hadoop/etc/hadoop/ssl-server.xml

nano /usr/local/hadoop/etc/hadoop/ssl-server.xml

Update properties

		<description>Truststore to be used by NN and DN. Must be specified.</description>
		<description>Optional. Default value is "".</description>
		<description>Optional. The keystore file format, default value is "jks".</description>
		<description>Truststore reload check interval, in milliseconds. Default value is 10000 (10 seconds).</description>
		<description>Keystore to be used by NN and DN. Must be specified.</description>
		<description>Must be specified.</description>
		<description>Must be specified.</description>
		<description>Optional. The keystore file format, default value is "jks".</description>
		<description>Optional. The weak security cipher suites that you want excluded from SSL communication.</description>


Change ssl-client.xml.example to ssl-client.xml

cp /usr/local/hadoop/etc/hadoop/ssl-client.xml.example /usr/local/hadoop/etc/hadoop/ssl-client.xml

nano /usr/local/hadoop/etc/hadoop/ssl-client.xml

Update properties

		<description>Truststore to be used by clients like distcp. Must be specified.</description>
		<description>Optional. Default value is "".</description>
		<description>Optional. The keystore file format, default value is "jks".</description>
		<description>Truststore reload check interval, in milliseconds. Default value is 10000 (10 seconds).</description>
		<description>Keystore to be used by clients like distcp. Must be specified.</description>
		<description>Optional. Default value is "".</description>
		<description>Optional. Default value is "".</description>
		<description>Optional. The keystore file format, default value is "jks".</description>


Just add the following to the config to let it know the Kerberos keytabs to use.

nano /usr/local/hadoop/etc/hadoop/mapred-site.xml



Add the following properties

nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

	<description>Your NameNode hostname for http access.</description>
	<description>Your Secondary NameNode hostname for http access.</description>
	<description> If "true", access tokens are used as capabilities for accessing datanodes. If "false", no access tokens are checked on accessing datanod</description>
	<description> Kerberos principal name for the NameNode</description>
	<description>Kerberos principal name for the secondary NameNode.</description>
	<description>The Kerberos keytab file with the credentials for the HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint.</description>
	<description>Combined keytab file containing the namenode service and host principals.</description>
	<description>The filename of the keytab file for the DataNode.</description>
	<description>The Kerberos principal that the DataNode runs as. "_HOST" is replaced by the real host name.</description>
	<description>The HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint.</description>          

Remove the following properties



Add the following properties

nano /usr/local/hadoop/etc/hadoop/yarn-site.xml


Remove the following properties



Setup SSL Directories

sudo mkdir -p /etc/security/serverKeys
sudo chown -R root:hadoopuser /etc/security/serverKeys/
sudo chmod 755 /etc/security/serverKeys/

cd /etc/security/serverKeys

Setup Keystore

sudo keytool -genkey -alias NAMENODE -keyalg RSA -keysize 1024 -dname "CN=NAMENODE,OU=ORGANIZATION_UNIT,C=canada" -keypass PASSWORD -keystore /etc/security/serverKeys/keystore.jks -storepass PASSWORD
sudo keytool -export -alias NAMENODE -keystore /etc/security/serverKeys/keystore.jks -rfc -file /etc/security/serverKeys/NAMENODE.csr -storepass PASSWORD

Setup Truststore

sudo keytool -import -noprompt -alias NAMENODE -file /etc/security/serverKeys/NAMENODE.csr -keystore /etc/security/serverKeys/truststore.jks -storepass PASSWORD

Generate Self Signed Certifcate

sudo openssl genrsa -out /etc/security/serverKeys/NAMENODE.key 2048

sudo openssl req -x509 -new -key /etc/security/serverKeys/NAMENODE.key -days 300 -out /etc/security/serverKeys/NAMENODE.pem

sudo keytool -keystore /etc/security/serverKeys/keystore.jks -alias NAMENODE -certreq -file /etc/security/serverKeys/NAMENODE.cert -storepass PASSWORD -keypass PASSWORD

sudo openssl x509 -req -CA /etc/security/serverKeys/NAMENODE.pem -CAkey /etc/security/serverKeys/NAMENODE.key -in /etc/security/serverKeys/NAMENODE.cert -out /etc/security/serverKeys/NAMENODE.signed -days 300 -CAcreateserial

Setup File Permissions

sudo chmod 440 /etc/security/serverKeys/*
sudo chown root:hadoopuser /etc/security/serverKeys/*

Start the Cluster --config $HADOOP_CONF_DIR start historyserver

Create User Directory

kinit -kt /etc/security/keytabs/myuser.keytab myuser/hadoop@REALM.CA
#ensure the login worked

#Create hdfs directory now
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/myuser

#remove kerberos ticket




HortonWorks: Install YARN/MR

This tutorial guides you through installing YARN/MapReduce on Hortonworks using a multi node cluster setup with Ubuntu OS.

Step 1: Go to “Stack and Version”. Then click “Add Service” on YARN. You will notice that “MapReduce2” comes with it.

Step 2: Assign Masters I usually put the ResourceManager, History Server and App Timeline Server all on the secondary namenode. But it is totally up to you how you setup your environment.

Step 3: Assign Slaves and Clients I put NodeManagers on all the datanodes and Client’s on all servers. Up to you though. This is what worked for me and my requirements.

Step 4: During Customize Services you may get the warning that Ambari Metrics “hbase_master_heapsize” needs to be increased. I recommend doing this change but it’s up to you and what makes sense in your environment.

Step 5: Follow the remaining steps and installation should complete with no issues. Should an issue arise review the error and if it was just a turning on connection error then you may not have any issues and it just needs all services to be stopped and started again. Please not Ambari Metrics may report errors but they should clear in around 15 minutes.


Hadoop: Commands

Below is a list of all the commands I have had to use while working with Hadoop. If you have any other ones that are not listed here please feel free to add them in or if you have updates to ones below.

Move Files:

 hadoop fs -mv /OLD_DIR/* /NEW_DIR/

Sort Files By Size. Note this is for viewing information only on terminal. It has no affect on the files or the way they are displayed via web ui:

 hdfs fsck /logs/ -files | grep "/FILE_DIR/" | grep -v "<dir>" | gawk '{print $2, $1;}' | sort –n

Display system information:

 hdfs fsck /FILE_dir/ -files

Remove folder with all files in it:

 hadoop fs -rm -R hdfs:///DIR_TO_REMOVE

Make folder:

 hadoop fs -mkdir hdfs:///NEW_DIR

Remove one file:

 hadoop fs -rm hdfs:///DIR/FILENAME.EXTENSION

Copy all file from directory outside of HDFS to HDFS:

 hadoop fs -copyFromLocal LOCAL_DIR hdfs:///DIR

Copy files from HDFS to local directory:

 hadoop dfs -copyToLocal hdfs:///DIR/REGPATTERN LOCAL_DIR

Kill a running MR job:

 hadoop job -kill job_1461090210469_0003

You could also do that via the 8088 web ui interface

Kill yarn application:

 yarn application -kill application_1461778722971_0001

Check status of DATANODES. Check “Under Replicated blocks” field. If you have any you should probably rebalance:

 hadoop dfsadmin –report

Number of files in HDFS directory:

 hadoop fs -count -q hdfs:///DIR


Rename directory:

 hadoop fs -mv hdfs:///OLD_NAME hdfs:///NEW_NAME

Change replication factor on files:

 hadoop fs -setrep -R 3 hdfs:///DIR

3 is the replication number.
You can choose a file if you want

Get yarn log. You can also view via web ui 8088:

 yarn logs -applicationId application_1462141864581_0016

Refresh Nodes:

 hadoop dfsadmin –refreshNodes

Report of blocks and their locations:

 hadoop fsck / -files -blocks –locations

Find out where a particular file is located with blocks:

 hadoop fsck /DIR/FILENAME -files -locations –blocks

Fix under replicated blocks. First command gets the blocks that are under replicated. The second sets replication to 2 for those files. You might have to restart the dfs to see a change from dfsadmin –report:

 hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}' >> /tmp/under_replicated_files

for hdfsfile in `cat /tmp/under_replicated_files`; do echo "Fixing $hdfsfile :" ; hadoop fs -setrep 2 $hdfsfile; done

Show all the classpaths associated to hadoop:

 hadoop classpath

