If you want to setup rack awareness it is really straight forward and easy.
Step 1: Login to Ambari Server
Step 2: Click “Hosts”
Step 3: Go to each server and modify “rack” setting.
Step 4: Shut down cluster and restart.
Happy racking!
A place for tutorials on programming and other such works.
If you want to setup rack awareness it is really straight forward and easy.
Step 1: Login to Ambari Server
Step 2: Click “Hosts”
Step 3: Go to each server and modify “rack” setting.
Step 4: Shut down cluster and restart.
Happy racking!
This tutorial guides you through installing Hadoop on hortonworks using a multi node cluster setup with Ubuntu OS.
Ensure every server has the FQDN of all the servers to be in the cluster.
- sudo nano /etc/hosts
You will do the following to all the servers in the Hadoop cluster.
- ssh-keygen -t rsa -P ""
- cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
- ssh-copy-id -i ~/.ssh/id_rsa.pub ##USER##@##FQDN##
- ssh ##USER##@##FQDN##
Java 8:
- sudo apt-get install openjdk-8-jdk
Chrony:
- sudo apt-get install chrony
Disable HugePage:
- sudo su
- echo never > /sys/kernel/mm/transparent_hugepage/enabled
- exit
You will need to login to Ambari Server and click “Launch Install Wizard”. For the most part you will just follow the prompts. The major hurdles is that in the “Install Options” section make sure you put the FQDN (IE: host@domain.com). You will also need to get the SSH Private Key from the Ambari Server you just did during pre requisites from this location /home/##USER##/.ssh/id_rsa. Make sure you also set the SSH User Account to what you used during SSH creation. If for any reason it fails you can click the status to find out what failed and rectify the problem. As long as you did the pre-requisites you should be fine.
ZooKeeper / Ambari Metrics
As you install HDFS you will notice that Ambari Metrics and ZooKeeper get installed automatically. This is a good thing and you want it. ZooKeeper keeps all configs in sync and Ambari Metrics lets you easily monitor the system.
Assign Masters
You will need to setup how you want your masters to look. I usually have three zookeepers. Your secondary name node should go on a separate server. But it is totally up to you how you design your cluster. Have fun!
Assign Slaves / Clients
Your slaves (aka DataNodes) I don’t put any on my namenode or secondary namenode or my zookeeper servers. I leave my datanodes to perform that action alone. I also install clients on namenode, secondary namenode and all datanodes. Up to you how you configure it just have fun while doing it!
Key Config Optional Changes
Once you get to the customize services section. You can for the most part leave this as is and just do the password areas. But I do recommend reviewing the following and update as needed.
Deploy
Deploy should work with no issues. If there is issues sometimes you don’t need to worry about it. Such as connection issue. As long as it installed if it didn’t start right away and that was the connection issue then it may start once completed. You should also note that Ambari Metrics shows errors directly after starting. That is expected and no need to worry it will clear itself.
If you want to use SSL with Ambari Server (note this is not with Hadoop yet) then follow the below steps. Please note this does not cover the creation of a SSL Cert as there are many tutorials on how to create self signed certs, etc available.
Step 1: Stop the Ambari Server
- sudo ambari-server stop
Step 2: Run Ambari Server Security Setup Command
- sudo ambari-server setup-security
Select option 1 during the prompts and note that you cannot use port 443 for https as that is reserved in Ambari. The default is 8443 and that is what they recommend. Enter path to your cert /etc/ssl/certs/hostname.cer file. Enter path to your encrypted key /etc/ssl/private/hostname.key file. Follow the rest of the prompts.
Step 3: Start Ambari Server
- sudo ambari-server start
Step 4: Login to Ambari Server now available at https://hostname:8443
If you want to use LDAP with your Ambari Server then follow the below steps.
Step 1: Stop the Ambari Server to setup LDAP integration
- sudo ambari-server stop
Step 2: Run Ambari Server LDAP Setup Command. This will require a bunch of settings to be set. Consult your IT department for your specific settings.
- sudo ambari-server setup-ldap
Step 3: Create the groups and users text files and add the users you want to add comma separated to users and groups comma separated to the groups file.
- nano ~/groups.txt
- nano ~/users.txt
Step 4 (Optional): You may need to adjust the “SERVER_API_HOST” value to your ambari server hostname. Default is 127.0.0.1 which is technically your host but sometimes it complains and you need to make this modification.
- sudo nano /usr/lib/python2.6/site-packages/ambari_server/serverUtils.py
Step 5: Import Groups/Users from the text files created in step 3. You will need to start the ambari server first.
- sudo ambari-server start
- #Import groups
- sudo ambari-server sync-ldap --groups groups.txt
- #Import users
- sudo ambari-server sync-ldap --users users.txt
Step 6: Login to Ambari and got to manage ambari and you will see your new users and groups.
I have been playing around with HortonWorks sandbox and thought it about time I attempt installation on a multi node cluster. Feel free to reach out to me for further support or information. I will be documenting more in the coming weeks.
- wget -O /etc/apt/sources.list.d/ambari.list http://public-repo-1.hortonworks.com/ambari/ubuntu16/2.x/updates/2.6.0.0/ambari.list
- sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com B9733A7A07513CAD
- sudo apt-get update
- sudo apt-get install openjdk-8-jdk
- sudo apt-get install ambari-server
I recommend installed as non root. In fact do not do any of the defaults for users and passwords. But it is totally up to you. I will document more as I learn more.
- sudo ambari-server setup
- sudo ambari-server restart
- sudo ambari-server start
- sudo ambari-server stop
I suggest changing the ambari log directory as well.
- sudo vi /etc/ambari-server/conf/log4j.properties
- #Look for the property "ambari.log.dir" and change it.
- #Don't forget to create the folder you point to and ensure that chown to the user that is running ambari.
As well change the ambari agent log directory.
- sudo vi /etc/ambari-agent/conf/ambari-agent.ini
- #Look for "logdir" and change to a directory that exists and has permissions.
Running as non-root user you will have to change the run directory because otherwise during a system reboot you may get folders not being able to be created. Because /var/run/ gets deleted on each restart and it get’s rebuilt. Do the following:
- sudo mkdir -p /home/##NONROOTUSER##/run/ambari-server
- sudo chown ##NONROOTUSER##:root /home/##NONROOTUSER##/run/ambari-server
Then you have to edit the ambari.properties file.
- sudo vi /etc/ambari-server/conf/ambari.properties
- #Change the following to the folder you created above.
- bootstrap.dir
- pid.dir
- recommendations.dir
By default a secondary namenode runs on the main namenode server. This is not ideal. A secondary namenode should be on it’s own server.
First bring up a new server that has the exact same configuration as the primary namenode.
- nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Remove property “dfs.namenode.secondary.http-address” and “dfs.namenode.name.dir” as they are unneeded.
Then add the following property. Making sure to change to the path you will store your checkpoints in.
- <property>
- <name>dfs.namenode.checkpoint.dir</name>
- <value>file:/usr/local/hadoop_store/data/checkpoint</value>
- </property>
- nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Then add the following property. Making sure to change ##SECONDARYNAMENODE##
- <property>
- <name>dfs.namenode.secondary.http-address</name>
- <value>##SECONDARYNAMENODE##:50090</value>
- <description>Your Secondary NameNode hostname for http access.</description>
- </property>
Now when you stop and start the cluster you will see the secondary name node now start on the secondary server and not on the primary namenode server. This is what you want.
If you want your multi node cluster to be rack aware you need to do a few things. The following is to be done only on the master (namenode) only.
- nano /home/myuser/rack.sh
With the following contents
- #!/bin/bash
- # Adjust/Add the property "net.topology.script.file.name"
- # to core-site.xml with the "absolute" path the this
- # file. ENSURE the file is "executable".
- # Supply appropriate rack prefix
- RACK_PREFIX=myrackprefix
- # To test, supply a hostname as script input:
- if [ $# -gt 0 ]; then
- CTL_FILE=${CTL_FILE:-"rack.data"}
- HADOOP_CONF=${HADOOP_CONF:-"/home/myuser"}
- if [ ! -f ${HADOOP_CONF}/${CTL_FILE} ]; then
- echo -n "/$RACK_PREFIX/rack "
- exit 0
- fi
- while [ $# -gt 0 ] ; do
- nodeArg=$1
- exec< ${HADOOP_CONF}/${CTL_FILE}
- result=""
- while read line ; do
- ar=( $line )
- if [ "${ar[0]}" = "$nodeArg" ] ; then
- result="${ar[1]}"
- fi
- done
- shift
- if [ -z "$result" ] ; then
- echo -n "/$RACK_PREFIX/rack "
- else
- echo -n "/$RACK_PREFIX/rack_$result "
- fi
- done
- else
- echo -n "/$RACK_PREFIX/rack "
- fi
Set execute permissions
- sudo chmod 755 rack.sh
Create the data file that has your rack information. You must be very careful not to have too many spaces between the host and the rack.
- namenode_ip 1
- secondarynode_ip 2
- datanode1_ip 1
- datanode2_ip 2
The last step is to update core-site.xml file located in your hadoop directory.
- nano /usr/local/hadoop/etc/hadoop/core-site.xml
Set the contents to the following of where your rack.sh file is located.
- <property>
- <name>net.topology.script.file.name</name>
- <value>/home/myuser/rack.sh</value>
- </property>
Apache PIG analyzes large data sets. There are a variety of ways of processing data using it. As I learn more about it I will put use cases below.
JSON:
- REGISTER 'hdfs:///elephant-bird-core-4.15.jar';
- REGISTER 'hdfs:///elephant-bird-hadoop-compat-4.15.jar';
- REGISTER 'hdfs:///elephant-bird-pig-4.15.jar';
- REGISTER 'hdfs:///json-simple-1.1.1.jar';
- loadedJson = LOAD '/hdfs_dir/MyFile.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
- rec = FOREACH loadedJson GENERATE json#'my_key' as (m:chararray);
- DESCRIBE rec;
- DUMP rec;
- --Store the results in a hdfs dir. You can have HIVE query that directory
- STORE rec INTO '/hdfs_dir' USING PigStorage();
If you use hadoop and you want to run a map reduce type job using Python you can use MRJob.
Installation:
- pip install mrjob
Here is an example if you run just the mapper code and you load a json file. yield writes the data out.
- from mrjob.job import MRJob, MRStep
- import json
- class MRTest(MRJob):
- def steps(self):
- return [
- MRStep(mapper=self.mapper_test)
- ]
- def mapper_test(self, _, line):
- result = {}
- doc = json.loads(line)
- yield key, result
- if __name__ == '__main__':
- MRTest.run()
We can connect to Hadoop from Python using PyWebhdfs package. For the purposes of this post we will use version 0.4.1. You can see all API’s from here.
To build a connection to Hadoop you first need to import it.
- from pywebhdfs.webhdfs import PyWebHdfsClient
Then you build the connection like this.
- HDFS_CONNECTION = PyWebHdfsClient(host=##HOST## port='50070', user_name=##USER##)
To list the contents of a directory you do this.
- HDFS_CONNECTION.list_dir(##HADOOP_DIR##)
To pull a single file down from Hadoop is straight forward. Notice how we have the “FileNotFound” brought in. That is important when pulling a file in. You don’t actually need it but “read_file” will raise that exception if it is not found. By default we should always include this.
- from pywebhdfs.errors import FileNotFound
- try:
- file_data = HDFS_CONNECTION.read_file(##FILENAME##)
- except FileNotFound as e:
- print(e)
- except Exception as e:
- print(e)
We are going to install Hive over Hadoop and perform a basic query. Ensure you install Kerberos and Hadoop with Kerberos.
This assumes your hostname is “hadoop”
- wget http://apache.forsale.plus/hive/hive-2.3.3/apache-hive-2.3.3-bin.tar.gz
- tar -xzf apache-hive-2.3.3-bin.tar.gz
- sudo mv apache-hive-2.3.3-bin /usr/local/hive
- sudo chown -R root:hadoopuser /usr/local/hive/
- sudo nano ~/.bashrc
Add the following to the end of the file.
#HIVE VARIABLES START
export HIVE_HOME=/usr/local/hive
export HIVE_CONF_DIR=/usr/local/hive/conf
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/hadoop/lib/*:/usr/local/hive/lib/*
#HIVE VARIABLES STOP
- source ~/.bashrc
- kinit -kt /etc/security/keytabs/myuser.keytab myuser/hadoop@REAL.CA
- hdfs dfs -mkdir -p /user/hive/warehouse
- hdfs dfs -mkdir /tmp
- hdfs dfs -chmod g+w /tmp
- hdfs dfs -chmod g+w /user/hive/warehouse
- cd /etc/security/keytabs
- sudo kadmin.local
- addprinc -randkey hive/hadoop@REALM.CA
- addprinc -randkey hivemetastore/hadoop@REALM.CA
- addprinc -randkey hive-spnego/hadoop@REALM.CA
- xst -kt hive.service.keytab hive/hadoop@REALM.CA
- xst -kt hivemetastore.service.keytab hivemetastore/hadoop@REALM.CA
- xst -kt hive-spnego.service.keytab hive-spnego/hadoop@REALM.CA
- q
- sudo chown root:hadoopuser /etc/security/keytabs/*
- sudo chmod 750 /etc/security/keytabs/*
- cd $HIVE_HOME/conf
- sudo cp hive-env.sh.template hive-env.sh
- sudo nano /usr/local/hive/conf/hive-env.sh
- #locate "HADOOP_HOME" and change to be this
- export HADOOP_HOME=/usr/local/hadoop
- #locate "HIVE_CONF_DIR" and change to be this
- export HIVE_CONF_DIR=/usr/local/hive/conf
Chekck out this link for the configuration properties.
- sudo cp /usr/local/hive/conf/hive-default.xml.template /usr/local/hive/conf/hive-site.xml
- sudo nano /usr/local/hive/conf/hive-site.xml
- #Modify the following properties
- <property>
- <name>system:user.name</name>
- <value>${user.name}</value>
- </property>
- <property>
- <name>javax.jdo.option.ConnectionURL</name>
- <value>jdbc:postgresql://myhost:5432/metastore</value>
- </property>
- <property>
- <name>javax.jdo.option.ConnectionDriverName</name>
- <value>org.postgresql.Driver</value>
- </property>
- <property>
- <name>hive.metastore.warehouse.dir</name>
- <value>/user/hive/warehouse</value>
- </property>
- <property>
- <name>javax.jdo.option.ConnectionUserName</name>
- <value>hiveuser</value>
- </property>
- <property>
- <name>javax.jdo.option.ConnectionPassword</name>
- <value>PASSWORD</value>
- </property>
- <property>
- <name>hive.exec.local.scratchdir</name>
- <value>/tmp/${system:user.name}</value>
- <description>Local scratch space for Hive jobs</description>
- </property>
- <property>
- <name>hive.querylog.location</name>
- <value>/tmp/${system:user.name}</value>
- <description>Location of Hive run time structured log file</description>
- </property>
- <property>
- <name>hive.downloaded.resources.dir</name>
- <value>/tmp/${hive.session.id}_resources</value>
- <description>Temporary local directory for added resources in the remote file system.</description>
- </property>
- <property>
- <name>hive.server2.logging.operation.log.location</name>
- <value>/tmp/${system:user.name}/operation_logs</value>
- <description>Top level directory where operation logs are stored if logging functionality is enabled</description>
- </property>
- <property>
- <name>hive.metastore.uris</name>
- <value>thrift://0.0.0.0:9083</value>
- <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
- </property>
- <property>
- <name>hive.server2.webui.host</name>
- <value>0.0.0.0</value>
- </property>
- <property>
- <name>hive.server2.webui.port</name>
- <value>10002</value>
- </property>
- <property>
- <name>hive.metastore.port</name>
- <value>9083</value>
- </property>
- <property>
- <name>hive.server2.transport.mode</name>
- <value>binary</value>
- </property>
- <property>
- <name>hive.server2.thrift.sasl.qop</name>
- <value>auth-int</value>
- </property>
- <property>
- <name>hive.server2.authentication</name>
- <value>KERBEROS</value>
- <description>authenticationtype</description>
- </property>
- <property>
- <name>hive.server2.authentication.kerberos.principal</name>
- <value>hive/_HOST@REALM.CA</value>
- <description>HiveServer2 principal. If _HOST is used as the FQDN portion, it will be replaced with the actual hostname of the running instance.</description>
- </property>
- <property>
- <name>hive.server2.authentication.kerberos.keytab</name>
- <value>/etc/security/keytabs/hive.service.keytab</value>
- <description>Keytab file for HiveServer2 principal</description>
- </property>
- <property>
- <name>hive.metastore.sasl.enabled</name>
- <value>true</value>
- <description>If true, the metastore thrift interface will be secured with SASL. Clients
- must authenticate with Kerberos.</description>
- </property>
- <property>
- <name>hive.metastore.kerberos.keytab.file</name>
- <value>/etc/security/keytabs/hivemetastore.service.keytab</value>
- <description>The path to the Kerberos Keytab file containing the metastore thrift
- server's service principal.</description>
- </property>
- <property>
- <name>hive.metastore.kerberos.principal</name>
- <value>hivemetastore/_HOST@REALM.CA</value>
- <description>The service principal for the metastore thrift server. The special string _HOST will be replaced automatically with the correct host name.</description>
- </property>
- <property>
- <name>hive.security.authorization.enabled</name>
- <value>true</value>
- <description>enable or disable the hive client authorization</description>
- </property>
- <property>
- <name>hive.metastore.pre.event.listeners</name>
- <value>org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener</value>
- <description>List of comma separated listeners for metastore events.</description>
- </property>
- <property>
- <name>hive.security.metastore.authorization.manager</name>
- <value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider</value>
- <description>
- Names of authorization manager classes (comma separated) to be used in the metastore
- for authorization. The user defined authorization class should implement interface
- org.apache.hadoop.hive.ql.security.authorization.HiveMetastoreAuthorizationProvider.
- All authorization manager classes have to successfully authorize the metastore API
- call for the command execution to be allowed.
- </description>
- </property>
- <property>
- <name>hive.security.metastore.authenticator.manager</name>
- <value>org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator</value>
- <description>
- authenticator manager class name to be used in the metastore for authentication.
- The user defined authenticator should implement interface org.apache.hadoop.hive.ql.security.HiveAuthenticationProvider.
- </description>
- </property>
- <property>
- <name>hive.security.metastore.authorization.auth.reads</name>
- <value>true</value>
- <description>If this is true, metastore authorizer authorizes read actions on database, table</description>
- </property>
- <property>
- <name>datanucleus.autoCreateSchema</name>
- <value>false</value>
- </property>
Notice here how it’s .hive. that is used with the storage based authentication.
- sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
- <property>
- <name>hadoop.proxyuser.hive.hosts</name>
- <value>*</value>
- </property>
- <property>
- <name>hadoop.proxyuser.hive.groups</name>
- <value>*</value>
- </property>
Follow this install for installing Psotgresql 9.6
- sudo su - postgres
- psql
- CREATE USER hiveuser WITH PASSWORD 'PASSWORD';
- CREATE DATABASE metastore;
- GRANT ALL PRIVILEGES ON DATABASE metastore TO hiveuser;
- \q
- exit
- schematool -dbType postgres -initSchema
- nohup /usr/local/hive/bin/hive --service metastore --hiveconf hive.log.file=hivemetastore.log >/var/log/hive/hivemetastore.out 2>/var/log/hive/hivemetastoreerr.log &
- nohup /usr/local/hive/bin/hiveserver2 --hiveconf hive.metastore.uris=" " --hiveconf hive.log.file=hiveserver2.log >/var/log/hive/hiveserver2.out 2> /var/log/hive/hiveserver2err.log &
- sudo mkdir /var/log/hive/
- sudo chown root:hduser /var/log/hive
- sudo chmod 777 /var/log/hive
- crontab -e
- #Add the following
- @reboot nohup /usr/local/hive/bin/hive --service metastore --hiveconf hive.log.file=hivemetastore.log >/var/log/hive/hivemetastore.out 2>/var/log/hive/hivemetastoreerr.log &
- @reboot nohup /usr/local/hive/bin/hiveserver2 --hiveconf hive.metastore.uris=" " --hiveconf hive.log.file=hiveserver2.log >/var/log/hive/hiveserver2.out 2> /var/log/hive/hiveserver2err.log &
Now you can check the hive version
- hive --version
- #We first need to have a ticket to access beeline using the hive kerberos user we setup earlier.
- kinit -kt /etc/security/keytabs/hive.service.keytab hive/hadoop@REALM.CA
- #Now we can get into beeline using that principal
- beeline -u "jdbc:hive2://0.0.0.0:10000/default;principal=hive/hadoop@REALM.CA;"
- #You can also just get into beeline then connect from there
- beeline
- beeline>!connect jdbc:hive2://0.0.0.0:10000/default;principal=hive/hadoop@REALM.CA
- #Disconnect from beeline
- !q
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Hive_Install_On_Ubuntu_16_04.php
https://maprdocs.mapr.com/home/Hive/Config-RemotePostgreSQLForHiveMetastore.html
https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool#HiveSchemaTool-TheHiveSchemaTool
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-InstallationandConfiguration
I will attempt to explain how to setup a map, reduce, Combiner, Path Filter, Partitioner, Outputer using Java Eclipse with Maven. If you need to know how to install Eclipse go here. Remember that these are not complete code just snipets to get you going.
A starting point I used was this tutorial however it was built using older Hadoop code.
Mapper: Maps input key/value pairs to a set of intermediate key/value pairs.
Reducer: Reduces a set of intermediate values which share a key to a smaller set of values.
Partitioner: http://www.tutorialspoint.com/map_reduce/map_reduce_partitioner.htm
Combiner: http://www.tutorialspoint.com/map_reduce/map_reduce_combiners.htm
First you will need to create a maven project. You can follow any tutorial on how to do that if you don’t know how.
- <properties>
- <hadoop.version>2.7.2</hadoop.version>
- </properties>
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-hdfs</artifactId>
- <version>${hadoop.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-common</artifactId>
- <version>${hadoop.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-client</artifactId>
- <version>${hadoop.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-mapreduce-client-core</artifactId>
- <version>${hadoop.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-yarn-api</artifactId>
- <version>${hadoop.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-yarn-common</artifactId>
- <version>${hadoop.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-auth</artifactId>
- <version>${hadoop.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-yarn-server-nodemanager</artifactId>
- <version>${hadoop.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-yarn-server-resourcemanager</artifactId>
- <version>${hadoop.version}</version>
- </dependency>
- public class JobDriver extends Configured implements Tool {
- private Configuration conf;
- private static String hdfsURI = "hdfs://localhost:54310";
- public static void main(String[] args) throws Exception {
- int res = ToolRunner.run(new Configuration(), new JobDriver(), args);
- System.exit(res);
- }
- @Override
- public int run(String[] args) throws Exception {
- BasicConfigurator.configure();
- conf = this.getConf();
- //The paths for the configuration
- final String HADOOP_HOME = System.getenv("HADOOP_HOME");
- conf.addResource(new Path(HADOOP_HOME, "etc/hadoop/core-site.xml"));
- conf.addResource(new Path(HADOOP_HOME, "etc/hadoop/hdfs-site.xml"));
- conf.addResource(new Path(HADOOP_HOME, "etc/hadoop/yarn-site.xml"));
- hdfsURI = conf.get("fs.defaultFS");
- Job job = Job.getInstance(conf, YOURJOBNAME);
- //You can setup additional configuration information by doing the below.
- job.getConfiguration().set("NAME", "VALUE");
- job.setJarByClass(JobDriver.class);
- //If you are going to use a mapper class
- job.setMapperClass(MAPPERCLASS.class);
- //If you are going to use a combiner class
- job.setCombinerClass(COMBINERCLASS.class);
- //If you plan on splitting the output
- job.setPartitionerClass(PARTITIONERCLASS.class);
- job.setNumReduceTasks(NUMOFREDUCERS);
- //if you plan on use a reducer
- job.setReducerClass(REDUCERCLASS.class);
- //You need to set the output key and value types. We will just use Text for this example
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(Text.class);
- //If you want to use an input filter class
- FileInputFormat.setInputPathFilter(job, INPUTPATHFILTER.class);
- //You must setup what the input path is for the files you want to parse. It takes either string or Path
- FileInputFormat.setInputPaths(job, inputPaths);
- //Once you parse the data you must put it somewhere.
- job.setOutputFormatClass(OUTPUTFORMATCLASS.class);
- FileOutputFormat.setOutputPath(job, new Path(OUTPUTPATH));
- return job.waitForCompletion(true) ? 0 : 1;
- }
- }
- public class InputPathFilter extends Configured implements PathFilter {
- Configuration conf;
- FileSystem fs;
- Pattern includePattern = null;
- Pattern excludePattern = null;
- @Override
- public void setConf(Configuration conf) {
- this.conf = conf;
- if (conf != null) {
- try {
- fs = FileSystem.get(conf);
- //If you want you can always pass in regex patterns from the job driver class and filter that way. Up to you!
- if (conf.get("file.includePattern") != null)
- includePattern = conf.getPattern("file.includePattern", null);
- if (conf.get("file.excludePattern") != null)
- excludePattern = conf.getPattern("file.excludePattern", null);
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
- }
- @Override
- public boolean accept(Path path) {
- //Here you could filter based on your include or exclude regex or file size.
- //Remember if you have sub directories you have to return true for that
- if (fs.isDirectory(path)) {
- return true;
- }
- else {
- //You can also do this to get file size in case you want to do anything when files are certains size, etc
- FileStatus file = fs.getFileStatus(path);
- String size = FileUtils.byteCountToDisplaySize(file.getLen());
- //You can also move files in this section
- boolean move_success = fs.rename(path, new Path(NEWPATH + path.getName()));
- }
- }
- }
- //Remember at the beginning I said we will use key and value as Text. That is the second part of the extends mapper
- public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
- //Do whatever setup you would like. Remember in the job drive you could set things to configuration well you can access them here now
- @Override
- protected void setup(Context context) throws IOException, InterruptedException {
- super.setup(context);
- Configuration conf = context.getConfiguration();
- }
- //This is the main map method.
- @Override
- public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
- //This will get the file name you are currently processing if you want. However not necessary.
- String filename = ((FileSplit) context.getInputSplit()).getPath().toString();
- //Do whatever you want in the mapper. The context is what you print out to.
- //If you want to embed javascript go <a href="http://www.gaudreault.ca/java-embed-javascript/" target="_blank">here</a>.
- //If you want to embed Python go <a href="http://www.gaudreault.ca/java-embed-python/" target="_blank">here</a>.
- }
- }
If you decided to embed Python or JavaScript you will need these scripts as an example. map_python and map
- public class MyCombiner extends Reducer<Text, Text, Text, Text> {
- //Do whatever setup you would like. Remember in the job drive you could set things to configuration well you can access them here now
- @Override
- protected void setup(Context context) throws IOException, InterruptedException {
- super.setup(context);
- Configuration conf = context.getConfiguration();
- }
- @Override
- protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
- //Do whatever you want in the mapper. The context is what you print out to.
- //If you want to embed javascript go <a href="http://www.gaudreault.ca/java-embed-javascript/" target="_blank">here</a>.
- //If you want to embed Python go <a href="http://www.gaudreault.ca/java-embed-python/" target="_blank">here</a>.
- }
- }
If you decided to embed Python or JavaScript you will need these scripts as an example. combiner_python and combiner_js
- public class MyReducer extends Reducer<Text, Text, Text, Text> {
- //Do whatever setup you would like. Remember in the job drive you could set things to configuration well you can access them here now
- @Override
- protected void setup(Context context) throws IOException, InterruptedException {
- super.setup(context);
- Configuration conf = context.getConfiguration();
- }
- @Override
- protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
- //Do whatever you want in the mapper. The context is what you print out to.
- //If you want to embed javascript go <a href="http://www.gaudreault.ca/java-embed-javascript/" target="_blank">here</a>.
- //If you want to embed Python go <a href="http://www.gaudreault.ca/java-embed-python/" target="_blank">here</a>.
- }
- }
If you decided to embed Python or JavaScript you will need these scripts as an example. reduce_python and reduce_js
- public class MyPartitioner extends Partitioner<Text, Text> implements Configurable
- {
- private Configuration conf;
- @Override
- public Configuration getConf() {
- return conf;
- }
- //Do whatever setup you would like. Remember in the job drive you could set things to configuration well you can access them here now
- @Override
- public void setConf(Configuration conf) {
- this.conf = conf;
- }
- @Override
- public int getPartition(Text key, Text value, int numReduceTasks)
- {
- Integer partitionNum = 0;
- //Do whatever logic you would like to figure out the way you want to partition.
- //If you want to embed javascript go <a href="http://www.gaudreault.ca/java-embed-javascript/" target="_blank">here</a>.
- //If you want to embed Python go <a href="http://www.gaudreault.ca/java-embed-python/" target="_blank">here</a>.
- return partionNum;
- }
- }
If you decided to embed Python or JavaScript you will need these scripts as an example. partitioner_python and partitioner_js
- public class MyOutputFormat<K, V> extends FileOutputFormat<K, V> {
- protected static int outputCount = 0;
- protected static class JsonRecordWriter<K, V> extends RecordWriter<K, V> {
- protected DataOutputStream out;
- public JsonRecordWriter(DataOutputStream out) throws IOException {
- this.out = out;
- }
- @Override
- public void close(TaskAttemptContext arg0) throws IOException, InterruptedException {
- out.writeBytes(WRITE_WHATEVER_YOU_WANT);
- out.close();
- }
- @Override
- public void write(K key, V value) throws IOException, InterruptedException {
- //write the value
- //You could also send to a database here if you wanted. Up to you how you want to deal with it.
- }
- }
- @Override
- public RecordWriter<K, V> getRecordWriter(TaskAttemptContext tac) throws IOException, InterruptedException {
- Configuration conf = tac.getConfiguration();
- Integer numReducers = conf.getInt("mapred.reduce.tasks", 0);
- //you can set output filename in the config from the job driver if you want
- String outputFileName = conf.get("outputFileName");
- outputCount++;
- //If you used a partitioner you need to split out the content so you should break the output filename into parts
- if (numReducers > 1) {
- //Do whatever logic you need to in order to get unique filenames per split
- }
- Path file = FileOutputFormat.getOutputPath(tac);
- Path fullPath = new Path(file, outputFileName);
- FileSystem fs = file.getFileSystem(conf);
- FSDataOutputStream fileout = fs.create(fullPath);
- return new JsonRecordWriter<K, V>(fileout);
- }
- }
Below is a list of all the commands I have had to use while working with Hadoop. If you have any other ones that are not listed here please feel free to add them in or if you have updates to ones below.
Move Files:
- hadoop fs -mv /OLD_DIR/* /NEW_DIR/
Sort Files By Size. Note this is for viewing information only on terminal. It has no affect on the files or the way they are displayed via web ui:
- hdfs fsck /logs/ -files | grep "/FILE_DIR/" | grep -v "<dir>" | gawk '{print $2, $1;}' | sort –n
Display system information:
- hdfs fsck /FILE_dir/ -files
Remove folder with all files in it:
- hadoop fs -rm -R hdfs:///DIR_TO_REMOVE
Make folder:
- hadoop fs -mkdir hdfs:///NEW_DIR
Remove one file:
- hadoop fs -rm hdfs:///DIR/FILENAME.EXTENSION
Copy all file from directory outside of HDFS to HDFS:
- hadoop fs -copyFromLocal LOCAL_DIR hdfs:///DIR
Copy files from HDFS to local directory:
- hadoop dfs -copyToLocal hdfs:///DIR/REGPATTERN LOCAL_DIR
Kill a running MR job:
- hadoop job -kill job_1461090210469_0003
You could also do that via the 8088 web ui interface
Kill yarn application:
- yarn application -kill application_1461778722971_0001
Check status of DATANODES. Check “Under Replicated blocks” field. If you have any you should probably rebalance:
- hadoop dfsadmin –report
Number of files in HDFS directory:
- hadoop fs -count -q hdfs:///DIR
-q is optional – Gives columns QUOTA, REMAINING_QUATA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, FILE_NAME
Rename directory:
- hadoop fs -mv hdfs:///OLD_NAME hdfs:///NEW_NAME
Change replication factor on files:
- hadoop fs -setrep -R 3 hdfs:///DIR
3 is the replication number.
You can choose a file if you want
Get yarn log. You can also view via web ui 8088:
- yarn logs -applicationId application_1462141864581_0016
Refresh Nodes:
- hadoop dfsadmin –refreshNodes
Report of blocks and their locations:
- hadoop fsck / -files -blocks –locations
Find out where a particular file is located with blocks:
- hadoop fsck /DIR/FILENAME -files -locations –blocks
Fix under replicated blocks. First command gets the blocks that are under replicated. The second sets replication to 2 for those files. You might have to restart the dfs to see a change from dfsadmin –report:
- hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}' >> /tmp/under_replicated_files
- for hdfsfile in `cat /tmp/under_replicated_files`; do echo "Fixing $hdfsfile :" ; hadoop fs -setrep 2 $hdfsfile; done
Show all the classpaths associated to hadoop:
- hadoop classpath
DataNode:
Use rsync from one of the other datanodes you previously setup. Ensure you change datanode specific settings you configured during installation.
- hadoop-daemon.sh start datanode
- start-yarn.sh
NameNode:
- nano /usr/local/hadoop/etc/hadoop/slaves
Add the new slave hostname
- hadoop dfsadmin –refreshNodes
Refreshes all the nodes you have without doing a full restart
When you add a new datanode no data will exist so you can rebalance the cluster to what makes sense in your environment.
- hdfs balancer –threshold 1 –include ALL_DATA_NODES_HOSTNAME_SEPERATED_BY_COMMA
I have been working with Hadoop 2.9.1 for over a year and have learned much on the installation of Hadoop in a multi node cluster environment. I would like to share what I have learned and applied in the hopes that it will help someone else configure their system. The deployment I have done is to have a Name Node and 1-* DataNodes on Ubuntu 16.04 assuming 5 cpu and 13GB RAM. I will put all commands used in this tutorial right down to the very basics for those that are new to Ubuntu.
NOTE: Sometimes you may have to use “sudo” in front of the command. I also use nano for this article for beginners but you can use any editor you prefer (ie: vi). Also this article does not take into consideration any SSL, kerberos, etc. For all purposes here Hadoop will be open without having to login, etc.
Additional Setup/Configurations to Consider:
Zookeeper: It is also a good idea to use ZooKeeper to synchronize your configuration
Secondary NameNode: This should be done on a seperate server and it’s function is to take checkpoints of the namenodes file system.
Rack Awareness: Fault tolerance to ensure blocks are placed as evenly as possible on different racks if they are available.
Apply the following to all NameNode and DataNodes unless otherwise directed:
Hadoop User:
For this example we will just use hduser as our group and user for simplicity sake.
The “-a” on usermod is for appending to a group used with –G for which groups
- addgroup hduser
- sudo gpasswd -a $USER sudo
- usermod –a –G sudo hduser
Install JDK:
- apt-get update
- apt-get upgrade
- apt-get install default-jdk
Install SSH:
- apt-get install ssh
- which ssh
- which sshd
These two commands will check that ssh installed correctly and will return “/usr/bin/ssh” and “/usr/bin/sshd”
- java -version
You use this to verify that java installed correctly and will return something like the following.
openjdk version “1.8.0_171”
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
System Configuration
- nano ~/.bashrc
The .bashrc is a script that is executed when a terminal session is started.
Add the following line to the end and save because Hadoop uses IPv4.
export _JAVA_OPTIONS=’-XX:+UseCompressedOops -Djava.net.preferIPv4Stack=true’
- source ~/.bashrc
sysctl.conf
Disable ipv6 as it causes issues in getting your server up and running.
- nano /etc/sysctl.conf
Add the following to the end and save
- net.ipv6.conf.all.disable_ipv6 = 1
- net.ipv6.conf.default.disable_ipv6 = 1
- net.ipv6.conf.lo.disable_ipv6 = 1
- #Change eth0 to what ifconfig has
- net.ipv6.conf.eth0.disable_ipv6 = 1
Close sysctl
- sysctl -p
- cat /proc/sys/net/ipv6/conf/all/disable_ipv6
- reboot
If all the above disabling IPv6 configuration was successful you should get “1” returned.
Sometimes you can reach open file descriptor limit and open file limit. If you do encounter this issue you might have to set the ulimit and descriptor limit. For this example I have set some values but you will have to figure out the best numbers for your specific case.
If you get “cannot stat /proc/sys/-p: No such file or directory”. Then you need to add /sbin/ to PATH.
- sudo nano ~/.bashrc
- export PATH=$PATH:/sbin/
- nano /etc/sysctl.conf
fs.file-max = 500000
- sysctl –p
limits.conf
- nano /etc/security/limits.conf
* soft nofile 60000
* hard nofile 60000
- reboot
Test Limits
You can now test the limits you applied to make sure they took.
- ulimit -a
- more /proc/sys/fs/file-max
- more /proc/sys/fs/file-nr
- lsof | wc -l
file-max: Current open file descriptor limit
file-nr: How many file descriptors are currently being used
lsof wc: How many files are currently open
You might be wondering why we installed ssh at the beginning. That is because Hadoop uses ssh to access its nodes. We need to eliminate the password requirement by setting up ssh certificates. If asked for a filename just leave it blank and confirm with enter.
- su hduser
If not already logged in as the user we created in the Hadoop user section.
- ssh-keygen –t rsa –P ""
You will get the below example as well as the fingerprint and randomart image.
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory ‘/home/hduser/.ssh’.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
- cat $HOME/.ssh/id-rsa.pub >> $HOME/.ssh/authorized_keys
You may get “No such file or directory”. It is most likely just the id-rsa.pub filename. Look in the .ssh directory for the name it most likely will be “id_rsa.pub”.
This will add the newly created key to the list of authorized keys so that Hadoop can use SSH without prompting for a password.
Now we check that it worked by running “ssh localhost”. When prompted with if you should continue connecting type “yes” and enter. You will be permanently added to localhost
Once we have done this on all Name Node and Data Node you should run the following command from the Name Node to each Data Node.
- ssh-copy-id –i ~/.ssh/id_rsa.pub hduser@DATANODEHOSTNAME
- ssh DATANODEHOSTNAME
We need to update the hosts file.
- sudo nano /etc/hosts
- #Comment out line "127.0.0.1 localhost"
- 127.0.0.1 HOSTNAME localhost
Hadoop Installation:
NAMENODE: You will see this in the config files below and it can be the hostname, the static ip or it could be 0.0.0.0 so that all TCP ports will be bound to all IP’s of the server. You should also note that the masters and slaves file later on in this tutorial can still be the hostname.
Note: You could run rsync after setting up the Name Node Initial configuration to each Data Node if you want. This would save initial hadoop setup time. You do that by running the following command:
- rsync –a /usr/local/hadoop/ hduser@DATANODEHOSTNAME:/usr/local/hadoop/
Download & Extract:
- wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.9.1/hadoop-2.9.1.tar.gz
- tar xvzf hadoop-2.9.1.tar.gz
- sudo mv hadoop-2.9.1/ /usr/local/hadoop
- chown –R hduser:hduser /usr/local/hadoop
- update-alternatives --config java
Basically the above downloads, extracts, moves the extracted hadoop directory to the /usr/local directory, if the hduser doesn’t own the newly created directory then switch ownership
and tells us the path where java was been installed to to set the JAVA_HOME environment variable. It should return something like the following:
There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
- nano ~/.bashrc
Add the following to the end of the file. Make sure to do this on Name Node and all Data Nodes:
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_INSTALL/lib”
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_HOME=$HADOOP_INSTALL
#HADOOP VARIABLES END
- source ~/.bashrc
- javac –version
- which javac
- readlink –f /usr/bin/javac
This basically validates that bashrc update worked!
javac should return “javac 1.8.0_171” or something similar
which javac should return “/usr/bin/javac”
readlink should return “/usr/lib/jvm/java-8-openjdk-amd64/bin/javac”
Memory Tools
There is an application from HortonWorks you can download which can help get you started on how you should setup memory utilization for yarn. I found it’s a great starting point but you need to tweak it to work for what you need on your specific case.
- wget http://public-repo-1.hortonworks.com/HDP/tools/2.6.0.3/hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
- tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
- cd hdp_manual_install_rpm_helper_files-2.6.0.3.8/
- sudo apt-get install python2.7
- python2.7 scripts/yarn-utils.py -c 5 -m 13 -d 1 -k False
-c is for how many cores you have
-m is for how much memory you have
-d is for how many disks you have
False is if you are running HBASE. True if you are.
After the script is ran it will give you guidelines on yarn/mapreduce settings. See below for example. Remember they are guidelines. Tweak as needed.
Now the real fun begins!!! Remember that these settings are what worked for me and you may need to adjust them.
- nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
You will see JAVA_HOME near the beginning of the file you will need to change that to where java is installed on your system.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HEAPSIZE=1000
export HADOOP_NAMENODE_OPTS=”-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,DRFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,RFAAUDIT} $HADOOP_NAMENODE_OPTS”
export HADOOP_SECONDARYNAMENODE_OPTS=$HADOOP_NAMENODE_OPTS
export HADOOP_CLIENT_OPTS=”-Xmx1024m $HADOOP_CLIENT_OPTS”
- mkdir –p /app/hadoop/tmp
This is the temp directory hadoop uses
- chown hduser:hduser /app/hadoop/tmp
Click here to view the docs.
- nano /usr/local/hadoop/etc/hadoop/core-site.xml
This file contains configuration properties that Hadoop uses when starting up. By default it will look like . This will need to be changed.
- <configuration>
- <property>
- <name>fs.defaultFS</name>
- <value>hdfs://NAMENODE:54310</value>
- <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming
- the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
- </property>
- <property>
- <name>hadoop.tmp.dir</name>
- <value>/app/hadoop/tmp</value>
- </property>
- <property>
- <name>hadoop.proxyuser.hduser.hosts</name>
- <value>*</value>
- </property>
- <property>
- <name>hadoop.proxyuser.hduser.groups</name>
- <value>*</value>
- </property>
- </configuration>
Click here to view the docs.
- nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
- <configuration>
- <property>
- <name>yarn.nodemanager.aux-services</name>
- <value>mapreduce_shuffle</value>
- </property>
- <property>
- <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
- </property>
- <property>
- <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
- <value>org.apache.hadoop.mapred.ShuffleHandler</value>
- </property>
- <property>
- <name>yarn.nodemanager.resource.memory-mb</name>
- <value>12288</value>
- <final>true</final>
- </property>
- <property>
- <name>yarn.scheduler.minimum-allocation-mb</name>
- <value>4096</value>
- <final>true</final>
- </property>
- <property>
- <name>yarn.scheduler.maximum-allocation-mb</name>
- <value>12288</value>
- <final>true</final>
- </property>
- <property>
- <name>yarn.app.mapreduce.am.resource.mb</name>
- <value>4096</value>
- </property>
- <property>
- <name>yarn.app.mapreduce.am.command-opts</name>
- <value>-Xmx3276m</value>
- </property>
- <property>
- <name>yarn.nodemanager.local-dirs</name>
- <value>/app/hadoop/tmp/nm-local-dir</value>
- </property>
- <!--LOG-->
- <property>
- <name>yarn.log-aggregation-enable</name>
- <value>true</value>
- </property>
- <property>
- <description>Where to aggregate logs to.</description>
- <name>yarn.nodemanager.remote-app-log-dir</name>
- <value>/tmp/yarn/logs</value>
- </property>
- <property>
- <name>yarn.log-aggregation.retain-seconds</name>
- <value>604800</value>
- </property>
- <property>
- <name>yarn.log-aggregation.retain-check-interval-seconds</name>
- <value>86400</value>
- </property>
- <property>
- <name>yarn.log.server.url</name>
- <value>http://NAMENODE:19888/jobhistory/logs/</value>
- </property>
- <!--URLs-->
- <property>
- <name>yarn.resourcemanager.resource-tracker.address</name>
- <value>NAMENODE:8025</value>
- </property>
- <property>
- <name>yarn.resourcemanager.scheduler.address</name>
- <value>NAMENODE:8030</value>
- </property>
- <property>
- <name>yarn.resourcemanager.address</name>
- <value>NAMENODE:8050</value>
- </property>
- <property>
- <name>yarn.resourcemanager.admin.address</name>
- <value>NAMENODE:8033</value>
- </property>
- <property>
- <name>yarn.resourcemanager.webapp.address</name>
- <value>NAMENODE:8088</value>
- </property>
- </configuration>
By default it will look like . This will need to be changed.
Click here to view the docs. By default, the /usr/local/hadoop/etc/hadoop/ folder contains /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed/copied with the name mapred-site.xml By default it will look like . This will need to be changed.
- cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
- nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
- <configuration>
- <property>
- <name>mapreduce.framework.name</name>
- <value>yarn</value>
- </property>
- <property>
- <name>mapreduce.jobhistory.address</name>
- <value>NAMENODE:10020</value>
- </property>
- <property>
- <name>mapreduce.jobhistory.webapp.address</name>
- <value>NAMENODE:19888</value>
- </property>
- <property>
- <name>mapreduce.jobtracker.address</name>
- <value>NAMENODE:54311</value>
- </property>
- <!-- Memory and concurrency tuning -->
- <property>
- <name>mapreduce.map.memory.mb</name>
- <value>4096</value>
- </property>
- <property>
- <name>mapreduce.map.java.opts</name>
- <value>-server -Xmx3276m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>
- </property>
- <property>
- <name>mapreduce.reduce.memory.mb</name>
- <value>4096</value>
- </property>
- <property>
- <name>mapreduce.reduce.java.opts</name>
- <value>-server -Xmx3276m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>
- </property>
- <property>
- <name>mapreduce.reduce.shuffle.input.buffer.percent</name>
- <value>0.5</value>
- </property>
- <property>
- <name>mapreduce.task.io.sort.mb</name>
- <value>600</value>
- </property>
- <property>
- <name>mapreduce.task.io.sort.factor</name>
- <value>1638</value>
- </property>
- <property>
- <name>mapreduce.map.sort.spill.percent</name>
- <value>0.50</value>
- </property>
- <property>
- <name>mapreduce.map.speculative</name>
- <value>false</value>
- </property>
- <property>
- <name>mapreduce.reduce.speculative</name>
- <value>false</value>
- </property>
- <property>
- <name>mapreduce.task.timeout</name>
- <value>1800000</value>
- </property>
- </configuration>
- nano /usr/local/hadoop/etc/hadoop/yarn-env.sh
Change or uncomment or add the following:
JAVA_HEAP_MAX=Xmx2000m
YARN_OPTS=”$YARN_OPTS -server -Dhadoop.log.dir=$YARN_LOG_DIR”
YARN_OPTS=”$YARN_OPTS -Djava.net.preferIPv4Stack=true”
Add the namenode hostname.
- nano /usr/local/hadoop/etc/hadoop/masters
APPLY THE FOLLOWING TO THE NAMENODE ONLY
Add namenode hostname and all datanodes hostname.
- nano /usr/local/hadoop/etc/hadoop/slaves
Click here to view the docs. By default it will look like . This will need to be changed. The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. Before editing this file, we need to create the namenode directory.
- mkdir -p /usr/local/hadoop_store/data/namenode
- chown -R hduser:hduser /usr/local/hadoop_store
- nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
- <configuration>
- <property>
- <name>dfs.replication</name>
- <value>3</value>
- <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
- </property>
- <property>
- <name>dfs.permissions</name>
- <value>false</value>
- </property>
- <property>
- <name>dfs.namenode.name.dir</name>
- <value>file:/usr/local/hadoop_store/data/namenode</value>
- </property>
- <property>
- <name>dfs.datanode.use.datanode.hostname</name>
- <value>false</value>
- </property>
- <property>
- <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
- <value>false</value>
- </property>
- <property>
- <name>dfs.namenode.http-address</name>
- <value>NAMENODE:50070</value>
- <description>Your NameNode hostname for http access.</description>
- </property>
- <property>
- <name>dfs.namenode.secondary.http-address</name>
- <value>SECONDARYNAMENODE:50090</value>
- <description>Your Secondary NameNode hostname for http access.</description>
- </property>
- <property>
- <name>dfs.blocksize</name>
- <value>128m</value>
- </property>
- <property>
- <name>dfs.namenode.http-bind-host</name>
- <value>0.0.0.0</value>
- </property>
- <property>
- <name>dfs.namenode.rpc-bind-host</name>
- <value>0.0.0.0</value>
- </property>
- <property>
- <name>dfs.namenode.servicerpc-bind-host</name>
- <value>0.0.0.0</value>
- </property>
- </configuration>
APPLY THE FOLLOWING TO THE DATANODE(s) ONLY
Add only that datanodes hostname.
- nano /usr/local/hadoop/etc/hadoop/slaves
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. Before editing this file, we need to create the datanode directory.
By default it will look like . This will need to be changed.
- mkdir -p /usr/local/hadoop_store/data/datanode
- chown -R hduser:hduser /usr/local/hadoop_store
- nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
- <configuration>
- <property>
- <name>dfs.replication</name>
- <value>3</value>
- <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
- </property>
- <property>
- <name>dfs.permissions</name>
- <value>false</value>
- </property>
- <property>
- <name>dfs.datanode.data.dir</name>
- <value>file:/usr/local/hadoop_store/data/datanode</value>
- </property>
- <property>
- <name>dfs.datanode.use.datanode.hostname</name>
- <value>false</value>
- </property>
- <property>
- <name>dfs.namenode.http-address</name>
- <value>NAMENODE:50070</value>
- <description>Your NameNode hostname for http access.</description>
- </property>
- <property>
- <name>dfs.namenode.secondary.http-address</name>
- <value>SECONDARYNAMENODE:50090</value>
- <description>Your Secondary NameNode hostname for http access.</description>
- </property>
- <property>
- <name>dfs.datanode.http.address</name>
- <value>DATANODE:50075</value>
- </property>
- <property>
- <name>dfs.blocksize</name>
- <value>128m</value>
- </property>
- </configuration>
You need to allow the pass-through for all ports necessary. If you have the Ubuntu firewall on.
- sudo ufw allow 50070
- sudo ufw allow 8088
Format Cluster:
Only do this if NO data is present. All data will be destroyed when the following is done.
This is to be done on NAMENODE ONLY!
- hdfs namenode –format
Start The Cluster:
You can now start the cluster.
You do this from the NAMENODE ONLY.
- start-dfs.sh
- start-yarn.sh
- mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver
If the above three commands didn’t work something went wrong. As it should have found the scripts located /usr/local/hadoop/sbin/ directory.
Cron Job:
You should probably setup a cron job to start the cluster when you reboot.
- crontab –e
@reboot /usr/local/hadoop/sbin/start-dfs.sh > /home/hduser/dfs-start.log 2>&1
@reboot /usr/local/hadoop/sbin/start-yarn.sh > /home/hduser/yarn-start.log 2>&1
@reboot /usr/local/hadoop/sbin/mr-jobhistory-daemon.sh –config $HADOOP_CONF_DIR stop historyserver > /home/hduser/history-stop.log 2>&1
Verification:
To check that everything is working as it should run “jps” on the NAMENODE. It should return something like the following where the pid will be different:
- jps
You could also run “netstat -plten | grep java” or “lsof –i :50070” and “lsof –i :8088”.
Picked up _JAVA_OPTIONS: -Xms3g -Xmx10g -Djava.net.preferIPv4Stack=true
2596 SecondaryNameNode
3693 Jps
1293 JobHistoryServer
1317 ResourceManager
1840 NameNode
1743 NodeManager
2351 DataNode
You can check the DATA NODES by ssh into each one and running “jps”. It should return something like the following where the pid will be different:
Picked up _JAVA_OPTIONS: -Xms3g -Xmx10g -Djava.net.preferIPv4Stack=true
3218 Jps
2215 NodeManager
2411 DataNode
If for any reason only of the services is not running you need to review the logs. They can be found at /usr/local/hadoop/logs/. If it’s ResourceManager that isn’t running then look at file that has “yarn” and “resourcemanager” in it.
WARNING:
Never reboot the system without first stopping the cluster. When the cluster shuts down it is safe to reboot it. Also if you configured a cronjob @reboot you should make sure the DATANODES are up and running first before starting the NAMENODE that way it automatically starts the DATANODES for you
To check that all the Hadoop ports are available on which IP run the following.
- sudo netstat -ltnp
If for some reason you are having issues connecting to a Hadoop port then run the following command as you try and connect via the port.
- sudo tcpdump -n -tttt -i eth1 port 50070
I used a lot of different resources and reference material on this. However I did not save all the relevant links I used. Below are just a few I used. There was various blog posts about memory utilization, etc.
You must be logged in to post a comment.