PySpark: Save a DataFrame To ADLS

This how-to is how to save a DataFrame to ADLS

First we need a spark Session. See PySpark: Create a Spark Session for my details on that.

Then we need to create a DataFrame. See PySpark: Create a DataFrame.

Then we do the following:

You should note you don’t need all the options below. I just gave an example.

  1. path = 'abfss://my_container@my_storage_account.dfs.core.windows.net/my_folder/'
  2. mode = 'overwrite'
  3. format = 'delta'
  4. partitions = []
  5.  
  6. df.write.mode(mode).format(format).option('mergeSchema', False).partitionBy(*partitions).save(path)

 

 

 

 

 

 

PySpark: Create a Spark Session

This post is how to create a Spark Session

Imports

  1. from pyspark.sql import SparkSession

Create the Spark Session

  1. spark = SparkSession.builder.appName('pyspark_app_name').getOrCreate()

You can add any configs you wish during creation. You would add this before the “.getOrCreate()”.

You can see a list here

  • .config(“spark.sql.jsonGenerator.ignoreNullFields”, “false”)
    • When reading JSON you will not ignore NULL fields
  • .config(“spark.sql.parquet.int96RebaseModeInWrite”, “CORRECTED”)
    • Fixes issues in timestamps in write operations
  • .config(“spark.sql.parquet.int96RebaseModeInRead”, “CORRECTED”)
    • Fixes issues in timestamps in read operations

 

PySpark: Create a DataFrame

This post is how to create a DataFrame in pyspark.

First we need a spark Session. See PySpark: Create a Spark Session for my details on that.

Next we need to import

  1. from pyspark.sql import Row
  2. from pyspark.sql.types import StringType, DecimalType, TimestampType, FloatType, IntegerType, LongType, StructField, StructType

Then you create the schema

  1. schema = StructType([
  2. StructField('id', IntegerType()),
  3. .....
  4. ])
  5.  
  6. data = [Row(id=1)]

Create the DataFrame

  1. df = spark.createDataFrame(data, schema=schema)

If you want to use a JSON file to build your schema do the following

  1. import json
  2. from pyspark.sql.types import StructType
  3.  
  4. data = {
  5. "fields": [
  6. {
  7. "metadata": {},
  8. "name": "column_a",
  9. "nullable": false,
  10. "type": "string"
  11. }
  12. ],
  13. "type": "struct"
  14. }
  15.  
  16. json_schema = json.loads(data)
  17. table_schema = StructType.fromJson(dict(json_schema))
  18.  
  19. df = spark.createDataFrame(data, schema=table_schema)
  20.  

 

Python: Unit Testing

This post focus’ on common hurdles when trying to do unit testing.

Testing Values During Run

You add the following line to anywhere you want to pause the unit test to check values.

  1. import pdb
  2. pdb.set_trace()

How to Patch a Function

  1. from unittest.mock import path
  2.  
  3. @patch('src.path.to.file.my_function')
  4. @path('src.path.to.file.my_function_add')
  5. def test_some_function(mock_my_function_add, mock_my_function):
  6. mock_function_add.return_value = <something>
  7. .......

How to Patch a Function With No Return Value

  1. from unittest.mock import patch
  2.  
  3. def test_some_function():
  4. with(patch('src.path.to.file.my_function'):
  5. ...

How to Patch a Function With 1 Return Value

  1. from unittest.mock import patch
  2.  
  3. def test_some_function():
  4. with(patch('src.path.to.file.my_function', MagicMock(return_value=[<MY_VALUES>])):
  5. ...

How to Patch a Function With Multiple Return Value

  1. from unittest.mock import patch
  2.  
  3. def test_some_function():
  4. with(patch('src.path.to.file.my_function', MagicMock(side-effect=[[<MY_VALUES>], [<OTHER_VALUES>]])):
  5. ...

How to Create a Test Module

  1. from unittest import TestCase
  2.  
  3. class MyModule(TestCase):
  4. def setUp(self):
  5. some_class.my_variable = <something>
  6. ... DO OTHER STUFF
  7. def test_my_function(self):
  8. ... DO Function Test Stuff

How to Patch a Method

  1. patch_methods = [
  2. "pyodbc.connect"
  3. ]
  4.  
  5. for method in patch_methods:
  6. patch(method).start()

How to create a PySpark Session

Now once you do this you can just call spark and it will set it.

  1. import pytest
  2. from pyspark.sql import SparkSession
  3.  
  4. @pytest.fixture(scope='module')
  5. def spark():
  6. return (SparkSession.builder.appName('pyspark_test').getOrCreate())

How to Create a Spark SQL Example

  1. import pytest
  2. from pyspark.sql import SparkSession, Row
  3. from pyspark.sql.types import StructType, StructField, StringType
  4.  
  5. @pytest.fixture(scope='module')
  6. def spark():
  7. return (SparkSession.builder.appName('pyspark_test').getOrCreate())
  8.  
  9. def test_function(spark):
  10. query = 'SELECT * FROM SOMETHING'
  11. schema = StructType([
  12. StructField('column_a', StringType()),
  13. StructField('column_b', StringType()),
  14. StructField('column_c', StringType()),
  15. ])
  16.  
  17. data = [Row(column_a='a', column_b='b', column_c='c')]
  18. table = spark.createDataFrame(data, schema=schema)
  19. table.createOrReplaceTempView('<table_name>')
  20. df = spark.sql(query).toPandas()
  21.  
  22. assert not df.empty
  23. assert df.shape[0] == 1
  24. assert df.shape(1) == 5
  25.  
  26. spark.catalog.dropTempView('<table_name>')

How to Mock a Database Call

First let’s assume you have an exeucte sql function

  1. def execute_sql(cursor, sql, params):
  2. result = cursor.execute(sql, params).fetchone()
  3. connection.commit()
  4. return result

Next in your unit tests you want to test that funciton

  1. def test_execute_sql():
  2. val = <YOUR_RETURN_VALUE>
  3. with patch('path.to.code.execute_sql', MagicMock(return_value=val)) as mock_execute:
  4. return_val = some_other_function_that_calls_execute_sql(....)
  5. assert return_val == val

If you need to close a cursor or DB connection

  1. def test_execute_sql():
  2. val = <YOUR_RETURN_VALUE>
  3. mock_cursor = MagicMock()
  4. mock_cursor.configure_mock(
  5. **{
  6. "close": MagicMock()
  7. }
  8. )
  9. mock_connection = MagicMock()
  10. mock_connection.configure_mock(
  11. **{
  12. "close": MagicMock()
  13. }
  14. )
  15.  
  16. with patch('path.to.code.cursor', MagicMock(return_value=mock_cursor)) as mock_cursor_close:
  17. with patch('path.to.code.connection', MagicMock(return_value=mock_connection)) as mock_connection_close:
  18. return_val = some_other_function_that_calls_execute_sql(....)
  19. assert return_val == val

How to Mock Open a File Example 1

  1. @patch('builtins.open", new_callable=mock_open, read_data='my_data')
  2. def test_file_open(mock_file):
  3. assert open("my/file/path/filename.extension").read() == 'my_data'
  4. mock_file.assert_called_with("my/file/path/filename.extension")
  5.  
  6. val = function_to_test(....)
  7. assert 'my_data' == val

How to Mock Open a File Example 2

  1. def test_file_open():
  2. fake_file_path = 'file/path/to/mock'
  3. file_content_mock = 'test'
  4. with patch('path.to.code.function'.format(__name__), new=mock_open(read_data=file_content_mock)) as mock_file:
  5. with patch(os.utime') as mock_utime:
  6. actual = function_to_test(fake_file_path)
  7. mock_file.assert_called_once_with(fake_file_path)
  8. assertIsNotNone(actual)

Compare DataFrames

  1. def as_dicts(df):
  2. df = [row.asDict() for row in df.collect()]
  3. return sorted(df, key=lambda row: str(row))
  4.  
  5. assert as_dicts(df1) == as_dicts(df2)

Python: Create a WHL File

This post will just be a how-to on creating a whl file.

You need the following files:

Manifest.in:

  1. recursive-include <directory> *
  2. recursive-exclude tests *.py

Requirements.txt:

This file just holds your packages and the version.

Setup.py

You remove pytest and coverage from your whl file because you don’t want those applications being required when you deploy your code.

  1. from setuptools import find_packages
  2. from distutils.core import setup
  3. import os
  4. import json
  5.  
  6. if os.path.exists('requirements.txt'):
  7. req = [line.strip('\n') for line in open('requirements.txt') if 'pytest' not in line and 'coverage' not in line]
  8.  
  9. setup(
  10. include_package_data=True,
  11. name=<app_name>,
  12. version=<app-version>,
  13. description=<app_desc>,
  14. install_requires=req,
  15. packages=find_packages(excude=["*tests.*","*tests"]),
  16. classifiers=[
  17. "Programming Language :: Python || <python_Version>",
  18. "License || OSI Approved :: MIT License",
  19. "Operating System :: OS Independent",
  20. ],
  21. python_requires='>=<python_version>',
  22. package_dir={<directory>: <directory>},
  23. )

To Check Your Whl File

Install package

  1. pip install check-wheel-contents

Check WHL

  1. check-wheel-contents <PATH_TO_WHL>\<filename>.whl

Install WHL

This will deploy to <PATH_TO_PYTHON>\Lib\site-packages\<directory>

  1. <PATH_TO_PYTHON>\Scripts\pip3.7.exe install <PATH_TO_WHL>\<filename>.whl

 

 

 

Azure: EventHub

In this tutorial I will show you how to connect to event hub from Python. Ensure you have first installed an IDE (Eclipse) and Python3.7.

Python Package Installation

  1. pip3 install azure-eventhub

Create a Producer

This will publish events to event hub. The important part here is the “EndPoint”. You need to login to Azure Portal and get the get the endpoint from the “Shared Access Policies” from the event hub namespace.

  1. from azure.eventhub import EventHubProducerClient, EventData, EventHubConsumerClient
  2.  
  3. connection_str = 'Endpoint=sb://testeventhubnamespace.servicebus.windows.net/;SharedAccessKeyName=<<THE_ACCESS_KEY_NAME>>;SharedAccessKey=<<THE_ACCESS_KEY>>'
  4. eventhub_name = '<<THE_EVENT_HUB_NAME>>'
  5. producer = EventHubProducerClient.from_connection_string(connection_str, eventhub_name=eventhub_name)
  6.  
  7. event_data_batch = producer.create_batch()
  8.  
  9. event_data_batch.add(EventData('My Test Data'))
  10.  
  11. with producer:
  12. producer.send_batch(event_data_batch)

Create a Consumer

This will monitor the event hub for new messages.

  1. from azure.eventhub import EventHubProducerClient, EventData, EventHubConsumerClient
  2.  
  3. connection_str = 'Endpoint=sb://testeventhubnamespace.servicebus.windows.net/;SharedAccessKeyName=<<THE_ACCESS_KEY_NAME>>;SharedAccessKey=<<THE_ACCESS_KEY>>'
  4. eventhub_name = '<<THE_EVENT_HUB_NAME>>'
  5. consumer_group = '<<THE_EVENT_HUB_CONSUMER_GROUP>>'
  6. client = EventHubConsumerClient.from_connection_string(connection_str, consumer_group, eventhub_name=eventhub_name)
  7.  
  8. def on_event(partition_context, event):
  9. print("Received event from partition {} - {}".format(partition_context.partition_id, event))
  10. partition_context.update_checkpoint(event)
  11.  
  12. with client:
  13. #client.receive(
  14. # on_event=on_event,
  15. # starting_position="-1", # "-1" is from the beginning of the partition.
  16. #)
  17. client.receive(
  18. on_event=on_event
  19. )
  20.  

 

Jupyter Installation

In this tutorial I will show you how to install Jupyter. I will use self signed certs for this example.

This assumes your hostname is “hadoop”

Prerequisites

Python3.5 Installation

  1. sudo apt install python3-pip

Update .bashrc

  1. sudo nano ~/.bashrc
  2.  
  3. #Add the following
  4. alias python=python3.5
  5.  
  6. source ~/.bashrc

Install

  1. pip3 install jupyter
  2.  
  3. jupyter notebook --generate-config
  4.  
  5. jupyter notebook password
  6. #ENTER PASSWORD
  7.  
  8. cat ~/.jupyter/jupyter_notebook_config.json
  9. #Get the SHA1 value

Setup Configuration

  1. nano ~/.jupyter/jupyter_notebook_config.py
  2.  
  3. #Find and change the values for the following
  4. c.NotebookApp.ip = '0.0.0.0'
  5. c.NotebookApp.port = 8888
  6. c.NotebookApp.password = u'sha1:1234567fbbd5:dfgy8e0a3l12fehh46ea89f23jjjkae54a2kk54g'
  7. c.NotebookApp.open_browser = False
  8. c.NotebookApp.certfile = '/etc/security/serverKeys/hadoop.pem'
  9. c.NotebookApp.keyfile = '/etc/security/serverKeys/hadoop.key'

Run Jupyter

  1. jupyter notebook

https://NAMENODE:8888

References

https://jupyter.readthedocs.io/en/latest/index.html

Spark Installation on Hadoop

In this tutorial I will show you how to use Kerberos/SSL with Spark integrated with Yarn. I will use self signed certs for this example. Before you begin ensure you have installed Kerberos Server and Hadoop.

This assumes your hostname is “hadoop”

Create Kerberos Principals

  1. cd /etc/security/keytabs/
  2.  
  3. sudo kadmin.local
  4.  
  5. #You can list princepals
  6. listprincs
  7.  
  8. #Create the following principals
  9. addprinc -randkey spark/hadoop@REALM.CA
  10.  
  11. #Create the keytab files.
  12. #You will need these for Hadoop to be able to login
  13. xst -k spark.service.keytab spark/hadoop@REALM.CA

Set Keytab Permissions/Ownership

  1. sudo chown root:hadoopuser /etc/security/keytabs/*
  2. sudo chmod 750 /etc/security/keytabs/*

Download

Go to Apache Spark Download and get the link for Spark.

  1. wget http://apache.forsale.plus/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
  2. tar -xvf spark-2.4.4-bin-hadoop2.7.tgz
  3. mv spark-2.4.4-bin-hadoop2.7 /usr/local/spark/

Update .bashrc

  1. sudo nano ~/.bashrc
  2.  
  3. #Ensure we have the following in the Hadoop section
  4. export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
  5.  
  6. #Add the following
  7.  
  8. #SPARK VARIABLES START
  9. export SPARK_HOME=/usr/local/spark
  10. export PATH=$PATH:$SPARK_HOME/bin
  11. export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
  12. #SPARK VARIABLES STOP
  13.  
  14. source ~/.bashrc

Setup Configuration

  1. cd /usr/local/spark/conf
  2. mv spark-defaults.conf.template spark-defaults.conf
  3. nano spark-defaults.conf
  4.  
  5. #Add to the end
  6. spark.master yarn
  7. spark.yarn.historyServer.address ${hadoopconf-yarn.resourcemanager.hostname}:18080
  8. spark.yarn.keytab /etc/security/keytabs/spark.service.keytab
  9. spark.yarn.principal spark/hadoop@REALM.CA
  10. spark.yarn.access.hadoopFileSystems hdfs://NAMENODE:54310
  11. spark.authenticate true
  12. spark.driver.bindAddress 0.0.0.0
  13. spark.authenticate.enableSaslEncryption true
  14. spark.eventLog.enabled true
  15. spark.eventLog.dir hdfs://NAMENODE:54310/user/spark/applicationHistory
  16. spark.history.fs.logDirectory hdfs://NAMENODE:54310/user/spark/applicationHistory
  17. spark.history.fs.update.interval 10s
  18. spark.history.ui.port 18080
  19.  
  20. #SSL
  21. spark.ssl.enabled true
  22. spark.ssl.keyPassword PASSWORD
  23. spark.ssl.keyStore /etc/security/serverKeys/keystore.jks
  24. spark.ssl.keyStorePassword PASSWORD
  25. spark.ssl.keyStoreType JKS
  26. spark.ssl.trustStore /etc/security/serverKeys/truststore.jks
  27. spark.ssl.trustStorePassword PASSWORD
  28. spark.ssl.trustStoreType JKS

Kinit

  1. kinit -kt /etc/security/keytabs/spark.service.keytab spark/hadoop@REALM.CA
  2. klist
  3. hdfs dfs -mkdir /user/spark/
  4. hdfs dfs -mkdir /user/spark/applicationHistory
  5. hdfs dfs -ls /user/spark

Start The Service

  1. $SPARK_HOME/sbin/start-history-server.sh

Stop The Service

  1. $SPARK_HOME/sbin/stop-history-server.sh

Spark History Server Web UI

References

I used a lot of different resources and reference material on this. Below are just a few I used.

https://spark.apache.org/docs/latest/running-on-yarn.html#configuration

https://spark.apache.org/docs/latest/security.html

https://www.linode.com/docs/databases/hadoop/install-configure-run-spark-on-top-of-hadoop-yarn-cluster/

 

 

 

 

Python: xlrd (Read Excel File)

In this tutorial I will show you how to read an excel file in Python.

Installation

  1. pip install xlrd

Open The Workbook

  1. import xlrd
  2.  
  3. my_excel = (r'C:\path\to\file')
  4. wb = xlrd.open_workbook(my_excel)

Select Sheet

  1. # Select the first sheet. If you want to select the third just change to (3)
  2. sheet = wb.sheet_by_index(0)

Get Data In Column

  1. #This loops through all the rows in that sheet
  2. for i in range(sheet.nrows):
  3. # if the value isn't empty then print it out.
  4. if sheet.cell_value(i, 0) != '':
  5. print(sheet.cell_value(i, 0))

Get all the Column Header

  1. #This loops through all the rows in that sheet
  2. for i in range(sheet.ncols):
  3. # if the value isn't empty then print it out.
  4. if sheet.cell_value(0, i) != '':
  5. print(sheet.cell_value(0, i))

 

Sqoop2: Kerberize Installation

In this tutorial I will show you how to kerberize Sqoop installation. Before you begin ensure you have installed Sqoop.

This assumes your hostname is “hadoop”

Create Kerberos Principals

  1. cd /etc/security/keytabs
  2. sudo kadmin.local
  3. addprinc -randkey sqoop/hadoop@REALM.CA
  4. xst -kt sqoop.service.keytab sqoop/hadoop@REALM.CA
  5. addprinc -randkey sqoopHTTP/hadoop@REALM.CA
  6. xst -kt sqoopHTTP.service.keytab sqoopHTTP/hadoop@REALM.CA
  7. q

Set Keytab Permissions/Ownership

  1. sudo chown root:hadoopuser /etc/security/keytabs/*
  2. sudo chmod 750 /etc/security/keytabs/*

Configuration

Configure Kerberos with Sqoop

  1. cd /usr/local/sqoop/conf/
  2. nano sqoop.properties
  3.  
  4. #uncomment the following
  5. org.apache.sqoop.security.authentication.type=KERBEROS
  6. org.apache.sqoop.security.authentication.handler=org.apache.sqoop.security.authentication.KerberosAuthenticationHandler
  7.  
  8. #update to the following
  9. org.apache.sqoop.security.authentication.kerberos.principal=sqoop/hadoop@GAUDREAULT_KDC.CA
  10. org.apache.sqoop.security.authentication.kerberos.keytab=/etc/security/keytabs/sqoop.service.keytab

 

 

 

 

 

 

 

 

 

 

Sqoop2: Installation

We are going to install Sqoop. Ensure you have Hadoop installed already.

This assumes your hostname is “hadoop”

Install Java JDK

  1. apt-get update
  2. apt-get upgrade
  3. apt-get install default-jdk

Download Sqoop:

  1. wget https://archive.apache.org/dist/sqoop/1.99.7/sqoop-1.99.7-bin-hadoop200.tar.gz
  2. tar -zxvf sqoop-1.99.7-bin-hadoop200.tar.gz
  3. sudo mv sqoop-1.99.7-bin-hadoop200 /usr/local/sqoop/
  4. sudo chown -R root:hadoopuser /usr/local/sqoop/

Setup .bashrc:

  1. sudo nano ~/.bashrc

Add the following to the end of the file.

#SQOOP VARIABLES START
export SQOOP_HOME=/usr/local/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
export SQOOP_CONF_DIR=$SQOOP_HOME/conf
export SQOOP_CLASS_PATH=$SQOOP_CONF_DIR
#SQOOP VARIABLES STOP

  1. source ~/.bashrc

Initialise Repository

  1. ./bin/sqoop2-tool upgrade

Modify sqoop2-server

If you are running Hadoop on the same server as Sqoop Server you will need to modify this file. The reason is because Sqoop needs you to point to the lib directory for common, hdfs, mapreduce and yarn.

  1. nano /usr/loca/sqoop/bin/sqoop.sh
  2.  
  3. #Modify these lines
  4. HADOOP_COMMON_HOME=${HADOOP_COMMON_HOME:-${HADOOP_HOME}/share/hadoop/common}
  5. HADOOP_HDFS_HOME=${HADOOP_HDFS_HOME:-${HADOOP_HOME}/share/hadoop/hdfs}
  6. HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-${HADOOP_HOME}/share/hadoop/mapreduce}
  7. HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-${HADOOP_HOME}/share/hadoop/yarn}
  8.  
  9. #TO
  10.  
  11. HADOOP_COMMON_HOME=${HADOOP_HOME}/share/hadoop/common
  12. HADOOP_HDFS_HOME=${HADOOP_HOME}/share/hadoop/hdfs
  13. HADOOP_MAPRED_HOME=${HADOOP_HOME}/share/hadoop/mapreduce
  14. HADOOP_YARN_HOME=${HADOOP_HOME}/share/hadoop/yarn

Configuration

  1. nano /usr/local/sqoop/conf/sqoop.properties
  2. #Update the following line
  3. org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/usr/local/hadoop/etc/hadoop/

Start Sqoop Server

  1. ./bin/sqoop2-server start

References

https://linoxide.com/tools/install-apache-sqoop-ubuntu-16-04/

 

 

PySpark: Eclipse Integration

This tutorial will guide you through configuring PySpark on Eclipse.

First you need to install Eclipse.

You need to add “pyspark.zip” and “py4j-0.10.7-src.zip” to “Libraries” for the Python Interpreter.

Next you need to configure the Environment variables for PySpark.

Test that it works!

  1. from pyspark import SparkConf, SparkContext
  2. from pyspark.sql import SparkSession
  3.  
  4. def init_spark():
  5. spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
  6. sc = spark.sparkContext
  7. return spark,sc
  8.  
  9. if __name__ == '__main__':
  10. spark,sc = init_spark()
  11. nums = sc.parallelize([1,2,3,4])
  12. print(nums.map(lambda x: x*x).collect())

Shell: Functions

In this tutorial I will show you how to work with functions.

Define Function

You return 0 for success and 1 for failure

  1. function test()
  2. {
  3. return 0
  4. }

Call function with no arguments

  1. test

Call function with arguments

  1. val1='test'
  2. val2='test2'
  3. test $val1 $val2

Call function with arguments that have spaces in the value

  1. val1='test'
  2. val2='test2 test23'
  3. test "${val1}" "${val2}"

 

 

PySpark: StandAlone Installation on Windows

This tutorial will guide you through installing PySpark StandAlone on Windows for development.

Install Python 3

You need to have python 3 installed. You can get it here.

Install Spark

Go to Apache Spark and download the latest version and package “Pre-built for Apache Hadoop 2.7 and later”. Download spark-2.4.1-bin-hadoop2.7.tgz.

Install 7 Zip

You will need 7 Zip to open spark-2.4.1-bin-hadoop2.7.tgz.

Install Java

You need to ensure you have Java 8 install.

Extract Spark

Once you have installed 7 Zip you can extract spark into C:\spark\ directory. The directory structure will look like this c:\spark\spark-2.4.1-bin-hadoop2.7\

Download WinUtils.exe

Download the winutils.exe and put to C:\spark\spark-2.4.1-bin-hadoop2.7\bin\

Environment Variables

You the following commands to set your Spark specific ENV variables. The “-m” option means all users. You can either use that or not. If you don’t then it adds for current user only.

  1. setx -m SPARK_HOME C:\spark\spark-2.4.1-bin-hadoop2.7
  2. setx -m HADOOP_HOME C:\spark\spark-2.4.1-bin-hadoop2.7

You also want to add the following to the “Path” env variable “;C:\spark\spark-2.4.1-bin-hadoop2.7\bin”

Run PySpark

Open a command prompt and type the following. The –master parameter is used for setting the master node address. local[2] is to tell Spark to run locally on 2 cores.

  1. pyspark --master local[2]

Test

You can then test that it is working by running the following code.

  1. words = sc.parallelize (
  2. ["scala",
  3. "java",
  4. "hadoop",
  5. "spark",
  6. "hbase",
  7. "spark vs hadoop",
  8. "pyspark",
  9. "pyspark and spark"]
  10. )
  11. counts = words.count()
  12. print("Number of elements in RDD -> %i" % (counts))

References

I used this as a guide. https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c

Hadoop 3.2.0: Installation

I would like to share what I have learned and applied in the hopes that it will help someone else configure their system. The deployment I have done is to have a Name Node and 1-* DataNodes on Ubuntu 16.04 assuming 5 cpu and 13GB RAM. I will put all commands used in this tutorial right down to the very basics for those that are new to Ubuntu.

NOTE: Sometimes you may have to use “sudo” in front of the command. I also use nano for this article for beginners but you can use any editor you prefer (ie: vi). Also this article does not take into consideration any SSL, kerberos, etc. For all purposes here Hadoop will be open without having to login, etc.

Additional Setup/Configurations to Consider:

Zookeeper: It is also a good idea to use ZooKeeper to synchronize your configuration

Secondary NameNode: This should be done on a seperate server and it’s function is to take checkpoints of the namenodes file system.

Rack AwarenessFault tolerance to ensure blocks are placed as evenly as possible on different racks if they are available.

Apply the following to all NameNode and DataNodes unless otherwise directed:

Hadoop User:
For this example we will just use hduser as our group and user for simplicity sake.
The “-a” on usermod is for appending to a group used with –G for which groups

  1. addgroup hduser
  2. sudo gpasswd -a $USER sudo
  3. usermod G sudo hduser

Install JDK:

  1. apt-get update
  2. apt-get upgrade
  3. apt-get install default-jdk

Install SSH:

  1. apt-get install ssh
  2. which ssh
  3. which sshd

These two commands will check that ssh installed correctly and will return “/usr/bin/ssh” and “/usr/bin/sshd”

  1. java -version

You use this to verify that java installed correctly and will return something like the following.

openjdk version “1.8.0_171”
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

System Configuration

  1. nano ~/.bashrc

The .bashrc is a script that is executed when a terminal session is started.
Add the following line to the end and save because Hadoop uses IPv4.

export _JAVA_OPTIONS=’-XX:+UseCompressedOops -Djava.net.preferIPv4Stack=true’

  1. source ~/.bashrc

sysctl.conf

Disable ipv6 as it causes issues in getting your server up and running.

  1. nano /etc/sysctl.conf

Add the following to the end and save

  1. net.ipv6.conf.all.disable_ipv6 = 1
  2. net.ipv6.conf.default.disable_ipv6 = 1
  3. net.ipv6.conf.lo.disable_ipv6 = 1
  4. #Change eth0 to what ifconfig has
  5. net.ipv6.conf.eth0.disable_ipv6 = 1

Close sysctl

  1. sysctl -p
  2. cat /proc/sys/net/ipv6/conf/all/disable_ipv6
  3. reboot

If all the above disabling IPv6 configuration was successful you should get “1” returned.
Sometimes you can reach open file descriptor limit and open file limit. If you do encounter this issue you might have to set the ulimit and descriptor limit. For this example I have set some values but you will have to figure out the best numbers for your specific case.

If you get “cannot stat /proc/sys/-p: No such file or directory”. Then you need to add /sbin/ to PATH.

  1. sudo nano ~/.bashrc
  2. export PATH=$PATH:/sbin/
  1. nano /etc/sysctl.conf

fs.file-max = 500000

  1. sysctl p

limits.conf

  1. nano /etc/security/limits.conf

* soft nofile 60000
* hard nofile 60000

  1. reboot

Test Limits

You can now test the limits you applied to make sure they took.

  1. ulimit -a
  2. more /proc/sys/fs/file-max
  3. more /proc/sys/fs/file-nr
  4. lsof | wc -l

file-max: Current open file descriptor limit
file-nr: How many file descriptors are currently being used
lsof wc: How many files are currently open

You might be wondering why we installed ssh at the beginning. That is because Hadoop uses ssh to access its nodes. We need to eliminate the password requirement by setting up ssh certificates. If asked for a filename just leave it blank and confirm with enter.

  1. su hduser

If not already logged in as the user we created in the Hadoop user section.

  1. ssh-keygen t rsa ""

You will get the below example as well as the fingerprint and randomart image.

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory ‘/home/hduser/.ssh’.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.

  1. cat $HOME/.ssh/id-rsa.pub >> $HOME/.ssh/authorized_keys

You may get “No such file or directory”. It is most likely just the id-rsa.pub filename. Look in the .ssh directory for the name it most likely will be “id_rsa.pub”.

This will add the newly created key to the list of authorized keys so that Hadoop can use SSH without prompting for a password.
Now we check that it worked by running “ssh localhost”. When prompted with if you should continue connecting type “yes” and enter. You will be permanently added to localhost
Once we have done this on all Name Node and Data Node you should run the following command from the Name Node to each Data Node.

  1. ssh-copy-id ~/.ssh/id_rsa.pub hduser@DATANODEHOSTNAME
  2. ssh DATANODEHOSTNAME

/etc/hosts Update

We need to update the hosts file.

  1. sudo nano /etc/hosts
  2.  
  3. #Comment out line "127.0.0.1 localhost"
  4.  
  5. 127.0.0.1 HOSTNAME localhost

Now we are getting to the part we have been waiting for.

Hadoop Installation:

NAMENODE: You will see this in the config files below and it can be the hostname, the static ip or it could be 0.0.0.0 so that all TCP ports will be bound to all IP’s of the server. You should also note that the masters and slaves file later on in this tutorial can still be the hostname.

Note: You could run rsync after setting up the Name Node Initial configuration to each Data Node if you want. This would save initial hadoop setup time. You do that by running the following command:

  1. rsync /usr/local/hadoop/ hduser@DATANODEHOSTNAME:/usr/local/hadoop/

Download & Extract:

  1. wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
  2. tar xvzf hadoop-3.2.0.tar.gz
  3. sudo mv hadoop-3.2.0/ /usr/local/hadoop
  4. chown R hduser:hduser /usr/local/hadoop
  5. update-alternatives --config java

Basically the above downloads, extracts, moves the extracted hadoop directory to the /usr/local directory, if the hduser doesn’t own the newly created directory then switch ownership
and tells us the path where java was been installed to to set the JAVA_HOME environment variable. It should return something like the following:

There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

  1. nano ~/.bashrc

Add the following to the end of the file. Make sure to do this on Name Node and all Data Nodes:

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_INSTALL/lib”
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_HOME=$HADOOP_INSTALL

export HDFS_NAMENODE_USER=hduser
export HDFS_DATANODE_USER=hduser
export HDFS_SECONDARYNAMENODE_USER=hduser

#HADOOP VARIABLES END

  1. source ~/.bashrc
  2. javac version
  3. which javac
  4. readlink /usr/bin/javac

This basically validates that bashrc update worked!
javac should return “javac 1.8.0_171” or something similar
which javac should return “/usr/bin/javac”
readlink should return “/usr/lib/jvm/java-8-openjdk-amd64/bin/javac”

Memory Tools

There is an application from HortonWorks you can download which can help get you started on how you should setup memory utilization for yarn. I found it’s a great starting point but you need to tweak it to work for what you need on your specific case.

  1. wget http://public-repo-1.hortonworks.com/HDP/tools/2.6.0.3/hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
  2. tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
  3. cd hdp_manual_install_rpm_helper_files-2.6.0.3.8/
  4. sudo apt-get install python2.7
  5. python2.7 scripts/yarn-utils.py -5 -13 -1 -False

-c is for how many cores you have
-m is for how much memory you have
-d is for how many disks you have
False is if you are running HBASE. True if you are.

After the script is ran it will give you guidelines on yarn/mapreduce settings. See below for example. Remember they are guidelines. Tweak as needed.
Now the real fun begins!!! Remember that these settings are what worked for me and you may need to adjust them.

 

hadoop-env.sh

  1. nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

You will see JAVA_HOME near the beginning of the file you will need to change that to where java is installed on your system.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HEAPSIZE=1000
export HADOOP_NAMENODE_OPTS=”-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,DRFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,RFAAUDIT} $HADOOP_NAMENODE_OPTS”
export HADOOP_SECONDARYNAMENODE_OPTS=$HADOOP_NAMENODE_OPTS
export HADOOP_CLIENT_OPTS=”-Xmx1024m $HADOOP_CLIENT_OPTS”

  1. mkdir /app/hadoop/tmp

This is the temp directory hadoop uses

  1. chown hduser:hduser /app/hadoop/tmp

core-site.xml

Click here to view the docs.

  1. nano /usr/local/hadoop/etc/hadoop/core-site.xml

This file contains configuration properties that Hadoop uses when starting up. By default it will look like . This will need to be changed.

  1. <configuration>
  2.       <property>
  3.             <name>fs.defaultFS</name>
  4.             <value>hdfs://NAMENODE:54310</value>
  5.             <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming
  6. the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
  7.       </property>
  8.       <property>
  9.             <name>hadoop.tmp.dir</name>
  10.             <value>/app/hadoop/tmp</value>
  11.       </property>
  12.       <property>
  13.             <name>hadoop.proxyuser.hduser.hosts</name>
  14.             <value>*</value>
  15.       </property>
  16.       <property>
  17.             <name>hadoop.proxyuser.hduser.groups</name>
  18.             <value>*</value>
  19.       </property>
  20. </configuration>

yarn-site.xml

Click here to view the docs.

  1. nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
  1. <configuration>
  2.       <property>
  3.             <name>yarn.nodemanager.aux-services</name>
  4.             <value>mapreduce_shuffle</value>
  5.       </property>
  6.       <property>
  7.             <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
  8.       </property>
  9.       <property>
  10.             <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
  11.             <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  12.       </property>
  13.       <property>
  14.             <name>yarn.nodemanager.resource.memory-mb</name>
  15.             <value>12288</value>
  16.             <final>true</final>
  17.       </property>
  18.       <property>
  19.             <name>yarn.scheduler.minimum-allocation-mb</name>
  20.             <value>4096</value>
  21.             <final>true</final>
  22.       </property>
  23.       <property>
  24.             <name>yarn.scheduler.maximum-allocation-mb</name>
  25.             <value>12288</value>
  26.             <final>true</final>
  27.       </property>
  28.       <property>
  29.             <name>yarn.app.mapreduce.am.resource.mb</name>
  30.             <value>4096</value>
  31.       </property>
  32.       <property>
  33.             <name>yarn.app.mapreduce.am.command-opts</name>
  34.             <value>-Xmx3276m</value>
  35.       </property>
  36.       <property>
  37.             <name>yarn.nodemanager.local-dirs</name>
  38.             <value>/app/hadoop/tmp/nm-local-dir</value>
  39.       </property>
  40.       <!--LOG-->
  41.       <property>
  42.             <name>yarn.log-aggregation-enable</name>
  43.             <value>true</value>
  44.       </property>
  45.       <property>
  46.             <description>Where to aggregate logs to.</description>
  47.             <name>yarn.nodemanager.remote-app-log-dir</name>
  48.             <value>/tmp/yarn/logs</value>
  49.       </property>
  50.       <property>
  51.             <name>yarn.log-aggregation.retain-seconds</name>
  52.             <value>604800</value>
  53.       </property>
  54.       <property>
  55.             <name>yarn.log-aggregation.retain-check-interval-seconds</name>
  56.             <value>86400</value>
  57.       </property>
  58.       <property>
  59.             <name>yarn.log.server.url</name>
  60.             <value>http://NAMENODE:19888/jobhistory/logs/</value>
  61.       </property>
  62.       
  63.       <!--URLs-->
  64. <property>
  65. <name>yarn.resourcemanager.resource-tracker.address</name>
  66. <value>${yarn.resourcemanager.hostname}:8025</value>
  67. </property>
  68. <property>
  69. <name>yarn.resourcemanager.scheduler.address</name>
  70. <value>${yarn.resourcemanager.hostname}:8030</value>
  71. </property>
  72. <property>
  73. <name>yarn.resourcemanager.address</name>
  74. <value>${yarn.resourcemanager.hostname}:8050</value>
  75. </property>
  76. <property>
  77. <name>yarn.resourcemanager.admin.address</name>
  78. <value>${yarn.resourcemanager.hostname}:8033</value>
  79. </property>
  80. <property>
  81. <name>yarn.resourcemanager.webapp.address</name>
  82. <value>${yarn.nodemanager.hostname}:8088</value>
  83. </property>
  84. <property>
  85. <name>yarn.nodemanager.hostname</name>
  86. <value>0.0.0.0</value>
  87. </property>
  88. <property>
  89. <name>yarn.nodemanager.address</name>
  90. <value>${yarn.nodemanager.hostname}:0</value>
  91. </property>
  92. <property>
  93. <name>yarn.nodemanager.webapp.address</name>
  94. <value>${yarn.nodemanager.hostname}:8042</value>
  95. </property>
  96. </configuration>

By default it will look like . This will need to be changed.

mapred-site.xml

Click here to view the docs. By default, the /usr/local/hadoop/etc/hadoop/ folder contains /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed/copied with the name mapred-site.xml By default it will look like . This will need to be changed.

  1. cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
  2.  
  3. nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
  1. <configuration>
  2.       <property>
  3.             <name>mapreduce.framework.name</name>
  4.             <value>yarn</value>
  5.       </property>
  6.       <property>
  7.             <name>mapreduce.jobhistory.address</name>
  8.             <value>0.0.0.0:10020</value>
  9.       </property>
  10.       <property>
  11.             <name>mapreduce.jobhistory.webapp.address</name>
  12.             <value>0.0.0.0:19888</value>
  13.       </property>
  14.       <property>
  15.             <name>mapreduce.jobtracker.address</name>
  16.             <value>0.0.0.0:54311</value>
  17.       </property>
  18.       <property>
  19.             <name>mapreduce.jobhistory.admin.address</name>
  20.             <value>0.0.0.0:10033</value>
  21.       </property>
  22.       <!-- Memory and concurrency tuning -->
  23.       <property>
  24.             <name>mapreduce.map.memory.mb</name>
  25.             <value>4096</value>
  26.       </property>
  27.       <property>
  28.             <name>mapreduce.map.java.opts</name>
  29.             <value>-server -Xmx3276m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>
  30.       </property>
  31.       <property>
  32.             <name>mapreduce.reduce.memory.mb</name>
  33.             <value>4096</value>
  34.       </property>
  35.       <property>
  36.             <name>mapreduce.reduce.java.opts</name>
  37.             <value>-server -Xmx3276m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</value>
  38.       </property>
  39.       <property>
  40.             <name>mapreduce.reduce.shuffle.input.buffer.percent</name>
  41.             <value>0.5</value>
  42.       </property>
  43.       <property>
  44.             <name>mapreduce.task.io.sort.mb</name>
  45.             <value>600</value>
  46.       </property>
  47.       <property>
  48.             <name>mapreduce.task.io.sort.factor</name>
  49.             <value>1638</value>
  50.       </property>
  51.       <property>
  52.             <name>mapreduce.map.sort.spill.percent</name>
  53.             <value>0.50</value>
  54.       </property>
  55.       <property>
  56.             <name>mapreduce.map.speculative</name>
  57.             <value>false</value>
  58.       </property>
  59.       <property>
  60.             <name>mapreduce.reduce.speculative</name>
  61.             <value>false</value>
  62.       </property>
  63.       <property>
  64.             <name>mapreduce.task.timeout</name>
  65.             <value>1800000</value>
  66.       </property>
  67. </configuration>

yarn-env.sh

  1. nano /usr/local/hadoop/etc/hadoop/yarn-env.sh

Change or uncomment or add the following:

JAVA_HEAP_MAX=Xmx2000m
HADOOP_OPTS=”$HADOOP_OPTS-server -Dhadoop.log.dir=$YARN_LOG_DIR”
HADOOP_OPTS=”$HADOOP_OPTS-Djava.net.preferIPv4Stack=true”

Master

Add the namenode hostname.

  1. nano /usr/local/hadoop/etc/hadoop/masters

APPLY THE FOLLOWING TO THE NAMENODE ONLY

Slaves

Add namenode hostname and all datanodes hostname.

  1. nano /usr/local/hadoop/etc/hadoop/slaves

hdfs-site.xml

Click here to view the docs. By default it will look like . This will need to be changed. The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. Before editing this file, we need to create the namenode directory.

  1. mkdir -/usr/local/hadoop_store/data/namenode
  2. chown -R hduser:hduser /usr/local/hadoop_store
  3. nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
  1. <configuration>
  2.       <property>
  3.             <name>dfs.replication</name>
  4.             <value>3</value>
  5.             <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
  6.       </property>
  7.       <property>
  8.             <name>dfs.permissions</name>
  9.             <value>false</value>
  10.       </property>
  11.       <property>
  12.             <name>dfs.namenode.name.dir</name>
  13.             <value>file:/usr/local/hadoop_store/data/namenode</value>
  14.       </property>
  15.       <property>
  16.             <name>dfs.datanode.use.datanode.hostname</name>
  17.             <value>false</value>
  18.       </property>
  19.       <property>
  20.             <name>dfs.blocksize</name>
  21.             <value>128m</value>
  22.       </property>
  23.       <property>
  24.             <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
  25.             <value>false</value>
  26.       </property>
  27.       
  28. <!-- URL -->
  29. <property>
  30. <name>dfs.namenode.http-address</name>
  31. <value>${dfs.namenode.http-bind-host}:50070</value>
  32. <description>Your NameNode hostname for http access.</description>
  33. </property>
  34. <property>
  35. <name>dfs.namenode.secondary.http-address</name>
  36. <value>${dfs.namenode.http-bind-host}:50090</value>
  37. <description>Your Secondary NameNode hostname for http access.</description>
  38. </property>
  39. <property>
  40. <name>dfs.datanode.http.address</name>
  41. <value>${dfs.namenode.http-bind-host}:50075</value>
  42. </property>
  43. <property>
  44. <name>dfs.datanode.address</name>
  45. <value>${dfs.namenode.http-bind-host}:50076</value>
  46. </property>
  47. <property>
  48. <name>dfs.namenode.http-bind-host</name>
  49. <value>0.0.0.0</value>
  50. </property>
  51. <property>
  52. <name>dfs.namenode.rpc-bind-host</name>
  53. <value>0.0.0.0</value>
  54. </property>
  55. <property>
  56. <name>dfs.namenode.servicerpc-bind-host</name>
  57. <value>0.0.0.0</value>
  58. </property>
  59. &lt;/configuration>

APPLY THE FOLLOWING TO THE DATANODE(s) ONLY

Slaves

Add only that datanodes hostname.

  1. nano /usr/local/hadoop/etc/hadoop/slaves

hdfs-site.xml

The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. Before editing this file, we need to create the datanode directory.
By default it will look like . This will need to be changed.

  1. mkdir -/usr/local/hadoop_store/data/datanode
  2. chown -R hduser:hduser /usr/local/hadoop_store
  3. nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
  1. <configuration>
  2.       <property>
  3.             <name>dfs.replication</name>
  4.             <value>3</value>
  5.             <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
  6.       </property>
  7.       <property>
  8.             <name>dfs.permissions</name>
  9.             <value>false</value>
  10.       </property>
  11.       <property>
  12.             <name>dfs.blocksize</name>
  13.             <value>128m</value>
  14.       </property>
  15.       <property>
  16.             <name>dfs.datanode.data.dir</name>
  17.             <value>file:/usr/local/hadoop_store/data/datanode</value>
  18.       </property>
  19.       <property>
  20.             <name>dfs.datanode.use.datanode.hostname</name>
  21.             <value>false</value>
  22.       </property>
  23.       <property>
  24.             <name>dfs.namenode.http-address</name>
  25.             <value>${dfs.namenode.http-bind-host}:50070</value>
  26.             <description>Your NameNode hostname for http access.</description>
  27.       </property>
  28.       <property>
  29.             <name>dfs.namenode.secondary.http-address</name>
  30.             <value>${dfs.namenode.http-bind-host}:50090</value>
  31.             <description>Your Secondary NameNode hostname for http access.</description>
  32.       </property>
  33.       <property>
  34.             <name>dfs.datanode.http.address</name>
  35.             <value>${dfs.namenode.http-bind-host}:50075</value>
  36.       </property>
  37.       <property>
  38.             <name>dfs.datanode.address</name>
  39.             <value>${dfs.namenode.http-bind-host}:50076</value>
  40.       </property>
  41. <property>
  42. <name>dfs.namenode.http-bind-host</name>
  43. <value>0.0.0.0</value>
  44. </property>
  45. <property>
  46. <name>dfs.namenode.rpc-bind-host</name>
  47. <value>0.0.0.0</value>
  48. </property>
  49. <property>
  50. <name>dfs.namenode.servicerpc-bind-host</name>
  51. <value>0.0.0.0</value>
  52. </property>
  53. </configuration>

You need to allow the pass-through for all ports necessary. If you have the Ubuntu firewall on.

  1. sudo ufw allow 50070
  2. sudo ufw allow 8088

Format Cluster:
Only do this if NO data is present. All data will be destroyed when the following is done.
This is to be done on NAMENODE ONLY!

  1. hdfs namenode -format

Start The Cluster:
You can now start the cluster.
You do this from the NAMENODE ONLY.

  1. start-dfs.sh
  2. start-yarn.sh
  3. mapred --config $HADOOP_CONF_DIR --daemon start historyserver

If the above three commands didn’t work something went wrong. As it should have found the scripts located /usr/local/hadoop/sbin/ directory.

Cron Job:
You should probably setup a cron job to start the cluster when you reboot.

  1. crontab e

@reboot /usr/local/hadoop/sbin/start-dfs.sh > /home/hduser/dfs-start.log 2>&1
@reboot /usr/local/hadoop/sbin/start-yarn.sh > /home/hduser/yarn-start.log 2>&1
@reboot /usr/local/hadoop/bin/mapred –config $HADOOP_CONF_DIR –daemon start historyserver > /home/hduser/history-stop.log 2>&1

Verification:
To check that everything is working as it should run “jps” on the NAMENODE. It should return something like the following where the pid will be different:

  1. jps

You could also run “netstat -plten | grep java” or “lsof –i :50070” and “lsof –i :8088”.

Picked up _JAVA_OPTIONS: -Xms3g -Xmx10g -Djava.net.preferIPv4Stack=true
12007 SecondaryNameNode
13090 Jps
12796 JobHistoryServer
12261 ResourceManager
11653 NameNode
12397 NodeManager
11792 DataNode

You can check the DATA NODES by ssh into each one and running “jps”. It should return something like the following where the pid will be different:

Picked up _JAVA_OPTIONS: -Xms3g -Xmx10g -Djava.net.preferIPv4Stack=true
3218 Jps
2215 NodeManager
2411 DataNode

If for any reason only of the services is not running you need to review the logs. They can be found at /usr/local/hadoop/logs/. If it’s ResourceManager that isn’t running then look at file that has “yarn” and “resourcemanager” in it.

WARNING:
Never reboot the system without first stopping the cluster. When the cluster shuts down it is safe to reboot it. Also if you configured a cronjob @reboot you should make sure the DATANODES are up and running first before starting the NAMENODE that way it automatically starts the DATANODES for you

Web Ports:

NameNode

  • 50070: HDFS Namenode
  • 50075: HDFS Datanode
  • 50090: HDFS Secondary Namenode
  • 8088: Resource Manager
  • 19888: Job History

DataNode

  • 50075: HDFS Datanode

NetStat

To check that all the Hadoop ports are available on which IP run the following.

  1. sudo netstat -ltnp

Port Check

If for some reason you are having issues connecting to a Hadoop port then run the following command as you try and connect via the port.

  1. sudo tcpdump -n -tttt -i eth1 port 50070

References

I used a lot of different resources and reference material on this. However I did not save all the relevant links I used. Below are just a few I used. There was various blog posts about memory utilization, etc.

Shell: Misc Commands

In this tutorial I will show a few useful commands when working with linux shell.

Check Directory Exists:

  1. if [ -d /opt/test/ ]; then
  2. echo 'Directory Exists'
  3. fi

Check Directory Not Exists:

  1. if [ ! -d /opt/test/ ]; then
  2. echo 'Directory Does Not Exist'
  3. fi

Check File Exists:

  1. if [ -f /opt/test/test.log ]; then
  2. echo 'File Exists'
  3. fi

Check File Not Exists:

  1. if [ ! -f /opt/test/test.log ]; then
  2. echo 'File Does Not Exist'
  3. fi

Lowercase Variable:

  1. val='TEXT'
  2. echo "${val,,}"

Echo Variable:
This will print the value of “test”. Notice we use double quotes.

  1. test='My Test Val'
  2. echo "$test"

Echo String:

  1. echo 'My test string'

Split:
This will split on the comma into an array list and then loop through it.

  1. test='test 1,test 2'
  2. split_test=(${test//,/ })
  3.  
  4. for val in "${split_test[@]}"
  5. do
  6. echo $val
  7. done

Date:
This will print the date in the format YYYY-MM-dd

  1. my_date="$(date +Y-%m-%d)"
  2. echo "$my_date"

Remove Space From Variable:

  1. VAL='test string'
  2. echo "${VAL//\ /}"

Increment Variable:

  1. index=0
  2. index=$((index+1))

Substring

  1. VAL='test string'
  2. echo "${VAL:4:4}"

If value is equal to

  1. VAL='hello'
  2. if [ "$VAL" == 'hello' ] ; then
  3. echo 'Is Equal'
  4. fi

If with OR

  1. VAL='hello'
  2. if [ "$VAL" == 'hello' ] || [ "$VAL" != 'hi' ] ; then
  3. echo 'Is Hello'
  4. fi

If Variable is Empty

  1. VAL=''
  2. if [ -z "$VAL" ] ; then
  3. echo 'Is Empty'
  4. fi

Append to File

  1. echo 'Hi' >> file_to_log_to.log

Write to File

  1. echo 'Hi' > file_to_log_to.log

While Loop: Increment to 10

This will loop till the value is 9 then exit.

  1. i=0
  2. while [ $i -lt 10 ];
  3. do
  4. echo "$i"
  5. done

whoami

  1. USER=$(whoami)

If Variable Contains Text

  1. VAL='my Test String'
  2. if [[ "${VAL,,}" == *"test"* ]] ; then
  3. echo "Found test"
  4. fi

Color Coding

  1. NoColor=$'\033[0m'
  2. READ=$'\033[0;31m'
  3. GREEN=$'\033[0;32m'
  4. YELLOW=$'\033[1;33;40m'
  5.  
  6. printf "%s Variable Not Set %s\n" "${RED}" "${NoColor}"

Get Log to a logfile and console

  1. SOME_COMMAND 2>&1 | tee -a "${LOG_FILE_PATH}"

Read a JSON config

  1. JSON=$(cat "/path/to/json/file.json")
  2. export MY_VAR=$(echo "${JSON}" | python -c 'import json,sys;obj=json.load(sys.stdin);print(obj["MyKey"])')

Extract tar to Folder

  1. sudo tar -xvf /the/location/file.tar -C /to/location/ --force-local --no-same-owner

Update Certificates

This will update certificates. After you put a certificate in /usr/local/share/ca-certificates/

  1. update-ca-certificates

PipeStatus

  1. somecommand
  2. RETURN_CODE=${PIPESTATUS[0]}

Hive: Struct

This tutorial will show you how to use struct. If you have no installed Hive yet please follow this tutorial.

Create Table with Struct:

  1. CREATE TABLE test_struct (
  2. columnA STRING,
  3. columnB VARCHAR(15),
  4. columnC INT,
  5. columnD TIMESTAMP,
  6. columnE DATE,
  7. columnF STRUCT<key:STRING, value:INT>
  8. )
  9. STORED AS ORC;

Insert Data:

  1. INSERT INTO test_struct
  2. SELECT '1', '2', 1, '2019-02-07 20:58:27', '2019-02-07', NAMED_STRUCT('key', 'val', 'value', 1);

Select Data:
This will give back the value of “key” and “value” in columnF.

  1. SELECT columnF.key, columnF.value
  2. FROM test_struct;

Hive: Map

This tutorial will show you how to use map. If you have no installed Hive yet please follow this tutorial.

Create Table with Map:

  1. CREATE TABLE test_map (
  2. columnA STRING,
  3. columnB VARCHAR(15),
  4. columnC INT,
  5. columnD TIMESTAMP,
  6. columnE DATE,
  7. columnF MAP<STRING, INT>
  8. )
  9. STORED AS ORC;

Insert Data:

  1. INSERT INTO test_map
  2. SELECT '1', '2', 1, '2019-02-07 20:58:27', '2019-02-07', MAP('Val', 1);

Select Data:
This will give back the value of “Val” in columnF.

  1. SELECT columnF['Val']
  2. FROM test_map;