Python: Connect To Hadoop

(Last Updated On: )

We can connect to Hadoop from Python using PyWebhdfs package. For the purposes of this post we will use version 0.4.1. You can see all API’s from here.

To build a connection to Hadoop you first need to import it.

from pywebhdfs.webhdfs import PyWebHdfsClient

Then you build the connection like this.

HDFS_CONNECTION = PyWebHdfsClient(host=##HOST## port='50070', user_name=##USER##)

To list the contents of a directory you do this.

HDFS_CONNECTION.list_dir(##HADOOP_DIR##)

To pull a single file down from Hadoop is straight forward. Notice how we have the “FileNotFound” brought in. That is important when pulling a file in. You don’t actually need it but “read_file” will raise that exception if it is not found. By default we should always include this.

from pywebhdfs.errors import FileNotFound

try:
	file_data = HDFS_CONNECTION.read_file(##FILENAME##)
except FileNotFound as e:
	print(e)
except Exception as e:
	print(e)