We can connect to Hadoop from Python using PyWebhdfs package. For the purposes of this post we will use version 0.4.1. You can see all API’s from here.
To build a connection to Hadoop you first need to import it.
from pywebhdfs.webhdfs import PyWebHdfsClient
Then you build the connection like this.
HDFS_CONNECTION = PyWebHdfsClient(host=##HOST## port='50070', user_name=##USER##)
To list the contents of a directory you do this.
HDFS_CONNECTION.list_dir(##HADOOP_DIR##)
To pull a single file down from Hadoop is straight forward. Notice how we have the “FileNotFound” brought in. That is important when pulling a file in. You don’t actually need it but “read_file” will raise that exception if it is not found. By default we should always include this.
from pywebhdfs.errors import FileNotFound try: file_data = HDFS_CONNECTION.read_file(##FILENAME##) except FileNotFound as e: print(e) except Exception as e: print(e)