This tutorial will guide you through installing PySpark StandAlone on Windows for development.
Install Python 3
You need to have python 3 installed. You can get it here.
Install Spark
Go to Apache Spark and download the latest version and package “Pre-built for Apache Hadoop 2.7 and later”. Download spark-2.4.1-bin-hadoop2.7.tgz.
Install 7 Zip
You will need 7 Zip to open spark-2.4.1-bin-hadoop2.7.tgz.
Install Java
You need to ensure you have Java 8 install.
Extract Spark
Once you have installed 7 Zip you can extract spark into C:\spark\ directory. The directory structure will look like this c:\spark\spark-2.4.1-bin-hadoop2.7\
Download WinUtils.exe
Download the winutils.exe and put to C:\spark\spark-2.4.1-bin-hadoop2.7\bin\
Environment Variables
You the following commands to set your Spark specific ENV variables. The “-m” option means all users. You can either use that or not. If you don’t then it adds for current user only.
setx -m SPARK_HOME C:\spark\spark-2.4.1-bin-hadoop2.7 setx -m HADOOP_HOME C:\spark\spark-2.4.1-bin-hadoop2.7
You also want to add the following to the “Path” env variable “;C:\spark\spark-2.4.1-bin-hadoop2.7\bin”
Run PySpark
Open a command prompt and type the following. The –master parameter is used for setting the master node address. local[2] is to tell Spark to run locally on 2 cores.
pyspark --master local[2]
Test
You can then test that it is working by running the following code.
words = sc.parallelize ( ["scala", "java", "hadoop", "spark", "hbase", "spark vs hadoop", "pyspark", "pyspark and spark"] ) counts = words.count() print("Number of elements in RDD -> %i" % (counts))
References
I used this as a guide. https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c