This post is how to create a DataFrame in pyspark.
First we need a spark Session. See PySpark: Create a Spark Session for my details on that.
Next we need to import
- from pyspark.sql import Row
- from pyspark.sql.types import StringType, DecimalType, TimestampType, FloatType, IntegerType, LongType, StructField, StructType
Then you create the schema
- schema = StructType([
- StructField('id', IntegerType()),
- .....
- ])
- data = [Row(id=1)]
Create the DataFrame
- df = spark.createDataFrame(data, schema=schema)
If you want to use a JSON file to build your schema do the following
- import json
- from pyspark.sql.types import StructType
- data = {
- "fields": [
- {
- "metadata": {},
- "name": "column_a",
- "nullable": false,
- "type": "string"
- }
- ],
- "type": "struct"
- }
- json_schema = json.loads(data)
- table_schema = StructType.fromJson(dict(json_schema))
- df = spark.createDataFrame(data, schema=table_schema)