This post is how to create a DataFrame in pyspark.
First we need a spark Session. See PySpark: Create a Spark Session for my details on that.
Next we need to import
from pyspark.sql import Row
from pyspark.sql.types import StringType, DecimalType, TimestampType, FloatType, IntegerType, LongType, StructField, StructType
Then you create the schema
schema = StructType([
StructField('id', IntegerType()),
.....
])
data = [Row(id=1)]
Create the DataFrame
df = spark.createDataFrame(data, schema=schema)
If you want to use a JSON file to build your schema do the following
import json
from pyspark.sql.types import StructType
data = {
"fields": [
{
"metadata": {},
"name": "column_a",
"nullable": false,
"type": "string"
}
],
"type": "struct"
}
json_schema = json.loads(data)
table_schema = StructType.fromJson(dict(json_schema))
df = spark.createDataFrame(data, schema=table_schema)
You must be logged in to post a comment.