使用最后两列作为分区将 spark 数据帧转换为 pyspark 中的配置单元分区创建 table
Convert a spark dataframe to a hive partitioned create table in pyspark using last two columns as partitions
我在 Pyspark(2.3) 中有一个数据框,我需要从中生成一个分区的创建 table 语句到 运行 到 spark.sql() 以使其与配置单元兼容。
Sample Dataframe:
final.printSchema()
root
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- value: long (nullable = true)
|-- date: string (nullable = true)
|-- subid: string( nullable=true)
脚本应读取数据框并创建以下内容table并将最后两列视为分区列。
`create table schema.final( name string ,age string ,value long )
partitioned by (date string , subid string) stored as parquet;`
任何对上述 pyspark 解决方案的帮助都将非常有用
这是一种通过模式迭代并生成 Hive 的方法 SQL:
from pyspark.sql.types import StructType, StructField, StringType, LongType
schema = StructType([
StructField('name', StringType()),
StructField('age', StringType()),
StructField('value', LongType()),
StructField('date', StringType()),
StructField('subid', StringType())
])
hiveCols = ""
hivePartitionCols = ""
for idx, c in enumerate(schema):
# populate hive schema
if(idx < len(schema[:-2])):
hiveCols += "{0} {1}".format(c.name, c.dataType.simpleString())
if(idx < len(schema[:-2]) - 1):
hiveCols += ","
# populate hive partition
if(idx >= len(schema) - 2):
hivePartitionCols += "{0} {1}".format(c.name, c.dataType.simpleString())
if(idx < len(schema) - 1):
hivePartitionCols += ","
hiveCreateSql = "create table schema.final({0}) partitioned by ({1}) stored as parquet".format(hiveCols, hivePartitionCols)
# create table schema.final(name string,age string,value bigint) partitioned by (date string,subid string) stored as parquet
spark.sql(hiveCreateSql)
我在 Pyspark(2.3) 中有一个数据框,我需要从中生成一个分区的创建 table 语句到 运行 到 spark.sql() 以使其与配置单元兼容。
Sample Dataframe:
final.printSchema()
root
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- value: long (nullable = true)
|-- date: string (nullable = true)
|-- subid: string( nullable=true)
脚本应读取数据框并创建以下内容table并将最后两列视为分区列。
`create table schema.final( name string ,age string ,value long )
partitioned by (date string , subid string) stored as parquet;`
任何对上述 pyspark 解决方案的帮助都将非常有用
这是一种通过模式迭代并生成 Hive 的方法 SQL:
from pyspark.sql.types import StructType, StructField, StringType, LongType
schema = StructType([
StructField('name', StringType()),
StructField('age', StringType()),
StructField('value', LongType()),
StructField('date', StringType()),
StructField('subid', StringType())
])
hiveCols = ""
hivePartitionCols = ""
for idx, c in enumerate(schema):
# populate hive schema
if(idx < len(schema[:-2])):
hiveCols += "{0} {1}".format(c.name, c.dataType.simpleString())
if(idx < len(schema[:-2]) - 1):
hiveCols += ","
# populate hive partition
if(idx >= len(schema) - 2):
hivePartitionCols += "{0} {1}".format(c.name, c.dataType.simpleString())
if(idx < len(schema) - 1):
hivePartitionCols += ","
hiveCreateSql = "create table schema.final({0}) partitioned by ({1}) stored as parquet".format(hiveCols, hivePartitionCols)
# create table schema.final(name string,age string,value bigint) partitioned by (date string,subid string) stored as parquet
spark.sql(hiveCreateSql)