Spark:从平面数据框创建嵌套数据框
Spark: Create nested dataframe from a flat one
来自以下数据框:
import spark.implicits._
val data = Seq(
(1, "value11", "value12"),
(2, "value21", "value22"),
(3, "value31", "value32")
)
val df = data.toDF("id", "v1", "v2")
是否可以将 df 转换为嵌套数据框,其架构为:
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("nested", StructType(Array(
StructField("value1", StringType),
StructField("value2", StringType)
)))
))
我知道有一个RDD解决方案:
spark.createDataFrame(df.rdd.map(row => Row(row.get(0), Row(row.get(1), row.get(2))), schema)
但是我想将它动态应用到很多列,这会导致很多样板代码。
有没有更简单的方法?
谢谢
一种方法是使用 struct
如果需要,您也可以将列重命名为
val newColumns = List("value1", "value2")
columns.zip(newColumns).foldLeft(df){(acc, name) =>
acc.withColumnRenamed(name._1, name._2)
}
//list the columns names you want to nested
val columns = df.columns.tail
//use struct to create new fields and drop all columns
val finalDF = df.withColumn("nested", struct(columns.map(col(_)):_*))..drop(columns:_*)
最终架构:
finalDF.printSchema()
root
|-- id: integer (nullable = false)
|-- nested: struct (nullable = false)
| |-- v1: string (nullable = true)
| |-- v2: string (nullable = true)
来自以下数据框:
import spark.implicits._
val data = Seq(
(1, "value11", "value12"),
(2, "value21", "value22"),
(3, "value31", "value32")
)
val df = data.toDF("id", "v1", "v2")
是否可以将 df 转换为嵌套数据框,其架构为:
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("nested", StructType(Array(
StructField("value1", StringType),
StructField("value2", StringType)
)))
))
我知道有一个RDD解决方案:
spark.createDataFrame(df.rdd.map(row => Row(row.get(0), Row(row.get(1), row.get(2))), schema)
但是我想将它动态应用到很多列,这会导致很多样板代码。
有没有更简单的方法? 谢谢
一种方法是使用 struct
如果需要,您也可以将列重命名为
val newColumns = List("value1", "value2")
columns.zip(newColumns).foldLeft(df){(acc, name) =>
acc.withColumnRenamed(name._1, name._2)
}
//list the columns names you want to nested
val columns = df.columns.tail
//use struct to create new fields and drop all columns
val finalDF = df.withColumn("nested", struct(columns.map(col(_)):_*))..drop(columns:_*)
最终架构:
finalDF.printSchema()
root
|-- id: integer (nullable = false)
|-- nested: struct (nullable = false)
| |-- v1: string (nullable = true)
| |-- v2: string (nullable = true)