有没有一种方法可以在不使用 explode 函数的情况下展平结构数组的复杂数据类型数组？

Question

我正在尝试在 PySpark 中展平一个复杂的架构。数据太大而无法使用爆炸函数（我读到爆炸函数是一个非常昂贵的函数）。这是我的模式的样子 -

 |-- A: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- B: string (nullable = true)
 |    |    |    |-- C: string (nullable = true)

我想把它展平到

|-- A: array (nullable = true)
|    |-- B: string (nullable = true)
|    |-- C: string (nullable = true)

我尝试了 df.select("A.*") 但出现异常

: org.apache.spark.sql.AnalysisException: Can only star expand struct data types. Attribute: `ArrayBuffer(A)`;

提前致谢！

Answer 1

检查下面的代码。

scala> df.printSchema
root
 |-- A: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- A: string (nullable = true)
 |    |    |    |-- B: string (nullable = true)

scala> df.withColumn("A",expr("flatten(transform(A,x -> array(x.A,x.b)))")).printSchema
root
 |-- A: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

scala> df.withColumn("A",flatten($"A")).printSchema
root
 |-- A: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- A: string (nullable = true)
 |    |    |-- B: string (nullable = true)

有没有一种方法可以在不使用 explode 函数的情况下展平结构数组的复杂数据类型数组？

Is there a way I can flatten a complex datatypes array of array of struct without using explode function?

apache-spark

apache-spark-sql

pyspark

pyspark-dataframes