将具有嵌套结构的数组与 PySpark DataFrame 中的其他列一起转换为字符串列
Convert Array with nested struct to string column along with other columns from the PySpark DataFrame
这类似于
但是,接受的答案不适用于我的情况,所以在这里提问
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
|-- element: struct (containsNull = true)
|-- Col2Sub: string (nullable = true)
样本JSON
{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}
这会在单列中给出结果
import pyspark.sql.functions as F
df.selectExpr("EXPLODE(Col2) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+----------------+
| Col2_concated |
+----------------+
|foo,bar |
+----------------+
但是,如何得到这样的结果或 DataFrame
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo,bar |
+-------+---------------+
编辑:
此解决方案给出了错误的结果
df.selectExpr("Col1","EXPLODE(Col2) AS structCol").select("Col1", F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo |
+-------+---------------+
|abc123 |bar |
+-------+---------------+
只要避开爆炸,你就已经在那里了。您只需要 concat_ws 函数即可。此函数使用给定的分隔符连接多个字符串列。请参见下面的示例:
from pyspark.sql import functions as F
j = '{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}'
df = spark.read.json(sc.parallelize([j]))
#printSchema tells us the column names we can use with concat_ws
df.printSchema()
输出:
root
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Col2Sub: string (nullable = true)
列 Col2 是一个 Col2Sub 数组,我们可以使用这个列名来得到想要的结果:
bla = df.withColumn('Col2', F.concat_ws(',', df.Col2.Col2Sub))
bla.show()
+------+-------+
| Col1| Col2|
+------+-------+
|abc123|foo,bar|
+------+-------+
这类似于
但是,接受的答案不适用于我的情况,所以在这里提问
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
|-- element: struct (containsNull = true)
|-- Col2Sub: string (nullable = true)
样本JSON
{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}
这会在单列中给出结果
import pyspark.sql.functions as F
df.selectExpr("EXPLODE(Col2) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+----------------+
| Col2_concated |
+----------------+
|foo,bar |
+----------------+
但是,如何得到这样的结果或 DataFrame
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo,bar |
+-------+---------------+
编辑: 此解决方案给出了错误的结果
df.selectExpr("Col1","EXPLODE(Col2) AS structCol").select("Col1", F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo |
+-------+---------------+
|abc123 |bar |
+-------+---------------+
只要避开爆炸,你就已经在那里了。您只需要 concat_ws 函数即可。此函数使用给定的分隔符连接多个字符串列。请参见下面的示例:
from pyspark.sql import functions as F
j = '{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}'
df = spark.read.json(sc.parallelize([j]))
#printSchema tells us the column names we can use with concat_ws
df.printSchema()
输出:
root
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Col2Sub: string (nullable = true)
列 Col2 是一个 Col2Sub 数组,我们可以使用这个列名来得到想要的结果:
bla = df.withColumn('Col2', F.concat_ws(',', df.Col2.Col2Sub))
bla.show()
+------+-------+
| Col1| Col2|
+------+-------+
|abc123|foo,bar|
+------+-------+