将列组合成键值对列表（无 UDF）

Question

我想创建一个新列，它是其他一些列的 JSON 表示。列表中的键值对。

来源：

origin	destination	count
toronto	ottawa	5
montreal	vancouver	10

我想要的：

origin	destination	count	json
toronto	ottawa	5	[{"origin":"toronto"},{"destination","ottawa"}, {"count": "5"}]
montreal	vancouver	10	[{"origin":"montreal"},{"destination","vancouver"}, {"count": "10"}]

（一切都可以是字符串，没关系）。

我试过类似的东西：

df.withColumn('json', to_json(struct(col('origin'), col('destination'), col('count'))))

但是它在一个对象中创建了包含所有 key:value 对的列：

{"origin":"United States","destination":"Romania"}

如果没有 UDF，这可能吗？

Answer 1

解决这个问题的方法：

import pyspark.sql.functions as F

df2 = df.withColumn(
    'json', 
    F.array(
        F.to_json(F.struct('origin')),
        F.to_json(F.struct('destination')),
        F.to_json(F.struct('count'))
    ).cast('string')
)

df2.show(truncate=False)
+--------+-----------+-----+--------------------------------------------------------------------+
|origin  |destination|count|json                                                                |
+--------+-----------+-----+--------------------------------------------------------------------+
|toronto |ottawa     |5    |[{"origin":"toronto"}, {"destination":"ottawa"}, {"count":"5"}]     |
|montreal|vancouver  |10   |[{"origin":"montreal"}, {"destination":"vancouver"}, {"count":"10"}]|
+--------+-----------+-----+--------------------------------------------------------------------+

Answer 2

另一种方法是在调用之前创建映射列数组 to_json:

from pyspark.sql import functions as F

df1 = df.withColumn(
    'json',
    F.to_json(F.array(*[F.create_map(F.lit(c), F.col(c)) for c in df.columns]))
)

df1.show(truncate=False)

#+--------+-----------+-----+------------------------------------------------------------------+
#|origin  |destination|count|json                                                              |
#+--------+-----------+-----+------------------------------------------------------------------+
#|toronto |ottawa     |5    |[{"origin":"toronto"},{"destination":"ottawa"},{"count":"5"}]     |
#|montreal|vancouver  |10   |[{"origin":"montreal"},{"destination":"vancouver"},{"count":"10"}]|
#+--------+-----------+-----+------------------------------------------------------------------+

将列组合成键值对列表（无 UDF）

Combine columns into list of key, value pairs (no UDF)

key-value

apache-spark

apache-spark-sql

pyspark