将列组合成键值对列表(无 UDF)
Combine columns into list of key, value pairs (no UDF)
我想创建一个新列,它是其他一些列的 JSON 表示。列表中的键值对。
来源:
origin
destination
count
toronto
ottawa
5
montreal
vancouver
10
我想要的:
origin
destination
count
json
toronto
ottawa
5
[{"origin":"toronto"},{"destination","ottawa"}, {"count": "5"}]
montreal
vancouver
10
[{"origin":"montreal"},{"destination","vancouver"}, {"count": "10"}]
(一切都可以是字符串,没关系)。
我试过类似的东西:
df.withColumn('json', to_json(struct(col('origin'), col('destination'), col('count'))))
但是它在一个对象中创建了包含所有 key:value
对的列:
{"origin":"United States","destination":"Romania"}
如果没有 UDF,这可能吗?
解决这个问题的方法:
import pyspark.sql.functions as F
df2 = df.withColumn(
'json',
F.array(
F.to_json(F.struct('origin')),
F.to_json(F.struct('destination')),
F.to_json(F.struct('count'))
).cast('string')
)
df2.show(truncate=False)
+--------+-----------+-----+--------------------------------------------------------------------+
|origin |destination|count|json |
+--------+-----------+-----+--------------------------------------------------------------------+
|toronto |ottawa |5 |[{"origin":"toronto"}, {"destination":"ottawa"}, {"count":"5"}] |
|montreal|vancouver |10 |[{"origin":"montreal"}, {"destination":"vancouver"}, {"count":"10"}]|
+--------+-----------+-----+--------------------------------------------------------------------+
另一种方法是在调用之前创建映射列数组 to_json
:
from pyspark.sql import functions as F
df1 = df.withColumn(
'json',
F.to_json(F.array(*[F.create_map(F.lit(c), F.col(c)) for c in df.columns]))
)
df1.show(truncate=False)
#+--------+-----------+-----+------------------------------------------------------------------+
#|origin |destination|count|json |
#+--------+-----------+-----+------------------------------------------------------------------+
#|toronto |ottawa |5 |[{"origin":"toronto"},{"destination":"ottawa"},{"count":"5"}] |
#|montreal|vancouver |10 |[{"origin":"montreal"},{"destination":"vancouver"},{"count":"10"}]|
#+--------+-----------+-----+------------------------------------------------------------------+
我想创建一个新列,它是其他一些列的 JSON 表示。列表中的键值对。
来源:
origin | destination | count |
---|---|---|
toronto | ottawa | 5 |
montreal | vancouver | 10 |
我想要的:
origin | destination | count | json |
---|---|---|---|
toronto | ottawa | 5 | [{"origin":"toronto"},{"destination","ottawa"}, {"count": "5"}] |
montreal | vancouver | 10 | [{"origin":"montreal"},{"destination","vancouver"}, {"count": "10"}] |
(一切都可以是字符串,没关系)。
我试过类似的东西:
df.withColumn('json', to_json(struct(col('origin'), col('destination'), col('count'))))
但是它在一个对象中创建了包含所有 key:value
对的列:
{"origin":"United States","destination":"Romania"}
如果没有 UDF,这可能吗?
解决这个问题的方法:
import pyspark.sql.functions as F
df2 = df.withColumn(
'json',
F.array(
F.to_json(F.struct('origin')),
F.to_json(F.struct('destination')),
F.to_json(F.struct('count'))
).cast('string')
)
df2.show(truncate=False)
+--------+-----------+-----+--------------------------------------------------------------------+
|origin |destination|count|json |
+--------+-----------+-----+--------------------------------------------------------------------+
|toronto |ottawa |5 |[{"origin":"toronto"}, {"destination":"ottawa"}, {"count":"5"}] |
|montreal|vancouver |10 |[{"origin":"montreal"}, {"destination":"vancouver"}, {"count":"10"}]|
+--------+-----------+-----+--------------------------------------------------------------------+
另一种方法是在调用之前创建映射列数组 to_json
:
from pyspark.sql import functions as F
df1 = df.withColumn(
'json',
F.to_json(F.array(*[F.create_map(F.lit(c), F.col(c)) for c in df.columns]))
)
df1.show(truncate=False)
#+--------+-----------+-----+------------------------------------------------------------------+
#|origin |destination|count|json |
#+--------+-----------+-----+------------------------------------------------------------------+
#|toronto |ottawa |5 |[{"origin":"toronto"},{"destination":"ottawa"},{"count":"5"}] |
#|montreal|vancouver |10 |[{"origin":"montreal"},{"destination":"vancouver"},{"count":"10"}]|
#+--------+-----------+-----+------------------------------------------------------------------+