您将如何在 window 上生成新的数组列?
How would you generate a new array column over a window?
我正在尝试生成一个新列,该列是 window 上的数组,但数组函数似乎无法在 window 上运行,我正在努力寻找一个替代方法。
代码片段:
df = df.withColumn('array_output', F.array(df.things_to_agg_in_array).over(Window.partitionBy("aggregate_over_this")))
理想情况下,我想要的是如下所示的输出 table:
+---------------------+------------------------+--------------+
| Aggregate Over This | Things to Agg in Array | Array Output |
+---------------------+------------------------+--------------+
| 1 | C | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 1 | F | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 1 | K | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 1 | L | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 2 | A | [A,B,C] |
+---------------------+------------------------+--------------+
| 2 | B | [A,B,C] |
+---------------------+------------------------+--------------+
| 2 | C | [A,B,C] |
+---------------------+------------------------+--------------+
对于进一步的上下文,这是爆炸的一部分,然后将根据 'aggregate over this' 重新加入另一个 table,结果只返回 array_ouput
的一个实例。
谢谢
这个解决方案使用了collect_list()
,不确定是否满足您的要求。
myValues = [(1,'C'),(1,'F'),(1,'K'),(1,'L'),(2,'A'),(2,'B'),(2,'C')]
df = sqlContext.createDataFrame(myValues,['Aggregate_Over_This','Things_to_Agg_in_Array'])
df.show()
+-------------------+----------------------+
|Aggregate_Over_This|Things_to_Agg_in_Array|
+-------------------+----------------------+
| 1| C|
| 1| F|
| 1| K|
| 1| L|
| 2| A|
| 2| B|
| 2| C|
+-------------------+----------------------+
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select Aggregate_Over_This, Things_to_Agg_in_Array, collect_list(Things_to_Agg_in_Array) over (partition by Aggregate_Over_This) as aray_output from table_view'
)
df1.show()
+-------------------+----------------------+------------+
|Aggregate_Over_This|Things_to_Agg_in_Array| aray_output|
+-------------------+----------------------+------------+
| 1| C|[C, F, K, L]|
| 1| F|[C, F, K, L]|
| 1| K|[C, F, K, L]|
| 1| L|[C, F, K, L]|
| 2| A| [A, B, C]|
| 2| B| [A, B, C]|
| 2| C| [A, B, C]|
+-------------------+----------------------+------------+
我正在尝试生成一个新列,该列是 window 上的数组,但数组函数似乎无法在 window 上运行,我正在努力寻找一个替代方法。
代码片段:
df = df.withColumn('array_output', F.array(df.things_to_agg_in_array).over(Window.partitionBy("aggregate_over_this")))
理想情况下,我想要的是如下所示的输出 table:
+---------------------+------------------------+--------------+
| Aggregate Over This | Things to Agg in Array | Array Output |
+---------------------+------------------------+--------------+
| 1 | C | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 1 | F | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 1 | K | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 1 | L | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 2 | A | [A,B,C] |
+---------------------+------------------------+--------------+
| 2 | B | [A,B,C] |
+---------------------+------------------------+--------------+
| 2 | C | [A,B,C] |
+---------------------+------------------------+--------------+
对于进一步的上下文,这是爆炸的一部分,然后将根据 'aggregate over this' 重新加入另一个 table,结果只返回 array_ouput
的一个实例。
谢谢
这个解决方案使用了collect_list()
,不确定是否满足您的要求。
myValues = [(1,'C'),(1,'F'),(1,'K'),(1,'L'),(2,'A'),(2,'B'),(2,'C')]
df = sqlContext.createDataFrame(myValues,['Aggregate_Over_This','Things_to_Agg_in_Array'])
df.show()
+-------------------+----------------------+
|Aggregate_Over_This|Things_to_Agg_in_Array|
+-------------------+----------------------+
| 1| C|
| 1| F|
| 1| K|
| 1| L|
| 2| A|
| 2| B|
| 2| C|
+-------------------+----------------------+
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select Aggregate_Over_This, Things_to_Agg_in_Array, collect_list(Things_to_Agg_in_Array) over (partition by Aggregate_Over_This) as aray_output from table_view'
)
df1.show()
+-------------------+----------------------+------------+
|Aggregate_Over_This|Things_to_Agg_in_Array| aray_output|
+-------------------+----------------------+------------+
| 1| C|[C, F, K, L]|
| 1| F|[C, F, K, L]|
| 1| K|[C, F, K, L]|
| 1| L|[C, F, K, L]|
| 2| A| [A, B, C]|
| 2| B| [A, B, C]|
| 2| C| [A, B, C]|
+-------------------+----------------------+------------+