如何对pyspark中每个组内的变量进行排序?
How to sort on a variable within each group in pyspark?
我正在尝试使用另一列 ts
为每个 id
.
对值 val
进行排序
# imports
from pyspark.sql import functions as F
from pyspark.sql import SparkSession as ss
import pandas as pd
# create dummy data
pdf = pd.DataFrame( [['2',2,'cat'],['1',1,'dog'],['1',2,'cat'],['2',3,'cat'],['2',4,'dog']] ,columns=['id','ts','val'])
sdf = ss.createDataFrame( pdf )
sdf.show()
+---+---+---+
| id| ts|val|
+---+---+---+
| 2| 2|cat|
| 1| 1|dog|
| 1| 2|cat|
| 2| 3|cat|
| 2| 4|dog|
+---+---+---+
您可以按 id
聚合并按 ts
排序:
sorted_sdf = ( sdf.groupBy('id')
.agg( F.sort_array( F.collect_list( F.struct( F.col('ts'), F.col('val') ) ), asc = True)
.alias('sorted_col') )
)
sorted_sdf.show()
+---+--------------------+
| id| sorted_col|
+---+--------------------+
| 1| [[1,dog], [2,cat]]|
| 2|[[2,cat], [3,cat]...|
+---+--------------------+
然后,我们可以展开这个列表:
explode_sdf = sorted_sdf.select( 'id' , F.explode( F.col('sorted_col') ).alias('sorted_explode') )
explode_sdf.show()
+---+--------------+
| id|sorted_explode|
+---+--------------+
| 1| [1,dog]|
| 1| [2,cat]|
| 2| [2,cat]|
| 2| [3,cat]|
| 2| [4,dog]|
+---+--------------+
将sorted_explode
的元组一分为二:
detupled_sdf = explode_sdf.select( 'id', 'sorted_explode.*' )
detupled_sdf.show()
+---+---+---+
| id| ts|val|
+---+---+---+
| 1| 1|dog|
| 1| 2|cat|
| 2| 2|cat|
| 2| 3|cat|
| 2| 4|dog|
+---+---+---+
现在我们的原始数据帧按 ts
排序,每个 id
!
我正在尝试使用另一列 ts
为每个 id
.
val
进行排序
# imports
from pyspark.sql import functions as F
from pyspark.sql import SparkSession as ss
import pandas as pd
# create dummy data
pdf = pd.DataFrame( [['2',2,'cat'],['1',1,'dog'],['1',2,'cat'],['2',3,'cat'],['2',4,'dog']] ,columns=['id','ts','val'])
sdf = ss.createDataFrame( pdf )
sdf.show()
+---+---+---+
| id| ts|val|
+---+---+---+
| 2| 2|cat|
| 1| 1|dog|
| 1| 2|cat|
| 2| 3|cat|
| 2| 4|dog|
+---+---+---+
您可以按 id
聚合并按 ts
排序:
sorted_sdf = ( sdf.groupBy('id')
.agg( F.sort_array( F.collect_list( F.struct( F.col('ts'), F.col('val') ) ), asc = True)
.alias('sorted_col') )
)
sorted_sdf.show()
+---+--------------------+
| id| sorted_col|
+---+--------------------+
| 1| [[1,dog], [2,cat]]|
| 2|[[2,cat], [3,cat]...|
+---+--------------------+
然后,我们可以展开这个列表:
explode_sdf = sorted_sdf.select( 'id' , F.explode( F.col('sorted_col') ).alias('sorted_explode') )
explode_sdf.show()
+---+--------------+
| id|sorted_explode|
+---+--------------+
| 1| [1,dog]|
| 1| [2,cat]|
| 2| [2,cat]|
| 2| [3,cat]|
| 2| [4,dog]|
+---+--------------+
将sorted_explode
的元组一分为二:
detupled_sdf = explode_sdf.select( 'id', 'sorted_explode.*' )
detupled_sdf.show()
+---+---+---+
| id| ts|val|
+---+---+---+
| 1| 1|dog|
| 1| 2|cat|
| 2| 2|cat|
| 2| 3|cat|
| 2| 4|dog|
+---+---+---+
现在我们的原始数据帧按 ts
排序,每个 id
!