Pyspark:GroupBy多列并计算组数
Pyspark: GroupBy multiple columns and calculate group number
我有一个像这样的数据框:
id Name Rank Course
1 S1 21 Physics
2 S2 22 Chemistry
3 S3 24 Math
4 S2 22 English
5 S2 22 Social
6 S1 21 Geography
我想根据名称、排名对这个数据集进行分组并计算组数。在 pandas 中,我可以轻松地做到:
df['ngrp'] = df.groupby(['Name', 'Rank']).ngroup()
经过上面的计算,我得到以下输出:
id Name Rank Course ngrp
1 S1 21 Physics 0
6 S1 22 Geography 0
2 S2 22 Chemistry 1
4 S2 22 English 1
5 S2 23 Social 1
3 S3 24 Math 2
Pyspark 中是否有方法可以实现相同的输出?我尝试了以下方法,但它似乎不起作用:
from pyspark.sql import Window
w = Window.partitionBy(['Name', 'Rank'])
df.select(['Name', 'Rank'], ['Course'], f.count(['Name', 'Rank']).over(w).alias('ngroup')).show()
您可以选择 DENSE_RANK -
数据准备
df = pd.read_csv(StringIO("""
id,Name,Rank,Course
1,S1,21,Physics
2,S2,22,Chemistry
3,S3,24,Math
4,S2,22,English
5,S2,22,Social
6,S1,21,Geography
"""),delimiter=',')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+---+----+----+---------+
| id|Name|Rank| Course|
+---+----+----+---------+
| 1| S1| 21| Physics|
| 2| S2| 22|Chemistry|
| 3| S3| 24| Math|
| 4| S2| 22| English|
| 5| S2| 22| Social|
| 6| S1| 21|Geography|
+---+----+----+---------+
密集排名
window = Window.orderBy(['Name','Rank'])
sparkDF = sparkDF.withColumn('ngroup',F.dense_rank().over(window) - 1)
sparkDF.orderBy(['Name','ngroup']).show()
+---+----+----+---------+------+
| id|Name|Rank| Course|ngroup|
+---+----+----+---------+------+
| 6| S1| 21|Geography| 0|
| 1| S1| 21| Physics| 0|
| 4| S2| 22| English| 1|
| 2| S2| 22|Chemistry| 1|
| 5| S2| 22| Social| 1|
| 3| S3| 24| Math| 2|
+---+----+----+---------+------+
密集排名 - SparkSQL
sql.sql("""
SELECT
ID,
NAME,
RANK,
COURSE,
DENSE_RANK() OVER(ORDER BY NAME,RANK) - 1 as NGROUP
FROM TB1
""").show()
+---+----+----+---------+------+
| ID|NAME|RANK| COURSE|NGROUP|
+---+----+----+---------+------+
| 1| S1| 21| Physics| 0|
| 6| S1| 21|Geography| 0|
| 2| S2| 22|Chemistry| 1|
| 4| S2| 22| English| 1|
| 5| S2| 22| Social| 1|
| 3| S3| 24| Math| 2|
+---+----+----+---------+------+
我有一个像这样的数据框:
id Name Rank Course
1 S1 21 Physics
2 S2 22 Chemistry
3 S3 24 Math
4 S2 22 English
5 S2 22 Social
6 S1 21 Geography
我想根据名称、排名对这个数据集进行分组并计算组数。在 pandas 中,我可以轻松地做到:
df['ngrp'] = df.groupby(['Name', 'Rank']).ngroup()
经过上面的计算,我得到以下输出:
id Name Rank Course ngrp
1 S1 21 Physics 0
6 S1 22 Geography 0
2 S2 22 Chemistry 1
4 S2 22 English 1
5 S2 23 Social 1
3 S3 24 Math 2
Pyspark 中是否有方法可以实现相同的输出?我尝试了以下方法,但它似乎不起作用:
from pyspark.sql import Window
w = Window.partitionBy(['Name', 'Rank'])
df.select(['Name', 'Rank'], ['Course'], f.count(['Name', 'Rank']).over(w).alias('ngroup')).show()
您可以选择 DENSE_RANK -
数据准备
df = pd.read_csv(StringIO("""
id,Name,Rank,Course
1,S1,21,Physics
2,S2,22,Chemistry
3,S3,24,Math
4,S2,22,English
5,S2,22,Social
6,S1,21,Geography
"""),delimiter=',')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+---+----+----+---------+
| id|Name|Rank| Course|
+---+----+----+---------+
| 1| S1| 21| Physics|
| 2| S2| 22|Chemistry|
| 3| S3| 24| Math|
| 4| S2| 22| English|
| 5| S2| 22| Social|
| 6| S1| 21|Geography|
+---+----+----+---------+
密集排名
window = Window.orderBy(['Name','Rank'])
sparkDF = sparkDF.withColumn('ngroup',F.dense_rank().over(window) - 1)
sparkDF.orderBy(['Name','ngroup']).show()
+---+----+----+---------+------+
| id|Name|Rank| Course|ngroup|
+---+----+----+---------+------+
| 6| S1| 21|Geography| 0|
| 1| S1| 21| Physics| 0|
| 4| S2| 22| English| 1|
| 2| S2| 22|Chemistry| 1|
| 5| S2| 22| Social| 1|
| 3| S3| 24| Math| 2|
+---+----+----+---------+------+
密集排名 - SparkSQL
sql.sql("""
SELECT
ID,
NAME,
RANK,
COURSE,
DENSE_RANK() OVER(ORDER BY NAME,RANK) - 1 as NGROUP
FROM TB1
""").show()
+---+----+----+---------+------+
| ID|NAME|RANK| COURSE|NGROUP|
+---+----+----+---------+------+
| 1| S1| 21| Physics| 0|
| 6| S1| 21|Geography| 0|
| 2| S2| 22|Chemistry| 1|
| 4| S2| 22| English| 1|
| 5| S2| 22| Social| 1|
| 3| S3| 24| Math| 2|
+---+----+----+---------+------+