Pyspark：GroupBy多列并计算组数

Question

我有一个像这样的数据框：

id Name Rank Course
1  S1   21   Physics
2  S2   22   Chemistry
3  S3   24   Math
4  S2   22   English
5  S2   22   Social
6  S1   21   Geography

我想根据名称、排名对这个数据集进行分组并计算组数。在 pandas 中，我可以轻松地做到：

df['ngrp'] = df.groupby(['Name', 'Rank']).ngroup()

经过上面的计算，我得到以下输出：

id Name Rank Course     ngrp
1  S1   21   Physics    0
6  S1   22   Geography  0
2  S2   22   Chemistry  1
4  S2   22   English    1
5  S2   23   Social     1
3  S3   24   Math       2

Pyspark 中是否有方法可以实现相同的输出？我尝试了以下方法，但它似乎不起作用：

from pyspark.sql import Window
w = Window.partitionBy(['Name', 'Rank'])
df.select(['Name', 'Rank'], ['Course'], f.count(['Name', 'Rank']).over(w).alias('ngroup')).show()

Answer 1

您可以选择 DENSE_RANK -

数据准备

df = pd.read_csv(StringIO("""
id,Name,Rank,Course
1,S1,21,Physics
2,S2,22,Chemistry
3,S3,24,Math
4,S2,22,English
5,S2,22,Social
6,S1,21,Geography
"""),delimiter=',')

sparkDF = sql.createDataFrame(df)

sparkDF.show()
+---+----+----+---------+
| id|Name|Rank|   Course|
+---+----+----+---------+
|  1|  S1|  21|  Physics|
|  2|  S2|  22|Chemistry|
|  3|  S3|  24|     Math|
|  4|  S2|  22|  English|
|  5|  S2|  22|   Social|
|  6|  S1|  21|Geography|
+---+----+----+---------+

密集排名

window = Window.orderBy(['Name','Rank'])

sparkDF = sparkDF.withColumn('ngroup',F.dense_rank().over(window) - 1)

sparkDF.orderBy(['Name','ngroup']).show()

+---+----+----+---------+------+
| id|Name|Rank|   Course|ngroup|
+---+----+----+---------+------+
|  6|  S1|  21|Geography|     0|
|  1|  S1|  21|  Physics|     0|
|  4|  S2|  22|  English|     1|
|  2|  S2|  22|Chemistry|     1|
|  5|  S2|  22|   Social|     1|
|  3|  S3|  24|     Math|     2|
+---+----+----+---------+------+

密集排名 - SparkSQL

sql.sql("""
SELECT
    ID,
    NAME,
    RANK,
    COURSE,
    DENSE_RANK() OVER(ORDER BY NAME,RANK) - 1 as NGROUP
FROM TB1
""").show()

+---+----+----+---------+------+
| ID|NAME|RANK|   COURSE|NGROUP|
+---+----+----+---------+------+
|  1|  S1|  21|  Physics|     0|
|  6|  S1|  21|Geography|     0|
|  2|  S2|  22|Chemistry|     1|
|  4|  S2|  22|  English|     1|
|  5|  S2|  22|   Social|     1|
|  3|  S3|  24|     Math|     2|
+---+----+----+---------+------+

Pyspark：GroupBy多列并计算组数

Pyspark: GroupBy multiple columns and calculate group number

python

group-by

bigdata

pyspark

数据准备

密集排名

密集排名 - SparkSQL