Pyspark 分区最多
Pyspark partition by most count
我有一个问题,
我有这样的数据集:
id / color
1 / red
2 / green
2 / green
2 / blue
3 / blue
4 / yellow
4 / pink
5 / red
并且我想按 id 分组并保留最常见的颜色
有这样的东西:
(如果随机抽奖没关系,或者如果有更好的解决方案)
id / most_color
1 / red
2 / green
3 / blue
4 / yellow
5 / red
我试过类似的东西:
display(dataset.select("id","color").
dropDuplicates().
withColumn("most_color",count("color").over(w)))
或者像这样:
dataset2= (dataset.select("id","color").
withColumn("most_color", dataset["color"]).
groupBy("id").
agg(count('color').
alias('count').
filter(column('count') == max(count))))
display(dataset2)
谢谢大家
您可以使用 Window 函数 row_number() 来实现这个
from pyspark.sql import functions as F
from pyspark.sql import Window as W
_w = W.partitionBy('id').orderBy(F.col('id').desc())
_w = W.partitionBy('id').orderBy(F.col('id').desc())
df_final = df_final.withColumn('rn_no', F.row_number().over(_w))
df_final = df_final.filter(F.col('rn_no') == 1)
df_final.show()
输出
id / most_color
1 / red
2 / green
3 / blue
4 / yellow
5 / red
修改版:这会给你一个组中最大的used/appeared值--
输入
df_a = spark.createDataFrame([(1,'red'),(2,'green'),(2,'green'),(2,'blue'),(3,'blue'),(4,'yellow'),(4,'pink'),(5,'red')],[ "id","color"])
+---+------+
| id| color|
+---+------+
| 1| red|
| 2| green|
| 2| green|
| 2| blue|
| 3| blue|
| 4|yellow|
| 4| pink|
| 5| red|
+---+------+
# First Group the values to get the max appeared color in a group
df = df_a.groupBy('id','color').agg(F.count('color').alias('count')).orderBy(F.col('id'))
# Now, make a partition and sort of the decending order for each window of ID and take the first value
_w = W.partitionBy('id').orderBy(F.col('count').desc())
df_a = df.withColumn('rn_no', F.row_number().over(_w))
df_a = df_a.filter(F.col('rn_no') == F.lit('1'))
输出
df_a.show()
+---+-----+-----+-----+
| id|color|count|rn_no|
+---+-----+-----+-----+
| 1| red| 1| 1|
| 2|green| 2| 1|
| 3| blue| 1| 1|
| 4| pink| 1| 1|
| 5| red| 1| 1|
+---+-----+-----+-----+
我有一个问题,
我有这样的数据集:
id / color
1 / red
2 / green
2 / green
2 / blue
3 / blue
4 / yellow
4 / pink
5 / red
并且我想按 id 分组并保留最常见的颜色 有这样的东西: (如果随机抽奖没关系,或者如果有更好的解决方案)
id / most_color
1 / red
2 / green
3 / blue
4 / yellow
5 / red
我试过类似的东西:
display(dataset.select("id","color").
dropDuplicates().
withColumn("most_color",count("color").over(w)))
或者像这样:
dataset2= (dataset.select("id","color").
withColumn("most_color", dataset["color"]).
groupBy("id").
agg(count('color').
alias('count').
filter(column('count') == max(count))))
display(dataset2)
谢谢大家
您可以使用 Window 函数 row_number() 来实现这个
from pyspark.sql import functions as F
from pyspark.sql import Window as W
_w = W.partitionBy('id').orderBy(F.col('id').desc())
_w = W.partitionBy('id').orderBy(F.col('id').desc())
df_final = df_final.withColumn('rn_no', F.row_number().over(_w))
df_final = df_final.filter(F.col('rn_no') == 1)
df_final.show()
输出
id / most_color
1 / red
2 / green
3 / blue
4 / yellow
5 / red
修改版:这会给你一个组中最大的used/appeared值--
输入
df_a = spark.createDataFrame([(1,'red'),(2,'green'),(2,'green'),(2,'blue'),(3,'blue'),(4,'yellow'),(4,'pink'),(5,'red')],[ "id","color"])
+---+------+
| id| color|
+---+------+
| 1| red|
| 2| green|
| 2| green|
| 2| blue|
| 3| blue|
| 4|yellow|
| 4| pink|
| 5| red|
+---+------+
# First Group the values to get the max appeared color in a group
df = df_a.groupBy('id','color').agg(F.count('color').alias('count')).orderBy(F.col('id'))
# Now, make a partition and sort of the decending order for each window of ID and take the first value
_w = W.partitionBy('id').orderBy(F.col('count').desc())
df_a = df.withColumn('rn_no', F.row_number().over(_w))
df_a = df_a.filter(F.col('rn_no') == F.lit('1'))
输出
df_a.show()
+---+-----+-----+-----+
| id|color|count|rn_no|
+---+-----+-----+-----+
| 1| red| 1| 1|
| 2|green| 2| 1|
| 3| blue| 1| 1|
| 4| pink| 1| 1|
| 5| red| 1| 1|
+---+-----+-----+-----+