如何在percentile_approxreturns基于groupby的特定列的单个值时select另一列的对应值?
How to select the corresponding value of another column when percentile_approx returns a single value of a particular column based on groupby?
我是 pyspark 的新手,需要一些说明。
我有一个 PySpark table 是这样的:
+---+-------+-----+-------+
| id| ranges|score| uom|
+---+-------+-----+-------+
| 1| low| 20|percent|
| 1|verylow| 10|percent|
| 1| high| 70| bytes|
| 1| medium| 40|percent|
| 1| high| 60|percent|
| 1|verylow| 10|percent|
| 1| high| 70|percent|
+---+-------+-----+-------+
我想计算给定百分比的分数列的百分位值为 0.95,同时我希望它也应该 return 相应的范围值。我尝试了 运行 这个查询:
results = spark.sql('select percentile_approx(score,0.95) as score, first(ranges) from subset GROUP BY id')
我得到的结果是这样的:
+-----+--------------------+
|score|first(ranges, false)|
+-----+--------------------+
| 70| low|
+-----+--------------------+
它 return 告诉我第一个不正确的范围值,它应该是 'high'。
如果我从查询中删除 first(ranges) 它会给我错误:
> pyspark.sql.utils.AnalysisException: u"expression 'subset.`ranges`' is
> neither present in the group by, nor is it an aggregate function. Add
> to group by or wrap in first() (or first_value) if you don't care
> which value you get.;;\nAggregate [id#0L],
> [percentile_approx(score#2L, cast(0.95 as double), 10000, 0, 0) AS
> score#353L, ranges#1]\n+- SubqueryAlias subset\n +- LogicalRDD
> [id#0L, ranges#1, score#2L, uom#3], false\n
这是因为您仅按 ID 分组。因此,通过使用第一个函数,您可以有效地从范围列中选择一个随机值。
一个解决方案是创建第二个数据框,其中包含分数到范围的映射,然后在最后将其连接回结果 df。
>>> df.registerTempTable("df") # Register first before selecting from 'df'
>>> map = spark.sql('select ranges, score from df')
>>> results = spark.sql('select percentile_approx(score,0.95) as score from subset GROUP BY id')
>>> results .registerTempTable("results ")
>>> final_result = spark.sql('select r.score, m.ranges from results as r join map as m on r.score = m.score')
我是 pyspark 的新手,需要一些说明。 我有一个 PySpark table 是这样的:
+---+-------+-----+-------+
| id| ranges|score| uom|
+---+-------+-----+-------+
| 1| low| 20|percent|
| 1|verylow| 10|percent|
| 1| high| 70| bytes|
| 1| medium| 40|percent|
| 1| high| 60|percent|
| 1|verylow| 10|percent|
| 1| high| 70|percent|
+---+-------+-----+-------+
我想计算给定百分比的分数列的百分位值为 0.95,同时我希望它也应该 return 相应的范围值。我尝试了 运行 这个查询:
results = spark.sql('select percentile_approx(score,0.95) as score, first(ranges) from subset GROUP BY id')
我得到的结果是这样的:
+-----+--------------------+
|score|first(ranges, false)|
+-----+--------------------+
| 70| low|
+-----+--------------------+
它 return 告诉我第一个不正确的范围值,它应该是 'high'。 如果我从查询中删除 first(ranges) 它会给我错误:
> pyspark.sql.utils.AnalysisException: u"expression 'subset.`ranges`' is
> neither present in the group by, nor is it an aggregate function. Add
> to group by or wrap in first() (or first_value) if you don't care
> which value you get.;;\nAggregate [id#0L],
> [percentile_approx(score#2L, cast(0.95 as double), 10000, 0, 0) AS
> score#353L, ranges#1]\n+- SubqueryAlias subset\n +- LogicalRDD
> [id#0L, ranges#1, score#2L, uom#3], false\n
这是因为您仅按 ID 分组。因此,通过使用第一个函数,您可以有效地从范围列中选择一个随机值。
一个解决方案是创建第二个数据框,其中包含分数到范围的映射,然后在最后将其连接回结果 df。
>>> df.registerTempTable("df") # Register first before selecting from 'df'
>>> map = spark.sql('select ranges, score from df')
>>> results = spark.sql('select percentile_approx(score,0.95) as score from subset GROUP BY id')
>>> results .registerTempTable("results ")
>>> final_result = spark.sql('select r.score, m.ranges from results as r join map as m on r.score = m.score')