按 Python 中的两个或多个条件对 pandas 数据帧进行分组后过滤
Filtering AFTER grouped pandas dataframe by two or more conditions in Python
有没有人有过按所有行过滤 AFTER 分组 pandas 数据帧的经验?让我解释一下。
这是可重现的数据:
name users_rated average bayes_average maxplayers
<chr> <dbl> <dbl> <dbl> <dbl>
1 Pandemic 108975 7.59 7.49 4
2 Carcassonne 108738 7.42 7.31 5
3 Catan 108024 7.14 6.97 4
4 7 Wonders 89982 7.74 7.63 7
5 Dominion 81561 7.61 7.50 4
6 Ticket to Ride 76171 7.41 7.30 5
7 Codenames 74419 7.6 7.51 8
8 Terraforming M… 74216 8.42 8.27 5
9 7 Wonders Duel 69472 8.11 7.98 2
10 Agricola 66093 7.93 7.81 5
对于我要解决的问题,在R中,我可以这样做:
df |>
select(
name,
users_rated,
average,
bayes_average,
maxplayers
) |>
group_by(maxplayers) |>
filter(average == min(average) | average == max(average)) |> # What I'm struggling right now in Python
filter(maxplayers <= 4 & maxplayers != 0) |>
arrange(maxplayers, average) |>
ungroup()
这段代码所做的是按 maxplayers
进行分组,然后按组过滤(例如,如果有三个 maxplayers
= 99 的游戏,那么这三个游戏将应用于过滤器)
这是一个示例输出(注意:与可重现的数据不同,但它有助于我尝试做的事情)
name users_rated average bayes_average maxplayers
<chr> <dbl> <dbl> <dbl> <dbl>
1 Solitaire 1014 4.4 5.07 1
2 Five Parsecs Fr… 65 8.89 5.59 1
3 W.W.B 40 1.44 5.41 2
4 System Gateway … 47 9.4 5.6 2
5 Exploration: Wa… 47 3.55 5.44 3
6 Old School Tact… 151 8.54 5.69 3
7 Oneupmanship: M… 75 1.04 5.32 4
8 TerroriXico 70 9.43 5.50 4
(检查最后一列,其中每个最大球员都有自己的最大和最小平均分组)
事实上,我在 SQL 中使用 PARTITION BY 做到了,但是在 Python 中,我不知道我做错了什么。
这是在SQL:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals()) # Quicker to write query
q = """
WITH cte_data AS
(
SELECT
name
, users_rated
, average
, bayes_average
, maxplayers
, MIN(average) OVER(PARTITION BY maxplayers) AS avg_min
, MAX(average) OVER(PARTITION BY maxplayers) AS avg_max
FROM df
)
SELECT
*
FROM cte_data
WHERE 1=1
AND (average = avg_min OR average = avg_max)
AND (maxplayers <= 4 AND maxplayers != 0)
ORDER BY maxplayers, average
;
"""
pysqldf(q)
这就是我在 Whosebug 中偶然发现类似问题后 Python Pandas 中的内容:filtering grouped pandas dataframe by all records being the same
(
df[["name", "users_rated", "average", "bayes_average", "maxplayers"]]
.groupby(["maxplayers"])
.filter(lambda x: ((x["average"] == x["average"].min()).all()) or ((x["average"] == x["average"].max()).all()))
)
另一种方法,但运气不好:
dta = (df[["name", "users_rated", "average", "bayes_average", "maxplayers"]]
.assign(avg_min = lambda x: x.groupby(["maxplayers"])["average"].min())
.assign(avg_max = lambda x: x.groupby(["maxplayers"])["average"].max())
)
dta.query("average == avg_min or average == avg_max")
有什么想法吗?非常感谢您提供的任何帮助。
你能做到吗:
import pandas as pd
(
df[(df.maxplayers<=4) & (df.maxplayers!=0)]
.groupby("maxplayers")
.apply(lambda gp: gp.assign(avg_min = min(gp["average"]), avg_max=max(gp["average"])))
.reset_index(drop=True)
.query("average == avg_min or average==avg_max")
.sort_values(["maxplayers","average"])
)
输出:
name users_rated average bayes_average maxplayers
2 7 Wonders Duel 69472 8.11 7.98 2
0 Catan 108024 7.14 6.97 4
1 Dominion 81561 7.61 7.50 4
R 输出:
# A tibble: 3 x 5
name users_rated average bayes_average maxplayers
<chr> <int> <dbl> <dbl> <int>
1 7 Wonders Duel 69472 8.11 7.98 2
2 Catan 108024 7.14 6.97 4
3 Dominion 81561 7.61 7.5 4
使用 transform
在 groupby 上构建条件,这在某种程度上相当于 SQL 的分区:
grouped = df.groupby('maxplayers').average
cond1 = df.average.eq(grouped.transform('min')) | df.average.eq(grouped.transform('max'))
cond2 = df.maxplayers.between(0,4) # a simpler interpretation
df.loc[cond1 & cond2].sort_values(['maxplayers', 'average'])
name users_rated average bayes_average maxplayers
8 7 Wonders Duel 69472 8.11 7.98 2
2 Catan 108024 7.14 6.97 4
4 Dominion 81561 7.61 7.50 4
有没有人有过按所有行过滤 AFTER 分组 pandas 数据帧的经验?让我解释一下。
这是可重现的数据:
name users_rated average bayes_average maxplayers
<chr> <dbl> <dbl> <dbl> <dbl>
1 Pandemic 108975 7.59 7.49 4
2 Carcassonne 108738 7.42 7.31 5
3 Catan 108024 7.14 6.97 4
4 7 Wonders 89982 7.74 7.63 7
5 Dominion 81561 7.61 7.50 4
6 Ticket to Ride 76171 7.41 7.30 5
7 Codenames 74419 7.6 7.51 8
8 Terraforming M… 74216 8.42 8.27 5
9 7 Wonders Duel 69472 8.11 7.98 2
10 Agricola 66093 7.93 7.81 5
对于我要解决的问题,在R中,我可以这样做:
df |>
select(
name,
users_rated,
average,
bayes_average,
maxplayers
) |>
group_by(maxplayers) |>
filter(average == min(average) | average == max(average)) |> # What I'm struggling right now in Python
filter(maxplayers <= 4 & maxplayers != 0) |>
arrange(maxplayers, average) |>
ungroup()
这段代码所做的是按 maxplayers
进行分组,然后按组过滤(例如,如果有三个 maxplayers
= 99 的游戏,那么这三个游戏将应用于过滤器)
这是一个示例输出(注意:与可重现的数据不同,但它有助于我尝试做的事情)
name users_rated average bayes_average maxplayers
<chr> <dbl> <dbl> <dbl> <dbl>
1 Solitaire 1014 4.4 5.07 1
2 Five Parsecs Fr… 65 8.89 5.59 1
3 W.W.B 40 1.44 5.41 2
4 System Gateway … 47 9.4 5.6 2
5 Exploration: Wa… 47 3.55 5.44 3
6 Old School Tact… 151 8.54 5.69 3
7 Oneupmanship: M… 75 1.04 5.32 4
8 TerroriXico 70 9.43 5.50 4
(检查最后一列,其中每个最大球员都有自己的最大和最小平均分组)
事实上,我在 SQL 中使用 PARTITION BY 做到了,但是在 Python 中,我不知道我做错了什么。
这是在SQL:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals()) # Quicker to write query
q = """
WITH cte_data AS
(
SELECT
name
, users_rated
, average
, bayes_average
, maxplayers
, MIN(average) OVER(PARTITION BY maxplayers) AS avg_min
, MAX(average) OVER(PARTITION BY maxplayers) AS avg_max
FROM df
)
SELECT
*
FROM cte_data
WHERE 1=1
AND (average = avg_min OR average = avg_max)
AND (maxplayers <= 4 AND maxplayers != 0)
ORDER BY maxplayers, average
;
"""
pysqldf(q)
这就是我在 Whosebug 中偶然发现类似问题后 Python Pandas 中的内容:filtering grouped pandas dataframe by all records being the same
(
df[["name", "users_rated", "average", "bayes_average", "maxplayers"]]
.groupby(["maxplayers"])
.filter(lambda x: ((x["average"] == x["average"].min()).all()) or ((x["average"] == x["average"].max()).all()))
)
另一种方法,但运气不好:
dta = (df[["name", "users_rated", "average", "bayes_average", "maxplayers"]]
.assign(avg_min = lambda x: x.groupby(["maxplayers"])["average"].min())
.assign(avg_max = lambda x: x.groupby(["maxplayers"])["average"].max())
)
dta.query("average == avg_min or average == avg_max")
有什么想法吗?非常感谢您提供的任何帮助。
你能做到吗:
import pandas as pd
(
df[(df.maxplayers<=4) & (df.maxplayers!=0)]
.groupby("maxplayers")
.apply(lambda gp: gp.assign(avg_min = min(gp["average"]), avg_max=max(gp["average"])))
.reset_index(drop=True)
.query("average == avg_min or average==avg_max")
.sort_values(["maxplayers","average"])
)
输出:
name users_rated average bayes_average maxplayers
2 7 Wonders Duel 69472 8.11 7.98 2
0 Catan 108024 7.14 6.97 4
1 Dominion 81561 7.61 7.50 4
R 输出:
# A tibble: 3 x 5
name users_rated average bayes_average maxplayers
<chr> <int> <dbl> <dbl> <int>
1 7 Wonders Duel 69472 8.11 7.98 2
2 Catan 108024 7.14 6.97 4
3 Dominion 81561 7.61 7.5 4
使用 transform
在 groupby 上构建条件,这在某种程度上相当于 SQL 的分区:
grouped = df.groupby('maxplayers').average
cond1 = df.average.eq(grouped.transform('min')) | df.average.eq(grouped.transform('max'))
cond2 = df.maxplayers.between(0,4) # a simpler interpretation
df.loc[cond1 & cond2].sort_values(['maxplayers', 'average'])
name users_rated average bayes_average maxplayers
8 7 Wonders Duel 69472 8.11 7.98 2
2 Catan 108024 7.14 6.97 4
4 Dominion 81561 7.61 7.50 4