Select 用户基于列值 - pandas 数据框
Select users based on the column values - pandas dataframe
我在选择符合数据框中某些条件的 ID 时遇到问题。
这是问题所在:
我的数据框如下所示:
index ID score_1 score_2 ...
0 22 0 0
1 22 0 0
2 22 0 0
3 23 1 0
4 23 1 0
5 23 1 0
6 24 0 0
7 24 0 0
8 24 0 1
10 25 0 0
11 25 0 0
12 26 0 1
13 26 0 1
我想做的是获取具有以下 ID 的数量:
score_1 == 0
和 score_2 == 0
对于所有实例 - 例如 ID == 22
和 ID == 25
满足此要求。
score_1 == 0
,但给定 ID
的至少一行具有 score_2 == 1
- 例如 ID == 24
满足此要求
score_1 == 0
,给定 ID
的所有行都有 score_2 == 1
- 例如 ID == 26
满足此要求
每个 ID 只能出现在其中一个组中。
我尝试使用条件过滤和 groupby,但后来我得到了重复的 ID,因为它只选择单行,而不是 'having in mind' 用户。
我试过的一些代码:
# Create a df with only IDs that have score_1 == 0, group by `ID`
zero_IDs = df[df['score_1'] == 0].groupby(by = 'ID').nunique()
# 'Count' the number of IDs that have only one type of `score_2`
# But this does not differentitate between `0` or `1` values for score_2 column
zero_IDs[(zero_IDs['score_2'] == 1)].shape[0]
# 'Count' the number of IDs that have at leat one `score_2 == 1`
zero_IDs[(zero_IDs['score_2'] > 1)].shape[0]
你能帮我解决这个问题吗?
这样的事情怎么样?结果是 [22 25] [24] [26].
dfsum = df.groupby('ID').sum()
case1 = dfsum[(dfsum.score_1==0) & (dfsum.score_2==0)].index
case2 = dfsum[(dfsum.score_1==0) & (dfsum.score_2>0) & (dfsum.score_2<df.groupby('ID').count().score_2)].index
case3 = dfsum[(dfsum.score_1==0) & (dfsum.score_2>0) & (dfsum.score_2==df.groupby('ID').count().score_2)].index
print(case1.values)
print(case2.values)
print(case3.values)
这里是新手。尽力了...
df['dummy'] = list(range(0,len(df))) #added a column for looping
grp = df.groupby('ID').agg({'score_1' : 'sum' , 'score_2' : 'sum', 'dummy' : 'count'}).reset_index(level=[0])
instance_1 = []
instance_2 = []
instance_3 = []
i = 0
while i < len(grp):
if(grp.score_1[i] == 0 and grp.score_2[i] == 0):
instance_1.append(grp.ID[i])
elif(grp.score_1[i] == 0 and grp.score_2[i] >= 1 and grp.score_2[i] < grp.dummy[i]):
instance_2.append(grp.ID[i])
elif(grp.score_1[i] == 0 and grp.score_2[i] == grp.dummy[i]):
instance_3.append(grp.ID[i])
i += 1
对于这个问题,数据应该先通过flag过滤,然后是grouby ID,并检查那里的条件:
import pandas as pd
import numpy as np
from io import StringIO
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
无需每行使用print
csv=StringIO("""
index ID score_1 score_2
0 22 0 0
1 22 0 0
2 22 0 0
3 23 1 1
4 23 1 0
5 23 1 0
6 24 0 0
7 24 0 0
8 24 0 1
10 25 0 0
11 25 0 0
12 26 0 1
13 26 0 1
""")
加载数据,这里是代码:
df=pd.read_csv(csv,sep='\s+',index_col=0)
flag10=df.score_1==0
group=df[flag10].groupby('ID')['score_2']
case1=group.sum()==0
case1[case1]
case2=group.sum()>0
case2[case2]
case2=group.sum()==group.count()
case2[case2]
答案是:
ID
22 True
25 True
Name: score_2, dtype: bool
ID
24 True
26 True
Name: score_2, dtype: bool
ID
26 True
Name: score_2, dtype: bool
希望对您有所帮助
我在选择符合数据框中某些条件的 ID 时遇到问题。 这是问题所在: 我的数据框如下所示:
index ID score_1 score_2 ...
0 22 0 0
1 22 0 0
2 22 0 0
3 23 1 0
4 23 1 0
5 23 1 0
6 24 0 0
7 24 0 0
8 24 0 1
10 25 0 0
11 25 0 0
12 26 0 1
13 26 0 1
我想做的是获取具有以下 ID 的数量:
score_1 == 0
和score_2 == 0
对于所有实例 - 例如ID == 22
和ID == 25
满足此要求。score_1 == 0
,但给定ID
的至少一行具有score_2 == 1
- 例如ID == 24
满足此要求score_1 == 0
,给定ID
的所有行都有score_2 == 1
- 例如ID == 26
满足此要求
每个 ID 只能出现在其中一个组中。
我尝试使用条件过滤和 groupby,但后来我得到了重复的 ID,因为它只选择单行,而不是 'having in mind' 用户。 我试过的一些代码:
# Create a df with only IDs that have score_1 == 0, group by `ID`
zero_IDs = df[df['score_1'] == 0].groupby(by = 'ID').nunique()
# 'Count' the number of IDs that have only one type of `score_2`
# But this does not differentitate between `0` or `1` values for score_2 column
zero_IDs[(zero_IDs['score_2'] == 1)].shape[0]
# 'Count' the number of IDs that have at leat one `score_2 == 1`
zero_IDs[(zero_IDs['score_2'] > 1)].shape[0]
你能帮我解决这个问题吗?
这样的事情怎么样?结果是 [22 25] [24] [26].
dfsum = df.groupby('ID').sum()
case1 = dfsum[(dfsum.score_1==0) & (dfsum.score_2==0)].index
case2 = dfsum[(dfsum.score_1==0) & (dfsum.score_2>0) & (dfsum.score_2<df.groupby('ID').count().score_2)].index
case3 = dfsum[(dfsum.score_1==0) & (dfsum.score_2>0) & (dfsum.score_2==df.groupby('ID').count().score_2)].index
print(case1.values)
print(case2.values)
print(case3.values)
这里是新手。尽力了...
df['dummy'] = list(range(0,len(df))) #added a column for looping
grp = df.groupby('ID').agg({'score_1' : 'sum' , 'score_2' : 'sum', 'dummy' : 'count'}).reset_index(level=[0])
instance_1 = []
instance_2 = []
instance_3 = []
i = 0
while i < len(grp):
if(grp.score_1[i] == 0 and grp.score_2[i] == 0):
instance_1.append(grp.ID[i])
elif(grp.score_1[i] == 0 and grp.score_2[i] >= 1 and grp.score_2[i] < grp.dummy[i]):
instance_2.append(grp.ID[i])
elif(grp.score_1[i] == 0 and grp.score_2[i] == grp.dummy[i]):
instance_3.append(grp.ID[i])
i += 1
对于这个问题,数据应该先通过flag过滤,然后是grouby ID,并检查那里的条件:
import pandas as pd
import numpy as np
from io import StringIO
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
无需每行使用print
csv=StringIO("""
index ID score_1 score_2
0 22 0 0
1 22 0 0
2 22 0 0
3 23 1 1
4 23 1 0
5 23 1 0
6 24 0 0
7 24 0 0
8 24 0 1
10 25 0 0
11 25 0 0
12 26 0 1
13 26 0 1
""")
加载数据,这里是代码:
df=pd.read_csv(csv,sep='\s+',index_col=0)
flag10=df.score_1==0
group=df[flag10].groupby('ID')['score_2']
case1=group.sum()==0
case1[case1]
case2=group.sum()>0
case2[case2]
case2=group.sum()==group.count()
case2[case2]
答案是:
ID
22 True
25 True
Name: score_2, dtype: bool
ID
24 True
26 True
Name: score_2, dtype: bool
ID
26 True
Name: score_2, dtype: bool
希望对您有所帮助