Select 用户基于列值 - pandas 数据框

Question

我在选择符合数据框中某些条件的 ID 时遇到问题。这是问题所在：我的数据框如下所示：

index    ID    score_1   score_2   ...
   0     22      0          0
   1     22      0          0
   2     22      0          0
   3     23      1          0
   4     23      1          0 
   5     23      1          0
   6     24      0          0
   7     24      0          0
   8     24      0          1
   10    25      0          0
   11    25      0          0
   12    26      0          1
   13    26      0          1

我想做的是获取具有以下 ID 的数量：

score_1 == 0 和 score_2 == 0 对于所有实例 - 例如 ID == 22 和 ID == 25 满足此要求。
score_1 == 0，但给定 ID 的至少一行具有 score_2 == 1 - 例如 ID == 24 满足此要求
score_1 == 0，给定 ID 的所有行都有 score_2 == 1 - 例如 ID == 26 满足此要求

每个 ID 只能出现在其中一个组中。

我尝试使用条件过滤和 groupby，但后来我得到了重复的 ID，因为它只选择单行，而不是 'having in mind' 用户。我试过的一些代码：

# Create a df with only IDs that have score_1 == 0, group by `ID`
zero_IDs = df[df['score_1'] == 0].groupby(by = 'ID').nunique()
# 'Count' the number of IDs that have only one type of `score_2`
# But this does not differentitate between `0` or `1` values for score_2 column
zero_IDs[(zero_IDs['score_2'] == 1)].shape[0] 
# 'Count' the number of IDs that have at leat one `score_2 == 1`
zero_IDs[(zero_IDs['score_2'] > 1)].shape[0]

你能帮我解决这个问题吗？

Answer 1

这样的事情怎么样？结果是 [22 25] [24] [26].

dfsum = df.groupby('ID').sum()
case1 = dfsum[(dfsum.score_1==0) & (dfsum.score_2==0)].index
case2 = dfsum[(dfsum.score_1==0) & (dfsum.score_2>0) &  (dfsum.score_2<df.groupby('ID').count().score_2)].index  
case3 = dfsum[(dfsum.score_1==0) & (dfsum.score_2>0) &  (dfsum.score_2==df.groupby('ID').count().score_2)].index
print(case1.values)
print(case2.values)
print(case3.values)

Answer 2

这里是新手。尽力了...

df['dummy'] = list(range(0,len(df))) #added a column for looping

grp = df.groupby('ID').agg({'score_1' : 'sum' , 'score_2' : 'sum', 'dummy' : 'count'}).reset_index(level=[0])
instance_1 = []
instance_2 = []
instance_3 = []

i = 0
while i < len(grp):
    if(grp.score_1[i] == 0 and grp.score_2[i] == 0):
        instance_1.append(grp.ID[i])
    elif(grp.score_1[i] == 0 and grp.score_2[i] >= 1 and grp.score_2[i] < grp.dummy[i]):
        instance_2.append(grp.ID[i])
    elif(grp.score_1[i] == 0 and grp.score_2[i] == grp.dummy[i]):
        instance_3.append(grp.ID[i])
    i += 1

Answer 3

对于这个问题，数据应该先通过flag过滤，然后是grouby ID，并检查那里的条件：

import pandas as pd
import numpy as np
from io import StringIO
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

无需每行使用print

csv=StringIO("""
index    ID    score_1   score_2
   0     22      0          0
   1     22      0          0
   2     22      0          0
   3     23      1          1
   4     23      1          0 
   5     23      1          0
   6     24      0          0
   7     24      0          0
   8     24      0          1
   10    25      0          0
   11    25      0          0
   12    26      0          1
   13    26      0          1
""")

加载数据，这里是代码：

df=pd.read_csv(csv,sep='\s+',index_col=0)

flag10=df.score_1==0
group=df[flag10].groupby('ID')['score_2']
case1=group.sum()==0
case1[case1]
case2=group.sum()>0
case2[case2]
case2=group.sum()==group.count()
case2[case2]

答案是：

ID
22    True
25    True
Name: score_2, dtype: bool
ID
24    True
26    True
Name: score_2, dtype: bool
ID
26    True
Name: score_2, dtype: bool

希望对您有所帮助

Select 用户基于列值 - pandas 数据框

Select users based on the column values - pandas dataframe

python

conditional

numpy

dataframe

pandas