Python/Pandas - 删除不重复的行

Python/Pandas - remove not duplicated rows

我有这样的 DataFrame:

        product_id          dt  stock_qty
226870     2948259  2017-11-11     17.000
233645     2948259  2017-11-12     17.000
240572     2948260  2017-11-13      5.000
247452     2948260  2017-11-14      5.000
233644     2948260  2017-11-12      5.000
226869     2948260  2017-11-11      5.000
247451     2948262  2017-11-14     -2.000
226868     2948262  2017-11-11     -1.000  <- not duplicated
240571     2948262  2017-11-13     -2.000
240570     2948264  2017-11-13      5.488
233643     2948264  2017-11-12      5.488
244543     2948269  2017-11-11      2.500
247450     2948276  2017-11-14      3.250
226867     2948276  2017-11-11      3.250

我必须删除 stock_qty 不同但 product_id 值相同的行。所以我应该像这样得到 DataFrame:

        product_id          dt  stock_qty
226870     2948259  2017-11-11     17.000
233645     2948259  2017-11-12     17.000
240572     2948260  2017-11-13      5.000
247452     2948260  2017-11-14      5.000
233644     2948260  2017-11-12      5.000
226869     2948260  2017-11-11      5.000
240570     2948264  2017-11-13      5.488
233643     2948264  2017-11-12      5.488
244543     2948269  2017-11-11      2.500
247450     2948276  2017-11-14      3.250
226867     2948276  2017-11-11      3.250

感谢您的帮助!

您需要 drop_duplicates for get all product_id values and then exclude them by isin 以及由 xor (^):

链接的另一个条件
m1 = df['product_id'].isin(df.drop_duplicates('stock_qty', keep=False)['product_id'])
m2 = df.duplicated('product_id', keep=False)

df = df[m1 ^ m2]
print (df)
        product_id          dt  stock_qty
226870     2948259  2017-11-11     17.000
233645     2948259  2017-11-12     17.000
240572     2948260  2017-11-13      5.000
247452     2948260  2017-11-14      5.000
233644     2948260  2017-11-12      5.000
226869     2948260  2017-11-11      5.000
240570     2948264  2017-11-13      5.488
233643     2948264  2017-11-12      5.488
244543     2948269  2017-11-11      2.500
247450     2948276  2017-11-14      3.250
226867     2948276  2017-11-11      3.250

详情:

print (m1)
226870    False
233645    False
240572    False
247452    False
233644    False
226869    False
247451     True
226868     True
240571     True
240570    False
233643    False
244543     True
247450    False
226867    False
Name: product_id, dtype: bool

print (m2)
226870     True
233645     True
240572     True
247452     True
233644     True
226869     True
247451     True
226868     True
240571     True
240570     True
233643     True
244543    False
247450     True
226867     True
dtype: bool

@jezrael 解决方案是最优的,但另一种方法是使用 groupbyfilter:

df.groupby(['product_id','stock_qty']).filter(lambda x: len(x)>1)

输出:

        product_id          dt  stock_qty
226870     2948259  2017-11-11     17.000
233645     2948259  2017-11-12     17.000
240572     2948260  2017-11-13      5.000
247452     2948260  2017-11-14      5.000
233644     2948260  2017-11-12      5.000
226869     2948260  2017-11-11      5.000
247451     2948262  2017-11-14     -2.000
240571     2948262  2017-11-13     -2.000
240570     2948264  2017-11-13      5.488
233643     2948264  2017-11-12      5.488
247450     2948276  2017-11-14      3.250
226867     2948276  2017-11-11      3.250

通过使用drop_duplicates

df.drop(df.drop_duplicates(['stock_qty', 'product_id'], keep=False).index)
Out[797]: 
        product_id          dt  stock_qty
226870     2948259  2017-11-11     17.000
233645     2948259  2017-11-12     17.000
240572     2948260  2017-11-13      5.000
247452     2948260  2017-11-14      5.000
233644     2948260  2017-11-12      5.000
226869     2948260  2017-11-11      5.000
247451     2948262  2017-11-14     -2.000
240571     2948262  2017-11-13     -2.000
240570     2948264  2017-11-13      5.488
233643     2948264  2017-11-12      5.488
247450     2948276  2017-11-14      3.250
226867     2948276  2017-11-11      3.250

使用 loc[] 您可以只过滤重复的行并分配给您的原始数据框。

df = df.loc[df.duplicated(subset=['product_id','stock_qty'], keep=False)]

此外,keep=False 参数将所有重复的行标记为 True,如果您只想第一个或最后一个使用 keep='first'keep='last'