Python/Pandas - 删除不重复的行
Python/Pandas - remove not duplicated rows
我有这样的 DataFrame:
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
247451 2948262 2017-11-14 -2.000
226868 2948262 2017-11-11 -1.000 <- not duplicated
240571 2948262 2017-11-13 -2.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
244543 2948269 2017-11-11 2.500
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250
我必须删除 stock_qty
不同但 product_id
值相同的行。所以我应该像这样得到 DataFrame:
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
244543 2948269 2017-11-11 2.500
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250
感谢您的帮助!
您需要 drop_duplicates
for get all product_id
values and then exclude them by isin
以及由 xor
(^)
:
链接的另一个条件
m1 = df['product_id'].isin(df.drop_duplicates('stock_qty', keep=False)['product_id'])
m2 = df.duplicated('product_id', keep=False)
df = df[m1 ^ m2]
print (df)
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
244543 2948269 2017-11-11 2.500
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250
详情:
print (m1)
226870 False
233645 False
240572 False
247452 False
233644 False
226869 False
247451 True
226868 True
240571 True
240570 False
233643 False
244543 True
247450 False
226867 False
Name: product_id, dtype: bool
print (m2)
226870 True
233645 True
240572 True
247452 True
233644 True
226869 True
247451 True
226868 True
240571 True
240570 True
233643 True
244543 False
247450 True
226867 True
dtype: bool
@jezrael 解决方案是最优的,但另一种方法是使用 groupby
和 filter
:
df.groupby(['product_id','stock_qty']).filter(lambda x: len(x)>1)
输出:
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
247451 2948262 2017-11-14 -2.000
240571 2948262 2017-11-13 -2.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250
通过使用drop_duplicates
df.drop(df.drop_duplicates(['stock_qty', 'product_id'], keep=False).index)
Out[797]:
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
247451 2948262 2017-11-14 -2.000
240571 2948262 2017-11-13 -2.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250
使用 loc[]
您可以只过滤重复的行并分配给您的原始数据框。
df = df.loc[df.duplicated(subset=['product_id','stock_qty'], keep=False)]
此外,keep=False
参数将所有重复的行标记为 True,如果您只想第一个或最后一个使用 keep='first'
或 keep='last'
我有这样的 DataFrame:
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
247451 2948262 2017-11-14 -2.000
226868 2948262 2017-11-11 -1.000 <- not duplicated
240571 2948262 2017-11-13 -2.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
244543 2948269 2017-11-11 2.500
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250
我必须删除 stock_qty
不同但 product_id
值相同的行。所以我应该像这样得到 DataFrame:
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
244543 2948269 2017-11-11 2.500
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250
感谢您的帮助!
您需要 drop_duplicates
for get all product_id
values and then exclude them by isin
以及由 xor
(^)
:
m1 = df['product_id'].isin(df.drop_duplicates('stock_qty', keep=False)['product_id'])
m2 = df.duplicated('product_id', keep=False)
df = df[m1 ^ m2]
print (df)
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
244543 2948269 2017-11-11 2.500
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250
详情:
print (m1)
226870 False
233645 False
240572 False
247452 False
233644 False
226869 False
247451 True
226868 True
240571 True
240570 False
233643 False
244543 True
247450 False
226867 False
Name: product_id, dtype: bool
print (m2)
226870 True
233645 True
240572 True
247452 True
233644 True
226869 True
247451 True
226868 True
240571 True
240570 True
233643 True
244543 False
247450 True
226867 True
dtype: bool
@jezrael 解决方案是最优的,但另一种方法是使用 groupby
和 filter
:
df.groupby(['product_id','stock_qty']).filter(lambda x: len(x)>1)
输出:
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
247451 2948262 2017-11-14 -2.000
240571 2948262 2017-11-13 -2.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250
通过使用drop_duplicates
df.drop(df.drop_duplicates(['stock_qty', 'product_id'], keep=False).index)
Out[797]:
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
247451 2948262 2017-11-14 -2.000
240571 2948262 2017-11-13 -2.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250
使用 loc[]
您可以只过滤重复的行并分配给您的原始数据框。
df = df.loc[df.duplicated(subset=['product_id','stock_qty'], keep=False)]
此外,keep=False
参数将所有重复的行标记为 True,如果您只想第一个或最后一个使用 keep='first'
或 keep='last'