Python/Pandas - 按列值删除重复行
Python/Pandas - Delete duplicate rows by column value
我有这样的 DataFrame:
sale_id dt receipts_qty
31 196.0 2017-02-19 95.0
32 203.0 2017-02-20 101.0
33 196.0 2017-02-21 105.0
34 196.0 2017-02-22 112.0
35 196.0 2017-02-23 118.0
36 196.0 2017-02-24 135.0
37 196.0 2017-02-25 135.0
38 196.0 2017-02-26 124.0
40 203.0 2017-02-27 290.0
39 196.0 2017-02-27 84.0
42 203.0 2017-02-28 330.0
41 196.0 2017-02-28 124.0
43 196.0 2017-03-01 100.0
44 203.0 2017-03-01 361.0
我必须删除 dt
之前的重复项并保留 sale_id == 196
所在的行。我只找到了 drop_duplicates('dt', keep='last')
和 drop_duplicates('dt', keep='first')
,但这不是我需要的。
我要获取的DataFrame:
sale_id dt receipts_qty
31 196.0 2017-02-19 95.0
32 203.0 2017-02-20 101.0
33 196.0 2017-02-21 105.0
34 196.0 2017-02-22 112.0
35 196.0 2017-02-23 118.0
36 196.0 2017-02-24 135.0
37 196.0 2017-02-25 135.0
38 196.0 2017-02-26 124.0
39 196.0 2017-02-27 84.0
41 196.0 2017-02-28 124.0
43 196.0 2017-03-01 100.0
首先根据条件为第一个值创建辅助列,然后 sort_values
and drop_duplicates
。
上次清理 - 删除列 a
和 sort_index
:
print (df)
sale_id dt receipts_qty
31 196.0 2017-02-19 95.0
32 203.0 2017-02-20 101.0
33 196.0 2017-02-21 105.0
34 196.0 2017-02-22 112.0
35 196.0 2017-02-23 118.0
36 196.0 2017-02-24 135.0
37 196.0 2017-02-25 135.0
38 196.0 2017-02-26 124.0
40 203.0 2017-02-27 290.0
39 196.0 2017-02-27 84.0
42 103.0 2017-02-28 330.0 <-changed data, value < 196
41 196.0 2017-02-28 124.0
43 196.0 2017-03-01 100.0
44 203.0 2017-03-01 361.0
#get only values > 196
df['a'] = (df.sale_id == 196).astype(int)
#sorting by new column, remove duplicates, remove helper column
df['a'] = (df.sale_id == 196).astype(int)
df = (df.sort_values(['a','dt'], ascending=[False, True])
.drop_duplicates('dt')
.drop('a', axis=1)
.sort_index())
print (df)
sale_id dt receipts_qty
31 196.0 2017-02-19 95.0
32 203.0 2017-02-20 101.0
33 196.0 2017-02-21 105.0
34 196.0 2017-02-22 112.0
35 196.0 2017-02-23 118.0
36 196.0 2017-02-24 135.0
37 196.0 2017-02-25 135.0
38 196.0 2017-02-26 124.0
39 196.0 2017-02-27 84.0
41 196.0 2017-02-28 124.0
43 196.0 2017-03-01 100.0
我有这样的 DataFrame:
sale_id dt receipts_qty
31 196.0 2017-02-19 95.0
32 203.0 2017-02-20 101.0
33 196.0 2017-02-21 105.0
34 196.0 2017-02-22 112.0
35 196.0 2017-02-23 118.0
36 196.0 2017-02-24 135.0
37 196.0 2017-02-25 135.0
38 196.0 2017-02-26 124.0
40 203.0 2017-02-27 290.0
39 196.0 2017-02-27 84.0
42 203.0 2017-02-28 330.0
41 196.0 2017-02-28 124.0
43 196.0 2017-03-01 100.0
44 203.0 2017-03-01 361.0
我必须删除 dt
之前的重复项并保留 sale_id == 196
所在的行。我只找到了 drop_duplicates('dt', keep='last')
和 drop_duplicates('dt', keep='first')
,但这不是我需要的。
我要获取的DataFrame:
sale_id dt receipts_qty
31 196.0 2017-02-19 95.0
32 203.0 2017-02-20 101.0
33 196.0 2017-02-21 105.0
34 196.0 2017-02-22 112.0
35 196.0 2017-02-23 118.0
36 196.0 2017-02-24 135.0
37 196.0 2017-02-25 135.0
38 196.0 2017-02-26 124.0
39 196.0 2017-02-27 84.0
41 196.0 2017-02-28 124.0
43 196.0 2017-03-01 100.0
首先根据条件为第一个值创建辅助列,然后 sort_values
and drop_duplicates
。
上次清理 - 删除列 a
和 sort_index
:
print (df)
sale_id dt receipts_qty
31 196.0 2017-02-19 95.0
32 203.0 2017-02-20 101.0
33 196.0 2017-02-21 105.0
34 196.0 2017-02-22 112.0
35 196.0 2017-02-23 118.0
36 196.0 2017-02-24 135.0
37 196.0 2017-02-25 135.0
38 196.0 2017-02-26 124.0
40 203.0 2017-02-27 290.0
39 196.0 2017-02-27 84.0
42 103.0 2017-02-28 330.0 <-changed data, value < 196
41 196.0 2017-02-28 124.0
43 196.0 2017-03-01 100.0
44 203.0 2017-03-01 361.0
#get only values > 196
df['a'] = (df.sale_id == 196).astype(int)
#sorting by new column, remove duplicates, remove helper column
df['a'] = (df.sale_id == 196).astype(int)
df = (df.sort_values(['a','dt'], ascending=[False, True])
.drop_duplicates('dt')
.drop('a', axis=1)
.sort_index())
print (df)
sale_id dt receipts_qty
31 196.0 2017-02-19 95.0
32 203.0 2017-02-20 101.0
33 196.0 2017-02-21 105.0
34 196.0 2017-02-22 112.0
35 196.0 2017-02-23 118.0
36 196.0 2017-02-24 135.0
37 196.0 2017-02-25 135.0
38 196.0 2017-02-26 124.0
39 196.0 2017-02-27 84.0
41 196.0 2017-02-28 124.0
43 196.0 2017-03-01 100.0