如何删除Python中的后续重复值?
How to remove subsequent duplicate values in Python?
我有一个如下所示的 df:
event_name |user_id|time_event |time_install
ProfileScreen|1111 |2021-05-01 11:31:00.679|2021-05-01 11:31:00.679
ProfileScreen|1111 |2021-05-01 11:35:22.273|2021-05-01 11:31:00.679 <--- Delete
WalletScreen |1111 |2021-05-01 11:37:00.329|2021-05-01 11:31:00.679
ProfileScreen|1111 |2021-05-01 11:38:24.456|2021-05-01 11:31:00.679
HomeScreen |1111 |2021-05-01 11:38:00.679|2021-05-01 11:38:00.679
ProfileScreen|1111 |2021-05-01 11:39:22.273|2021-05-01 11:38:00.679
WalletScreen |1111 |2021-05-01 11:40:00.329|2021-05-01 11:38:00.679
WalletScreen |1111 |2021-05-01 11:41:24.456|2021-05-01 11:38:00.679 <--- Delete
ProfileScreen|2222 |2021-05-03 11:31:00.679|2021-05-03 11:31:00.679
WalletScreen |2222 |2021-05-03 11:35:22.273|2021-05-03 11:31:00.679
HomeScreen |2222 |2021-05-03 11:37:00.329|2021-05-03 11:31:00.679
ProfileScreen|2222 |2021-05-03 11:37:30.456|2021-05-03 11:31:00.679
ProfileScreen|2222 |2021-05-03 11:38:00.679|2021-05-03 11:38:00.679
ProfileScreen|2222 |2021-05-03 11:39:22.273|2021-05-03 11:38:00.679 <--- Delete
ProfileScreen|2222 |2021-05-03 11:39:42.543|2021-05-03 11:38:00.679 <--- Delete
WalletScreen |2222 |2021-05-03 11:40:00.329|2021-05-03 11:38:00.679
ProfileScreen|2222 |2021-05-03 11:41:24.456|2021-05-03 11:38:00.679
按时间事件升序排序,我想删除屏幕、user_id 和 time_install相同。
为了保持最早的time_event,可以先将df按time_event排序,然后在drop_duplicates()中使用'keep=first'。
排序,可以使用.sort_values(...)
而要最早丢弃和保留,您可以使用
.drop_duplicates(subset =['event_name', 'user_id', time_install'], inplace=True, keep='first')
我有一个如下所示的 df:
event_name |user_id|time_event |time_install
ProfileScreen|1111 |2021-05-01 11:31:00.679|2021-05-01 11:31:00.679
ProfileScreen|1111 |2021-05-01 11:35:22.273|2021-05-01 11:31:00.679 <--- Delete
WalletScreen |1111 |2021-05-01 11:37:00.329|2021-05-01 11:31:00.679
ProfileScreen|1111 |2021-05-01 11:38:24.456|2021-05-01 11:31:00.679
HomeScreen |1111 |2021-05-01 11:38:00.679|2021-05-01 11:38:00.679
ProfileScreen|1111 |2021-05-01 11:39:22.273|2021-05-01 11:38:00.679
WalletScreen |1111 |2021-05-01 11:40:00.329|2021-05-01 11:38:00.679
WalletScreen |1111 |2021-05-01 11:41:24.456|2021-05-01 11:38:00.679 <--- Delete
ProfileScreen|2222 |2021-05-03 11:31:00.679|2021-05-03 11:31:00.679
WalletScreen |2222 |2021-05-03 11:35:22.273|2021-05-03 11:31:00.679
HomeScreen |2222 |2021-05-03 11:37:00.329|2021-05-03 11:31:00.679
ProfileScreen|2222 |2021-05-03 11:37:30.456|2021-05-03 11:31:00.679
ProfileScreen|2222 |2021-05-03 11:38:00.679|2021-05-03 11:38:00.679
ProfileScreen|2222 |2021-05-03 11:39:22.273|2021-05-03 11:38:00.679 <--- Delete
ProfileScreen|2222 |2021-05-03 11:39:42.543|2021-05-03 11:38:00.679 <--- Delete
WalletScreen |2222 |2021-05-03 11:40:00.329|2021-05-03 11:38:00.679
ProfileScreen|2222 |2021-05-03 11:41:24.456|2021-05-03 11:38:00.679
按时间事件升序排序,我想删除屏幕、user_id 和 time_install相同。
为了保持最早的time_event,可以先将df按time_event排序,然后在drop_duplicates()中使用'keep=first'。
排序,可以使用.sort_values(...)
而要最早丢弃和保留,您可以使用
.drop_duplicates(subset =['event_name', 'user_id', time_install'], inplace=True, keep='first')