根据时间戳删除几乎重复的行
Drop almost duplicates rows based on timestamp
我正在尝试删除一些几乎重复的数据。我正在寻找一种方法来检测用户最近的 (edited_at
) 行程而不丢失信息。
所以我想通过计算连续时间戳之间的差异来解决这个问题,我删除了最小差异(本例中为零:步骤 1)。
我愿意接受其他建议
注:
不要使用 drop_duplicates()
函数。
数据框:
data = [[111, 121, "2019-10-22 05:00:00", 0],
[111, 121, "2019-10-22 05:00:00", 1],
[111, 123, "2019-10-22 07:10:00", 0],
[111, 123, "2019-10-22 07:10:00", 1],
[111, 123, "2019-10-22 07:10:00", 2],
[111, 124, "2019-10-22 07:20:00", 0],
[111, 124, "2019-10-22 07:20:00", 1],
[111, 124, "2019-10-22 07:20:00", 2],
[111, 124, "2019-10-22 07:20:00", 3],
[111, 125, "2019-10-22 19:20:00", 0],
[111, 125, "2019-10-22 19:20:00", 1],
[222, 223, "2019-11-24 06:00:00", 0],
[222, 223, "2019-11-24 06:00:00", 1],
[222, 244, "2019-11-24 06:15:00", 0],
[222, 244, "2019-11-24 06:15:00", 1],
[222, 255, "2019-11-24 18:15:10", 0],
[222, 255, "2019-11-24 18:15:10", 1]]
df = pd.DataFrame(data, columns = ["user_id", "prompt_uuid", "edited_at", "prompt_num"])
df['edited_at'] = pd.to_datetime(df['edited_at'])
第一步:
111, 121, "2019-10-22 05:00:00", 0, somthing,
111, 121, "2019-10-22 05:00:00", 1, somthing,
111, 123, "2019-10-22 07:10:00", 0, 140,
111, 123, "2019-10-22 07:10:00", 1, 140,
111, 123, "2019-10-22 07:10:00", 2, 140,
111, 124, "2019-10-22 07:20:00", 0, 10,
111, 124, "2019-10-22 07:20:00", 1, 10,
111, 124, "2019-10-22 07:20:00", 2, 10,
111, 124, "2019-10-22 07:20:00", 3, 10,
111, 125, "2019-10-22 19:20:00", 0, 720,
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 223, "2019-11-24 06:00:00", 0, 0,
222, 223, "2019-11-24 06:00:00", 1, 0,
222, 244, "2019-11-24 06:15:00", 0, 15,
222, 244, "2019-11-24 06:15:00", 1, 15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
第 2 步:
111, 121, "2019-10-22 05:00:00", 0, somthing,
111, 121, "2019-10-22 05:00:00", 1, somthing,
111, 124, "2019-10-22 07:20:00", 0, 10,
111, 124, "2019-10-22 07:20:00", 1, 10,
111, 124, "2019-10-22 07:20:00", 2, 10,
111, 124, "2019-10-22 07:20:00", 3, 10,
111, 125, "2019-10-22 19:20:00", 0, 720,
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 244, "2019-11-24 06:15:00", 0, 15,
222, 244, "2019-11-24 06:15:00", 1, 15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
我可能不理解所有要求,但我已经从我期望看到的示例输出中推断出来了。'拆分得到'resp'列的状态。使用 groupby().firts()
获取该拆分状态的第一行。现在我们已经修复了列名和列顺序。
df1 = pd.concat([df, df['resp'].str.split(',', expand=True)], axis=1).drop('resp',axis=1)
df1 = df1.groupby(1, as_index=False).first().sort_values('edited_at', ascending=True)
df1.drop(0, axis=1,inplace=True)
df1.columns = ['resp','prompt_uuid','displayed_at','edited_at','latitude','longitude','prompt_num','uuid']
df1 = df1.iloc[:,[1,0,2,3,4,5,6,7]]
df1
prompt_uuid resp displayed_at edited_at latitude longitude prompt_num uuid
1 ab123-9600-3ee130b2c1ff foot 2019-10-22 22:39:57 2019-10-22 23:15:07 44.618787 -72.616841 0 4248-b313-ef2206755488
2 ab123-9600-3ee130b2c1ff metro 2019-10-22 22:50:35 2019-10-22 23:15:07 44.617968 -72.615851 1 4248-b313-ef2206755488
4 ab123-9600-3ee130b2c1ff work 2019-10-22 22:59:20 2019-10-22 23:15:07 44.616902 -72.614793 2 4248-b313-ef2206755488
3 zw999-1555-8ee140b2w1aa shopping 2019-11-23 08:01:35 2019-10-23 08:38:07 44.617968 -72.615851 1 4248-b313-ef2206755488
0 zw999-1555-8ee140b2w1bb bike 2019-11-23 07:39:57 2019-10-23 08:45:24 44.618787 -72.616841 0 4248-b313-ef2206755488
因为您的 DataFrame 相对于 ['user_id', 'prompt_uuid']
是重复的,采用简单的 diff
不会给出连续组之间的时间差。先drop_duplicates
再计算每个'user_id'
内的时间差。然后您可以对其进行过滤以找到每个用户的最小时间差:
s = df.drop_duplicates(['user_id', 'prompt_uuid']).copy()
s['time_diff'] = s.groupby('user_id')['edited_at'].diff(-1).abs()
s = s[s['time_diff'] == s.groupby('user_id')['time_diff'].transform('min')]
# user_id prompt_uuid edited_at prompt_num time_diff
#2 111 123 2019-10-22 07:10:00 0 00:10:00
#11 222 223 2019-11-24 06:00:00 0 00:15:00
现在,如果您想进一步将其子集化到时差在某个小阈值内的行(即,您可以保留最小时差为 4 小时的组...)
# Doesn't alter `s` in this example as both min_diffs are < 1 hour.
min_time = '1 hour'
s = s[s['time_diff'].le(min_time)]
现在 s
表示您要从 DataFrame 中删除的唯一 ['user_id', 'prompt_uuid']
组。我们使用 outer
排除合并来完成此操作,使用 indicator
:
keys = ['user_id', 'prompt_uuid']
df = (df.merge(s[keys], on=keys, how='outer', indicator=True)
.query('_merge == "left_only"')
.drop(columns='_merge'))
user_id prompt_uuid edited_at prompt_num
0 111 121 2019-10-22 05:00:00 0
1 111 121 2019-10-22 05:00:00 1
5 111 124 2019-10-22 07:20:00 0
6 111 124 2019-10-22 07:20:00 1
7 111 124 2019-10-22 07:20:00 2
8 111 124 2019-10-22 07:20:00 3
9 111 125 2019-10-22 19:20:00 0
10 111 125 2019-10-22 19:20:00 1
13 222 244 2019-11-24 06:15:00 0
14 222 244 2019-11-24 06:15:00 1
15 222 255 2019-11-24 18:15:10 0
16 222 255 2019-11-24 18:15:10 1
我正在尝试删除一些几乎重复的数据。我正在寻找一种方法来检测用户最近的 (edited_at
) 行程而不丢失信息。
所以我想通过计算连续时间戳之间的差异来解决这个问题,我删除了最小差异(本例中为零:步骤 1)。
我愿意接受其他建议
注:
不要使用 drop_duplicates()
函数。
数据框:
data = [[111, 121, "2019-10-22 05:00:00", 0],
[111, 121, "2019-10-22 05:00:00", 1],
[111, 123, "2019-10-22 07:10:00", 0],
[111, 123, "2019-10-22 07:10:00", 1],
[111, 123, "2019-10-22 07:10:00", 2],
[111, 124, "2019-10-22 07:20:00", 0],
[111, 124, "2019-10-22 07:20:00", 1],
[111, 124, "2019-10-22 07:20:00", 2],
[111, 124, "2019-10-22 07:20:00", 3],
[111, 125, "2019-10-22 19:20:00", 0],
[111, 125, "2019-10-22 19:20:00", 1],
[222, 223, "2019-11-24 06:00:00", 0],
[222, 223, "2019-11-24 06:00:00", 1],
[222, 244, "2019-11-24 06:15:00", 0],
[222, 244, "2019-11-24 06:15:00", 1],
[222, 255, "2019-11-24 18:15:10", 0],
[222, 255, "2019-11-24 18:15:10", 1]]
df = pd.DataFrame(data, columns = ["user_id", "prompt_uuid", "edited_at", "prompt_num"])
df['edited_at'] = pd.to_datetime(df['edited_at'])
第一步:
111, 121, "2019-10-22 05:00:00", 0, somthing,
111, 121, "2019-10-22 05:00:00", 1, somthing,
111, 123, "2019-10-22 07:10:00", 0, 140,
111, 123, "2019-10-22 07:10:00", 1, 140,
111, 123, "2019-10-22 07:10:00", 2, 140,
111, 124, "2019-10-22 07:20:00", 0, 10,
111, 124, "2019-10-22 07:20:00", 1, 10,
111, 124, "2019-10-22 07:20:00", 2, 10,
111, 124, "2019-10-22 07:20:00", 3, 10,
111, 125, "2019-10-22 19:20:00", 0, 720,
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 223, "2019-11-24 06:00:00", 0, 0,
222, 223, "2019-11-24 06:00:00", 1, 0,
222, 244, "2019-11-24 06:15:00", 0, 15,
222, 244, "2019-11-24 06:15:00", 1, 15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
第 2 步:
111, 121, "2019-10-22 05:00:00", 0, somthing,
111, 121, "2019-10-22 05:00:00", 1, somthing,
111, 124, "2019-10-22 07:20:00", 0, 10,
111, 124, "2019-10-22 07:20:00", 1, 10,
111, 124, "2019-10-22 07:20:00", 2, 10,
111, 124, "2019-10-22 07:20:00", 3, 10,
111, 125, "2019-10-22 19:20:00", 0, 720,
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 244, "2019-11-24 06:15:00", 0, 15,
222, 244, "2019-11-24 06:15:00", 1, 15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
我可能不理解所有要求,但我已经从我期望看到的示例输出中推断出来了。'拆分得到'resp'列的状态。使用 groupby().firts()
获取该拆分状态的第一行。现在我们已经修复了列名和列顺序。
df1 = pd.concat([df, df['resp'].str.split(',', expand=True)], axis=1).drop('resp',axis=1)
df1 = df1.groupby(1, as_index=False).first().sort_values('edited_at', ascending=True)
df1.drop(0, axis=1,inplace=True)
df1.columns = ['resp','prompt_uuid','displayed_at','edited_at','latitude','longitude','prompt_num','uuid']
df1 = df1.iloc[:,[1,0,2,3,4,5,6,7]]
df1
prompt_uuid resp displayed_at edited_at latitude longitude prompt_num uuid
1 ab123-9600-3ee130b2c1ff foot 2019-10-22 22:39:57 2019-10-22 23:15:07 44.618787 -72.616841 0 4248-b313-ef2206755488
2 ab123-9600-3ee130b2c1ff metro 2019-10-22 22:50:35 2019-10-22 23:15:07 44.617968 -72.615851 1 4248-b313-ef2206755488
4 ab123-9600-3ee130b2c1ff work 2019-10-22 22:59:20 2019-10-22 23:15:07 44.616902 -72.614793 2 4248-b313-ef2206755488
3 zw999-1555-8ee140b2w1aa shopping 2019-11-23 08:01:35 2019-10-23 08:38:07 44.617968 -72.615851 1 4248-b313-ef2206755488
0 zw999-1555-8ee140b2w1bb bike 2019-11-23 07:39:57 2019-10-23 08:45:24 44.618787 -72.616841 0 4248-b313-ef2206755488
因为您的 DataFrame 相对于 ['user_id', 'prompt_uuid']
是重复的,采用简单的 diff
不会给出连续组之间的时间差。先drop_duplicates
再计算每个'user_id'
内的时间差。然后您可以对其进行过滤以找到每个用户的最小时间差:
s = df.drop_duplicates(['user_id', 'prompt_uuid']).copy()
s['time_diff'] = s.groupby('user_id')['edited_at'].diff(-1).abs()
s = s[s['time_diff'] == s.groupby('user_id')['time_diff'].transform('min')]
# user_id prompt_uuid edited_at prompt_num time_diff
#2 111 123 2019-10-22 07:10:00 0 00:10:00
#11 222 223 2019-11-24 06:00:00 0 00:15:00
现在,如果您想进一步将其子集化到时差在某个小阈值内的行(即,您可以保留最小时差为 4 小时的组...)
# Doesn't alter `s` in this example as both min_diffs are < 1 hour.
min_time = '1 hour'
s = s[s['time_diff'].le(min_time)]
现在 s
表示您要从 DataFrame 中删除的唯一 ['user_id', 'prompt_uuid']
组。我们使用 outer
排除合并来完成此操作,使用 indicator
:
keys = ['user_id', 'prompt_uuid']
df = (df.merge(s[keys], on=keys, how='outer', indicator=True)
.query('_merge == "left_only"')
.drop(columns='_merge'))
user_id prompt_uuid edited_at prompt_num
0 111 121 2019-10-22 05:00:00 0
1 111 121 2019-10-22 05:00:00 1
5 111 124 2019-10-22 07:20:00 0
6 111 124 2019-10-22 07:20:00 1
7 111 124 2019-10-22 07:20:00 2
8 111 124 2019-10-22 07:20:00 3
9 111 125 2019-10-22 19:20:00 0
10 111 125 2019-10-22 19:20:00 1
13 222 244 2019-11-24 06:15:00 0
14 222 244 2019-11-24 06:15:00 1
15 222 255 2019-11-24 18:15:10 0
16 222 255 2019-11-24 18:15:10 1