如何根据 pandas 中滚动 window 中的多列查找重复项?
How to find duplicate based upon multiple columns in a rolling window in pandas?
示例数据
{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 90, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 90, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:02:30.000Z"}}
.
.
我有一些这样的代码
df = pd.DataFrame()
for line in sys.stdin:
data = json.loads(line)
# df1 = pd.DataFrame(data["transaction"], index=[len(df.index)])
df1 = pd.DataFrame(data["transaction"], index=[data['transaction']['time']])
df1['time'] = pd.to_datetime(df1['time'])
df = df.append(df1)
# df['count'] = df.rolling('2min', on='time', min_periods=1)['amount'].count()
print(df)
print(len(df[df.merchant.eq(data['transaction']['merchant']) & df.amount.eq(data['transaction']['amount'])].index))
当前输出
2019-02-13T10:00:00.000Z merchantA 20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z merchantB 90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z merchantC 90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z merchantD 90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z merchantE 90 2019-02-13 11:01:30
2019-02-13T11:02:30.000Z merchantE 90 2019-02-13 11:02:30
2
预期输出
2019-02-13T10:00:00.000Z merchantA 20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z merchantB 90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z merchantC 90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z merchantD 90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z merchantE 90 2019-02-13 11:01:30
因为数据正在流式传输。我想检查是否有重复记录(其商家和金额值相同)在两分钟内到达,所以我将其丢弃并且不对其进行任何处理。打印一份。
我必须对索引压缩或 groupby 做些什么吗?但是然后如何等同于多列。
或者在两列上有一些滚动条件,但找不到任何操作方法。
我在这里错过了什么?
谢谢
编辑
#dup = df[df.duplicated(subset=['merchant', 'amount'], keep=False)]
res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
# res['timediff'] = pd.to_timedelta((data['transaction']['time'] - res['time']), unit='T')
res['timediff'] = (data['transaction']['time'] - res['time'])
if len(res.index) >1:
print(res)
所以我正在尝试这样的事情,如果结果少于 120 秒,我可以处理它。
但目前生成的 df 的形式为
merchant amount time concat timediff
2019-02-13 11:03:00 merchantF 10 2019-02-13 11:03:00 merchantF10 -1 days +23:59:20
2019-02-13 11:02:20 merchantF 10 2019-02-13 11:02:20 merchantF10 00:00:00
2019-02-13 11:01:30 merchantE 10 2019-02-13 11:01:30 merchantE10 00:01:00
2019-02-13 11:02:00 merchantE 10 2019-02-13 11:02:00 merchantE10 00:00:30
2019-02-13 11:02:30 merchantE 10 2019-02-13 11:02:30 merchantE10 00:00:00
-1天+23:59:20这个格式我觉得可以取绝对值吗?
如何将时间转换为可以与 120 秒进行比较的格式? pd.to_deltatime() 对我不起作用,或者我用错了。
首先,您可以形成滚动的 120 秒数据块。
然后你可以申请;
使用重复的块和评估:
df = df[df.duplicated(subset=['val1','val2','val3'], keep=False)]
或groupby:
df.groupby(['val1','val2','val3']).count()
甚至 SQL 不同。
https://www.w3schools.com/sql/sql_distinct.asp
请post你试过什么。上述方法适用于字符串、浮点数、日期时间和整数数据类型。
所以我让它工作但不能与滚动 windows 因为它不支持字符串类型。该功能也在 Pandas Repo 上报告和请求。
我的问题解决方案片段:
if len(df.index) > 0:
res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
res['timediff'] = (data['transaction']['time'] - res['time']).dt.total_seconds().abs() <= 120
if res.timediff.any():
continue
df = df.append(df1)
print(df)
示例数据:
{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 10, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 10, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:03:00.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:00.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:02:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:05:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:00:30.000Z"}}
输出:
merchant amount time
2019-02-13 10:00:00 merchantA 20 2019-02-13 10:00:00
2019-02-13 11:00:01 merchantB 90 2019-02-13 11:00:01
2019-02-13 11:00:10 merchantC 10 2019-02-13 11:00:10
2019-02-13 11:00:20 merchantD 10 2019-02-13 11:00:20
2019-02-13 11:01:30 merchantE 10 2019-02-13 11:01:30
2019-02-13 11:03:00 merchantF 10 2019-02-13 11:03:00
2019-02-13 11:05:20 merchantF 10 2019-02-13 11:05:20
示例数据
{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 90, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 90, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:02:30.000Z"}}
.
.
我有一些这样的代码
df = pd.DataFrame()
for line in sys.stdin:
data = json.loads(line)
# df1 = pd.DataFrame(data["transaction"], index=[len(df.index)])
df1 = pd.DataFrame(data["transaction"], index=[data['transaction']['time']])
df1['time'] = pd.to_datetime(df1['time'])
df = df.append(df1)
# df['count'] = df.rolling('2min', on='time', min_periods=1)['amount'].count()
print(df)
print(len(df[df.merchant.eq(data['transaction']['merchant']) & df.amount.eq(data['transaction']['amount'])].index))
当前输出
2019-02-13T10:00:00.000Z merchantA 20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z merchantB 90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z merchantC 90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z merchantD 90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z merchantE 90 2019-02-13 11:01:30
2019-02-13T11:02:30.000Z merchantE 90 2019-02-13 11:02:30
2
预期输出
2019-02-13T10:00:00.000Z merchantA 20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z merchantB 90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z merchantC 90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z merchantD 90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z merchantE 90 2019-02-13 11:01:30
因为数据正在流式传输。我想检查是否有重复记录(其商家和金额值相同)在两分钟内到达,所以我将其丢弃并且不对其进行任何处理。打印一份。
我必须对索引压缩或 groupby 做些什么吗?但是然后如何等同于多列。 或者在两列上有一些滚动条件,但找不到任何操作方法。
我在这里错过了什么?
谢谢
编辑
#dup = df[df.duplicated(subset=['merchant', 'amount'], keep=False)]
res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
# res['timediff'] = pd.to_timedelta((data['transaction']['time'] - res['time']), unit='T')
res['timediff'] = (data['transaction']['time'] - res['time'])
if len(res.index) >1:
print(res)
所以我正在尝试这样的事情,如果结果少于 120 秒,我可以处理它。 但目前生成的 df 的形式为
merchant amount time concat timediff
2019-02-13 11:03:00 merchantF 10 2019-02-13 11:03:00 merchantF10 -1 days +23:59:20
2019-02-13 11:02:20 merchantF 10 2019-02-13 11:02:20 merchantF10 00:00:00
2019-02-13 11:01:30 merchantE 10 2019-02-13 11:01:30 merchantE10 00:01:00
2019-02-13 11:02:00 merchantE 10 2019-02-13 11:02:00 merchantE10 00:00:30
2019-02-13 11:02:30 merchantE 10 2019-02-13 11:02:30 merchantE10 00:00:00
-1天+23:59:20这个格式我觉得可以取绝对值吗?
如何将时间转换为可以与 120 秒进行比较的格式? pd.to_deltatime() 对我不起作用,或者我用错了。
首先,您可以形成滚动的 120 秒数据块。 然后你可以申请;
使用重复的块和评估: df = df[df.duplicated(subset=['val1','val2','val3'], keep=False)]
或groupby: df.groupby(['val1','val2','val3']).count()
甚至 SQL 不同。 https://www.w3schools.com/sql/sql_distinct.asp
请post你试过什么。上述方法适用于字符串、浮点数、日期时间和整数数据类型。
所以我让它工作但不能与滚动 windows 因为它不支持字符串类型。该功能也在 Pandas Repo 上报告和请求。
我的问题解决方案片段:
if len(df.index) > 0:
res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
res['timediff'] = (data['transaction']['time'] - res['time']).dt.total_seconds().abs() <= 120
if res.timediff.any():
continue
df = df.append(df1)
print(df)
示例数据:
{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 10, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 10, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:03:00.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:00.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:02:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:05:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:00:30.000Z"}}
输出:
merchant amount time
2019-02-13 10:00:00 merchantA 20 2019-02-13 10:00:00
2019-02-13 11:00:01 merchantB 90 2019-02-13 11:00:01
2019-02-13 11:00:10 merchantC 10 2019-02-13 11:00:10
2019-02-13 11:00:20 merchantD 10 2019-02-13 11:00:20
2019-02-13 11:01:30 merchantE 10 2019-02-13 11:01:30
2019-02-13 11:03:00 merchantF 10 2019-02-13 11:03:00
2019-02-13 11:05:20 merchantF 10 2019-02-13 11:05:20