添加缺失日期并计算它们时无法从重复轴重新编制索引
Cannot reindex from a duplicate axis while adding missing dates and count them
我有问题。我想计算一些日期问题。但不幸的是我得到了一个错误ValueError: cannot reindex from a duplicate axis
。我看了What does `ValueError: cannot reindex from a duplicate axis` mean?。但对我没有任何帮助。我该如何解决这个问题?
我试过了print(True in df.index.duplicated()) [OUT] False
# Did not work for me
#df[df.index.duplicated()]
#df = df.loc[:,~df.columns.duplicated()]
#df = df.reset_index()
数据框
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
8 3 2021-05-17
9 3 2021-07-17
10 3 2021-02-22
11 3 2021-02-22
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
}
df = pd.DataFrame(data=d)
#display(df)
#converting to datetimes
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
[OUT]
ValueError: cannot reindex from a duplicate axis
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-3f715dc564ee> in <module>()
3 .set_index('fromDate')
4 .groupby('customerId')['lastInteractivity']
----> 5 .apply(lambda x: x.asfreq('d'))
6 .reset_index())
事实上,我得出的结论与@ALollz 在他的评论中所说的相同,通过使用 drop_duplicates
,你得到了预期的结果:
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.drop_duplicates(['fromDate', 'customerId'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
输出:
customerId fromDate lastInteractivity
0 1 2021-02-10 468 days
1 1 2021-02-11 NaT
2 1 2021-02-12 NaT
3 1 2021-02-13 NaT
4 1 2021-02-14 NaT
...
485 3 2021-07-13 NaT
486 3 2021-07-14 NaT
487 3 2021-07-15 NaT
488 3 2021-07-16 NaT
489 3 2021-07-17 311 days
我有问题。我想计算一些日期问题。但不幸的是我得到了一个错误ValueError: cannot reindex from a duplicate axis
。我看了What does `ValueError: cannot reindex from a duplicate axis` mean?。但对我没有任何帮助。我该如何解决这个问题?
我试过了print(True in df.index.duplicated()) [OUT] False
# Did not work for me
#df[df.index.duplicated()]
#df = df.loc[:,~df.columns.duplicated()]
#df = df.reset_index()
数据框
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
8 3 2021-05-17
9 3 2021-07-17
10 3 2021-02-22
11 3 2021-02-22
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
}
df = pd.DataFrame(data=d)
#display(df)
#converting to datetimes
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
[OUT]
ValueError: cannot reindex from a duplicate axis
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-3f715dc564ee> in <module>()
3 .set_index('fromDate')
4 .groupby('customerId')['lastInteractivity']
----> 5 .apply(lambda x: x.asfreq('d'))
6 .reset_index())
事实上,我得出的结论与@ALollz 在他的评论中所说的相同,通过使用 drop_duplicates
,你得到了预期的结果:
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.drop_duplicates(['fromDate', 'customerId'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
输出:
customerId fromDate lastInteractivity
0 1 2021-02-10 468 days
1 1 2021-02-11 NaT
2 1 2021-02-12 NaT
3 1 2021-02-13 NaT
4 1 2021-02-14 NaT
...
485 3 2021-07-13 NaT
486 3 2021-07-14 NaT
487 3 2021-07-15 NaT
488 3 2021-07-16 NaT
489 3 2021-07-17 311 days