添加缺失日期并计算它们时无法从重复轴重新编制索引

Question

我有问题。我想计算一些日期问题。但不幸的是我得到了一个错误ValueError: cannot reindex from a duplicate axis。我看了What does `ValueError: cannot reindex from a duplicate axis` mean?。但对我没有任何帮助。我该如何解决这个问题？

我试过了print(True in df.index.duplicated()) [OUT] False

# Did not work for me
#df[df.index.duplicated()]
#df = df.loc[:,~df.columns.duplicated()]
#df = df.reset_index()

数据框

    customerId    fromDate
0            1  2021-02-22
1            1  2021-03-18
2            1  2021-03-22
3            1  2021-02-10
4            1  2021-09-07
5            1        None
6            1  2022-01-18
7            2  2021-05-17
8            3  2021-05-17
9            3  2021-07-17
10           3  2021-02-22
11           3  2021-02-22

import pandas as pd

d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
     'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22', 
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
    }
df = pd.DataFrame(data=d)
#display(df)

#converting to datetimes
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])

#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']

#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
        .set_index('fromDate')
        .groupby('customerId')['lastInteractivity']
        .apply(lambda x: x.asfreq('d'))
        .reset_index())

[OUT] 
ValueError: cannot reindex from a duplicate axis
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-3f715dc564ee> in <module>()
      3         .set_index('fromDate')
      4         .groupby('customerId')['lastInteractivity']
----> 5         .apply(lambda x: x.asfreq('d'))
      6         .reset_index())

Answer 1

事实上，我得出的结论与@ALollz 在他的评论中所说的相同，通过使用 drop_duplicates，你得到了预期的结果：

#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
        .drop_duplicates(['fromDate', 'customerId'])
        .set_index('fromDate')
        .groupby('customerId')['lastInteractivity']
        .apply(lambda x: x.asfreq('d'))
        .reset_index())

输出：

    customerId  fromDate    lastInteractivity
0   1           2021-02-10  468 days
1   1           2021-02-11  NaT
2   1           2021-02-12  NaT
3   1           2021-02-13  NaT
4   1           2021-02-14  NaT
...
485 3           2021-07-13  NaT
486 3           2021-07-14  NaT
487 3           2021-07-15  NaT
488 3           2021-07-16  NaT
489 3           2021-07-17  311 days

添加缺失日期并计算它们时无法从重复轴重新编制索引

Cannot reindex from a duplicate axis while adding missing dates and count them

python

dataframe

pandas