从重复轴重新索引

Question

我有以下代码：

import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime

TOKEN = "d0d2a3295349c625be6c0cbe23f9136221eb45ef"
con = fxcmpy.fxcmpy(access_token=TOKEN, log_level='error')
symbols = con.get_instruments()

start = datetime.datetime(2015,1,1)
end = datetime.datetime.today()
data = con.get_candles('NGAS', period='D1', start = start, end = end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')
data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)

最后一行 data = data.reindex(full_dates) 给出了以下错误：

ValueError: cannot reindex from a duplicate axis

我想做的是填充缺失的日期并重新索引该列。

如@jezrael 所述"problem is duplicated values in DatetimeIndex, so reindex cannot be used here"

我之前使用过相同的代码并且运行良好。好奇为什么它在这种情况下不起作用

import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
import numpy as np

stock = 'F'
start = datetime.date(2008,1,1)
end = datetime.date.today()
data = web.DataReader(stock, 'yahoo',start, end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')

full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)

除了提供程序之外，代码是相同的，但是这个有效而上面的无效？

Answer 1

所以问题是 DatetimeIndex 中的重复值，所以 reindex 不能在这里使用。

可能的解决方案是将 DataFrame.join 与助手 DataFrame 一起使用，所有值：

data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
df = pd.DataFrame({'date':full_dates}).join(data, on='date')
print (df)
           date  bidopen  bidclose  bidhigh  bidlow  askopen  askclose  \
0    2015-01-01      NaN       NaN      NaN     NaN      NaN       NaN   
1    2015-01-02   2.9350     2.947   3.0910   2.860   2.9450     2.957   
2    2015-01-03      NaN       NaN      NaN     NaN      NaN       NaN   
3    2015-01-04      NaN       NaN      NaN     NaN      NaN       NaN   
4    2015-01-05   2.9470     2.912   3.1710   2.871   2.9570     2.922   
        ...      ...       ...      ...     ...      ...       ...   
1797 2019-12-03   2.3890     2.441   2.5115   2.371   2.3970     2.449   
1798 2019-12-04   2.3455     2.392   2.3970   2.341   2.3535     2.400   
1798 2019-12-04   2.4410     2.406   2.4645   2.370   2.4490     2.414   
1799 2019-12-05   2.4060     2.421   2.4650   2.399   2.4140     2.429   
1800 2019-12-06      NaN       NaN      NaN     NaN      NaN       NaN   

      askhigh  asklow  tickqty  
0         NaN     NaN      NaN  
1       3.101  2.8700  12688.0  
2         NaN     NaN      NaN  
3         NaN     NaN      NaN  
4       3.181  2.8810  21849.0  
      ...     ...      ...  
1797    2.519  2.3785  36679.0  
1798    2.406  2.3505   5333.0  
1798    2.473  2.3780  74881.0  
1799    2.473  2.4070  29238.0  
1800      NaN     NaN      NaN  

[1802 rows x 10 columns]

但我认为接下来的处理应该是有问题的（因为重复索引），所以使用 DataFrame.resample by days with days with aggregation functions in dictionary:

df = data.resample('D').agg({'bidopen': 'first', 
                             'bidclose': 'last',
                             'bidhigh': 'max', 
                             'bidlow': 'min', 
                             'askopen': 'first', 
                             'askclose': 'last',
                             'askhigh': 'max', 
                             'asklow': 'min', 
                             'tickqty':'sum'})

print (df)
            bidopen  bidclose  bidhigh  bidlow  askopen  askclose  askhigh  \
date                                                                         
2015-01-02   2.9350    2.9470   3.0910   2.860   2.9450    2.9570    3.101   
2015-01-03      NaN       NaN      NaN     NaN      NaN       NaN      NaN   
2015-01-04      NaN       NaN      NaN     NaN      NaN       NaN      NaN   
2015-01-05   2.9470    2.9120   3.1710   2.871   2.9570    2.9220    3.181   
2015-01-06   2.9120    2.9400   2.9510   2.807   2.9220    2.9500    2.961   
            ...       ...      ...     ...      ...       ...      ...   
2019-12-01      NaN       NaN      NaN     NaN      NaN       NaN      NaN   
2019-12-02   2.3505    2.3455   2.3670   2.292   2.3590    2.3535    2.375   
2019-12-03   2.3890    2.4410   2.5115   2.371   2.3970    2.4490    2.519   
2019-12-04   2.3455    2.4060   2.4645   2.341   2.3535    2.4140    2.473   
2019-12-05   2.4060    2.4210   2.4650   2.399   2.4140    2.4290    2.473   

            asklow  tickqty  
date                         
2015-01-02  2.8700    12688  
2015-01-03     NaN        0  
2015-01-04     NaN        0  
2015-01-05  2.8810    21849  
2015-01-06  2.8170    17955  
           ...      ...  
2019-12-01     NaN        0  
2019-12-02  2.3000    31173  
2019-12-03  2.3785    36679  
2019-12-04  2.3505    80214  
2019-12-05  2.4070    29238  

[1799 rows x 9 columns]

从重复轴重新索引

reindex from a duplicate axis

python

pandas

valueerror