从重复轴重新索引
reindex from a duplicate axis
我有以下代码:
import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
TOKEN = "d0d2a3295349c625be6c0cbe23f9136221eb45ef"
con = fxcmpy.fxcmpy(access_token=TOKEN, log_level='error')
symbols = con.get_instruments()
start = datetime.datetime(2015,1,1)
end = datetime.datetime.today()
data = con.get_candles('NGAS', period='D1', start = start, end = end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')
data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)
最后一行 data = data.reindex(full_dates)
给出了以下错误:
ValueError: cannot reindex from a duplicate axis
我想做的是填充缺失的日期并重新索引该列。
如@jezrael 所述"problem is duplicated values in DatetimeIndex, so reindex cannot be used here"
我之前使用过相同的代码并且运行良好。好奇为什么它在这种情况下不起作用
import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
import numpy as np
stock = 'F'
start = datetime.date(2008,1,1)
end = datetime.date.today()
data = web.DataReader(stock, 'yahoo',start, end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')
full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)
除了提供程序之外,代码是相同的,但是这个有效而上面的无效?
所以问题是 DatetimeIndex
中的重复值,所以 reindex
不能在这里使用。
可能的解决方案是将 DataFrame.join
与助手 DataFrame
一起使用,所有值:
data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
df = pd.DataFrame({'date':full_dates}).join(data, on='date')
print (df)
date bidopen bidclose bidhigh bidlow askopen askclose \
0 2015-01-01 NaN NaN NaN NaN NaN NaN
1 2015-01-02 2.9350 2.947 3.0910 2.860 2.9450 2.957
2 2015-01-03 NaN NaN NaN NaN NaN NaN
3 2015-01-04 NaN NaN NaN NaN NaN NaN
4 2015-01-05 2.9470 2.912 3.1710 2.871 2.9570 2.922
... ... ... ... ... ... ...
1797 2019-12-03 2.3890 2.441 2.5115 2.371 2.3970 2.449
1798 2019-12-04 2.3455 2.392 2.3970 2.341 2.3535 2.400
1798 2019-12-04 2.4410 2.406 2.4645 2.370 2.4490 2.414
1799 2019-12-05 2.4060 2.421 2.4650 2.399 2.4140 2.429
1800 2019-12-06 NaN NaN NaN NaN NaN NaN
askhigh asklow tickqty
0 NaN NaN NaN
1 3.101 2.8700 12688.0
2 NaN NaN NaN
3 NaN NaN NaN
4 3.181 2.8810 21849.0
... ... ...
1797 2.519 2.3785 36679.0
1798 2.406 2.3505 5333.0
1798 2.473 2.3780 74881.0
1799 2.473 2.4070 29238.0
1800 NaN NaN NaN
[1802 rows x 10 columns]
但我认为接下来的处理应该是有问题的(因为重复索引),所以使用 DataFrame.resample
by days with days with aggregation functions in dictionary:
df = data.resample('D').agg({'bidopen': 'first',
'bidclose': 'last',
'bidhigh': 'max',
'bidlow': 'min',
'askopen': 'first',
'askclose': 'last',
'askhigh': 'max',
'asklow': 'min',
'tickqty':'sum'})
print (df)
bidopen bidclose bidhigh bidlow askopen askclose askhigh \
date
2015-01-02 2.9350 2.9470 3.0910 2.860 2.9450 2.9570 3.101
2015-01-03 NaN NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN NaN
2015-01-05 2.9470 2.9120 3.1710 2.871 2.9570 2.9220 3.181
2015-01-06 2.9120 2.9400 2.9510 2.807 2.9220 2.9500 2.961
... ... ... ... ... ... ...
2019-12-01 NaN NaN NaN NaN NaN NaN NaN
2019-12-02 2.3505 2.3455 2.3670 2.292 2.3590 2.3535 2.375
2019-12-03 2.3890 2.4410 2.5115 2.371 2.3970 2.4490 2.519
2019-12-04 2.3455 2.4060 2.4645 2.341 2.3535 2.4140 2.473
2019-12-05 2.4060 2.4210 2.4650 2.399 2.4140 2.4290 2.473
asklow tickqty
date
2015-01-02 2.8700 12688
2015-01-03 NaN 0
2015-01-04 NaN 0
2015-01-05 2.8810 21849
2015-01-06 2.8170 17955
... ...
2019-12-01 NaN 0
2019-12-02 2.3000 31173
2019-12-03 2.3785 36679
2019-12-04 2.3505 80214
2019-12-05 2.4070 29238
[1799 rows x 9 columns]
我有以下代码:
import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
TOKEN = "d0d2a3295349c625be6c0cbe23f9136221eb45ef"
con = fxcmpy.fxcmpy(access_token=TOKEN, log_level='error')
symbols = con.get_instruments()
start = datetime.datetime(2015,1,1)
end = datetime.datetime.today()
data = con.get_candles('NGAS', period='D1', start = start, end = end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')
data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)
最后一行 data = data.reindex(full_dates)
给出了以下错误:
ValueError: cannot reindex from a duplicate axis
我想做的是填充缺失的日期并重新索引该列。
如@jezrael 所述"problem is duplicated values in DatetimeIndex, so reindex cannot be used here"
我之前使用过相同的代码并且运行良好。好奇为什么它在这种情况下不起作用
import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
import numpy as np
stock = 'F'
start = datetime.date(2008,1,1)
end = datetime.date.today()
data = web.DataReader(stock, 'yahoo',start, end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')
full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)
除了提供程序之外,代码是相同的,但是这个有效而上面的无效?
所以问题是 DatetimeIndex
中的重复值,所以 reindex
不能在这里使用。
可能的解决方案是将 DataFrame.join
与助手 DataFrame
一起使用,所有值:
data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
df = pd.DataFrame({'date':full_dates}).join(data, on='date')
print (df)
date bidopen bidclose bidhigh bidlow askopen askclose \
0 2015-01-01 NaN NaN NaN NaN NaN NaN
1 2015-01-02 2.9350 2.947 3.0910 2.860 2.9450 2.957
2 2015-01-03 NaN NaN NaN NaN NaN NaN
3 2015-01-04 NaN NaN NaN NaN NaN NaN
4 2015-01-05 2.9470 2.912 3.1710 2.871 2.9570 2.922
... ... ... ... ... ... ...
1797 2019-12-03 2.3890 2.441 2.5115 2.371 2.3970 2.449
1798 2019-12-04 2.3455 2.392 2.3970 2.341 2.3535 2.400
1798 2019-12-04 2.4410 2.406 2.4645 2.370 2.4490 2.414
1799 2019-12-05 2.4060 2.421 2.4650 2.399 2.4140 2.429
1800 2019-12-06 NaN NaN NaN NaN NaN NaN
askhigh asklow tickqty
0 NaN NaN NaN
1 3.101 2.8700 12688.0
2 NaN NaN NaN
3 NaN NaN NaN
4 3.181 2.8810 21849.0
... ... ...
1797 2.519 2.3785 36679.0
1798 2.406 2.3505 5333.0
1798 2.473 2.3780 74881.0
1799 2.473 2.4070 29238.0
1800 NaN NaN NaN
[1802 rows x 10 columns]
但我认为接下来的处理应该是有问题的(因为重复索引),所以使用 DataFrame.resample
by days with days with aggregation functions in dictionary:
df = data.resample('D').agg({'bidopen': 'first',
'bidclose': 'last',
'bidhigh': 'max',
'bidlow': 'min',
'askopen': 'first',
'askclose': 'last',
'askhigh': 'max',
'asklow': 'min',
'tickqty':'sum'})
print (df)
bidopen bidclose bidhigh bidlow askopen askclose askhigh \
date
2015-01-02 2.9350 2.9470 3.0910 2.860 2.9450 2.9570 3.101
2015-01-03 NaN NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN NaN
2015-01-05 2.9470 2.9120 3.1710 2.871 2.9570 2.9220 3.181
2015-01-06 2.9120 2.9400 2.9510 2.807 2.9220 2.9500 2.961
... ... ... ... ... ... ...
2019-12-01 NaN NaN NaN NaN NaN NaN NaN
2019-12-02 2.3505 2.3455 2.3670 2.292 2.3590 2.3535 2.375
2019-12-03 2.3890 2.4410 2.5115 2.371 2.3970 2.4490 2.519
2019-12-04 2.3455 2.4060 2.4645 2.341 2.3535 2.4140 2.473
2019-12-05 2.4060 2.4210 2.4650 2.399 2.4140 2.4290 2.473
asklow tickqty
date
2015-01-02 2.8700 12688
2015-01-03 NaN 0
2015-01-04 NaN 0
2015-01-05 2.8810 21849
2015-01-06 2.8170 17955
... ...
2019-12-01 NaN 0
2019-12-02 2.3000 31173
2019-12-03 2.3785 36679
2019-12-04 2.3505 80214
2019-12-05 2.4070 29238
[1799 rows x 9 columns]