打印的数据框,df.to_pickle 时不一样。值变为 NaN
printed dataframe, not the same when df.to_pickle. Values become NaN
所以我有一个数据框字典stocks
我通过像这样插入股票代码来调用股票的数据框
stocks['OPK']
调用股票 'OPK'
输出为:
stocks['OPK']
Open High Low Close Volume Adj Close
Date
2010-01-04 1.80 1.97 1.76 1.95 234500.0 1.95
2010-01-05 1.64 1.95 1.64 1.93 135800.0 1.93
2010-01-06 1.90 1.92 1.77 1.79 546600.0 1.79 -
2010-01-07 1.79 1.94 1.76 1.92 138700.0 1.92
编辑:我已经添加了代码来构建我正在玩的同一个面板,所以那些试图解决我的问题的人在测试他们的想法时不会有问题。
Here is the code to get the Panel (for reproducibility)
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime as dt
import re
startDate = '2010-01-01'
endDate = '2016-09-07'
stocks_query = ['AAPL','OPK']
stocks = web.DataReader(stocks_query, data_source='yahoo',
start=startDate, end=endDate)
stocks = stocks.swapaxes('items','minor_axis')`
导致输出
Dimensions: 2 (items) x 1682 (major_axis) x 6 (minor_axis)
Items axis: AAPL to OPK
Major_axis axis: 2010-01-04 00:00:00 to 2016-09-07 00:00:00
Minor_axis axis: Open to Adj Close
我正在通过函数添加自定义列,然后将其保存到泡菜中。添加列后,当我打印数据框时,我没有发现任何问题。但是,当我将它保存到 pickle 中并加载它时,六个新创建的列中的两个以缺失值告终。我希望能够将其锯成泡菜,这样我就不必继续重新创建列。但我也想通过一个函数来做,因为我希望自动创建列。
这是我的代码(为简洁起见,我删除了一些部分):
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime as dt
import re
startDate = '2010-01-01'
endDate = dt.date.today()
stocks_query = ('AAPL','OPK')
source = 'yahoo'
columns =['Open', 'High', 'Low'......'p_changed']
def load_data(stocks_query, data_source, start, end):
file_extension = '_'.join(stocks_query)
stocks = pd.read_pickle('C:\Users\Moondra\MachineLearning\Stock_Market_Predictor-master\{}.pkl'. \
format(file_extension))
try:
stocks[stocks_query[0]]['log return'] #this checks if the customized
columns have been added
except KeyError:
print('There was an error, so we adding the columns')
stocks =new_columns(stocks, columns) #calls the function to add the columns
stocks.to_pickle('C:\Users\Moondra\MachineLearning\Stock_Market_Predictor-master\{}.pkl'.format\
(file_extension)) # saves to a pickle file
return stocks
def new_columns(stocks, columns): #this is the function that adds new columns
stocks =stocks.reindex_axis([columns], 'minor_axis')
for i in stocks:
stocks[i]['log_return'] = np.log(stocks[i]['Close']/(stocks[i]['Close'].shift(1)))
stocks [i] ['close_open'] = (stocks[i].Open - stocks[i].Close.shift(1))
stocks[i]['30_Avg_Vol'] = stocks[i] ['Volume'].rolling(min_periods =15, window=30).mean()
stocks[i]['changed'] = stocks[i]['close_open'] * stocks[i]['close_open'].shift(-1) < 0
stocks[i]['p_changed'] = (stocks[i]['close_open'] + stocks[i]['close_open'].shift(-1) < stocks[i]['close_open'].shift(-1))\
&(stocks[i]['close_open']* stocks[i]['close_open'].shift(-1) < 0)
return (stocks)
我遇到的问题是最后两列。
在 运行 代码和输入 stocks['OPK']
之后,我没有问题。
我看到所有列及其值都已添加。
最后一列有点不同,因为它们是 return 布尔值,但没有异常。
这是我的输出结果(没有错误):
Date changed p_changed
2010-01-04 False False
2010-01-05 False False
2010-01-06 False False
2010-01-07 False False
2010-01-08 False False
2010-01-11 True False
2010-01-12 False False
2010-01-13 False False
但是,当我加载 pickle 时(请注意,在 load_data
函数中,我在添加列后立即将其保存为 pickle)并输入 stock['OPK'],最后两列仅显示 NAN 值。
changed p_changed
Date
2010-01-04 NaN NaN
2010-01-05 NaN NaN
2010-01-06 NaN NaN
2010-01-07 NaN NaN
不确定为什么会这样。我添加的其他列 log_returns
等没有错误。只有最后两列是布尔值。
我怀疑这与此有关。
编辑:我也尝试在函数之外保存到 pickle 中。但是这个奇怪的 "Nan" 输出仍然保持不变。
伙计,这是一个解决方法。 Pandas Panel
比协调员更麻烦。
使用此代码将您的 stocks
数据转换为普通的多索引 Pandas 数据帧并观察其工作情况。
#use this to convert your Panel into multi-indexed pd.DataFrame
stocks_df = pd.concat([stocks[item] for item in stocks.items],keys = stocks.items)
#a new_columns function (note that it's different from yours)
def new_columns(df): #this is the function that adds new columns
df.loc[:,'log_return'] = np.log(df['Close']/(df['Close'].shift(1)))
df.loc[:,'close_open'] = (df.Open - df.Close.shift(1))
df.loc[:,'30_Avg_Vol'] = df.loc[:,'Volume'].rolling(min_periods =15, window=30).mean()
df.loc[:,'changed'] = df['close_open'] * df['close_open'].shift(-1) < 0
df.loc[:,'p_changed'] = (df['close_open'] + df['close_open'].shift(-1) < df['close_open'].shift(-1)) & (df['close_open']* df['close_open'].shift(-1) < 0)
return(df)
#here's how you would run it:
stocks_df = stocks_df.groupby(level=0).apply(new_columns)
#now I pickle it:
stocks_df.to_pickle("pickled_df.pkl")
#here I retrieve it.
stocks_read = pd.read_pickle("pickled_df.pkl")
In [41]: stocks_read.head()
Out[41]:
Open High Low Close Volume \
Date
AAPL 2010-01-04 213.429998 214.499996 212.380001 214.009998 123432400.0
2010-01-05 214.599998 215.589994 213.249994 214.379993 150476200.0
2010-01-06 214.379993 215.230000 210.750004 210.969995 138040000.0
2010-01-07 211.750000 212.000006 209.050005 210.580000 119282800.0
2010-01-08 210.299994 212.000006 209.060005 211.980005 111902700.0
Adj Close log_return close_open 30_Avg_Vol changed \
Date
AAPL 2010-01-04 27.727039 NaN NaN NaN False
2010-01-05 27.774976 0.001727 0.590000 NaN False
2010-01-06 27.333178 -0.016034 0.000000 NaN False
2010-01-07 27.282650 -0.001850 0.780005 NaN True
2010-01-08 27.464034 0.006626 -0.280006 NaN True
p_changed
Date
AAPL 2010-01-04 False
2010-01-05 False
2010-01-06 False
2010-01-07 False
2010-01-08 True
看,如果您不使用 Panel,那么一切都会很顺利。
所以我有一个数据框字典stocks
我通过像这样插入股票代码来调用股票的数据框
stocks['OPK']
调用股票 'OPK'
输出为:
stocks['OPK']
Open High Low Close Volume Adj Close
Date
2010-01-04 1.80 1.97 1.76 1.95 234500.0 1.95
2010-01-05 1.64 1.95 1.64 1.93 135800.0 1.93
2010-01-06 1.90 1.92 1.77 1.79 546600.0 1.79 -
2010-01-07 1.79 1.94 1.76 1.92 138700.0 1.92
编辑:我已经添加了代码来构建我正在玩的同一个面板,所以那些试图解决我的问题的人在测试他们的想法时不会有问题。
Here is the code to get the Panel (for reproducibility)
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime as dt
import re
startDate = '2010-01-01'
endDate = '2016-09-07'
stocks_query = ['AAPL','OPK']
stocks = web.DataReader(stocks_query, data_source='yahoo',
start=startDate, end=endDate)
stocks = stocks.swapaxes('items','minor_axis')`
导致输出
Dimensions: 2 (items) x 1682 (major_axis) x 6 (minor_axis)
Items axis: AAPL to OPK
Major_axis axis: 2010-01-04 00:00:00 to 2016-09-07 00:00:00
Minor_axis axis: Open to Adj Close
我正在通过函数添加自定义列,然后将其保存到泡菜中。添加列后,当我打印数据框时,我没有发现任何问题。但是,当我将它保存到 pickle 中并加载它时,六个新创建的列中的两个以缺失值告终。我希望能够将其锯成泡菜,这样我就不必继续重新创建列。但我也想通过一个函数来做,因为我希望自动创建列。
这是我的代码(为简洁起见,我删除了一些部分):
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime as dt
import re
startDate = '2010-01-01'
endDate = dt.date.today()
stocks_query = ('AAPL','OPK')
source = 'yahoo'
columns =['Open', 'High', 'Low'......'p_changed']
def load_data(stocks_query, data_source, start, end):
file_extension = '_'.join(stocks_query)
stocks = pd.read_pickle('C:\Users\Moondra\MachineLearning\Stock_Market_Predictor-master\{}.pkl'. \
format(file_extension))
try:
stocks[stocks_query[0]]['log return'] #this checks if the customized
columns have been added
except KeyError:
print('There was an error, so we adding the columns')
stocks =new_columns(stocks, columns) #calls the function to add the columns
stocks.to_pickle('C:\Users\Moondra\MachineLearning\Stock_Market_Predictor-master\{}.pkl'.format\
(file_extension)) # saves to a pickle file
return stocks
def new_columns(stocks, columns): #this is the function that adds new columns
stocks =stocks.reindex_axis([columns], 'minor_axis')
for i in stocks:
stocks[i]['log_return'] = np.log(stocks[i]['Close']/(stocks[i]['Close'].shift(1)))
stocks [i] ['close_open'] = (stocks[i].Open - stocks[i].Close.shift(1))
stocks[i]['30_Avg_Vol'] = stocks[i] ['Volume'].rolling(min_periods =15, window=30).mean()
stocks[i]['changed'] = stocks[i]['close_open'] * stocks[i]['close_open'].shift(-1) < 0
stocks[i]['p_changed'] = (stocks[i]['close_open'] + stocks[i]['close_open'].shift(-1) < stocks[i]['close_open'].shift(-1))\
&(stocks[i]['close_open']* stocks[i]['close_open'].shift(-1) < 0)
return (stocks)
我遇到的问题是最后两列。
在 运行 代码和输入 stocks['OPK']
之后,我没有问题。
我看到所有列及其值都已添加。
最后一列有点不同,因为它们是 return 布尔值,但没有异常。
这是我的输出结果(没有错误):
Date changed p_changed
2010-01-04 False False
2010-01-05 False False
2010-01-06 False False
2010-01-07 False False
2010-01-08 False False
2010-01-11 True False
2010-01-12 False False
2010-01-13 False False
但是,当我加载 pickle 时(请注意,在 load_data
函数中,我在添加列后立即将其保存为 pickle)并输入 stock['OPK'],最后两列仅显示 NAN 值。
changed p_changed
Date
2010-01-04 NaN NaN
2010-01-05 NaN NaN
2010-01-06 NaN NaN
2010-01-07 NaN NaN
不确定为什么会这样。我添加的其他列 log_returns
等没有错误。只有最后两列是布尔值。
我怀疑这与此有关。
编辑:我也尝试在函数之外保存到 pickle 中。但是这个奇怪的 "Nan" 输出仍然保持不变。
伙计,这是一个解决方法。 Pandas Panel
比协调员更麻烦。
使用此代码将您的 stocks
数据转换为普通的多索引 Pandas 数据帧并观察其工作情况。
#use this to convert your Panel into multi-indexed pd.DataFrame
stocks_df = pd.concat([stocks[item] for item in stocks.items],keys = stocks.items)
#a new_columns function (note that it's different from yours)
def new_columns(df): #this is the function that adds new columns
df.loc[:,'log_return'] = np.log(df['Close']/(df['Close'].shift(1)))
df.loc[:,'close_open'] = (df.Open - df.Close.shift(1))
df.loc[:,'30_Avg_Vol'] = df.loc[:,'Volume'].rolling(min_periods =15, window=30).mean()
df.loc[:,'changed'] = df['close_open'] * df['close_open'].shift(-1) < 0
df.loc[:,'p_changed'] = (df['close_open'] + df['close_open'].shift(-1) < df['close_open'].shift(-1)) & (df['close_open']* df['close_open'].shift(-1) < 0)
return(df)
#here's how you would run it:
stocks_df = stocks_df.groupby(level=0).apply(new_columns)
#now I pickle it:
stocks_df.to_pickle("pickled_df.pkl")
#here I retrieve it.
stocks_read = pd.read_pickle("pickled_df.pkl")
In [41]: stocks_read.head()
Out[41]:
Open High Low Close Volume \
Date
AAPL 2010-01-04 213.429998 214.499996 212.380001 214.009998 123432400.0
2010-01-05 214.599998 215.589994 213.249994 214.379993 150476200.0
2010-01-06 214.379993 215.230000 210.750004 210.969995 138040000.0
2010-01-07 211.750000 212.000006 209.050005 210.580000 119282800.0
2010-01-08 210.299994 212.000006 209.060005 211.980005 111902700.0
Adj Close log_return close_open 30_Avg_Vol changed \
Date
AAPL 2010-01-04 27.727039 NaN NaN NaN False
2010-01-05 27.774976 0.001727 0.590000 NaN False
2010-01-06 27.333178 -0.016034 0.000000 NaN False
2010-01-07 27.282650 -0.001850 0.780005 NaN True
2010-01-08 27.464034 0.006626 -0.280006 NaN True
p_changed
Date
AAPL 2010-01-04 False
2010-01-05 False
2010-01-06 False
2010-01-07 False
2010-01-08 True
看,如果您不使用 Panel,那么一切都会很顺利。