在 python 中使用 LSTM 添加滞后时的 NaN 值
NaN value when adding lags using LSTM in python
我正在尝试根据数据集分析和预测销售情况,我已经整理了我的数据,但是,当我尝试创建滞后时,每月销售滞后的值为 NaN,这个 NaN 是什么意思?从我所指的教程中,他没有这些 NaN 值,至少当他删除 NaN 值时,他仍然有一些输出,但就我而言,当我删除 NaN 值时我没有任何输出...
from __future__ import division
from datetime import datetime, timedelta, date
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import plotly.plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from keras.utils import np_utils
from keras.layers import LSTM
from sklearn.model_selection import KFold, cross_val_score, train_test_split
#initiate plotly
pyoff.init_notebook_mode()
#read data
df = pd.read_csv(r"C:\Users\User\Desktop\UOW\Yr3\FYP\Sample.csv", encoding='latin-1')
df['Order Date'] = pd.to_datetime(df['Order Date'])
df.head(10)
# Drop empty cells
df.dropna(axis=0, how='all', thresh=None, subset=None, inplace=False)
df.shape
# Drop unwanted columns
# Order ID, Ship Date, Ship Mode, Segment, Country, City, State, Postal Code, Region, Product ID,
Category, Sub-Category, Product Name,
# Discount
df_sales = df.drop(['Order ID', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name','Discount'], axis = 1)
df_sales.head(10)
# represent month in date field as its first day
df_sales['Order Date'] = pd.to_datetime(df_sales['Order Date']).dt.strftime("%Y-%m-%d")
df_sales = df_sales.groupby('Order Date').Sales.sum().reset_index()
df_sales
#plot monthly sales
plot_data = [
go.Scatter(
x=df_sales['Order Date'],
y=df_sales['Sales'],
)
]
plot_layout = go.Layout(
title='Montly Sales'
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
# Create a new dataframe to model the difference
df_diff = df_sales.copy()
# Add previous sales to the next row
df_diff['Prev_Sales'] = df_diff['Sales'].shift(1)
# Drop the null values and calculate the difference
df_diff = df_diff.dropna()
df_diff['diff'] = (df_diff['Sales'] - df_diff['Prev_Sales'])
df_diff.head(10)
#plot sales diff
plot_data = [
go.Scatter(
x=df_diff['Order Date'],
y=df_diff['diff'],)]
plot_layout = go.Layout(
title='Montly Sales Difference')
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
#create dataframe for transformation from time series to supervised
df_supervised = df_diff.drop(['Prev_Sales'],axis=1)
#adding lags
for inc in range(1,13):
field_name = 'lag_' + str(inc)
df_supervised[field_name] = df_supervised['diff'].shift(inc)
#drop null values
#df_supervised = df_supervised.dropna().reset_index(drop=True)***
df_supervised
然后我得到的输出是
订购日期|销售 |差异 | lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | lag_6 | lag_7
| lag_8 | lag_9 | lag_10 | lag_11 | lag_12
1 2019-02-01 | 333904.9556 | -30136.6174 |南 |南 |南 |南 |南 |南 |钠盐
|南 |南 |南 |南 |南
2 2019-03-01 | 361431.8218 | 27526.8662 | -30136.6174 |南 |南 |南 |南 |钠盐
|南 |南 |南 |南 |南 |南
3 2019-04-01 | 359930.1225 | -1501.6993 | 27526.8662 | -30136.6174 |南 |南 |钠盐
|南 |南 |南 |南 |南 |南 |南
4 2019-05-01 | 348999.4696 | -10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |钠盐
|南 |南 |南 |南 |南 |南 |南 |南
5 2019-06-01 | 372904.5441 | 23905.0745 | -10930.6529 | -1501.6993 | 27526.8662
| -30136.6174 |南 |南 |南 |南 |南 |南 |南 |南
6 2019-07-01 | 372936.2013 | 31.6572 | 23905.0745 | -10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |南 |南 |南 |南 |南 |南 |南
7 2019-08-01 | 328648.3505 | -44287.8508 | 31.6572 | 23905.0745 | -10930.6529 |
-1501.6993 | 27526.8662 | -30136.6174 |南 |南 |南 |南 |南 |南
8 2019-09-01 | 371825.2898 | 43176.9393 | -44287.8508 | 31.6572 | 23905.0745 |
-10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |南 |南 |南 |南
9 2019-10-01 | 363781.0459 | -8044.2439 | 43176.9393 | -44287.8508 | 31.6572 | 23905.0745
| -10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |南
|南 |南 |南
10 2019-11-01 | 336836.8240 | -26944.2219 | -8044.2439 | 43176.9393 | -44287.8508 |
31.6572 | 23905.0745 | -10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |南 |钠盐
|南
11 2019-12-01 | 374106.0722 | 37269.2482 | -26944.2219 | -8044.2439 | 43176.9393
| -44287.8508 | 31.6572 | 23905.0745 | -10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |南 |南
如果我取消注释这段代码:df_supervised = df_supervised.dropna().reset_index(drop=True)
它只会显示标题的输出
订购日期|销售 |差异 | lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | lag_6 | lag_7
| lag_8 | lag_9 | lag_10 | lag_11 | lag_12
谁能帮我解决这个问题?非常感谢!
NaN 表示非数字。
使用滞后时间时通常会出现 NaN。
如果您想保留数据,您应该尝试填充 NaN 而不是丢弃它们。
例如df.fillna(0)
你可以从这里开始看:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
我正在尝试根据数据集分析和预测销售情况,我已经整理了我的数据,但是,当我尝试创建滞后时,每月销售滞后的值为 NaN,这个 NaN 是什么意思?从我所指的教程中,他没有这些 NaN 值,至少当他删除 NaN 值时,他仍然有一些输出,但就我而言,当我删除 NaN 值时我没有任何输出...
from __future__ import division
from datetime import datetime, timedelta, date
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import plotly.plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from keras.utils import np_utils
from keras.layers import LSTM
from sklearn.model_selection import KFold, cross_val_score, train_test_split
#initiate plotly
pyoff.init_notebook_mode()
#read data
df = pd.read_csv(r"C:\Users\User\Desktop\UOW\Yr3\FYP\Sample.csv", encoding='latin-1')
df['Order Date'] = pd.to_datetime(df['Order Date'])
df.head(10)
# Drop empty cells
df.dropna(axis=0, how='all', thresh=None, subset=None, inplace=False)
df.shape
# Drop unwanted columns
# Order ID, Ship Date, Ship Mode, Segment, Country, City, State, Postal Code, Region, Product ID,
Category, Sub-Category, Product Name,
# Discount
df_sales = df.drop(['Order ID', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name','Discount'], axis = 1)
df_sales.head(10)
# represent month in date field as its first day
df_sales['Order Date'] = pd.to_datetime(df_sales['Order Date']).dt.strftime("%Y-%m-%d")
df_sales = df_sales.groupby('Order Date').Sales.sum().reset_index()
df_sales
#plot monthly sales
plot_data = [
go.Scatter(
x=df_sales['Order Date'],
y=df_sales['Sales'],
)
]
plot_layout = go.Layout(
title='Montly Sales'
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
# Create a new dataframe to model the difference
df_diff = df_sales.copy()
# Add previous sales to the next row
df_diff['Prev_Sales'] = df_diff['Sales'].shift(1)
# Drop the null values and calculate the difference
df_diff = df_diff.dropna()
df_diff['diff'] = (df_diff['Sales'] - df_diff['Prev_Sales'])
df_diff.head(10)
#plot sales diff
plot_data = [
go.Scatter(
x=df_diff['Order Date'],
y=df_diff['diff'],)]
plot_layout = go.Layout(
title='Montly Sales Difference')
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
#create dataframe for transformation from time series to supervised
df_supervised = df_diff.drop(['Prev_Sales'],axis=1)
#adding lags
for inc in range(1,13):
field_name = 'lag_' + str(inc)
df_supervised[field_name] = df_supervised['diff'].shift(inc)
#drop null values
#df_supervised = df_supervised.dropna().reset_index(drop=True)***
df_supervised
然后我得到的输出是
订购日期|销售 |差异 | lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | lag_6 | lag_7 | lag_8 | lag_9 | lag_10 | lag_11 | lag_12
1 2019-02-01 | 333904.9556 | -30136.6174 |南 |南 |南 |南 |南 |南 |钠盐 |南 |南 |南 |南 |南
2 2019-03-01 | 361431.8218 | 27526.8662 | -30136.6174 |南 |南 |南 |南 |钠盐 |南 |南 |南 |南 |南 |南
3 2019-04-01 | 359930.1225 | -1501.6993 | 27526.8662 | -30136.6174 |南 |南 |钠盐 |南 |南 |南 |南 |南 |南 |南
4 2019-05-01 | 348999.4696 | -10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |钠盐 |南 |南 |南 |南 |南 |南 |南 |南
5 2019-06-01 | 372904.5441 | 23905.0745 | -10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |南 |南 |南 |南 |南 |南 |南 |南
6 2019-07-01 | 372936.2013 | 31.6572 | 23905.0745 | -10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |南 |南 |南 |南 |南 |南 |南
7 2019-08-01 | 328648.3505 | -44287.8508 | 31.6572 | 23905.0745 | -10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |南 |南 |南 |南 |南 |南
8 2019-09-01 | 371825.2898 | 43176.9393 | -44287.8508 | 31.6572 | 23905.0745 | -10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |南 |南 |南 |南
9 2019-10-01 | 363781.0459 | -8044.2439 | 43176.9393 | -44287.8508 | 31.6572 | 23905.0745
| -10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |南
|南 |南 |南
10 2019-11-01 | 336836.8240 | -26944.2219 | -8044.2439 | 43176.9393 | -44287.8508 | 31.6572 | 23905.0745 | -10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |南 |钠盐 |南
11 2019-12-01 | 374106.0722 | 37269.2482 | -26944.2219 | -8044.2439 | 43176.9393 | -44287.8508 | 31.6572 | 23905.0745 | -10930.6529 | -1501.6993 | 27526.8662 | -30136.6174 |南 |南
如果我取消注释这段代码:df_supervised = df_supervised.dropna().reset_index(drop=True)
它只会显示标题的输出
订购日期|销售 |差异 | lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | lag_6 | lag_7 | lag_8 | lag_9 | lag_10 | lag_11 | lag_12
谁能帮我解决这个问题?非常感谢!
NaN 表示非数字。
使用滞后时间时通常会出现 NaN。
如果您想保留数据,您应该尝试填充 NaN 而不是丢弃它们。
例如df.fillna(0)
你可以从这里开始看:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html