无法对 OLS 模型进行预测
Cant make Prediction on OLS Model
我正在构建 OLS 模型,但无法做出任何预测。
你能解释一下我做错了什么吗?
构建模型:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'],
'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
'Total':[100,100,200,300,10,20,40,50,60,100,500]}
d = pd.DataFrame(data=d).set_index('Client Number')
df = pd.get_dummies(d,prefix='', prefix_sep='')
X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']
X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()
reg.summary()
预测:
d1 = {'City': ['Tokyo','Tokyo','Lisbon'],
'Card': ['Visa','Visa','Visa'],
'Colateral':['Yes','Yes','No'],
'Client Number':[11,12,13],
'Total':[0,0,0]}
df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
y_new = df1[['Lisbon','Tokyo','Visa','No','Yes']]
x_new = df1['Total']
mod = sm.OLS(y_new, x_new)
mod.predict(reg.params)
然后显示:ValueError:形状 (3,1) 和 (11,) 未对齐:1 (dim 1) != 11 (dim 0)
我做错了什么?
最大的问题是您没有使用相同的虚拟转换。也就是说,df1 中的某些值不存在。您可以使用以下代码(来自 here)添加缺少的 values/columns:
d1 = {'City': ['Tokyo','Tokyo','Lisbon'],
'Card': ['Visa','Visa','Visa'],
'Colateral':['Yes','Yes','No'],
'Client Number':[11,12,13],
'Total':[0,0,0]}
df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
print(df1.shape) # Shape is 3x6 but it has to be 3x11
# Get missing columns in the training test
missing_cols = set( df.columns ) - set( df1.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
df1[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
df1 = df1[df.columns]
print(df1.shape) # Shape is 3x11
此外,您混淆了 x_new
和 y_new
。所以应该是:
x_new = df1.drop(['Total'], axis=1).values
y_new = df1['Total'].values
mod = sm.OLS(y_new, x_new)
mod.predict(reg.params)
请注意,我使用 x_new = df1.drop(['Total'], axis=1).values
而不是 df1[['Lisbon','Tokyo','Visa','No','Yes']]
,因为它更方便(就 1)不太容易(打字)错误和 2)代码较少
首先,您需要 string-index 所有单词,或者 one-hot 对值进行编码。 ML 模型不接受文字,只接受数字。接下来,您希望 X 和 y 为:
X = d.iloc[:,:-1]
y = d.iloc[:,-1]
这样 X 的形状为 [11,3],y 的形状为 [11,],这是所需的正确形状。
这是代码的固定预测部分和我的评论:
d1 = {'City': ['Tokyo','Tokyo','Lisbon'],
'Card': ['Visa','Visa','Visa'],
'Colateral':['Yes','Yes','No'],
'Client Number':[11,12,13],
'Total':[0,0,0]}
df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')
主要问题是训练 X1
和 x_new
数据集的假人数量不同。
下面我添加了缺失的虚拟列并用零填充:
x_new = x_new.reindex(columns = X1.columns, fill_value=0)
现在 x_new
的列数等于训练数据集 X1
:
const Lisbon London Madrid ... Master Card Visa No Yes
Client Number ...
11 0 0 0 0 ... 0 1 0 1
12 0 0 0 0 ... 0 1 0 1
13 0 1 0 0 ... 0 1 1 0
[3 rows x 11 columns]
最终使用先前训练的模型 reg
对新数据集 x_new
进行预测:
reg.predict(x_new)
结果:
Client Number
11 35.956284
12 35.956284
13 135.956284
dtype: float64
附录
根据要求,我在下面附上完全可重现的代码来测试训练和预测任务:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'],
'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
'Total':[100,100,200,300,10,20,40,50,60,100,500]}
d = pd.DataFrame(data=d).set_index('Client Number')
df = pd.get_dummies(d,prefix='', prefix_sep='')
X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']
X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()
reg.summary()
###
d1 = {'City': ['Tokyo','Tokyo','Lisbon'],
'Card': ['Visa','Visa','Visa'],
'Colateral':['Yes','Yes','No'],
'Client Number':[11,12,13],
'Total':[0,0,0]}
df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')
x_new = x_new.reindex(columns = X1.columns, fill_value=0)
reg.predict(x_new)
我正在构建 OLS 模型,但无法做出任何预测。
你能解释一下我做错了什么吗?
构建模型:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'],
'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
'Total':[100,100,200,300,10,20,40,50,60,100,500]}
d = pd.DataFrame(data=d).set_index('Client Number')
df = pd.get_dummies(d,prefix='', prefix_sep='')
X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']
X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()
reg.summary()
预测:
d1 = {'City': ['Tokyo','Tokyo','Lisbon'],
'Card': ['Visa','Visa','Visa'],
'Colateral':['Yes','Yes','No'],
'Client Number':[11,12,13],
'Total':[0,0,0]}
df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
y_new = df1[['Lisbon','Tokyo','Visa','No','Yes']]
x_new = df1['Total']
mod = sm.OLS(y_new, x_new)
mod.predict(reg.params)
然后显示:ValueError:形状 (3,1) 和 (11,) 未对齐:1 (dim 1) != 11 (dim 0)
我做错了什么?
最大的问题是您没有使用相同的虚拟转换。也就是说,df1 中的某些值不存在。您可以使用以下代码(来自 here)添加缺少的 values/columns:
d1 = {'City': ['Tokyo','Tokyo','Lisbon'],
'Card': ['Visa','Visa','Visa'],
'Colateral':['Yes','Yes','No'],
'Client Number':[11,12,13],
'Total':[0,0,0]}
df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
print(df1.shape) # Shape is 3x6 but it has to be 3x11
# Get missing columns in the training test
missing_cols = set( df.columns ) - set( df1.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
df1[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
df1 = df1[df.columns]
print(df1.shape) # Shape is 3x11
此外,您混淆了 x_new
和 y_new
。所以应该是:
x_new = df1.drop(['Total'], axis=1).values
y_new = df1['Total'].values
mod = sm.OLS(y_new, x_new)
mod.predict(reg.params)
请注意,我使用 x_new = df1.drop(['Total'], axis=1).values
而不是 df1[['Lisbon','Tokyo','Visa','No','Yes']]
,因为它更方便(就 1)不太容易(打字)错误和 2)代码较少
首先,您需要 string-index 所有单词,或者 one-hot 对值进行编码。 ML 模型不接受文字,只接受数字。接下来,您希望 X 和 y 为:
X = d.iloc[:,:-1]
y = d.iloc[:,-1]
这样 X 的形状为 [11,3],y 的形状为 [11,],这是所需的正确形状。
这是代码的固定预测部分和我的评论:
d1 = {'City': ['Tokyo','Tokyo','Lisbon'],
'Card': ['Visa','Visa','Visa'],
'Colateral':['Yes','Yes','No'],
'Client Number':[11,12,13],
'Total':[0,0,0]}
df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')
主要问题是训练 X1
和 x_new
数据集的假人数量不同。
下面我添加了缺失的虚拟列并用零填充:
x_new = x_new.reindex(columns = X1.columns, fill_value=0)
现在 x_new
的列数等于训练数据集 X1
:
const Lisbon London Madrid ... Master Card Visa No Yes
Client Number ...
11 0 0 0 0 ... 0 1 0 1
12 0 0 0 0 ... 0 1 0 1
13 0 1 0 0 ... 0 1 1 0
[3 rows x 11 columns]
最终使用先前训练的模型 reg
对新数据集 x_new
进行预测:
reg.predict(x_new)
结果:
Client Number
11 35.956284
12 35.956284
13 135.956284
dtype: float64
附录
根据要求,我在下面附上完全可重现的代码来测试训练和预测任务:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'],
'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
'Total':[100,100,200,300,10,20,40,50,60,100,500]}
d = pd.DataFrame(data=d).set_index('Client Number')
df = pd.get_dummies(d,prefix='', prefix_sep='')
X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']
X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()
reg.summary()
###
d1 = {'City': ['Tokyo','Tokyo','Lisbon'],
'Card': ['Visa','Visa','Visa'],
'Colateral':['Yes','Yes','No'],
'Client Number':[11,12,13],
'Total':[0,0,0]}
df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')
x_new = x_new.reindex(columns = X1.columns, fill_value=0)
reg.predict(x_new)