Python 中使用简单线性回归包的不同结果：statsmodel.api 与 sklearn

Question

我希望了解为什么线性回归模型预测会得到两个不同的结果。我正在使用相同的数据集，并要求相同的预测值。我在下面粘贴了一些示例代码，还有一个 link 以及一个开放的 Google Colab，available here.

import pandas as pd
from sklearn import linear_model, metrics
import statsmodels.api as sm

temp = [73,65,81,90,75,77,82,93,86,79]
gallons = [110,95,135,160,97,105,120,175,140,121]
merged = list(zip(temp, gallons))
df = pd.DataFrame(merged, columns = ['temp', 'gallons'])

X = df[['temp']]
Y = df['gallons']

regr = linear_model.LinearRegression().fit(X,Y)
print("Using sklearn package, 80 temp predicts rent of:", regr.predict([[80]]))

model = sm.OLS(Y,X).fit()
print("Using statsmodel.api package, 80 temp predicts rent of:", model.predict([80]))

使用上面的代码，我收到以下结果：
使用 sklearn 包，80 temp 预测租金为：[125.5013734]
使用 statsmodel.api 包，80 temp 预测租金为：[126.72501891]

有人可以解释为什么结果不一样吗？我的理解是它们都是线性回归模型。

谢谢！

Answer 1

Statsmodel doesn't use intercept by default 而 sklearn 使用它 default.You 必须在 statsmodel 中手动添加拦截。

Statsmodel OLS 文档。

Notes

No constant is added by the model unless you are using formulas.

Sklearn

fit_interceptbool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).

使用 add_constant 函数向 X 添加截距，这将为两种算法提供相同的结果。

X = sm.add_constant(X)
model = sm.OLS(Y,X).fit()
print("Using statsmodel.api package, 80 temp predicts rent of:", model.predict([1,80]))

Python 中使用简单线性回归包的不同结果：statsmodel.api 与 sklearn

Different Results using Simple Linear Regression Packages in Python: statsmodel.api vs sklearn

python

linear-regression

scikit-learn

statsmodels