sklearn Standardscaler Ridge 管道
sklean Standard Scaler Ridge Pipeline
我正在尝试标准化特征,然后 运行 岭回归。
如题,两个答案不同
当我设置ridge=0时,答案是一样的。当我删除 StandardScaler 和 Dn 时,答案也是一样的。
我不知道如何协调这两个版本(原始版本和使用 sklearn 的版本)。
感谢您的帮助
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np
np.random.seed(0)
x = np.random.randn(100, 3)
y = np.random.randn(100, 2)
xx = x.T @ x
xy = x.T @ y
Dn = np.diag(1 / np.sqrt(np.diag(xx)))
ridge = 1
xx = Dn @ xx @ Dn
xy = Dn @ xy
beta_raw = Dn @ np.linalg.solve(xx + np.eye(len(xx)) * ridge, xy)
f_raw = x @ beta_raw
model = Pipeline([("scaler", StandardScaler(with_mean=False)), ("regression", Ridge(ridge, fit_intercept=False))])
trained_model = model.fit(x, y)
f_ml = trained_model.predict(x)
print(f_ml[:3] / f_raw[:3])
您正在按不同的值缩放,检查:
np.diag(Dn)
array([0.09699826, 0.10123938, 0.1016412 ])
model.steps[0][1].scale_
array([1.02603414, 0.98202661, 0.97598415])
您的标准差是协方差矩阵对角线的平方根。即使您没有将矩阵居中,您仍然需要减去均值以获得协方差。参见
所以如果我们做对了:
x_m = x.mean(axis=0)
x_cov = np.dot((x - x_m).T, x - x_m) / (x.shape[0])
Dn = np.diag(1 / np.sqrt(np.diag(x_cov)))
xx = x.T @ x
xy = x.T @ y
ridge = 1
xx = Dn @ xx @ Dn
xy = Dn @ xy
beta_raw = Dn @ np.linalg.solve(xx + np.eye(len(xx)) * ridge, xy)
f_raw = x @ beta_raw
model = Pipeline([("scaler", StandardScaler(with_mean=False)), ("regression", Ridge(ridge, fit_intercept=False))])
trained_model = model.fit(x, y)
f_ml = trained_model.predict(x)
print(f_ml[:3] / f_raw[:3])
[[1. 1.]
[1. 1.]
[1. 1.]]
我正在尝试标准化特征,然后 运行 岭回归。
如题,两个答案不同
当我设置ridge=0时,答案是一样的。当我删除 StandardScaler 和 Dn 时,答案也是一样的。
我不知道如何协调这两个版本(原始版本和使用 sklearn 的版本)。
感谢您的帮助
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np
np.random.seed(0)
x = np.random.randn(100, 3)
y = np.random.randn(100, 2)
xx = x.T @ x
xy = x.T @ y
Dn = np.diag(1 / np.sqrt(np.diag(xx)))
ridge = 1
xx = Dn @ xx @ Dn
xy = Dn @ xy
beta_raw = Dn @ np.linalg.solve(xx + np.eye(len(xx)) * ridge, xy)
f_raw = x @ beta_raw
model = Pipeline([("scaler", StandardScaler(with_mean=False)), ("regression", Ridge(ridge, fit_intercept=False))])
trained_model = model.fit(x, y)
f_ml = trained_model.predict(x)
print(f_ml[:3] / f_raw[:3])
您正在按不同的值缩放,检查:
np.diag(Dn)
array([0.09699826, 0.10123938, 0.1016412 ])
model.steps[0][1].scale_
array([1.02603414, 0.98202661, 0.97598415])
您的标准差是协方差矩阵对角线的平方根。即使您没有将矩阵居中,您仍然需要减去均值以获得协方差。参见
所以如果我们做对了:
x_m = x.mean(axis=0)
x_cov = np.dot((x - x_m).T, x - x_m) / (x.shape[0])
Dn = np.diag(1 / np.sqrt(np.diag(x_cov)))
xx = x.T @ x
xy = x.T @ y
ridge = 1
xx = Dn @ xx @ Dn
xy = Dn @ xy
beta_raw = Dn @ np.linalg.solve(xx + np.eye(len(xx)) * ridge, xy)
f_raw = x @ beta_raw
model = Pipeline([("scaler", StandardScaler(with_mean=False)), ("regression", Ridge(ridge, fit_intercept=False))])
trained_model = model.fit(x, y)
f_ml = trained_model.predict(x)
print(f_ml[:3] / f_raw[:3])
[[1. 1.]
[1. 1.]
[1. 1.]]