Sklearn StandardScaler 显示不正确的值

Question

我正在使用 StandardScaler() 来标准化 pandas 数据框，但是当我手动计算它时，我得到了不同的结果。

这是我的 DataFrame，名为 blood_df:

   dbp    sbp  weight  height
0  82.6  132.1      71     172
1  79.1  129.9      79     180
2  81.7  131.2      78     172
3  80.7  132.1      66     166
4  74.9  125.0      70     173
5  79.1  129.1      64     162
6  83.8  133.1      60     164
7  78.4  127.0      67     165
8  82.3  131.6      64     164
9  79.4  129.2      77     179

我使用

进行缩放

scaler = StandardScaler()
scaler.fit(blood_df)
blood_scaled = scaler.transform(blood_df)

得到blood_scaled。使用 blood_scaled['dbp'].describe() 我得到：

count    1.000000e+01
mean     4.618528e-15
std      1.054093e+00
min     -2.163355e+00
25%     -4.489983e-01
50%     -6.122704e-02
75%      7.959515e-01
max      1.469449e+00
Name: 0, dtype: float64

但是，仅查看缩放数据的 dbp 列，它与我使用 z = (x - u) / s:

手动计算时不同

((blood_df['dbp'] - blood_df['dbp'].mean()) / blood_df['dbp'].std()).describe()

给出：

count    1.000000e+01
mean     4.418688e-15
std      1.000000e+00
min     -2.052339e+00
25%     -4.259572e-01
50%     -5.808507e-02
75%      7.551059e-01
max      1.394042e+00
Name: dbp, dtype: float64

为什么标准差不相等？

Answer 1

来自 StandardScaler documentation:

Notes

...

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

同时来自 pandas.DataFrame.std documentation:

ddof : int, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

ddof 在这种情况下用于标准偏差公式，用 N - ddof 替换分母 N，如：

std = (sum((x - x.mean())**2) ** 0.5) / (N - ddof)

因此，默认情况下，StandardScaler 使用 ddof = 0，pandas.DataFrame.std 使用 ddof = 1。

如果您尝试在手动公式中指定 ddof，您会发现这是造成差异的原因：

((blood_df['dbp'] - blood_df['dbp'].mean()) / blood_df['dbp'].std(ddof = 0)).describe()

给出与 StandardScaler 相同的结果。

Sklearn StandardScaler 显示不正确的值

Sklearn StandardScaler showing incorrect values

python

statistics

pandas

scikit-learn

data-science