将 NaN 设置为 -1 时缩放特征的效果

Effect of scaling Features when NaNs are set to -1

我有一个数据集,其中包含一些具有大量 NaN(高达 80%)的特征。删除它们会扭曲我的整体分布,因此我的选择是将所有 NaN 设置为 -1/-99 或将我的连续变量分组,使其成为分类特征。

因为我已经有很多分类特征,所以我不想让少数连续的特征也分类。但是,如果我将 NaN 设置为 -1/-99,当我缩放这些特征时,会显着影响结果吗?

或者从不同的角度来看,有没有一种方法可以在不让 -1 影响其缩放比例的情况下缩放特征?

我知道您从上面的评论中得到了答案,但是为了向新的 scikit-learn 用户展示您可能会如何处理这样的问题,我整理了一个非常基本的解决方案来演示如何构建一个自定义转换器来处理这个问题:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted
import numpy as np

class NanImputeScaler(BaseEstimator, TransformerMixin):
    """Scale an array with missing values, then impute them
    with a dummy value. This prevents the imputed value from impacting
    the mean/standard deviation computation during scaling.

    Parameters
    ----------
    with_mean : bool, optional (default=True)
        Whether to center the variables.

    with_std : bool, optional (default=True)
        Whether to divide by the standard deviation.

    nan_level : int or float, optional (default=-99.)
        The value to impute over NaN values after scaling the other features.
    """
    def __init__(self, with_mean=True, with_std=True, nan_level=-99.):
        self.with_mean = with_mean
        self.with_std = with_std
        self.nan_level = nan_level

    def fit(self, X, y=None):
        # Check the input array, but don't force everything to be finite.
        # This also ensures the array is 2D
        X = check_array(X, force_all_finite=False, ensure_2d=True)

        # compute the statistics on the data irrespective of NaN values
        self.means_ = np.nanmean(X, axis=0)
        self.std_ = np.nanstd(X, axis=0)
        return self

    def transform(self, X):
        # Check that we have already fit this transformer
        check_is_fitted(self, "means_")

        # get a copy of X so we can change it in place
        X = check_array(X, force_all_finite=False, ensure_2d=True)

        # center if needed
        if self.with_mean:
            X -= self.means_
        # scale if needed
        if self.with_std:
            X /= self.std_

        # now fill in the missing values
        X[np.isnan(X)] = self.nan_level
        return X

其工作方式是计算 fit 部分中的 nanmean and nanstd,以便在计算统计信息时忽略 NaN 值。然后,在 transform 部分中,在对变量进行缩放和居中之后,剩余的 NaN 值将被指定为您指定的值(您提到了 -99,所以这是我的默认值)。您总是可以将变压器的那个组件分解成另一个变压器,但我将其包括在内只是为了演示目的。

实际示例:

这里我们将设置一些存在 NaN 的数据:

nan = np.nan
data = np.array([
    [ 1., nan,  3.],
    [ 2.,  3., nan],
    [nan,  4.,  5.],
    [ 4.,  5.,  6.]
])

当我们安装定标器并检查 means/standard 偏差时,您可以看到它们没有考虑 NaN 值:

>>> imputer = NanImputeScaler().fit(data)
>>> imputer.means_
array([ 2.33333333,  4.        ,  4.66666667])
>>> imputer.std_
array([ 1.24721913,  0.81649658,  1.24721913])

最后,当我们转换数据时,数据被缩放并处理 NaN 值:

>>> imputer.transform(data)
array([[ -1.06904497, -99.        ,  -1.33630621],
       [ -0.26726124,  -1.22474487, -99.        ],
       [-99.        ,   0.        ,   0.26726124],
       [  1.33630621,   1.22474487,   1.06904497]])

流水线

您甚至可以在 scikit-learn 管道中使用此模式(甚至将其保存到磁盘):

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
        ("scale", NanImputeScaler()),
        ("clf", LogisticRegression())
    ]).fit(data, y)