如何仅使用 SimpleImputer 或等价物转换某些列

Question

我正在使用 scikit 库迈出第一步，发现自己需要回填仅数据框中的一些列。

我已经仔细阅读了 documentation 但我仍然不知道如何实现它。

为了更具体一点，假设我有：

A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]

而且我想用平均值填充第二列，但不是第三列。我如何使用 SimpleImputer（或其他助手 class）做到这一点？

由此演变而来的自然后续问题是：如何用平均值填充第二列，用常数填充最后一列（显然，仅适用于没有值开头的单元格）？

Answer 1

无需使用 SimpleImputer。
DataFrame.fillna() 也可以做这项工作

对于第二列，使用

column.fillna(column.mean(), inplace=True)
对于第三列，使用

column.fillna(constant, inplace=True)

当然，您需要将 column 替换为您要更改的 DataFrame 列，并将 constant 替换为您想要的常量。

编辑
由于不鼓励使用 inplace 并将被弃用，因此语法应为

column = column.fillna(column.mean())

Answer 2

根据 Dan 的建议，使用 ColumnTransformer 和 SimpleImputer 回填列的示例是：

import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]

column_trans = ColumnTransformer(
[('imp_col1', SimpleImputer(strategy='mean'), [1]),
 ('imp_col2', SimpleImputer(strategy='constant', fill_value=29), [2])],
remainder='passthrough')

print(column_trans.fit_transform(A)[:, [2,0,1]])
# [[7 2.0 3]
#  [4 3.5 6]
#  [10 5.0 29]]

这种方法有助于构建更适合大型应用程序的pipelines。

Answer 3

这是我用的方法，你可以把low_cardinality_cols换成你要编码的cols。但这也适用于设置 max(df.columns.nunique()).

独有的值

#check cardinalité des cols a encoder
low_cardinality_cols = [cname for cname in df.columns if df[cname].nunique() < 16 and 
                        df[cname].dtype == "object"]

为什么建议这些列仅对基数接近 10 的列进行编码。

# Replace NaN, if not you'll stuck
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # feel free to use others strategy
df[low_cardinality_cols]  = imp.fit_transform(df[low_cardinality_cols])

# Apply label encoder 
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in low_cardinality_cols:
    df[col] = label_encoder.fit_transform(df[col])
    ```

Answer 4

我假设您的数据是 pandas 数据框。

在这种情况下，要使用 scikitlearn 中的 SimpleImputer，您需要做的就是选择您希望使用 'most_frequent' 值估算 nan 的特定列，将其转换为numpy 数组并重塑为列向量。

这方面的一个例子是，

## Imputing the missing values, we fill the missing values using the 'most_frequent'
# We are using the california housing dataset in this example
housing = pd.read_csv('housing.csv')
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
#Simple imputer expects a column vector, so converting the pandas Series
housing['total_bedrooms'] = imp.fit_transform(housing['total_bedrooms'].to_numpy().reshape(-1,1))

同样，您可以选择数据集中的任何列转换为 NumPy 数组，重塑它并使用 SimpleImputer

如何仅使用 SimpleImputer 或等价物转换某些列

How to transform some columns only with SimpleImputer or equivalent

python

pandas

scikit-learn

data-science

imputation