标签编码 DataFrame 中的多个列，但仅对那些需要它的列进行编码

Question

我有一个 pandas 数据框，其中包含浮点数、日期、整数和类。由于列的数量庞大，对我来说最自动化的方法是 select 需要它的列（主要是类的列）然后标签编码那些？

仅供参考：日期不得进行标签编码

Answer 1

您可以按名称使用 select_dtypes to select columns by data type or filter 到 select 列。

Answer 2

试试这个 -

# To select numerical and categorical columns
num_cols = X_train.select_dtypes(exclude="object").columns.tolist()
cat_cols = X_train.select_dtypes(include="object").columns.tolist()

# you can also pass a list like - 
cat_cols = X_train.select_dtypes(include=["object", "category"]).columns.tolist()

之后你可以像这样制作管道 -

# numerical data preprocessing pipeline
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

# categorical data preprocessing pipeline
cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="NA"),
    OneHotEncoder(handle_unknown="ignore", sparse=False),
)

# full pipeline
full_pipe = ColumnTransformer(
    [("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)]
)

标签编码 DataFrame 中的多个列，但仅对那些需要它的列进行编码

Label encoding several columns in DataFrame but only those who need it

python

pandas

scikit-learn

sklearn-pandas