标签编码 DataFrame 中的多个列,但仅对那些需要它的列进行编码

Label encoding several columns in DataFrame but only those who need it

我有一个 pandas 数据框,其中包含浮点数、日期、整数和 类。由于列的数量庞大,对我来说最自动化的方法是 select 需要它的列(主要是 类 的列)然后标签编码那些?

仅供参考:日期不得进行标签编码

您可以按名称使用 select_dtypes to select columns by data type or filter 到 select 列。

试试这个 -

# To select numerical and categorical columns
num_cols = X_train.select_dtypes(exclude="object").columns.tolist()
cat_cols = X_train.select_dtypes(include="object").columns.tolist()

# you can also pass a list like - 
cat_cols = X_train.select_dtypes(include=["object", "category"]).columns.tolist()

之后你可以像这样制作管道 -

# numerical data preprocessing pipeline
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

# categorical data preprocessing pipeline
cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="NA"),
    OneHotEncoder(handle_unknown="ignore", sparse=False),
)

# full pipeline
full_pipe = ColumnTransformer(
    [("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)]
)