标签编码 DataFrame 中的多个列,但仅对那些需要它的列进行编码
Label encoding several columns in DataFrame but only those who need it
我有一个 pandas 数据框,其中包含浮点数、日期、整数和 类。由于列的数量庞大,对我来说最自动化的方法是 select 需要它的列(主要是 类 的列)然后标签编码那些?
仅供参考:日期不得进行标签编码
您可以按名称使用 select_dtypes
to select columns by data type or filter
到 select 列。
试试这个 -
# To select numerical and categorical columns
num_cols = X_train.select_dtypes(exclude="object").columns.tolist()
cat_cols = X_train.select_dtypes(include="object").columns.tolist()
# you can also pass a list like -
cat_cols = X_train.select_dtypes(include=["object", "category"]).columns.tolist()
之后你可以像这样制作管道 -
# numerical data preprocessing pipeline
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
# categorical data preprocessing pipeline
cat_pipe = make_pipeline(
SimpleImputer(strategy="constant", fill_value="NA"),
OneHotEncoder(handle_unknown="ignore", sparse=False),
)
# full pipeline
full_pipe = ColumnTransformer(
[("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)]
)
我有一个 pandas 数据框,其中包含浮点数、日期、整数和 类。由于列的数量庞大,对我来说最自动化的方法是 select 需要它的列(主要是 类 的列)然后标签编码那些?
仅供参考:日期不得进行标签编码
您可以按名称使用 select_dtypes
to select columns by data type or filter
到 select 列。
试试这个 -
# To select numerical and categorical columns
num_cols = X_train.select_dtypes(exclude="object").columns.tolist()
cat_cols = X_train.select_dtypes(include="object").columns.tolist()
# you can also pass a list like -
cat_cols = X_train.select_dtypes(include=["object", "category"]).columns.tolist()
之后你可以像这样制作管道 -
# numerical data preprocessing pipeline
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
# categorical data preprocessing pipeline
cat_pipe = make_pipeline(
SimpleImputer(strategy="constant", fill_value="NA"),
OneHotEncoder(handle_unknown="ignore", sparse=False),
)
# full pipeline
full_pipe = ColumnTransformer(
[("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)]
)