在 sklearn 管道后获取特征名称
Get feature names after sklearn pipeline
我想将输出 np 数组与制作新的 pandas 数据帧的特征相匹配
这是我的管道:
from sklearn.pipeline import Pipeline
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
]
)
# Continuous pipeline
continuous_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('Scaling', StandardScaler())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(continuous_preprocessing, continuous_cols),
(categorical_preprocessing, categorical_cols),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
我是这样称呼它的:
X_train = pipeline.fit_transform(X_train)
X_val = pipeline.transform(X_val)
X_test = pipeline.transform(X_test)
这是我在尝试获取特征名称时得到的结果:
pipeline['Preprocessing'].transformers_[1][1]['Ordinal encoding'].get_feature_names()
输出:
AttributeError: 'OrdinalEncoder' object has no attribute 'get_feature_names'
这是一个类似的 SO 问题:
要点是,截至今天,一些转换器确实公开了一种方法 .get_feature_names_out()
而另一些则没有,这会产生一些问题 - 例如 - 无论何时你想创建一个 well-formatted DataFrame
来自 Pipeline
或 ColumnTransformer
实例输出的 np.array。 (相反,afaik,.get_feature_names()
在最新版本中被弃用,取而代之的是 .get_feature_names_out()
)。
关于您正在使用的变压器,StandardScaler
belongs to the first category of transformers exposing the method, while both SimpleImputer
and OrdinalEncoder
属于第二种。文档显示了 Methods 段落中公开的方法。如前所述,这会在您的 pipeline
上执行类似 pd.DataFrame(pipeline.fit_transform(X_train), columns=pipeline.get_feature_names_out())
的操作时出现问题,但它也会在您的 categorical_preprocessing
和 continuous_preprocessing
管道上引起问题(在这两种情况下至少一个变压器缺少该方法)和 preprocessing
ColumnTransformer
实例。
sklearn
正在尝试使用 .get_feature_names_out()
方法丰富 所有 估计器。它在 github issue #21308, which, as you might see, branches in many PRs (each one dealing with a specific module). For instance, issue #21079 for the preprocessing module, which will enrich the OrdinalEncoder
among the others, issue #21078 内跟踪 impute 模块,这将丰富 SimpleImputer
。我想一旦所有引用的 PR 合并,它们就会在新版本中可用。
与此同时,imo,您应该使用可能适合您需求的自定义解决方案。这是一个简单的示例,不一定符合您的需要,但旨在提供一种(可能的)处理方式:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw', ''],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath', 'The Jungle'],
'expert_rating': [5, 3, 4, 5, np.NaN],
'user_rating': [4, 5, 4, np.NaN, 3]})
X
num_cols = X.select_dtypes(include=np.number).columns.tolist()
cat_cols = X.select_dtypes(exclude=np.number).columns.tolist()
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values='', strategy='most_frequent')),
('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
]
)
# Continuous pipeline
continuous_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('Scaling', StandardScaler())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(continuous_preprocessing, num_cols),
(categorical_preprocessing, cat_cols),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
X_trans = pipeline.fit_transform(X)
pd.DataFrame(X_trans, columns= num_cols + cat_cols)
我想将输出 np 数组与制作新的 pandas 数据帧的特征相匹配
这是我的管道:
from sklearn.pipeline import Pipeline
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
]
)
# Continuous pipeline
continuous_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('Scaling', StandardScaler())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(continuous_preprocessing, continuous_cols),
(categorical_preprocessing, categorical_cols),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
我是这样称呼它的:
X_train = pipeline.fit_transform(X_train)
X_val = pipeline.transform(X_val)
X_test = pipeline.transform(X_test)
这是我在尝试获取特征名称时得到的结果:
pipeline['Preprocessing'].transformers_[1][1]['Ordinal encoding'].get_feature_names()
输出:
AttributeError: 'OrdinalEncoder' object has no attribute 'get_feature_names'
这是一个类似的 SO 问题:
要点是,截至今天,一些转换器确实公开了一种方法 .get_feature_names_out()
而另一些则没有,这会产生一些问题 - 例如 - 无论何时你想创建一个 well-formatted DataFrame
来自 Pipeline
或 ColumnTransformer
实例输出的 np.array。 (相反,afaik,.get_feature_names()
在最新版本中被弃用,取而代之的是 .get_feature_names_out()
)。
关于您正在使用的变压器,StandardScaler
belongs to the first category of transformers exposing the method, while both SimpleImputer
and OrdinalEncoder
属于第二种。文档显示了 Methods 段落中公开的方法。如前所述,这会在您的 pipeline
上执行类似 pd.DataFrame(pipeline.fit_transform(X_train), columns=pipeline.get_feature_names_out())
的操作时出现问题,但它也会在您的 categorical_preprocessing
和 continuous_preprocessing
管道上引起问题(在这两种情况下至少一个变压器缺少该方法)和 preprocessing
ColumnTransformer
实例。
sklearn
正在尝试使用 .get_feature_names_out()
方法丰富 所有 估计器。它在 github issue #21308, which, as you might see, branches in many PRs (each one dealing with a specific module). For instance, issue #21079 for the preprocessing module, which will enrich the OrdinalEncoder
among the others, issue #21078 内跟踪 impute 模块,这将丰富 SimpleImputer
。我想一旦所有引用的 PR 合并,它们就会在新版本中可用。
与此同时,imo,您应该使用可能适合您需求的自定义解决方案。这是一个简单的示例,不一定符合您的需要,但旨在提供一种(可能的)处理方式:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw', ''],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath', 'The Jungle'],
'expert_rating': [5, 3, 4, 5, np.NaN],
'user_rating': [4, 5, 4, np.NaN, 3]})
X
num_cols = X.select_dtypes(include=np.number).columns.tolist()
cat_cols = X.select_dtypes(exclude=np.number).columns.tolist()
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values='', strategy='most_frequent')),
('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
]
)
# Continuous pipeline
continuous_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('Scaling', StandardScaler())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(continuous_preprocessing, num_cols),
(categorical_preprocessing, cat_cols),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
X_trans = pipeline.fit_transform(X)
pd.DataFrame(X_trans, columns= num_cols + cat_cols)