是否可以在 OneHotEncoder 中为某些列指定 handle_unknown = 'ignore' 并为其他列指定 'error'?
Is it possible to specify handle_unknown = 'ignore' for certain columns and 'error' for others inside OneHotEncoder?
我有一个包含所有分类列的数据框,我正在使用 sklearn.preprocessing
中的 oneHotEncoder
对其进行编码。我的代码如下:
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]
pipeline = Pipeline(steps)
正如在 OneHotEncoder
中看到的那样,handle_unknown 参数采用 error
或 ignore
。我想知道是否有一种方法可以选择性地忽略某些列的未知类别,而对其他列给出错误?
import pandas as pd
df = pd.DataFrame({'Country':['USA','USA','IND','UK','UK','UK'],
'Fruits':['Apple','Strawberry','Mango','Berries','Banana','Grape'],
'Flower': ['Rose','Lily','Orchid','Petunia','Lotus','Dandelion'],
'Result':[1,2,3,4,5,6,]})
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]
pipeline = Pipeline(steps)
from sklearn.model_selection import train_test_split
X = df[["Country","Flower","Fruits"]]
Y = df["Result"]
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3, random_state=30, shuffle =True)
print("X_train.shape:", X_train.shape)
print("y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape)
print("y_test.shape:", y_test.shape)
pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
#Mean Squared Error:
MSE = mean_squared_error(y_test,y_pred)
print("MSE", MSE)
#Root Mean Squared Error:
from math import sqrt
RMSE = sqrt(MSE)
print("RMSE", RMSE)
#R-squared score:
R2_score = r2_score(y_test,y_pred)
print("R2_score", R2_score)
在这种情况下,对于 Country
、Fruits
和 Flowers
的所有列,如果有新值出现,模型仍然能够预测输出。
我想知道是否有办法忽略 Fruits
和 Flowers
的未知类别,但是在 Country
列中引发未知值的错误?
从v0.20开始,可以使用ColumnTransformerAPI。但是,对于旧版本,您可以轻松推出自己的预处理器实现,该预处理器有选择地处理列。
Here's a simple prototype I've implemented which extends OneHotEncoder
. 您将需要指定列的列表以在 raise_error_cols
参数上引发错误。任何未指定给该参数的列都被隐式处理为 "ignored"。
样本运行
# Setup data
X_train
Country Flower Fruits
2 IND Orchid Mango
0 USA Rose Apple
4 UK Lotus Banana
5 UK Dandelion Grape
X_test
Country Flower Fruits
3 UK Petunia Berries
1 USA Lily Strawberry
X_test2 = X_test.append(
{'Country': 'SA', 'Flower': 'Rose', 'Fruits': 'Tomato'}, ignore_index=True)
X_test2
Country Flower Fruits
0 UK Petunia Berries
1 USA Lily Strawberry
2 SA Rose Tomato
from selective_handler_ohe import SelectiveHandlerOHE
she = SelectiveHandlerOHE(raise_error_cols=['Country'])
she.fit(X_train)
she.transform(X_test).toarray()
# array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
# [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]])
she.transform(X_test2)
# ---------------------------------------------------------------------------
# ValueError: Found unknown categories SA in column Country during fit
我认为ColumnTransformer()
可以帮助您解决问题。您可以指定列表
您要应用 OneHotEncoder
的列,ignore
应用 handle_unknown
,类似地应用 error
.
使用 ColumnTransformer
将您的管道转换为以下内容
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("ohe_ignore", OneHotEncoder(handle_unknown ='ignore'),
["Flower", "Fruits"]),
("ohe_raise_error", OneHotEncoder(handle_unknown ='error'),
["Country"])])
steps = [('OneHotEncoder', ct),
('LReg', LinearRegression())]
pipeline = Pipeline(steps)
现在,当我们想要预测时
>>> pipeline.predict(pd.DataFrame({'Country': ['UK'], 'Fruits': ['Apple'], 'Flower': ['Rose']}))
array([2.83333333])
>>> pipeline.predict(pd.DataFrame({'Country': ['UK'], 'Fruits': ['chk'], 'Flower': ['Rose']}))
array([3.66666667])
>>> pipeline.predict(pd.DataFrame({'Country': ['chk'], 'Fruits': ['Apple'], 'Flower': ['Rose']}))
> ValueError: Found unknown categories ['chk'] in column 0 during
> transform
注意:ColumnTransformer
从版本 0.20
开始可用。
我有一个包含所有分类列的数据框,我正在使用 sklearn.preprocessing
中的 oneHotEncoder
对其进行编码。我的代码如下:
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]
pipeline = Pipeline(steps)
正如在 OneHotEncoder
中看到的那样,handle_unknown 参数采用 error
或 ignore
。我想知道是否有一种方法可以选择性地忽略某些列的未知类别,而对其他列给出错误?
import pandas as pd
df = pd.DataFrame({'Country':['USA','USA','IND','UK','UK','UK'],
'Fruits':['Apple','Strawberry','Mango','Berries','Banana','Grape'],
'Flower': ['Rose','Lily','Orchid','Petunia','Lotus','Dandelion'],
'Result':[1,2,3,4,5,6,]})
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]
pipeline = Pipeline(steps)
from sklearn.model_selection import train_test_split
X = df[["Country","Flower","Fruits"]]
Y = df["Result"]
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3, random_state=30, shuffle =True)
print("X_train.shape:", X_train.shape)
print("y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape)
print("y_test.shape:", y_test.shape)
pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
#Mean Squared Error:
MSE = mean_squared_error(y_test,y_pred)
print("MSE", MSE)
#Root Mean Squared Error:
from math import sqrt
RMSE = sqrt(MSE)
print("RMSE", RMSE)
#R-squared score:
R2_score = r2_score(y_test,y_pred)
print("R2_score", R2_score)
在这种情况下,对于 Country
、Fruits
和 Flowers
的所有列,如果有新值出现,模型仍然能够预测输出。
我想知道是否有办法忽略 Fruits
和 Flowers
的未知类别,但是在 Country
列中引发未知值的错误?
从v0.20开始,可以使用ColumnTransformerAPI。但是,对于旧版本,您可以轻松推出自己的预处理器实现,该预处理器有选择地处理列。
Here's a simple prototype I've implemented which extends OneHotEncoder
. 您将需要指定列的列表以在 raise_error_cols
参数上引发错误。任何未指定给该参数的列都被隐式处理为 "ignored"。
样本运行
# Setup data
X_train
Country Flower Fruits
2 IND Orchid Mango
0 USA Rose Apple
4 UK Lotus Banana
5 UK Dandelion Grape
X_test
Country Flower Fruits
3 UK Petunia Berries
1 USA Lily Strawberry
X_test2 = X_test.append(
{'Country': 'SA', 'Flower': 'Rose', 'Fruits': 'Tomato'}, ignore_index=True)
X_test2
Country Flower Fruits
0 UK Petunia Berries
1 USA Lily Strawberry
2 SA Rose Tomato
from selective_handler_ohe import SelectiveHandlerOHE
she = SelectiveHandlerOHE(raise_error_cols=['Country'])
she.fit(X_train)
she.transform(X_test).toarray()
# array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
# [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]])
she.transform(X_test2)
# ---------------------------------------------------------------------------
# ValueError: Found unknown categories SA in column Country during fit
我认为ColumnTransformer()
可以帮助您解决问题。您可以指定列表
您要应用 OneHotEncoder
的列,ignore
应用 handle_unknown
,类似地应用 error
.
使用 ColumnTransformer
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("ohe_ignore", OneHotEncoder(handle_unknown ='ignore'),
["Flower", "Fruits"]),
("ohe_raise_error", OneHotEncoder(handle_unknown ='error'),
["Country"])])
steps = [('OneHotEncoder', ct),
('LReg', LinearRegression())]
pipeline = Pipeline(steps)
现在,当我们想要预测时
>>> pipeline.predict(pd.DataFrame({'Country': ['UK'], 'Fruits': ['Apple'], 'Flower': ['Rose']}))
array([2.83333333])
>>> pipeline.predict(pd.DataFrame({'Country': ['UK'], 'Fruits': ['chk'], 'Flower': ['Rose']}))
array([3.66666667])
>>> pipeline.predict(pd.DataFrame({'Country': ['chk'], 'Fruits': ['Apple'], 'Flower': ['Rose']}))
> ValueError: Found unknown categories ['chk'] in column 0 during
> transform
注意:ColumnTransformer
从版本 0.20
开始可用。