我怎样才能对列的子集进行热编码?
How can I one hot encode a subset of columns?
我有一个包含一些分类列的数据集。这是一个小样本:
Temp precip dow tod
-20.44 snow 4 14.5
-22.69 snow 4 15.216666666666667
-21.52 snow 4 17.316666666666666
-21.52 snow 4 17.733333333333334
-20.51 snow 4 18.15
这里,dow
和 precip
是分类的,而其他的是连续的。
有什么方法可以为这些列创建 OneHotEncoder
吗?我不想使用 pd.get_dummies
,因为除非每个 dow
和 precip
都在新数据中,否则不会以正确的格式放置数据。
简短的回答是肯定的,但有一些注意事项。
首先,您将无法使用 OneHotEncoder
directly on the precip
feature. You will need to encode those labels in to integers with LabelEncoder
。
其次,如果您只想对这些功能进行编码,您可以将适当的值传递给 n_values
和 categorical_features
参数。
示例:
我假设 dow
是星期几,它有七个值,precip 有(雨、雨夹雪、雪和混合)作为值。
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
df2 = df.copy()
le = LabelEncoder()
le.fit(['rain', 'sleet', 'snow', 'mix'])
df2.precip = le.transform(df2.precip)
df2
Temp precip dow tod
0 -20.44 3 4 14.500000
1 -22.69 3 4 15.216667
2 -21.52 3 4 17.316667
3 -21.52 3 4 17.733333
4 -20.51 3 4 18.150000
# Initialize OneHotEncoder with 4 values for precip and 7 for dow.
ohe = OneHotEncoder(n_values=np.array([4,7]), categorical_features=[1,2])
X = ohe.fit_transform(df2)
X.toarray()
array([[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -20.44 , 14.5 ],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -22.69 ,
15.21666667],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -21.52 ,
17.31666667],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -21.52 ,
17.73333333],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -20.51 , 18.15 ]])
好的,但您必须就地改变数据或创建副本,否则事情会变得有点混乱。一种更有条理的方法是使用 Pipeline
.
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
def get_precip(X):
le = LabelEncoder()
le.fit(['rain', 'sleet', 'snow', 'mix'])
return le.transform(X.precip).reshape(-1,1)
def get_dow(X):
return X.dow.values.reshape(-1,1)
def get_rest(X):
return X.drop(['precip', 'dow'], axis=1)
precip_trans = FunctionTransformer(get_precip, validate=False)
dow_trans = FunctionTransformer(get_dow, validate=False)
rest_trans = FunctionTransformer(get_rest, validate=False)
union = FeatureUnion([('precip', precip_trans), ('dow', dow_trans), ('rest', rest_trans)])
ohe = OneHotEncoder(n_values=[4,7], categorical_features=[0,1])
pipe = Pipeline([('union', union), ('one_hot', ohe)])
X = pipe.fit_transform(df)
X.toarray()
array([[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -20.44 , 14.5 ],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -22.69 ,
15.21666667],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -21.52 ,
17.31666667],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -21.52 ,
17.73333333],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -20.51 , 18.15 ]])
我想指出的是,在即将发布的 sklearn v0.20 中,将会有一个 CategoricalEncoder
可以让这种事情变得更容易。
I don't want to use pd.get_dummies
because that won't put the data in
the proper format unless of each dow and precip are in the new data.
假设您既要编码又要保留这两列——您确定这对您不起作用吗?
df = pd.DataFrame({
'temp': np.random.random(5) + 20.,
'precip': pd.Categorical(['snow', 'snow', 'rain', 'none', 'rain']),
'dow': pd.Categorical([4, 4, 4, 3, 1]),
'tod': np.random.random(5) + 10.
})
pd.concat((df[['dow', 'precip']],
pd.get_dummies(df, columns=['dow', 'precip'], drop_first=True)),
axis=1)
dow precip temp tod dow_3 dow_4 precip_rain precip_snow
0 4 snow 20.7019 10.4610 0 1 0 1
1 4 snow 20.0917 10.0174 0 1 0 1
2 4 rain 20.3978 10.5766 0 1 1 0
3 3 none 20.9804 10.0770 1 0 0 0
4 1 rain 20.3121 10.3584 0 0 1 0
如果您要与包含 df
没有 "seen," 的类别的新数据交互,您可以使用
df['col'] = df['col'].cat.add_categories(...)
你传递设置差异列表的地方。这将添加到结果 pd.Categorical
对象的 "recognized" 类别列表中。
您可以检查两件事:sklearn-pandas and as mentioned by @Grr pipelines with this good intro。
所以我更喜欢管道,因为它们是一种整洁的方式,可以轻松地与 grid-seach 之类的东西一起使用,避免交叉验证中折叠之间的泄漏等。所以我通常最终会有这样的管道(给你先有沉淀LabelEncoded):
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
class Normalize(BaseEstimator, TransformerMixin):
def __init__(self, func=None, func_param={}):
self.func = func
self.func_param = func_param
def transform(self, X):
if self.func != None:
return self.func(X, **self.func_param)
else:
return X
def fit(self, X, y=None, **fit_params):
return self
cat_cols = ['precip', 'dow']
num_cols = ['Temp','tod']
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=num_cols),Normalize())),
('categorical', make_pipeline(Columns(names=cat_cols),OneHotEncoder(sparse=False)))
])),
('model', LinearRegression())
])
我有一个包含一些分类列的数据集。这是一个小样本:
Temp precip dow tod
-20.44 snow 4 14.5
-22.69 snow 4 15.216666666666667
-21.52 snow 4 17.316666666666666
-21.52 snow 4 17.733333333333334
-20.51 snow 4 18.15
这里,dow
和 precip
是分类的,而其他的是连续的。
有什么方法可以为这些列创建 OneHotEncoder
吗?我不想使用 pd.get_dummies
,因为除非每个 dow
和 precip
都在新数据中,否则不会以正确的格式放置数据。
简短的回答是肯定的,但有一些注意事项。
首先,您将无法使用 OneHotEncoder
directly on the precip
feature. You will need to encode those labels in to integers with LabelEncoder
。
其次,如果您只想对这些功能进行编码,您可以将适当的值传递给 n_values
和 categorical_features
参数。
示例:
我假设 dow
是星期几,它有七个值,precip 有(雨、雨夹雪、雪和混合)作为值。
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
df2 = df.copy()
le = LabelEncoder()
le.fit(['rain', 'sleet', 'snow', 'mix'])
df2.precip = le.transform(df2.precip)
df2
Temp precip dow tod
0 -20.44 3 4 14.500000
1 -22.69 3 4 15.216667
2 -21.52 3 4 17.316667
3 -21.52 3 4 17.733333
4 -20.51 3 4 18.150000
# Initialize OneHotEncoder with 4 values for precip and 7 for dow.
ohe = OneHotEncoder(n_values=np.array([4,7]), categorical_features=[1,2])
X = ohe.fit_transform(df2)
X.toarray()
array([[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -20.44 , 14.5 ],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -22.69 ,
15.21666667],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -21.52 ,
17.31666667],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -21.52 ,
17.73333333],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -20.51 , 18.15 ]])
好的,但您必须就地改变数据或创建副本,否则事情会变得有点混乱。一种更有条理的方法是使用 Pipeline
.
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
def get_precip(X):
le = LabelEncoder()
le.fit(['rain', 'sleet', 'snow', 'mix'])
return le.transform(X.precip).reshape(-1,1)
def get_dow(X):
return X.dow.values.reshape(-1,1)
def get_rest(X):
return X.drop(['precip', 'dow'], axis=1)
precip_trans = FunctionTransformer(get_precip, validate=False)
dow_trans = FunctionTransformer(get_dow, validate=False)
rest_trans = FunctionTransformer(get_rest, validate=False)
union = FeatureUnion([('precip', precip_trans), ('dow', dow_trans), ('rest', rest_trans)])
ohe = OneHotEncoder(n_values=[4,7], categorical_features=[0,1])
pipe = Pipeline([('union', union), ('one_hot', ohe)])
X = pipe.fit_transform(df)
X.toarray()
array([[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -20.44 , 14.5 ],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -22.69 ,
15.21666667],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -21.52 ,
17.31666667],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -21.52 ,
17.73333333],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -20.51 , 18.15 ]])
我想指出的是,在即将发布的 sklearn v0.20 中,将会有一个 CategoricalEncoder
可以让这种事情变得更容易。
I don't want to use
pd.get_dummies
because that won't put the data in the proper format unless of each dow and precip are in the new data.
假设您既要编码又要保留这两列——您确定这对您不起作用吗?
df = pd.DataFrame({
'temp': np.random.random(5) + 20.,
'precip': pd.Categorical(['snow', 'snow', 'rain', 'none', 'rain']),
'dow': pd.Categorical([4, 4, 4, 3, 1]),
'tod': np.random.random(5) + 10.
})
pd.concat((df[['dow', 'precip']],
pd.get_dummies(df, columns=['dow', 'precip'], drop_first=True)),
axis=1)
dow precip temp tod dow_3 dow_4 precip_rain precip_snow
0 4 snow 20.7019 10.4610 0 1 0 1
1 4 snow 20.0917 10.0174 0 1 0 1
2 4 rain 20.3978 10.5766 0 1 1 0
3 3 none 20.9804 10.0770 1 0 0 0
4 1 rain 20.3121 10.3584 0 0 1 0
如果您要与包含 df
没有 "seen," 的类别的新数据交互,您可以使用
df['col'] = df['col'].cat.add_categories(...)
你传递设置差异列表的地方。这将添加到结果 pd.Categorical
对象的 "recognized" 类别列表中。
您可以检查两件事:sklearn-pandas and as mentioned by @Grr pipelines with this good intro。
所以我更喜欢管道,因为它们是一种整洁的方式,可以轻松地与 grid-seach 之类的东西一起使用,避免交叉验证中折叠之间的泄漏等。所以我通常最终会有这样的管道(给你先有沉淀LabelEncoded):
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
class Normalize(BaseEstimator, TransformerMixin):
def __init__(self, func=None, func_param={}):
self.func = func
self.func_param = func_param
def transform(self, X):
if self.func != None:
return self.func(X, **self.func_param)
else:
return X
def fit(self, X, y=None, **fit_params):
return self
cat_cols = ['precip', 'dow']
num_cols = ['Temp','tod']
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=num_cols),Normalize())),
('categorical', make_pipeline(Columns(names=cat_cols),OneHotEncoder(sparse=False)))
])),
('model', LinearRegression())
])