我怎样才能对列的子集进行热编码？

Question

我有一个包含一些分类列的数据集。这是一个小样本：

Temp    precip dow  tod
-20.44  snow   4    14.5
-22.69  snow   4    15.216666666666667
-21.52  snow   4    17.316666666666666
-21.52  snow   4    17.733333333333334
-20.51  snow   4    18.15

这里，dow 和 precip 是分类的，而其他的是连续的。

有什么方法可以为这些列创建 OneHotEncoder 吗？我不想使用 pd.get_dummies，因为除非每个 dow 和 precip 都在新数据中，否则不会以正确的格式放置数据。

Answer 1

简短的回答是肯定的，但有一些注意事项。

首先，您将无法使用 OneHotEncoder directly on the precip feature. You will need to encode those labels in to integers with LabelEncoder。

其次，如果您只想对这些功能进行编码，您可以将适当的值传递给 n_values 和 categorical_features 参数。

示例：

我假设 dow 是星期几，它有七个值，precip 有（雨、雨夹雪、雪和混合）作为值。

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df2 = df.copy()

le = LabelEncoder()
le.fit(['rain', 'sleet', 'snow', 'mix'])
df2.precip = le.transform(df2.precip)
df2
    Temp  precip  dow        tod
0 -20.44       3    4  14.500000
1 -22.69       3    4  15.216667
2 -21.52       3    4  17.316667
3 -21.52       3    4  17.733333
4 -20.51       3    4  18.150000

# Initialize OneHotEncoder with 4 values for precip and 7 for dow.
ohe = OneHotEncoder(n_values=np.array([4,7]), categorical_features=[1,2])
X = ohe.fit_transform(df2)
X.toarray()
array([[  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -20.44      ,  14.5       ],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -22.69      ,
         15.21666667],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -21.52      ,
         17.31666667],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -21.52      ,
         17.73333333],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -20.51      ,  18.15      ]])

好的，但您必须就地改变数据或创建副本，否则事情会变得有点混乱。一种更有条理的方法是使用 Pipeline.

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline

def get_precip(X):
    le = LabelEncoder()
    le.fit(['rain', 'sleet', 'snow', 'mix'])
    return le.transform(X.precip).reshape(-1,1)

def get_dow(X):
    return X.dow.values.reshape(-1,1)

def get_rest(X):
    return X.drop(['precip', 'dow'], axis=1)

precip_trans = FunctionTransformer(get_precip, validate=False)
dow_trans = FunctionTransformer(get_dow, validate=False)
rest_trans = FunctionTransformer(get_rest, validate=False)
union = FeatureUnion([('precip', precip_trans), ('dow', dow_trans), ('rest', rest_trans)])
ohe = OneHotEncoder(n_values=[4,7], categorical_features=[0,1])
pipe = Pipeline([('union', union), ('one_hot', ohe)])
X = pipe.fit_transform(df)
X.toarray()
array([[  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -20.44      ,  14.5       ],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -22.69      ,
         15.21666667],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -21.52      ,
         17.31666667],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -21.52      ,
         17.73333333],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -20.51      ,  18.15      ]])

我想指出的是，在即将发布的 sklearn v0.20 中，将会有一个 CategoricalEncoder 可以让这种事情变得更容易。

Answer 2

I don't want to use pd.get_dummies because that won't put the data in the proper format unless of each dow and precip are in the new data.

假设您既要编码又要保留这两列——您确定这对您不起作用吗？

df = pd.DataFrame({
    'temp': np.random.random(5) + 20.,
    'precip': pd.Categorical(['snow', 'snow', 'rain', 'none', 'rain']),
    'dow': pd.Categorical([4, 4, 4, 3, 1]),
    'tod': np.random.random(5) + 10.
    })

pd.concat((df[['dow', 'precip']],
          pd.get_dummies(df, columns=['dow', 'precip'], drop_first=True)),
          axis=1)

  dow precip     temp      tod  dow_3  dow_4  precip_rain  precip_snow
0   4   snow  20.7019  10.4610      0      1            0            1
1   4   snow  20.0917  10.0174      0      1            0            1
2   4   rain  20.3978  10.5766      0      1            1            0
3   3   none  20.9804  10.0770      1      0            0            0
4   1   rain  20.3121  10.3584      0      0            1            0

如果您要与包含 df 没有 "seen," 的类别的新数据交互，您可以使用

df['col'] = df['col'].cat.add_categories(...)

你传递设置差异列表的地方。这将添加到结果 pd.Categorical 对象的 "recognized" 类别列表中。

Answer 3

您可以检查两件事：sklearn-pandas and as mentioned by @Grr pipelines with this good intro。

所以我更喜欢管道，因为它们是一种整洁的方式，可以轻松地与 grid-seach 之类的东西一起使用，避免交叉验证中折叠之间的泄漏等。所以我通常最终会有这样的管道（给你先有沉淀LabelEncoded):

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]

class Normalize(BaseEstimator, TransformerMixin):
    def __init__(self, func=None, func_param={}):
        self.func = func
        self.func_param = func_param

    def transform(self, X):
        if self.func != None:
            return self.func(X, **self.func_param)
        else:
            return X

    def fit(self, X, y=None, **fit_params):
        return self


cat_cols = ['precip', 'dow']
num_cols = ['Temp','tod']

pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=num_cols),Normalize())),
        ('categorical', make_pipeline(Columns(names=cat_cols),OneHotEncoder(sparse=False)))
    ])),
    ('model', LinearRegression())
])

我怎样才能对列的子集进行热编码？

How can I one hot encode a subset of columns?

python

feature-extraction

pandas

scikit-learn