Python sklearn - 确定LabelEncoder的编码顺序

Question

我希望确定 sklearn LabelEncoder 的标签（即 0,1,2,3,...）以适应分类变量可能值的特定顺序（比如 ['b'，'a', 'c', 'd' ]). LabelEncoder 选择按字典顺序排列标签，我猜可以在这个例子中看到：

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(['b', 'a', 'c', 'd' ])
le.classes_
array(['a', 'b', 'c', 'd'], dtype='<U1')
le.transform(['a', 'b'])
array([0, 1])

我怎样才能强制编码器坚持数据的顺序，因为它在 .fit 方法中第一次遇到（即将 'b' 编码为 0，'a' 编码为 1，'c' 到 2，'d' 到 3)?

Answer 1

你不能在原始版本中这样做。

LabelEncoder.fit() 使用 numpy.unique which will always return the data as sorted, as given in source:

def fit(...):
    y = column_or_1d(y, warn=True)
    self.classes_ = np.unique(y)
    return self

所以如果你想这样做，你需要重写 fit() 函数。像这样：

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import column_or_1d

class MyLabelEncoder(LabelEncoder):

    def fit(self, y):
        y = column_or_1d(y, warn=True)
        self.classes_ = pd.Series(y).unique()
        return self

那么你可以这样做：

le = MyLabelEncoder()
le.fit(['b', 'a', 'c', 'd' ])
le.classes_
#Output:  array(['b', 'a', 'c', 'd'], dtype=object)

在这里，我使用 pandas.Series.unique() 来获得唯一性类。如果您出于任何原因无法使用 pandas，请参阅使用 numpy 执行此问题的问题：

numpy unique without sort

Answer 2

Vivek Kumar 解决方案对我有用，但必须这样做

class LabelEncoder(LabelEncoder):

def fit(self, y):
    y = column_or_1d(y, warn=True)
    self.classes_ = pd.Series(y).unique().sort()
    return self

Answer 3

请注意，http://contrib.scikit-learn.org/categorical-encoding/ordinal.html 现在可能有更好的方法来执行此操作。特别是，请参阅 mapping 参数：

a mapping of class to label to use for the encoding, optional. the dict contains the keys ‘col’ and ‘mapping’. the value of ‘col’ should be the feature name. the value of ‘mapping’ should be a dictionary of ‘original_label’ to ‘encoded_label’. example mapping: [{‘col’: ‘col1’, ‘mapping’: {None: 0, ‘a’: 1, ‘b’: 2}}]

Answer 4

注意 :: 这不是标准方法，而是一种 hacky 方法我使用 'classes_' 属性来自定义我的映射

from sklearn import preprocessing
le_temp = preprocessing.LabelEncoder()
le_temp = le_temp.fit(df_1['Temp'])
print(df_1['Temp'])
le_temp.classes_ = np.array(['Cool', 'Mild','Hot'])
print("New classes sequence::",le_temp.classes_)
df_1['Temp'] = le_temp.transform(df_1['Temp'])
print(df_1['Temp'])

我的输出看起来像

1      Hot
2      Hot
3      Hot
4     Mild
5     Cool
6     Cool

Name: Temp, dtype: object
New classes sequence:: ['Cool' 'Mild' 'Hot']

1     2
2     2
3     2
4     1
5     0
6     0

Name: Temp, dtype: int32

Python sklearn - 确定LabelEncoder的编码顺序

Python sklearn - Determine the encoding order of LabelEncoder

python

encoder

scikit-learn