使用 OneHotEncoder 扩展系列

Question

我有一个 pandas DataFrame 具有以下特征：

            tag_id
object_id   
    1           77
    2           77
    3           91
    4           91
    5           91
    6           91
    7           77
    8           91
    9           85
    10          88
    10          211
    11          100
    12          81
    12          91
    13          65
    14          73
    15          91
    16          174
    17          91
    18          62
    19          62
    20          91
    ...         ...
    1527        105
    1527        108
    1528        87
    1529        91

    1907 rows × 1 columns

如您所见，某些索引值实际上确实以不同的 "tag_id" 值重复。我想用 OneHotEncoder 重新组织此 DataFrame 以将其转换为具有二进制值的稀疏矩阵，如下所示：

            1    2    3    ...    77    ...    85    ...    88    ...    91    ...    211
object_id
    1       0    0    0    ...    1     ...    0     ...     0    ...    0     ...     0
    2       0    0    0    ...    1     ...    0     ...     0    ...    0     ...     0
    3       0    0    0    ...    0     ...    0     ...     0    ...    1     ...     0
    4       0    0    0    ...    0     ...    0     ...     0    ...    1     ...     0
    5       0    0    0    ...    0     ...    0     ...     0    ...    1     ...     0
    6       0    0    0    ...    0     ...    0     ...     0    ...    1     ...     0
    7       0    0    0    ...    1     ...    0     ...     0    ...    0     ...     0
    8       0    0    0    ...    0     ...    0     ...     0    ...    1     ...     0
    9       0    0    0    ...    0     ...    1     ...     0    ...    0     ...     0
    10      0    0    0    ...    0     ...    0     ...     1    ...    0     ...     1

等等等

使用 pd.get_dummies(df['tag_id']) 给了我一些我想要的东西，但它不会堆叠具有重复索引的行，所以我最终仍然有 1907 行，而不是 1907 - 重复次数。

有什么办法可以解决这个问题吗？

Answer 1

只需sum

pd.get_dummies(df['tag_id']).sum(level=0).ne(0).astype(int)

或删除副本

pd.get_dummies(df['tag_id'].groupby(level=0).first())

Answer 2

除了文本的精彩回答，我还找到了另一个：

# Definition of categories (df_str is a master list of all possible 'tag_id' values)
cat = [int(x) for x in sorted(df_str['id'].unique())]

# Definition of data
data = df.groupby(df.index).agg(list)
data = data['tag_id'].apply(lambda row: [int(el) for el in row])

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(classes = cat).fit(data)
encoded_data = mlb.transform(data)

df_tags_encoded = pd.DataFrame(data = encoded_data, index = data.index, columns = ["tag_id_" + str(name) for name in cat])
df_tags_encoded.head(10)

        57  58  59  60  61  62  63  64  65  66  ...     203 204 205 206 207 208 209 210 211 212
object_id                                                                                   
    1   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    2   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    3   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    4   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    5   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    6   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    7   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    8   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    9   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    10  0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   1   0

10 rows × 156 columns

使用 OneHotEncoder 扩展系列

Expanding Series with a OneHotEncoder

python

pandas

one-hot-encoding