python 逗号分隔值的一种热编码

python One hot encoding for comma separated values

我的数据集看起来像这样

Type   Date         Issues
M      1 Jan 2019   A12,B56,C78
K      2 May 2019   B56, D90
M      5 Feb 2019   A12,K31 
K      3 Jan 2019   A12,B56,K31,F66
.
.
.

我想为问题栏做一个热编码

所以我的数据集看起来像这样

Type   Date         A12 B56 C78 D90 E88 K31 F66
M      1 Jan 2019   1   1   1   0   0   0   0
K      2 May 2019   0   1   0   1   0   0   0
M      5 Feb 2019   1   0   0   0   0   1   0
K      3 Jan 2019   1   1   0   0   0   1   1
.
.
.

如何在 Python

中做到这一点

假设您的问题被串联成字符串,您可以这样做:

# Get a list of the issues
issues = sorted(set(",".join(df.Issues).split(",")))

# Fill columns with 0's and 1's
for issue in issues:
    df[issue] = df.Issues.str.contains(issue).astype(int)

使用pandas.Series.str.get_dummies:

import pandas as pd

new_df = pd.concat([df.drop('Issues', 1), df['Issues'].str.get_dummies(sep=",")], 1)
print(new_df)

输出:

  Type        Date   D90  A12  B56  C78  F66  K31  K31 
0    M  1 Jan 2019     0    1    1    1    0    0     0
1    K  2 May 2019     1    0    1    0    0    0     0
2    M  5 Feb 2019     0    1    0    0    0    0     1
3    K  3 Jan 2019     0    1    1    0    1    1     0