python 逗号分隔值的一种热编码
python One hot encoding for comma separated values
我的数据集看起来像这样
Type Date Issues
M 1 Jan 2019 A12,B56,C78
K 2 May 2019 B56, D90
M 5 Feb 2019 A12,K31
K 3 Jan 2019 A12,B56,K31,F66
.
.
.
我想为问题栏做一个热编码
所以我的数据集看起来像这样
Type Date A12 B56 C78 D90 E88 K31 F66
M 1 Jan 2019 1 1 1 0 0 0 0
K 2 May 2019 0 1 0 1 0 0 0
M 5 Feb 2019 1 0 0 0 0 1 0
K 3 Jan 2019 1 1 0 0 0 1 1
.
.
.
如何在 Python
中做到这一点
假设您的问题被串联成字符串,您可以这样做:
# Get a list of the issues
issues = sorted(set(",".join(df.Issues).split(",")))
# Fill columns with 0's and 1's
for issue in issues:
df[issue] = df.Issues.str.contains(issue).astype(int)
使用pandas.Series.str.get_dummies
:
import pandas as pd
new_df = pd.concat([df.drop('Issues', 1), df['Issues'].str.get_dummies(sep=",")], 1)
print(new_df)
输出:
Type Date D90 A12 B56 C78 F66 K31 K31
0 M 1 Jan 2019 0 1 1 1 0 0 0
1 K 2 May 2019 1 0 1 0 0 0 0
2 M 5 Feb 2019 0 1 0 0 0 0 1
3 K 3 Jan 2019 0 1 1 0 1 1 0
我的数据集看起来像这样
Type Date Issues
M 1 Jan 2019 A12,B56,C78
K 2 May 2019 B56, D90
M 5 Feb 2019 A12,K31
K 3 Jan 2019 A12,B56,K31,F66
.
.
.
我想为问题栏做一个热编码
所以我的数据集看起来像这样
Type Date A12 B56 C78 D90 E88 K31 F66
M 1 Jan 2019 1 1 1 0 0 0 0
K 2 May 2019 0 1 0 1 0 0 0
M 5 Feb 2019 1 0 0 0 0 1 0
K 3 Jan 2019 1 1 0 0 0 1 1
.
.
.
如何在 Python
中做到这一点假设您的问题被串联成字符串,您可以这样做:
# Get a list of the issues
issues = sorted(set(",".join(df.Issues).split(",")))
# Fill columns with 0's and 1's
for issue in issues:
df[issue] = df.Issues.str.contains(issue).astype(int)
使用pandas.Series.str.get_dummies
:
import pandas as pd
new_df = pd.concat([df.drop('Issues', 1), df['Issues'].str.get_dummies(sep=",")], 1)
print(new_df)
输出:
Type Date D90 A12 B56 C78 F66 K31 K31
0 M 1 Jan 2019 0 1 1 1 0 0 0
1 K 2 May 2019 1 0 1 0 0 0 0
2 M 5 Feb 2019 0 1 0 0 0 0 1
3 K 3 Jan 2019 0 1 1 0 1 1 0