使用 OneHotEncoder 扩展系列
Expanding Series with a OneHotEncoder
我有一个 pandas DataFrame 具有以下特征:
tag_id
object_id
1 77
2 77
3 91
4 91
5 91
6 91
7 77
8 91
9 85
10 88
10 211
11 100
12 81
12 91
13 65
14 73
15 91
16 174
17 91
18 62
19 62
20 91
... ...
1527 105
1527 108
1528 87
1529 91
1907 rows × 1 columns
如您所见,某些索引值实际上确实以不同的 "tag_id" 值重复。我想用 OneHotEncoder 重新组织此 DataFrame 以将其转换为具有二进制值的稀疏矩阵,如下所示:
1 2 3 ... 77 ... 85 ... 88 ... 91 ... 211
object_id
1 0 0 0 ... 1 ... 0 ... 0 ... 0 ... 0
2 0 0 0 ... 1 ... 0 ... 0 ... 0 ... 0
3 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
4 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
5 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
6 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
7 0 0 0 ... 1 ... 0 ... 0 ... 0 ... 0
8 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
9 0 0 0 ... 0 ... 1 ... 0 ... 0 ... 0
10 0 0 0 ... 0 ... 0 ... 1 ... 0 ... 1
等等等
使用 pd.get_dummies(df['tag_id']) 给了我一些我想要的东西,但它不会堆叠具有重复索引的行,所以我最终仍然有 1907 行,而不是 1907 - 重复次数。
有什么办法可以解决这个问题吗?
只需sum
pd.get_dummies(df['tag_id']).sum(level=0).ne(0).astype(int)
或删除副本
pd.get_dummies(df['tag_id'].groupby(level=0).first())
除了文本的精彩回答,我还找到了另一个:
# Definition of categories (df_str is a master list of all possible 'tag_id' values)
cat = [int(x) for x in sorted(df_str['id'].unique())]
# Definition of data
data = df.groupby(df.index).agg(list)
data = data['tag_id'].apply(lambda row: [int(el) for el in row])
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes = cat).fit(data)
encoded_data = mlb.transform(data)
df_tags_encoded = pd.DataFrame(data = encoded_data, index = data.index, columns = ["tag_id_" + str(name) for name in cat])
df_tags_encoded.head(10)
57 58 59 60 61 62 63 64 65 66 ... 203 204 205 206 207 208 209 210 211 212
object_id
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
10 rows × 156 columns
我有一个 pandas DataFrame 具有以下特征:
tag_id
object_id
1 77
2 77
3 91
4 91
5 91
6 91
7 77
8 91
9 85
10 88
10 211
11 100
12 81
12 91
13 65
14 73
15 91
16 174
17 91
18 62
19 62
20 91
... ...
1527 105
1527 108
1528 87
1529 91
1907 rows × 1 columns
如您所见,某些索引值实际上确实以不同的 "tag_id" 值重复。我想用 OneHotEncoder 重新组织此 DataFrame 以将其转换为具有二进制值的稀疏矩阵,如下所示:
1 2 3 ... 77 ... 85 ... 88 ... 91 ... 211
object_id
1 0 0 0 ... 1 ... 0 ... 0 ... 0 ... 0
2 0 0 0 ... 1 ... 0 ... 0 ... 0 ... 0
3 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
4 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
5 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
6 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
7 0 0 0 ... 1 ... 0 ... 0 ... 0 ... 0
8 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
9 0 0 0 ... 0 ... 1 ... 0 ... 0 ... 0
10 0 0 0 ... 0 ... 0 ... 1 ... 0 ... 1
等等等
使用 pd.get_dummies(df['tag_id']) 给了我一些我想要的东西,但它不会堆叠具有重复索引的行,所以我最终仍然有 1907 行,而不是 1907 - 重复次数。
有什么办法可以解决这个问题吗?
只需sum
pd.get_dummies(df['tag_id']).sum(level=0).ne(0).astype(int)
或删除副本
pd.get_dummies(df['tag_id'].groupby(level=0).first())
除了文本的精彩回答,我还找到了另一个:
# Definition of categories (df_str is a master list of all possible 'tag_id' values)
cat = [int(x) for x in sorted(df_str['id'].unique())]
# Definition of data
data = df.groupby(df.index).agg(list)
data = data['tag_id'].apply(lambda row: [int(el) for el in row])
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes = cat).fit(data)
encoded_data = mlb.transform(data)
df_tags_encoded = pd.DataFrame(data = encoded_data, index = data.index, columns = ["tag_id_" + str(name) for name in cat])
df_tags_encoded.head(10)
57 58 59 60 61 62 63 64 65 66 ... 203 204 205 206 207 208 209 210 211 212
object_id
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
10 rows × 156 columns