python pandas:将逗号分隔的列拆分为新列 - 每个值一个
python pandas: split comma-separated column into new columns - one per value
我有一个这样的数据框:
data = np.array([["userA","event2, event3"],
['userB',"event3, event4"],
['userC',"event2"]])
data = pd.DataFrame(data)
0 1
0 userA "event2, event3"
1 userB "event3, event4"
2 userC "event2"
现在我想要一个这样的数据框:
0 event2 event3 event4
0 userA 1 1
1 userB 1 1
2 userC 1
有人可以帮忙吗?
您似乎需要 get_dummies
替换 0
来清空 string
s:
df = data[[0]].join(data[1].str.get_dummies(', ').replace(0, ''))
print (df)
0 event2 event3 event4
0 userA 1 1
1 userB 1 1
2 userC 1
详情:
print (data[1].str.get_dummies(', '))
event2 event3 event4
0 1 1 0
1 0 1 1
2 1 0 0
如果你有很多特征(词),那么使用稀疏矩阵以更有效地使用内存是有意义的:
In [120]: from sklearn.feature_extraction.text import CountVectorizer
In [121]: cvect = CountVectorizer()
In [122]: data = data.join(pd.SparseDataFrame(cvect.fit_transform(data.pop(1)),
data.index,
cvect.get_feature_names(),
default_fill_value=0))
In [123]: data
Out[123]:
0 event2 event3 event4
0 userA 1 1 0
1 userB 0 1 1
2 userC 1 0 0
In [124]: data.memory_usage()
Out[124]:
Index 80
0 24
event2 16
event3 16
event4 8
dtype: int64
我有一个这样的数据框:
data = np.array([["userA","event2, event3"],
['userB',"event3, event4"],
['userC',"event2"]])
data = pd.DataFrame(data)
0 1
0 userA "event2, event3"
1 userB "event3, event4"
2 userC "event2"
现在我想要一个这样的数据框:
0 event2 event3 event4
0 userA 1 1
1 userB 1 1
2 userC 1
有人可以帮忙吗?
您似乎需要 get_dummies
替换 0
来清空 string
s:
df = data[[0]].join(data[1].str.get_dummies(', ').replace(0, ''))
print (df)
0 event2 event3 event4
0 userA 1 1
1 userB 1 1
2 userC 1
详情:
print (data[1].str.get_dummies(', '))
event2 event3 event4
0 1 1 0
1 0 1 1
2 1 0 0
如果你有很多特征(词),那么使用稀疏矩阵以更有效地使用内存是有意义的:
In [120]: from sklearn.feature_extraction.text import CountVectorizer
In [121]: cvect = CountVectorizer()
In [122]: data = data.join(pd.SparseDataFrame(cvect.fit_transform(data.pop(1)),
data.index,
cvect.get_feature_names(),
default_fill_value=0))
In [123]: data
Out[123]:
0 event2 event3 event4
0 userA 1 1 0
1 userB 0 1 1
2 userC 1 0 0
In [124]: data.memory_usage()
Out[124]:
Index 80
0 24
event2 16
event3 16
event4 8
dtype: int64