如何在 pandas 数据框中进行单词标记化

Question

这是我的数据

No  Text                    
1   You are smart
2   You are beautiful

我的预期输出

No  Text                   You    are  smart  beautiful                 
1   You are smart            1      1      1          0
2   You are beautiful        1      1      0          1

Answer 1

对于 nltk 解决方案需要 word_tokenize 的单词列表，然后 MultiLabelBinarizer and last join 到原始的：

from sklearn.preprocessing import MultiLabelBinarizer
from  nltk import word_tokenize

mlb = MultiLabelBinarizer()
s = df.apply(lambda row: word_tokenize(row['Text']), axis=1)
df = df.join(pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index))
print (df)
   No               Text  You  are  beautiful  smart
0   1      You are smart    1    1          0      1
1   2  You are beautiful    1    1          1      0

对于纯 pandas 使用 get_dummies + join:

df = df.join(df['Text'].str.get_dummies(sep=' '))

如何在 pandas 数据框中进行单词标记化

How do I do word tokenisation in pandas data frame

tokenize

nltk

pandas

scikit-learn