如何在 pandas 数据框中进行单词标记化
How do I do word tokenisation in pandas data frame
这是我的数据
No Text
1 You are smart
2 You are beautiful
我的预期输出
No Text You are smart beautiful
1 You are smart 1 1 1 0
2 You are beautiful 1 1 0 1
对于 nltk
解决方案需要 word_tokenize
的单词列表,然后 MultiLabelBinarizer
and last join
到原始的:
from sklearn.preprocessing import MultiLabelBinarizer
from nltk import word_tokenize
mlb = MultiLabelBinarizer()
s = df.apply(lambda row: word_tokenize(row['Text']), axis=1)
df = df.join(pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index))
print (df)
No Text You are beautiful smart
0 1 You are smart 1 1 0 1
1 2 You are beautiful 1 1 1 0
对于纯 pandas
使用 get_dummies
+ join
:
df = df.join(df['Text'].str.get_dummies(sep=' '))
这是我的数据
No Text
1 You are smart
2 You are beautiful
我的预期输出
No Text You are smart beautiful
1 You are smart 1 1 1 0
2 You are beautiful 1 1 0 1
对于 nltk
解决方案需要 word_tokenize
的单词列表,然后 MultiLabelBinarizer
and last join
到原始的:
from sklearn.preprocessing import MultiLabelBinarizer
from nltk import word_tokenize
mlb = MultiLabelBinarizer()
s = df.apply(lambda row: word_tokenize(row['Text']), axis=1)
df = df.join(pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index))
print (df)
No Text You are beautiful smart
0 1 You are smart 1 1 0 1
1 2 You are beautiful 1 1 1 0
对于纯 pandas
使用 get_dummies
+ join
:
df = df.join(df['Text'].str.get_dummies(sep=' '))