标记 pandas 系列中的单词
Tokenizing words in pandas series
我在对熊猫系列中的单词进行分词时遇到问题。
我的系列名为 df
:
text
0 This monitor is a great deal for the price.
1 I would recommend it.
2 poor packaging.
dtype: object
我试过 df_tokenized=nltk.word_tokenize(df)
但结果是 TypeError: expected string or bytes-like object
我还尝试了 .apply(lambda row:)
的 3 种变体
df_tokenized=df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)
> TypeError: <lambda>() got an unexpected keyword argument 'axis'
df_tokenized=df.apply(lambda row: nltk.word_tokenize(row['text']))
> TypeError: string indices must be integers
df_tokenized=df.apply(lambda row: nltk.word_tokenize(row[1]))
> TypeError: 'float' object is not subscriptable
还有其他方法可以标记系列中的单词吗?
我相信您可以使用以下任何一个(这是您引用的第一个):
import nltk
import pandas as pd
df = pd.DataFrame({'text': [' This monitor is a great deal for the price.',
'I would recommend it.',
'poor packaging.']})
print(df.info())
df_tokenized = df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)
print(df_tokenized)
并且输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 3 non-null object
dtypes: object(1)
memory usage: 152.0+ bytes
text
0 This monitor is a great deal for the price.
1 I would recommend it.
2 poor packaging.
0 [This, monitor, is, a, great, deal, for, the, ...
1 [I, would, recommend, it, .]
2 [poor, packaging, .]
dtype: object
我在对熊猫系列中的单词进行分词时遇到问题。
我的系列名为 df
:
text
0 This monitor is a great deal for the price.
1 I would recommend it.
2 poor packaging.
dtype: object
我试过 df_tokenized=nltk.word_tokenize(df)
但结果是 TypeError: expected string or bytes-like object
我还尝试了 .apply(lambda row:)
df_tokenized=df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)
> TypeError: <lambda>() got an unexpected keyword argument 'axis'
df_tokenized=df.apply(lambda row: nltk.word_tokenize(row['text']))
> TypeError: string indices must be integers
df_tokenized=df.apply(lambda row: nltk.word_tokenize(row[1]))
> TypeError: 'float' object is not subscriptable
还有其他方法可以标记系列中的单词吗?
我相信您可以使用以下任何一个(这是您引用的第一个):
import nltk
import pandas as pd
df = pd.DataFrame({'text': [' This monitor is a great deal for the price.',
'I would recommend it.',
'poor packaging.']})
print(df.info())
df_tokenized = df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)
print(df_tokenized)
并且输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 3 non-null object
dtypes: object(1)
memory usage: 152.0+ bytes
text
0 This monitor is a great deal for the price.
1 I would recommend it.
2 poor packaging.
0 [This, monitor, is, a, great, deal, for, the, ...
1 [I, would, recommend, it, .]
2 [poor, packaging, .]
dtype: object