Python Pandas Pandas Dataframe 中的 NLTK Tokenize 列:预期的字符串或类似字节的对象
Python Pandas NLTK Tokenize Column in Pandas Dataframe: expected string or bytes-like object
我有以下带有 'problem_definition' 列的示例数据框:
ID problem_definition
1 cat, dog fish
2 turtle; cat; fish fish
3 hello book fish
4 dog hello fish cat
我想对 'problem_definition' 列进行分词。
下面是我的代码:
from nltk.tokenize import sent_tokenize, word_tokenize
import pandas as pd
df = pd.read_csv('log_page_nlp_subset.csv')
df['problem_definition_tokenized'] = df['problem_definition'].apply(word_tokenize)
上面的代码给我以下错误:
类型错误:预期的字符串或类字节对象
在apply
中使用lambda
:
df = pd.DataFrame({'TEXT':['cat, dog fish', 'turtle; cat; fish fish', 'hello book fish', 'dog hello fish cat']})
df
TEXT
0 cat, dog fish
1 turtle; cat; fish fish
2 hello book fish
3 dog hello fish cat
df.TEXT.apply(lambda x: word_tokenize(x))
0 [cat, ,, dog, fish]
1 [turtle, ;, cat, ;, fish, fish]
2 [hello, book, fish]
3 [dog, hello, fish, cat]
Name: TEXT, dtype: object
如果您还需要避开标点符号,请使用:
df.TEXT.apply(lambda x: RegexpTokenizer(r'\w+').tokenize(x))
0 [cat, dog, fish]
1 [turtle, cat, fish, fish]
2 [hello, book, fish]
3 [dog, hello, fish, cat]
Name: TEXT, dtype: object
您实际 df['TEXT']
中可能存在一个非字符串类对象(例如 NaN
),您发布的数据中未显示该对象。
您可以通过以下方式找到有问题的值:
mask = [isinstance(item, (str, bytes)) for item in df['TEXT']]
print(df.loc[~mask])
如果你想删除这些行,你可以使用
df = df.loc[mask]
或者,,
可以使用
将整个列强制为 str
dtype
df['TEXT'] = df['TEXT'].astype(str)
例如,如果 df['TEXT']
中有一个 NaN 值,
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.DataFrame({'ID': [1, 2, 3, 4],
'TEXT': ['cat, dog fish',
'turtle; cat; fish fish',
'hello book fish',
np.nan]})
# ID TEXT
# 0 1 cat, dog fish
# 1 2 turtle; cat; fish fish
# 2 3 hello book fish
# 3 4 NaN
# df['TEXT'].apply(word_tokenize)
# TypeError: expected string or buffer
mask = [isinstance(item, (str, bytes)) for item in df['TEXT']]
df = df.loc[mask]
# ID TEXT
# 0 1 cat, dog fish
# 1 2 turtle; cat; fish fish
# 2 3 hello book fish
现在应用 word_tokenize
有效:
In [108]: df['TEXT'].apply(word_tokenize)
Out[108]:
0 [cat, ,, dog, fish]
1 [turtle, ;, cat, ;, fish, fish]
2 [hello, book, fish]
Name: TEXT, dtype: object
我有以下带有 'problem_definition' 列的示例数据框:
ID problem_definition
1 cat, dog fish
2 turtle; cat; fish fish
3 hello book fish
4 dog hello fish cat
我想对 'problem_definition' 列进行分词。
下面是我的代码:
from nltk.tokenize import sent_tokenize, word_tokenize
import pandas as pd
df = pd.read_csv('log_page_nlp_subset.csv')
df['problem_definition_tokenized'] = df['problem_definition'].apply(word_tokenize)
上面的代码给我以下错误:
类型错误:预期的字符串或类字节对象
在apply
中使用lambda
:
df = pd.DataFrame({'TEXT':['cat, dog fish', 'turtle; cat; fish fish', 'hello book fish', 'dog hello fish cat']})
df
TEXT
0 cat, dog fish
1 turtle; cat; fish fish
2 hello book fish
3 dog hello fish cat
df.TEXT.apply(lambda x: word_tokenize(x))
0 [cat, ,, dog, fish]
1 [turtle, ;, cat, ;, fish, fish]
2 [hello, book, fish]
3 [dog, hello, fish, cat]
Name: TEXT, dtype: object
如果您还需要避开标点符号,请使用:
df.TEXT.apply(lambda x: RegexpTokenizer(r'\w+').tokenize(x))
0 [cat, dog, fish]
1 [turtle, cat, fish, fish]
2 [hello, book, fish]
3 [dog, hello, fish, cat]
Name: TEXT, dtype: object
您实际 df['TEXT']
中可能存在一个非字符串类对象(例如 NaN
),您发布的数据中未显示该对象。
您可以通过以下方式找到有问题的值:
mask = [isinstance(item, (str, bytes)) for item in df['TEXT']]
print(df.loc[~mask])
如果你想删除这些行,你可以使用
df = df.loc[mask]
或者,
str
dtype
df['TEXT'] = df['TEXT'].astype(str)
例如,如果 df['TEXT']
中有一个 NaN 值,
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.DataFrame({'ID': [1, 2, 3, 4],
'TEXT': ['cat, dog fish',
'turtle; cat; fish fish',
'hello book fish',
np.nan]})
# ID TEXT
# 0 1 cat, dog fish
# 1 2 turtle; cat; fish fish
# 2 3 hello book fish
# 3 4 NaN
# df['TEXT'].apply(word_tokenize)
# TypeError: expected string or buffer
mask = [isinstance(item, (str, bytes)) for item in df['TEXT']]
df = df.loc[mask]
# ID TEXT
# 0 1 cat, dog fish
# 1 2 turtle; cat; fish fish
# 2 3 hello book fish
现在应用 word_tokenize
有效:
In [108]: df['TEXT'].apply(word_tokenize)
Out[108]:
0 [cat, ,, dog, fish]
1 [turtle, ;, cat, ;, fish, fish]
2 [hello, book, fish]
Name: TEXT, dtype: object