word_tokenize TypeError: expected string or buffer

Question

调用 word_tokenize 时出现以下错误：

File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1322,
    in _slices_from_text for match in
    self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

我有一个大文本文件 (1500.txt)，我想从中删除停用词。我的代码如下：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

with open('E:\Book\1500.txt', "r", encoding='ISO-8859-1') as File_1500:
    stop_words = set(stopwords.words("english"))
    words = word_tokenize(File_1500)
    filtered_sentence = [w for w in words if not w in stop_words]
    print(filtered_sentence)

Answer 1

word_tokenize的输入是文档流语句，即字符串列表，例如['this is sentence 1.', 'that's sentence 2!'].

File_1500 是一个 File 对象而不是字符串列表，这就是它不起作用的原因。

要获取句子字符串列表，首先您必须将文件作为字符串对象 fin.read() 读取，然后使用 sent_tokenize 将句子拆分（我假设您的输入文件不是句子标记化的，只是一个原始文本文件）。

此外，使用 NLTK 以这种方式标记文件更好/更惯用：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

stop_words = set(stopwords.words("english"))

with open('E:\Book\1500.txt', "r", encoding='ISO-8859-1') as fin:
    for sent in sent_tokenize(fin.read()):
        words = word_tokenize(sent)
        filtered_sentence = [w for w in words if not w in stop_words]
        print(filtered_sentence)

word_tokenize TypeError: expected string or buffer

word_tokenize TypeError: expected string or buffer

python

nlp

tokenize

nltk

python-3.x