Python 与 NLTK 在 sent_tokenize 和 word_tokenize 处显示错误
Python with NLTK shows error at sent_tokenize and word_tokenize
我正在使用 Google Colab 处理我通过视频学习的脚本。不幸的是,尽管按照视频说明进行操作,但我还是遇到了错误。
sentences_tokens=nltk.sent_tokenize(allParagraphContent_cleanedData)
words_tokens=nltk.word_tokenize(allParagraphContent_cleanedData)
导致问题。两条线。我已经在 Python 3(我主要使用)中单独尝试了每一个。
以下是导入的库:
from urllib import request
from bs4 import BeautifulSoup as bs
import re
import nltk
import heapq
我得到的错误是:
---------------------------------------------------------------------------
LookupError Traceback (most recent call last)
<ipython-input-13-2467ae276de5> in <module>()
26 allParagraphContent_cleanedData=re.sub(r'\s+','',allParagraphContent_cleanedData)
27
---> 28 sentences_tokens=nltk.sent_tokenize(allParagraphContent_cleanedData)
29 words_tokens=nltk.word_tokenize(allParagraphContent_cleanedData)
30
说实话,我没看懂错误。
我没有看到什么问题?
--
这是完整的代码
from urllib import request
from bs4 import BeautifulSoup as bs
import re
import nltk
import heapq
url="https://en.wikipedia.org/wiki/Machine_learning"
allParagraphContent = ""
htmlDoc=request.urlopen(url)
soupObject=bs(htmlDoc,'html.parser')
for paragraphContent in paragraphContents:
allParagraphContent += paragraphContent.text
allParagraphContent_cleanerData=re.sub(r'\[0-9]*\]','',allParagraphContent)
allParagraphContent_cleanedData=re.sub(r'\s+','',allParagraphContent_cleanerData)
allParagraphContent_cleanedData=re.sub(r'[^a-zA-Z]','',allParagraphContent_cleanedData)
allParagraphContent_cleanedData=re.sub(r'\s+','',allParagraphContent_cleanedData)
sentences_tokens=nltk.sent_tokenize(allParagraphContent_cleanedData)
words_tokens=nltk.word_tokenize(allParagraphContent_cleanedData)
解决方法:
在 import nltk
之后添加 nltk.download("popular")
此错误通常在缺少模块时出现。这可以通过使用 download()
方法并指定模块来解决。此外,您可以传递 'all'
并下载所有内容。代码将是:
nltk.download('all')
我正在使用 Google Colab 处理我通过视频学习的脚本。不幸的是,尽管按照视频说明进行操作,但我还是遇到了错误。
sentences_tokens=nltk.sent_tokenize(allParagraphContent_cleanedData)
words_tokens=nltk.word_tokenize(allParagraphContent_cleanedData)
导致问题。两条线。我已经在 Python 3(我主要使用)中单独尝试了每一个。 以下是导入的库:
from urllib import request
from bs4 import BeautifulSoup as bs
import re
import nltk
import heapq
我得到的错误是:
---------------------------------------------------------------------------
LookupError Traceback (most recent call last)
<ipython-input-13-2467ae276de5> in <module>()
26 allParagraphContent_cleanedData=re.sub(r'\s+','',allParagraphContent_cleanedData)
27
---> 28 sentences_tokens=nltk.sent_tokenize(allParagraphContent_cleanedData)
29 words_tokens=nltk.word_tokenize(allParagraphContent_cleanedData)
30
说实话,我没看懂错误。
我没有看到什么问题?
-- 这是完整的代码
from urllib import request
from bs4 import BeautifulSoup as bs
import re
import nltk
import heapq
url="https://en.wikipedia.org/wiki/Machine_learning"
allParagraphContent = ""
htmlDoc=request.urlopen(url)
soupObject=bs(htmlDoc,'html.parser')
for paragraphContent in paragraphContents:
allParagraphContent += paragraphContent.text
allParagraphContent_cleanerData=re.sub(r'\[0-9]*\]','',allParagraphContent)
allParagraphContent_cleanedData=re.sub(r'\s+','',allParagraphContent_cleanerData)
allParagraphContent_cleanedData=re.sub(r'[^a-zA-Z]','',allParagraphContent_cleanedData)
allParagraphContent_cleanedData=re.sub(r'\s+','',allParagraphContent_cleanedData)
sentences_tokens=nltk.sent_tokenize(allParagraphContent_cleanedData)
words_tokens=nltk.word_tokenize(allParagraphContent_cleanedData)
解决方法:
在 import nltk
nltk.download("popular")
此错误通常在缺少模块时出现。这可以通过使用 download()
方法并指定模块来解决。此外,您可以传递 'all'
并下载所有内容。代码将是:
nltk.download('all')