Word tokenizing 在家里和在 Colaboratory 上给出不同的结果

Question

本地：

$ python
Python 3.8.0 (default, Nov  6 2019, 15:27:39) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import collections
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('averaged_perceptron_tagger')
>>> nltk.download('stopwords')
>>> stop_words = set(nltk.corpus.stopwords.words('english'))
>>> text = """Former Kansas Territorial Governor James W. Denver visited his namesake city in 1875 and in 1882."""
>>> def preprocess(document):
...     sentence_list = list()
...     for sentence in nltk.sent_tokenize(document):
...         word_tokens = nltk.word_tokenize(sentence)
...         sentence_list.append([w for w in word_tokens if not w in stop_words and len(w) > 1])
...     sentences = [nltk.pos_tag(sent) for sent in sentence_list]
...     return sentences
>>> grammar = r'Chunk: {(<A.*>*|<N.*>*|<VB[DGNP]?>*)+}'
>>> chunk_parser = nltk.RegexpParser(grammar)
>>> tagged = preprocess(text)
>>> result = collections.Counter()
>>> for sentence in tagged:
...     my_tree =  chunk_parser.parse(sentence)
...     for subtree in my_tree.subtrees():
...         if subtree.label() == 'Chunk':
...             leaves = [x[0] for x in subtree.leaves()]
...             phrase = " ".join(leaves)
...             result[phrase] += 1

家里的输出是：

>>> print(result.most_common(10))
[('Former Kansas Territorial Governor James W. Denver', 1), ('visited', 1), ('city', 1)]

Same code on Colaboratory，结果为：

>>> print(result.most_common(10))
[]

我在这两个地方都有运行非 NLTK 代码，并且得到了相同的输出。可能是本地的 NLTK 库不同吗？ NLTK 的不同版本？

Answer 1

我在本地是运行 python 3.8.0。我将其更改为 3.6.9，现在得到与 Colaboratory 相同的结果。

Word tokenizing 在家里和在 Colaboratory 上给出不同的结果

Word tokenizing gives different results at home than on Colaboratory

nltk

google-colaboratory