Word tokenizing 在家里和在 Colaboratory 上给出不同的结果
Word tokenizing gives different results at home than on Colaboratory
本地:
$ python
Python 3.8.0 (default, Nov 6 2019, 15:27:39)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import collections
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('averaged_perceptron_tagger')
>>> nltk.download('stopwords')
>>> stop_words = set(nltk.corpus.stopwords.words('english'))
>>> text = """Former Kansas Territorial Governor James W. Denver visited his namesake city in 1875 and in 1882."""
>>> def preprocess(document):
... sentence_list = list()
... for sentence in nltk.sent_tokenize(document):
... word_tokens = nltk.word_tokenize(sentence)
... sentence_list.append([w for w in word_tokens if not w in stop_words and len(w) > 1])
... sentences = [nltk.pos_tag(sent) for sent in sentence_list]
... return sentences
>>> grammar = r'Chunk: {(<A.*>*|<N.*>*|<VB[DGNP]?>*)+}'
>>> chunk_parser = nltk.RegexpParser(grammar)
>>> tagged = preprocess(text)
>>> result = collections.Counter()
>>> for sentence in tagged:
... my_tree = chunk_parser.parse(sentence)
... for subtree in my_tree.subtrees():
... if subtree.label() == 'Chunk':
... leaves = [x[0] for x in subtree.leaves()]
... phrase = " ".join(leaves)
... result[phrase] += 1
家里的输出是:
>>> print(result.most_common(10))
[('Former Kansas Territorial Governor James W. Denver', 1), ('visited', 1), ('city', 1)]
Same code on Colaboratory,结果为:
>>> print(result.most_common(10))
[]
我在这两个地方都有 运行 非 NLTK 代码,并且得到了相同的输出。可能是本地的 NLTK 库不同吗? NLTK 的不同版本?
我在本地是 运行 python 3.8.0。我将其更改为 3.6.9,现在得到与 Colaboratory 相同的结果。
本地:
$ python
Python 3.8.0 (default, Nov 6 2019, 15:27:39)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import collections
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('averaged_perceptron_tagger')
>>> nltk.download('stopwords')
>>> stop_words = set(nltk.corpus.stopwords.words('english'))
>>> text = """Former Kansas Territorial Governor James W. Denver visited his namesake city in 1875 and in 1882."""
>>> def preprocess(document):
... sentence_list = list()
... for sentence in nltk.sent_tokenize(document):
... word_tokens = nltk.word_tokenize(sentence)
... sentence_list.append([w for w in word_tokens if not w in stop_words and len(w) > 1])
... sentences = [nltk.pos_tag(sent) for sent in sentence_list]
... return sentences
>>> grammar = r'Chunk: {(<A.*>*|<N.*>*|<VB[DGNP]?>*)+}'
>>> chunk_parser = nltk.RegexpParser(grammar)
>>> tagged = preprocess(text)
>>> result = collections.Counter()
>>> for sentence in tagged:
... my_tree = chunk_parser.parse(sentence)
... for subtree in my_tree.subtrees():
... if subtree.label() == 'Chunk':
... leaves = [x[0] for x in subtree.leaves()]
... phrase = " ".join(leaves)
... result[phrase] += 1
家里的输出是:
>>> print(result.most_common(10))
[('Former Kansas Territorial Governor James W. Denver', 1), ('visited', 1), ('city', 1)]
Same code on Colaboratory,结果为:
>>> print(result.most_common(10))
[]
我在这两个地方都有 运行 非 NLTK 代码,并且得到了相同的输出。可能是本地的 NLTK 库不同吗? NLTK 的不同版本?
我在本地是 运行 python 3.8.0。我将其更改为 3.6.9,现在得到与 Colaboratory 相同的结果。