使用 Gensim(Python) 提取二元组时出现类型错误

Question

我想使用 Gensim 提取和打印二元字母。为此，我在 GoogleColab 中使用了该代码：

import gensim.downloader as api
from gensim.models import Word2Vec
from gensim.corpora import WikiCorpus, Dictionary
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from collections import Counter

data = api.load("text8") # wikipedia corpus
bigram = Phrases(data, min_count=3, threshold=10)


cntr = Counter()
for key in bigram.vocab.keys():
  if len(key.split('_')) > 1:
    cntr[key] += bigram.vocab[key]

for key, counts in cntr.most_common(50):
  print(key, " - ", counts)

但是出现错误：

然后我试了这个：

cntr = Counter()
for key in bigram.vocab.keys():
  if len(key.split(b'_')) > 1:
    cntr[key] += bigram.vocab[key]

for key, counts in cntr.most_common(50):
  print(key, " - ", counts)

然后：

怎么了？

Answer 1

 bigram_token  = list(bigram.vocab.keys())
 type(bigram_token[0])

 #op
 bytes

把它转换成字符串，它会解决问题，在你的代码中，只是在拆分 do

cntr = Counter()
for key in bigram.vocab.keys():
    if len(key.decode('utf-8').split(b'_')) > 1: # here added .decode('utf-8')
       cntr[key] += bigram.vocab[key]

使用 Gensim(Python) 提取二元组时出现类型错误

TypeError during extracting bigrams with Gensim(Python)

python

nlp

machine-learning

gensim