使用 Gensim(Python) 提取二元组时出现类型错误
TypeError during extracting bigrams with Gensim(Python)
我想使用 Gensim 提取和打印二元字母。为此,我在 GoogleColab 中使用了该代码:
import gensim.downloader as api
from gensim.models import Word2Vec
from gensim.corpora import WikiCorpus, Dictionary
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from collections import Counter
data = api.load("text8") # wikipedia corpus
bigram = Phrases(data, min_count=3, threshold=10)
cntr = Counter()
for key in bigram.vocab.keys():
if len(key.split('_')) > 1:
cntr[key] += bigram.vocab[key]
for key, counts in cntr.most_common(50):
print(key, " - ", counts)
但是出现错误:
然后我试了这个:
cntr = Counter()
for key in bigram.vocab.keys():
if len(key.split(b'_')) > 1:
cntr[key] += bigram.vocab[key]
for key, counts in cntr.most_common(50):
print(key, " - ", counts)
然后:
怎么了?
bigram_token = list(bigram.vocab.keys())
type(bigram_token[0])
#op
bytes
把它转换成字符串,它会解决问题,在你的代码中,只是在拆分 do
cntr = Counter()
for key in bigram.vocab.keys():
if len(key.decode('utf-8').split(b'_')) > 1: # here added .decode('utf-8')
cntr[key] += bigram.vocab[key]
我想使用 Gensim 提取和打印二元字母。为此,我在 GoogleColab 中使用了该代码:
import gensim.downloader as api
from gensim.models import Word2Vec
from gensim.corpora import WikiCorpus, Dictionary
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from collections import Counter
data = api.load("text8") # wikipedia corpus
bigram = Phrases(data, min_count=3, threshold=10)
cntr = Counter()
for key in bigram.vocab.keys():
if len(key.split('_')) > 1:
cntr[key] += bigram.vocab[key]
for key, counts in cntr.most_common(50):
print(key, " - ", counts)
但是出现错误:
然后我试了这个:
cntr = Counter()
for key in bigram.vocab.keys():
if len(key.split(b'_')) > 1:
cntr[key] += bigram.vocab[key]
for key, counts in cntr.most_common(50):
print(key, " - ", counts)
然后:
怎么了?
bigram_token = list(bigram.vocab.keys())
type(bigram_token[0])
#op
bytes
把它转换成字符串,它会解决问题,在你的代码中,只是在拆分 do
cntr = Counter()
for key in bigram.vocab.keys():
if len(key.decode('utf-8').split(b'_')) > 1: # here added .decode('utf-8')
cntr[key] += bigram.vocab[key]