文本处理 - 词组检测后的 Word2Vec 训练(二元模型)
Text Processing - Word2Vec training after phrase detection (bigram model)
我想制作一个 word2vec 模型,其中包含比平常更多的 n-gram。正如我发现的那样,gensim.models.phrase 中的短语 class 可以找到我想要的短语,并且可以在语料库上使用短语并将其结果模型用于 word2vec 训练函数。
所以首先我会做一些类似下面的事情,就像 gensim documentation.
中的示例代码一样
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield word_tokenize(line)
sentences = MySentences('sentences_directory')
bigram = gensim.models.Phrases(sentences)
model = gensim.models.Word2Vec(bigram['sentences'], size=300, window=5, workers=8)
模型已创建,但没有任何良好的评估结果和警告:
WARNING : train() called with an empty iterator (if not intended, be sure to provide a corpus that offers restartable iteration = an iterable)
我搜索了一下,找到了 https://groups.google.com/forum/#!topic/gensim/XWQ8fPMFSi0 并更改了我的代码:
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield word_tokenize(line)
class PhraseItertor(object):
def __init__(self, my_phraser, data):
self.my_phraser, self.data = my_phraser, data
def __iter__(self):
yield self.my_phraser[self.data]
sentences = MySentences('sentences_directory')
bigram_transformer = gensim.models.Phrases(sentences)
bigram = gensim.models.phrases.Phraser(bigram_transformer)
corpus = PhraseItertor(bigram, sentences)
model = gensim.models.Word2Vec(corpus, size=300, window=5, workers=8)
我收到错误:
Traceback (most recent call last):
File "/home/fatemeh/Desktop/Thesis/bigramModeler.py", line 36, in <module>
model = gensim.models.Word2Vec(corpus, size=300, window=5, workers=8)
File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 478, in init
self.build_vocab(sentences, trim_rule=trim_rule)
File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 553, in build_vocab
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 575, in scan_vocab
vocab[word] += 1
TypeError: unhashable type: 'list'
现在我想知道我的代码有什么问题。
我在 Gensim GoogleGroup and Mr Gordon Mohr 中问了我的问题,回答了我:
You typically wouldn't want an __iter__()
method to do a single
yield
. It should return an iterator object (ready to return multiple
objects via next()
or a StopIteration exception). One way to effect
a iterator is to use yield
to have the method treated as a
'generator' – but that would typically require the yield
to be
inside a loop.
But I now see that my example code in the thread you reference does
the wrong thing with its __iter__()
return line: it should not be
returning the raw phrasifier, but one that has already been
started-as-an-iterator, by use of the iter()
built-in method. That
is, the example there should have read:
class PhrasingIterable(object):
def __init__(self, phrasifier, texts):
self. phrasifier, self.texts = phrasifier, texts
def __iter__():
return iter(phrasifier[texts])
Making a similar change in your variation may resolve the TypeError:
iter() returned non-iterator of type 'TransformedCorpus'
error.
我想制作一个 word2vec 模型,其中包含比平常更多的 n-gram。正如我发现的那样,gensim.models.phrase 中的短语 class 可以找到我想要的短语,并且可以在语料库上使用短语并将其结果模型用于 word2vec 训练函数。
所以首先我会做一些类似下面的事情,就像 gensim documentation.
中的示例代码一样class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield word_tokenize(line)
sentences = MySentences('sentences_directory')
bigram = gensim.models.Phrases(sentences)
model = gensim.models.Word2Vec(bigram['sentences'], size=300, window=5, workers=8)
模型已创建,但没有任何良好的评估结果和警告:
WARNING : train() called with an empty iterator (if not intended, be sure to provide a corpus that offers restartable iteration = an iterable)
我搜索了一下,找到了 https://groups.google.com/forum/#!topic/gensim/XWQ8fPMFSi0 并更改了我的代码:
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield word_tokenize(line)
class PhraseItertor(object):
def __init__(self, my_phraser, data):
self.my_phraser, self.data = my_phraser, data
def __iter__(self):
yield self.my_phraser[self.data]
sentences = MySentences('sentences_directory')
bigram_transformer = gensim.models.Phrases(sentences)
bigram = gensim.models.phrases.Phraser(bigram_transformer)
corpus = PhraseItertor(bigram, sentences)
model = gensim.models.Word2Vec(corpus, size=300, window=5, workers=8)
我收到错误:
Traceback (most recent call last):
File "/home/fatemeh/Desktop/Thesis/bigramModeler.py", line 36, in <module>
model = gensim.models.Word2Vec(corpus, size=300, window=5, workers=8)
File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 478, in init
self.build_vocab(sentences, trim_rule=trim_rule)
File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 553, in build_vocab
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 575, in scan_vocab
vocab[word] += 1
TypeError: unhashable type: 'list'
现在我想知道我的代码有什么问题。
我在 Gensim GoogleGroup and Mr Gordon Mohr 中问了我的问题,回答了我:
You typically wouldn't want an
__iter__()
method to do a singleyield
. It should return an iterator object (ready to return multiple objects vianext()
or a StopIteration exception). One way to effect a iterator is to useyield
to have the method treated as a 'generator' – but that would typically require theyield
to be inside a loop.But I now see that my example code in the thread you reference does the wrong thing with its
__iter__()
return line: it should not be returning the raw phrasifier, but one that has already been started-as-an-iterator, by use of theiter()
built-in method. That is, the example there should have read:class PhrasingIterable(object): def __init__(self, phrasifier, texts): self. phrasifier, self.texts = phrasifier, texts def __iter__(): return iter(phrasifier[texts])
Making a similar change in your variation may resolve the
TypeError: iter() returned non-iterator of type 'TransformedCorpus'
error.