在检查上下文时,gensim 的 word2vec 实现是否超出了句子级别?

Does word2vec realization from gensim go beyond sentence level when examining context?

我发现 this question 提供了句子顺序可能很重要的证据(但效果也可能是不同随机初始化的结果)。

我想为我的项目处理 Reddit comment dumps,但是从 json 中提取的字符串将是未排序的并且属于非常不同的 subreddits 和主题,所以我不想弄乱上下文:

{"gilded":0,"author_flair_text":"Male","author_flair_css_class":"male","retrieved_on":1425124228,"ups":3,"subreddit_id":"t5_2s30g","edited":false,"controversiality":0,"parent_id":"t1_cnapn0k","subreddit":"AskMen","body":"I can't agree with passing the blame, but I'm glad to hear it's at least helping you with the anxiety. I went the other direction and started taking responsibility for everything. I had to realize that people make mistakes including myself and it's gonna be alright. I don't have to be shackled to my mistakes and I don't have to be afraid of making them. ","created_utc":"1420070668","downs":0,"score":3,"author":"TheDukeofEtown","archived":false,"distinguished":null,"id":"cnasd6x","score_hidden":false,"name":"t1_cnasd6x","link_id":"t3_2qyhmp"}

那么相邻句子对 gensim word2vec 重要吗?我应该恢复整个评论树结构,还是我可以简单地提取 "bag of sentences" 并在其上训练模型?

gensim Word2Vec 预期的语料库是 iterable of lists-of-tokens。 (例如,令牌列表列表可以工作,但对于较大的语料库,您通常希望提供一个可重新启动的可迭代对象,它从持久存储中流式传输文本示例,以避免将整个语料库保存在内存中。)

词向量训练仅考虑单个文本示例中的上下文。也就是说,在一个令牌列表中。所以如果两个连续的例子是...

['I', 'do', 'not', 'like', 'green', 'eggs', 'and', 'ham']
['Everybody', 'needs', 'a', 'thneed']

...在 'ham' 和 'Everybody' 之间的这些示例中没有影响。 (上下文仅在每个示例中。)

不过,如果示例的排序将特定类型的所有单词或主题聚集在一起,可能会对质量产生微妙的影响。例如,您不希望单词 X 的所有示例都出现在语料库的开头,而单词 Y 的所有示例都出现在较晚的位置——这会阻止实现最佳结果的那种交错的示例多样性。

因此,如果您的语料库以任何排序顺序出现,按主题、作者、大小或语言聚集在一起,执行初始洗牌以消除这种聚集通常是有益的。 (再重新洗牌,例如在训练阶段之间,通常可以忽略不计的额外好处。)