使用 Gensim 获取八卦的问题

Issues in getting trigrams using Gensim

我想从我提到的例句中得到双字母组和三字母组。

我的代码适用于双字母组。然而,它并没有捕获数据中的八卦(例如,人机交互,我的句子中有5处提到)

方法 1下面提到的是我在 Gensim 中使用 Phrases 的代码。

from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, threshold=1, delimiter=b' ')
trigram = Phrases(bigram_phraser[sentence_stream])

for sent in sentence_stream:
    bigrams_ = bigram_phraser[sent]
    trigrams_ = trigram[bigrams_]

    print(bigrams_)
    print(trigrams_)

方法 2我什至尝试同时使用 Phraser 和 Phrases,但没有用。

from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
trigram = Phrases(bigram_phraser[sentence_stream])

for sent in sentence_stream:
    bigrams_ = bigram_phraser[sent]
    trigrams_ = trigram[bigrams_]

    print(bigrams_)
    print(trigrams_)

请帮我解决获取卦象的问题

我正在关注 Gensim 的 example documentation

通过对您的代码进行一些修改,我能够获得二元组和三元组:

from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ')
trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ')

for sent in sentence_stream:
    bigrams_ = [b for b in bigram[sent] if b.count(' ') == 1]
    trigrams_ = [t for t in trigram[bigram[sent]] if t.count(' ') == 2]

    print(bigrams_)
    print(trigrams_)

我从二元组 Phrases 中删除了 threshold = 1 参数,因为否则它似乎形成了奇怪的二元组,允许构造奇怪的三元组(注意 bigram 用于构建卦 Phrases);当您有更多数据时,此参数可能会派上用场。对于三元组,还需要指定 min_count 参数,因为如果不提供它默认为 5。

为了检索每个文档的二元组和三元组,您可以使用此列表理解技巧分别过滤不是由两个或三个单词组成的元素。


编辑 - 关于threshold参数的一些细节:

估计器使用此参数来确定两个单词 ab 是否构成一个短语,并且仅当:

(count(a followed by b) - min_count) * N/(count(a) * count(b)) > threshold

其中 N 是总词汇量。默认情况下,参数值为 10(请参阅 docs)。因此,threshold 越高,单词组成短语的约束就越难。

例如,在您尝试使用 threshold = 1 的第一种方法中,您会得到 ['human computer','interaction is'] 作为以 "human computer interaction" 开头的 5 个句子中的 3 个的二连词;那个奇怪的第二个数字是更宽松的门槛的结果。

然后,当您尝试使用默认 threshold = 10 获取三元组时,您只会得到这 3 个句子的 ['human computer interaction is'],而其余两个句子则一无所获(按阈值过滤);因为那是 4-gram 而不是 trigram,所以它也会被 if t.count(' ') == 2 过滤。例如,如果您将三元组阈值降低到 1,则可以将 ['human computer interaction'] 作为剩余两个句子的三元组。获得良好的参数组合似乎并不容易, 更多关于它的信息。

我不是专家,所以对这个结论持保留态度:我认为最好先获得良好的二连词结果(不像 'interaction is'),然后再继续,因为奇怪的二连词可以添加混淆进一步的三元组,4-gram...