WordNet Python 词相似度

WordNet Python words similarity

我正在尝试找到一种可靠的方法来衡量 2 个术语的语义相似性。 第一个指标可以是 hyponym/hypernym 图上的路径距离(最终 2-3 个指标的线性组合可能会更好..)。

from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
print(dog.path_similarity(cat))

1.我仍然不明白 n.01 是什么意思以及为什么它是必要的。

from here and the source of nltk 表明结果为"WORD.PART-OF-SPEECH.SENSE-NUMBER"

引用来源:

Create a Lemma from a "<word>.<pos>.<number>.<lemma>" string where:
<word> is the morphological stem identifying the synset
<pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB
<number> is the sense number, counting from 0.
<lemma> is the morphological form of interest

n表示名词,我也建议阅读wordnet dataset

2。有一种方法可以直观地显示 2 项之间的计算路径吗?

请查看 相似度 部分的 nltk wordnet docs。那里有几种路径算法选择(你可以尝试混合几种)。

nltk 文档中的几个示例:

from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')

print(dog.path_similarity(cat))
print(dog.lch_similarity(cat))
print(dog.wup_similarity(cat))

对于可视化,您可以构建一个距离矩阵 M[i,j] 其中:

M[i,j] = word_similarity(i, j)

并使用下面的Whosebug answer来绘制可视化。

3。我还可以使用其他哪些 nltk 语义指标?

如上所述,有几种计算单词相似度的方法。我还建议调查 gensim。我将它的 word2vec 实现用于单词相似度,它对我来说效果很好。

如果您在选择算法方面需要任何帮助,请提供有关您所面临问题的更多信息。

更新:

可以找到有关单词 sense number 含义的更多信息 here:

Senses in WordNet are generally ordered from most to least frequently used, with the most common sense numbered 1...

问题是 "dog" 有歧义,您必须为其选择正确的含义。

您可能会选择第一种感觉作为天真的方法,或者根据您的应用或研究找到自己的算法来选择正确的含义。

要从 wordnet 中获取单词的所有可用定义(在 wordnet 文档中称为 synsets),您只需调用 wn.synsets(word).

我鼓励您针对每个定义深入研究这些同义词集中包含的元数据。

下面的代码显示了一个获取此元数据并很好地打印它的简单示例。

from nltk.corpus import wordnet as wn

dog_synsets = wn.synsets('dog')

for i, syn in enumerate(dog_synsets):
    print('%d. %s' % (i, syn.name()))
    print('alternative names (lemmas): "%s"' % '", "'.join(syn.lemma_names()))
    print('definition: "%s"' % syn.definition())
    if syn.examples():
        print('example usage: "%s"' % '", "'.join(syn.examples()))
    print('\n')

代码输出:

0. dog.n.01
alternative names (lemmas): "dog", "domestic_dog", "Canis_familiaris"
definition: "a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds"
example usage: "the dog barked all night"


1. frump.n.01
alternative names (lemmas): "frump", "dog"
definition: "a dull unattractive unpleasant girl or woman"
example usage: "she got a reputation as a frump", "she's a real dog"


2. dog.n.03
alternative names (lemmas): "dog"
definition: "informal term for a man"
example usage: "you lucky dog"


3. cad.n.01
alternative names (lemmas): "cad", "bounder", "blackguard", "dog", "hound", "heel"
definition: "someone who is morally reprehensible"
example usage: "you dirty dog"


4. frank.n.02
alternative names (lemmas): "frank", "frankfurter", "hotdog", "hot_dog", "dog", "wiener", "wienerwurst", "weenie"
definition: "a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll"


5. pawl.n.01
alternative names (lemmas): "pawl", "detent", "click", "dog"
definition: "a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward"


6. andiron.n.01
alternative names (lemmas): "andiron", "firedog", "dog", "dog-iron"
definition: "metal supports for logs in a fireplace"
example usage: "the andirons were too hot to touch"


7. chase.v.01
alternative names (lemmas): "chase", "chase_after", "trail", "tail", "tag", "give_chase", "dog", "go_after", "track"
definition: "go after with the intent to catch"
example usage: "The policeman chased the mugger down the alley", "the dog chased the rabbit"