WordNet Python 词相似度
WordNet Python words similarity
我正在尝试找到一种可靠的方法来衡量 2 个术语的语义相似性。
第一个指标可以是 hyponym/hypernym 图上的路径距离(最终 2-3 个指标的线性组合可能会更好..)。
from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
print(dog.path_similarity(cat))
- 我还是不明白
n.01
是什么意思以及为什么需要它。
- 有一种方法可以直观地显示计算出的 2 项之间的路径吗?
- 我还可以使用其他哪些 nltk 语义指标?
1.我仍然不明白 n.01 是什么意思以及为什么它是必要的。
from here and the source of nltk 表明结果为"WORD.PART-OF-SPEECH.SENSE-NUMBER"
引用来源:
Create a Lemma from a "<word>.<pos>.<number>.<lemma>" string where:
<word> is the morphological stem identifying the synset
<pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB
<number> is the sense number, counting from 0.
<lemma> is the morphological form of interest
n表示名词,我也建议阅读wordnet dataset。
2。有一种方法可以直观地显示 2 项之间的计算路径吗?
请查看 相似度 部分的 nltk wordnet docs。那里有几种路径算法选择(你可以尝试混合几种)。
nltk 文档中的几个示例:
from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
print(dog.path_similarity(cat))
print(dog.lch_similarity(cat))
print(dog.wup_similarity(cat))
对于可视化,您可以构建一个距离矩阵 M[i,j]
其中:
M[i,j] = word_similarity(i, j)
并使用下面的Whosebug answer来绘制可视化。
3。我还可以使用其他哪些 nltk 语义指标?
如上所述,有几种计算单词相似度的方法。我还建议调查 gensim。我将它的 word2vec 实现用于单词相似度,它对我来说效果很好。
如果您在选择算法方面需要任何帮助,请提供有关您所面临问题的更多信息。
更新:
可以找到有关单词 sense number
含义的更多信息 here:
Senses in WordNet are generally ordered from most to least frequently used, with the most common sense numbered 1...
问题是 "dog" 有歧义,您必须为其选择正确的含义。
您可能会选择第一种感觉作为天真的方法,或者根据您的应用或研究找到自己的算法来选择正确的含义。
要从 wordnet 中获取单词的所有可用定义(在 wordnet 文档中称为 synsets),您只需调用 wn.synsets(word)
.
我鼓励您针对每个定义深入研究这些同义词集中包含的元数据。
下面的代码显示了一个获取此元数据并很好地打印它的简单示例。
from nltk.corpus import wordnet as wn
dog_synsets = wn.synsets('dog')
for i, syn in enumerate(dog_synsets):
print('%d. %s' % (i, syn.name()))
print('alternative names (lemmas): "%s"' % '", "'.join(syn.lemma_names()))
print('definition: "%s"' % syn.definition())
if syn.examples():
print('example usage: "%s"' % '", "'.join(syn.examples()))
print('\n')
代码输出:
0. dog.n.01
alternative names (lemmas): "dog", "domestic_dog", "Canis_familiaris"
definition: "a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds"
example usage: "the dog barked all night"
1. frump.n.01
alternative names (lemmas): "frump", "dog"
definition: "a dull unattractive unpleasant girl or woman"
example usage: "she got a reputation as a frump", "she's a real dog"
2. dog.n.03
alternative names (lemmas): "dog"
definition: "informal term for a man"
example usage: "you lucky dog"
3. cad.n.01
alternative names (lemmas): "cad", "bounder", "blackguard", "dog", "hound", "heel"
definition: "someone who is morally reprehensible"
example usage: "you dirty dog"
4. frank.n.02
alternative names (lemmas): "frank", "frankfurter", "hotdog", "hot_dog", "dog", "wiener", "wienerwurst", "weenie"
definition: "a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll"
5. pawl.n.01
alternative names (lemmas): "pawl", "detent", "click", "dog"
definition: "a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward"
6. andiron.n.01
alternative names (lemmas): "andiron", "firedog", "dog", "dog-iron"
definition: "metal supports for logs in a fireplace"
example usage: "the andirons were too hot to touch"
7. chase.v.01
alternative names (lemmas): "chase", "chase_after", "trail", "tail", "tag", "give_chase", "dog", "go_after", "track"
definition: "go after with the intent to catch"
example usage: "The policeman chased the mugger down the alley", "the dog chased the rabbit"
我正在尝试找到一种可靠的方法来衡量 2 个术语的语义相似性。 第一个指标可以是 hyponym/hypernym 图上的路径距离(最终 2-3 个指标的线性组合可能会更好..)。
from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
print(dog.path_similarity(cat))
- 我还是不明白
n.01
是什么意思以及为什么需要它。 - 有一种方法可以直观地显示计算出的 2 项之间的路径吗?
- 我还可以使用其他哪些 nltk 语义指标?
1.我仍然不明白 n.01 是什么意思以及为什么它是必要的。
from here and the source of nltk 表明结果为"WORD.PART-OF-SPEECH.SENSE-NUMBER"
引用来源:
Create a Lemma from a "<word>.<pos>.<number>.<lemma>" string where:
<word> is the morphological stem identifying the synset
<pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB
<number> is the sense number, counting from 0.
<lemma> is the morphological form of interest
n表示名词,我也建议阅读wordnet dataset。
2。有一种方法可以直观地显示 2 项之间的计算路径吗?
请查看 相似度 部分的 nltk wordnet docs。那里有几种路径算法选择(你可以尝试混合几种)。
nltk 文档中的几个示例:
from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
print(dog.path_similarity(cat))
print(dog.lch_similarity(cat))
print(dog.wup_similarity(cat))
对于可视化,您可以构建一个距离矩阵 M[i,j]
其中:
M[i,j] = word_similarity(i, j)
并使用下面的Whosebug answer来绘制可视化。
3。我还可以使用其他哪些 nltk 语义指标?
如上所述,有几种计算单词相似度的方法。我还建议调查 gensim。我将它的 word2vec 实现用于单词相似度,它对我来说效果很好。
如果您在选择算法方面需要任何帮助,请提供有关您所面临问题的更多信息。
更新:
可以找到有关单词 sense number
含义的更多信息 here:
Senses in WordNet are generally ordered from most to least frequently used, with the most common sense numbered 1...
问题是 "dog" 有歧义,您必须为其选择正确的含义。
您可能会选择第一种感觉作为天真的方法,或者根据您的应用或研究找到自己的算法来选择正确的含义。
要从 wordnet 中获取单词的所有可用定义(在 wordnet 文档中称为 synsets),您只需调用 wn.synsets(word)
.
我鼓励您针对每个定义深入研究这些同义词集中包含的元数据。
下面的代码显示了一个获取此元数据并很好地打印它的简单示例。
from nltk.corpus import wordnet as wn
dog_synsets = wn.synsets('dog')
for i, syn in enumerate(dog_synsets):
print('%d. %s' % (i, syn.name()))
print('alternative names (lemmas): "%s"' % '", "'.join(syn.lemma_names()))
print('definition: "%s"' % syn.definition())
if syn.examples():
print('example usage: "%s"' % '", "'.join(syn.examples()))
print('\n')
代码输出:
0. dog.n.01
alternative names (lemmas): "dog", "domestic_dog", "Canis_familiaris"
definition: "a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds"
example usage: "the dog barked all night"
1. frump.n.01
alternative names (lemmas): "frump", "dog"
definition: "a dull unattractive unpleasant girl or woman"
example usage: "she got a reputation as a frump", "she's a real dog"
2. dog.n.03
alternative names (lemmas): "dog"
definition: "informal term for a man"
example usage: "you lucky dog"
3. cad.n.01
alternative names (lemmas): "cad", "bounder", "blackguard", "dog", "hound", "heel"
definition: "someone who is morally reprehensible"
example usage: "you dirty dog"
4. frank.n.02
alternative names (lemmas): "frank", "frankfurter", "hotdog", "hot_dog", "dog", "wiener", "wienerwurst", "weenie"
definition: "a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll"
5. pawl.n.01
alternative names (lemmas): "pawl", "detent", "click", "dog"
definition: "a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward"
6. andiron.n.01
alternative names (lemmas): "andiron", "firedog", "dog", "dog-iron"
definition: "metal supports for logs in a fireplace"
example usage: "the andirons were too hot to touch"
7. chase.v.01
alternative names (lemmas): "chase", "chase_after", "trail", "tail", "tag", "give_chase", "dog", "go_after", "track"
definition: "go after with the intent to catch"
example usage: "The policeman chased the mugger down the alley", "the dog chased the rabbit"