nltk 使用 RegexpParser 提取名词短语

Question

我想从文本中提取名词短语，我将 python 与 NLTK 一起使用。我在互联网上发现了一种使用 RegexpParser 的模式，如下所示：

grammar = r"""
        NBAR:
            {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns
        NP:
            {<NBAR>}
            {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
    """
    cp = nltk.RegexpParser(grammar)

我想修改语法变量以添加大小写'Noun of Noun'或'Noun in Noun'（例如"cup of coffee"或"water in cup"）我的测试字符串是：'postal code is new method of delivery' 我想接收短语列表：['portal code'、'new method'、'new method of delivery']

Answer 1

我的答案是：

def ExtractNP(text):
nounphrases = []
words = nltk.word_tokenize(text)
tagged = nltk.pos_tag(words)
grammar = r"""
     NP:
        {<JJ*><NN+><IN><NN>}
        {<NN.*|JJ>*<NN.*>}
    """
chunkParser = nltk.RegexpParser(grammar)
tree = chunkParser.parse(tagged)
for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
    myPhrase = ''
    for item in subtree.leaves():
        myPhrase += ' ' + item[0]
    nounphrases.append(myPhrase.strip())
    # print(myPhrase)
nounphrases = list(filter(lambda x: len(x.split()) > 1, nounphrases))
return nounphrases

实际上，这并不是什么新鲜事，但我发现语法回归按照它们声明的那样有序地分块。意思是输入的句子('postal code is new approach of delivery')会被截取匹配到

的内容

{<JJ*><NN+><IN><NN>}

('new approach of delivery')，然后将剩下的('postal code is')与

进行比较，用于下一次匹配

{<NN.*|JJ>*<NN.*>}

到return'postal code'。因此，我们无法在 returned 结果中获得 'new approach'。

nltk 使用 RegexpParser 提取名词短语

nltk extract nounphrase with RegexpParser

parsing

nltk