在没有 NLTK 的情况下使用 Python 解析词性标记树语料库

Parse Parts of Speech Tagged Tree Corpus with Python without NLTK

我有树语料库如下

(TOP END_OF_TEXT_UNIT)

(TOP (S (NP (DT The)
            (NNP Fulton)
            (NNP County)
            (NNP Grand)
            (NNP Jury))
        (VP (VBD said)
            (NP (NNP Friday))
            (SBAR (-NONE- 0)
                  (S (NP (DT an)
                         (NN investigation)
                         (PP (IN of)
                             (NP (NP (NNP Atlanta))
                                 (POS 's)
                                 (JJ recent)
                                 (JJ primary)
                                 (NN election))))
                     (VP (VBD produced)
                         (NP (`` ``)
                             (DT no)
                             (NN evidence)
                             ('' '')
                             (SBAR (IN that)
                                   (S (NP (DT any)
                                          (NNS irregularities))
                                      (VP (VBD took)
                                          (NP (NN place)))))))))))
     (. .))

我需要解析这棵树并转换成如下的句子形式

DT The NNP Fulton NNP County NNP Grand NNP Jury VBD said NNP Friday DT
an NN investigation ...

是否有任何算法可以解析以上内容,或者我们需要使用正则表达式来解析,我不想使用 NLTK 包来解析。

Pyparsing 可以快速进行嵌套表达式解析。

import pyparsing as pp

LPAR, RPAR = map(pp.Suppress, "()")
expr = pp.Forward()
label = pp.Word(pp.alphas.upper()+'-') | "''" | "``" | "."
word = pp.Literal(".") | "''" | "``" | pp.Word(pp.printables, excludeChars="()")

expr <<= LPAR + label + (word | pp.OneOrMore(expr)) + RPAR

sample = """
(TOP (S (NP (DT The)
            (NNP Fulton)
            (NNP County)
            (NNP Grand)
            (NNP Jury))
        (VP (VBD said)
            (NP (NNP Friday))
            (SBAR (-NONE- 0)
                  (S (NP (DT an)
                         (NN investigation)
                         (PP (IN of)
                             (NP (NP (NNP Atlanta))
                                 (POS 's)
                                 (JJ recent)
                                 (JJ primary)
                                 (NN election))))
                     (VP (VBD produced)
                         (NP (`` ``)
                             (DT no)
                             (NN evidence)
                             ('' '')
                             (SBAR (IN that)
                                   (S (NP (DT any)
                                          (NNS irregularities))
                                      (VP (VBD took)
                                          (NP (NN place)))))))))))
     (. .))
"""

result = pp.OneOrMore(expr).parseString(sample)
print(' '.join(result))

打印:

TOP S NP DT The NNP Fulton NNP County NNP Grand NNP Jury VP VBD said NP NNP Friday SBAR -NONE- 0 S NP DT an NN investigation PP IN of NP NP NNP Atlanta POS 's JJ recent JJ primary NN election VP VBD produced NP `` `` DT no NN evidence '' '' SBAR IN that S NP DT any NNS irregularities VP VBD took NP NN place . .

通常,像这样的解析器会使用 pp.Group(expr) 来保留嵌套元素的分组。但在你的情况下,由于你最终想要一个平面列表,我们只是将其排除在外 - pyparsing 的默认行为只是 return 匹配字符串的平面列表。