为什么 Stanford 词性标注器会修改输入句子?

Why does Stanford POS tagger modify input sentence?

我从华尔街日报上摘下了这句话,并通过了斯坦福词性标注器。奇怪的是,标注器把 "theatre" 变成了 "theater"

命令:

java -classpath stanford-postagger-2015-12-09/stanford-postagger-3.6.0.jar:stanford-postagger-2015-12-09/lib/slf4j-simple.jar:stanford-postagger-2015-12-09/lib/slf4j-api.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -props stanford-postagger-2015-12-09/penn-treebank.props -model /home/minhle/redep/output/dep/penntree.jackknife/jackknife-04.model -testFile format=TREES,test.tree

属性 文件:

## adopted english-bidirectional-distsim.tagger.props
## tagger training invoked at Tue Feb 25 01:33:39 PST 2014 with arguments:
                    arch = bidirectional5words,naacl2003unknowns,allwordshapes(-1,1),distsim(stanford-postagger-2015-12-09/egw4-reut.512.clusters.txt,-1,1),distsimconjunction(stanford-postagger-2015-12-09/egw4-reut.512.clusters.txt,-1,1)
            wordFunction = edu.stanford.nlp.process.AmericanizeFunction
         closedClassTags =
 closedClassTagThreshold = 40
 curWordMinFeatureThresh = 2
                   debug = false
             debugPrefix =
            tagSeparator = _
                encoding = UTF-8
              iterations = 100
                    lang = english
    learnClosedClassTags = false
        minFeatureThresh = 2
           openClassTags =
rareWordMinFeatureThresh = 5
          rareWordThresh = 5
                  search = owlqn2
                    sgml = false
            sigmaSquared = 0.5
                   regL1 = 0.75
               tagInside =
                tokenize = true
        tokenizerFactory =
        tokenizerOptions =
                 verbose = false
          verboseResults = true
    veryCommonWordThresh = 250
                xmlInput =
              outputFile =
            outputFormat = slashTags
     outputFormatOptions =
                nthreads = 4

输入语句:

( (SINV (`` ``) (S-TPC-2 (PP (IN Without) (NP (DT some) (JJ unexpected) (`` ``) (FW coup) (FW de) (FW theatre) ('' ''))) (, ,) (NP-SBJ (PRP I)) (VP (VBP do) (RB n't) (VP (VB see) (SBAR (WHNP-1 (WP what)) (S (NP-SBJ-1 (-NONE- T)) (VP (MD will) (VP (VB block) (NP (DT the) (NNP Paribas) (NN bid))))))))) (, ,) ('' '') (VP (VBD said) (S-2 (-NONE- T))) (NP-SBJ (NP (NNP Philippe) (NNP de) (NNP Cholet)) (, ,) (NP (NP (NN analyst)) (PP-LOC (IN at) (NP (NP (DT the) (NN brokerage)) (NP (NNP Cholet) (HYPH -) (NNP Dupont) (CC &) (NNP Cie)))))) (. .)) )

输出:

``_`` Without_IN some_DT unexpected_JJ ``_`` coup_NN de_IN theater_NN ''_'' ,_, I_PRP do_VBP n't_RB see_VB what_WP will_MD block_VB the_DT Paribas_NNP bid_NN ,_, ''_'' said_VBD Philippe_NNP de_IN Cholet_NNP ,_, analyst_NN at_IN the_DT brokerage_NN Cholet_NNP -_HYPH Dupont_NNP &_CC Cie_NNP ._.

据我了解,斯坦福词性标注器是使用美国英语训练数据训练的。在运行时,我们 "Americanize" 输入数据以确保它被标记器正确识别。在您的配置文件中查看这一行:

wordFunction = edu.stanford.nlp.process.AmericanizeFunction

如果您以编程方式访问 CoreNLP,则可以通过 CoreLabel.originalText 检索美国化前的形式。您也可以只禁用 AmericanizeFunction,但您可能会因此看到一些不正确的输出。