为什么 Stanford 词性标注器会修改输入句子?
Why does Stanford POS tagger modify input sentence?
我从华尔街日报上摘下了这句话,并通过了斯坦福词性标注器。奇怪的是,标注器把 "theatre" 变成了 "theater"
命令:
java -classpath stanford-postagger-2015-12-09/stanford-postagger-3.6.0.jar:stanford-postagger-2015-12-09/lib/slf4j-simple.jar:stanford-postagger-2015-12-09/lib/slf4j-api.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -props stanford-postagger-2015-12-09/penn-treebank.props -model /home/minhle/redep/output/dep/penntree.jackknife/jackknife-04.model -testFile format=TREES,test.tree
属性 文件:
## adopted english-bidirectional-distsim.tagger.props
## tagger training invoked at Tue Feb 25 01:33:39 PST 2014 with arguments:
arch = bidirectional5words,naacl2003unknowns,allwordshapes(-1,1),distsim(stanford-postagger-2015-12-09/egw4-reut.512.clusters.txt,-1,1),distsimconjunction(stanford-postagger-2015-12-09/egw4-reut.512.clusters.txt,-1,1)
wordFunction = edu.stanford.nlp.process.AmericanizeFunction
closedClassTags =
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
debug = false
debugPrefix =
tagSeparator = _
encoding = UTF-8
iterations = 100
lang = english
learnClosedClassTags = false
minFeatureThresh = 2
openClassTags =
rareWordMinFeatureThresh = 5
rareWordThresh = 5
search = owlqn2
sgml = false
sigmaSquared = 0.5
regL1 = 0.75
tagInside =
tokenize = true
tokenizerFactory =
tokenizerOptions =
verbose = false
verboseResults = true
veryCommonWordThresh = 250
xmlInput =
outputFile =
outputFormat = slashTags
outputFormatOptions =
nthreads = 4
输入语句:
( (SINV (`` ``) (S-TPC-2 (PP (IN Without) (NP (DT some) (JJ
unexpected) (`` ``) (FW coup) (FW de) (FW theatre) ('' ''))) (, ,)
(NP-SBJ (PRP I)) (VP (VBP do) (RB n't) (VP (VB see) (SBAR (WHNP-1 (WP
what)) (S (NP-SBJ-1 (-NONE- T)) (VP (MD will) (VP (VB block) (NP (DT
the) (NNP Paribas) (NN bid))))))))) (, ,) ('' '') (VP (VBD said) (S-2
(-NONE- T))) (NP-SBJ (NP (NNP Philippe) (NNP de) (NNP Cholet)) (, ,)
(NP (NP (NN analyst)) (PP-LOC (IN at) (NP (NP (DT the) (NN brokerage))
(NP (NNP Cholet) (HYPH -) (NNP Dupont) (CC &) (NNP Cie)))))) (. .)) )
输出:
``_`` Without_IN some_DT unexpected_JJ ``_`` coup_NN de_IN
theater_NN ''_'' ,_, I_PRP do_VBP n't_RB see_VB what_WP will_MD block_VB the_DT Paribas_NNP bid_NN ,_, ''_'' said_VBD Philippe_NNP
de_IN Cholet_NNP ,_, analyst_NN at_IN the_DT brokerage_NN Cholet_NNP
-_HYPH Dupont_NNP &_CC Cie_NNP ._.
据我了解,斯坦福词性标注器是使用美国英语训练数据训练的。在运行时,我们 "Americanize" 输入数据以确保它被标记器正确识别。在您的配置文件中查看这一行:
wordFunction = edu.stanford.nlp.process.AmericanizeFunction
如果您以编程方式访问 CoreNLP,则可以通过 CoreLabel.originalText
检索美国化前的形式。您也可以只禁用 AmericanizeFunction
,但您可能会因此看到一些不正确的输出。
我从华尔街日报上摘下了这句话,并通过了斯坦福词性标注器。奇怪的是,标注器把 "theatre" 变成了 "theater"
命令:
java -classpath stanford-postagger-2015-12-09/stanford-postagger-3.6.0.jar:stanford-postagger-2015-12-09/lib/slf4j-simple.jar:stanford-postagger-2015-12-09/lib/slf4j-api.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -props stanford-postagger-2015-12-09/penn-treebank.props -model /home/minhle/redep/output/dep/penntree.jackknife/jackknife-04.model -testFile format=TREES,test.tree
属性 文件:
## adopted english-bidirectional-distsim.tagger.props
## tagger training invoked at Tue Feb 25 01:33:39 PST 2014 with arguments:
arch = bidirectional5words,naacl2003unknowns,allwordshapes(-1,1),distsim(stanford-postagger-2015-12-09/egw4-reut.512.clusters.txt,-1,1),distsimconjunction(stanford-postagger-2015-12-09/egw4-reut.512.clusters.txt,-1,1)
wordFunction = edu.stanford.nlp.process.AmericanizeFunction
closedClassTags =
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
debug = false
debugPrefix =
tagSeparator = _
encoding = UTF-8
iterations = 100
lang = english
learnClosedClassTags = false
minFeatureThresh = 2
openClassTags =
rareWordMinFeatureThresh = 5
rareWordThresh = 5
search = owlqn2
sgml = false
sigmaSquared = 0.5
regL1 = 0.75
tagInside =
tokenize = true
tokenizerFactory =
tokenizerOptions =
verbose = false
verboseResults = true
veryCommonWordThresh = 250
xmlInput =
outputFile =
outputFormat = slashTags
outputFormatOptions =
nthreads = 4
输入语句:
( (SINV (`` ``) (S-TPC-2 (PP (IN Without) (NP (DT some) (JJ unexpected) (`` ``) (FW coup) (FW de) (FW theatre) ('' ''))) (, ,) (NP-SBJ (PRP I)) (VP (VBP do) (RB n't) (VP (VB see) (SBAR (WHNP-1 (WP what)) (S (NP-SBJ-1 (-NONE- T)) (VP (MD will) (VP (VB block) (NP (DT the) (NNP Paribas) (NN bid))))))))) (, ,) ('' '') (VP (VBD said) (S-2 (-NONE- T))) (NP-SBJ (NP (NNP Philippe) (NNP de) (NNP Cholet)) (, ,) (NP (NP (NN analyst)) (PP-LOC (IN at) (NP (NP (DT the) (NN brokerage)) (NP (NNP Cholet) (HYPH -) (NNP Dupont) (CC &) (NNP Cie)))))) (. .)) )
输出:
``_`` Without_IN some_DT unexpected_JJ ``_`` coup_NN de_IN theater_NN ''_'' ,_, I_PRP do_VBP n't_RB see_VB what_WP will_MD block_VB the_DT Paribas_NNP bid_NN ,_, ''_'' said_VBD Philippe_NNP de_IN Cholet_NNP ,_, analyst_NN at_IN the_DT brokerage_NN Cholet_NNP -_HYPH Dupont_NNP &_CC Cie_NNP ._.
据我了解,斯坦福词性标注器是使用美国英语训练数据训练的。在运行时,我们 "Americanize" 输入数据以确保它被标记器正确识别。在您的配置文件中查看这一行:
wordFunction = edu.stanford.nlp.process.AmericanizeFunction
如果您以编程方式访问 CoreNLP,则可以通过 CoreLabel.originalText
检索美国化前的形式。您也可以只禁用 AmericanizeFunction
,但您可能会因此看到一些不正确的输出。