Stanford Parser:如何包含标点符号?
Stanford Parser: How to include the punctuations?
我已经使用 Stanford Parser 来解析我的一些 已经标记化和 POS 标记的(由带有 Gate Twitter 模型的 Stanford POS 标记器)。但是生成的 conll 2007 格式输出不包含任何标点符号。这是为什么?
我用过的命令:
java -mx16g -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -sentences newline -tokenized -tagSeparator § -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory -escaper edu.stanford.nlp.process.PTBEscapingProcessor -outputFormat conll2007 edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz ..test.tagged > ../test.conll
例如
原始推文:
bbc sp says they don't understand why the tories aren't 8% ahead in the polls given the current economics stats ; bbc bias ? surely not ?
POS 标记的推文,用作 Stanford 解析器的输入:
bbc§NN sp§NN says§VBZ they§PRP don't§VBP understand§VB why§WRB the§DT tories§NNS aren't§VBZ 8%§CD ahead§RB in§IN the§DT polls§NNS given§VBN the§DT current§JJ economics§NNS stats§NNS ;§: bbc§NN bias§NN ?§. surely§RB not§RB ?§.
结果 conll 2007 格式化解析:
1 bbc _ NN NN _ 2 compound _ _
2 sp _ NN NN _ 3 nsubj _ _
3 says _ VBZ VBZ _ 0 root _ _
4 they _ PRP PRP _ 5 nsubj _ _
5 don't _ VBP VBP _ 3 ccomp _ _
6 understand _ VB VB _ 5 xcomp _ _
7 why _ WRB WRB _ 10 advmod _ _
8 the _ DT DT _ 9 det _ _
9 tories _ NNS NNS _ 10 nsubj _ _
10 aren't _ VBZ VBZ _ 6 ccomp _ _
11 8% _ CD CD _ 12 nmod:npmod _ _
12 ahead _ RB RB _ 15 advmod _ _
13 in _ IN IN _ 15 case _ _
14 the _ DT DT _ 15 det _ _
15 polls _ NNS NNS _ 10 nmod _ _
16 given _ VBN VBN _ 15 acl _ _
17 the _ DT DT _ 19 det _ _
18 current _ JJ JJ _ 19 amod _ _
19 economics _ NNS NNS _ 16 dobj _ _
20 stats _ NNS NNS _ 19 dep _ _
22 bbc _ NN NN _ 23 compound _ _
23 bias _ NN NN _ 20 dep _ _
25 surely _ RB RB _ 26 advmod _ _
26 not _ RB RB _ 16 neg _ _
如您所见,大部分标点符号未包含在解析中。但是为什么?
我认为在您的命令中添加“-parse.keepPunct”可以解决这个问题。如果这不起作用,请告诉我。
终于找到答案了,用
-outputFormatOptions includePunctuationDependencies
很久以前联系过斯坦福解析器和corenlp支持,完全没有反应
我已经使用 Stanford Parser 来解析我的一些 已经标记化和 POS 标记的(由带有 Gate Twitter 模型的 Stanford POS 标记器)。但是生成的 conll 2007 格式输出不包含任何标点符号。这是为什么?
我用过的命令:
java -mx16g -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -sentences newline -tokenized -tagSeparator § -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory -escaper edu.stanford.nlp.process.PTBEscapingProcessor -outputFormat conll2007 edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz ..test.tagged > ../test.conll
例如
原始推文:
bbc sp says they don't understand why the tories aren't 8% ahead in the polls given the current economics stats ; bbc bias ? surely not ?
POS 标记的推文,用作 Stanford 解析器的输入:
bbc§NN sp§NN says§VBZ they§PRP don't§VBP understand§VB why§WRB the§DT tories§NNS aren't§VBZ 8%§CD ahead§RB in§IN the§DT polls§NNS given§VBN the§DT current§JJ economics§NNS stats§NNS ;§: bbc§NN bias§NN ?§. surely§RB not§RB ?§.
结果 conll 2007 格式化解析:
1 bbc _ NN NN _ 2 compound _ _
2 sp _ NN NN _ 3 nsubj _ _
3 says _ VBZ VBZ _ 0 root _ _
4 they _ PRP PRP _ 5 nsubj _ _
5 don't _ VBP VBP _ 3 ccomp _ _
6 understand _ VB VB _ 5 xcomp _ _
7 why _ WRB WRB _ 10 advmod _ _
8 the _ DT DT _ 9 det _ _
9 tories _ NNS NNS _ 10 nsubj _ _
10 aren't _ VBZ VBZ _ 6 ccomp _ _
11 8% _ CD CD _ 12 nmod:npmod _ _
12 ahead _ RB RB _ 15 advmod _ _
13 in _ IN IN _ 15 case _ _
14 the _ DT DT _ 15 det _ _
15 polls _ NNS NNS _ 10 nmod _ _
16 given _ VBN VBN _ 15 acl _ _
17 the _ DT DT _ 19 det _ _
18 current _ JJ JJ _ 19 amod _ _
19 economics _ NNS NNS _ 16 dobj _ _
20 stats _ NNS NNS _ 19 dep _ _
22 bbc _ NN NN _ 23 compound _ _
23 bias _ NN NN _ 20 dep _ _
25 surely _ RB RB _ 26 advmod _ _
26 not _ RB RB _ 16 neg _ _
如您所见,大部分标点符号未包含在解析中。但是为什么?
我认为在您的命令中添加“-parse.keepPunct”可以解决这个问题。如果这不起作用,请告诉我。
终于找到答案了,用
-outputFormatOptions includePunctuationDependencies
很久以前联系过斯坦福解析器和corenlp支持,完全没有反应