无需拆分句子的斯坦福 coreNLP 情感

Question

我有文件要提供给 coreNLP 的情绪标记器。我已经将文件分解成单独的句子，因此想要 return 每个文件一个标签。如何使 java 命令成为 return 一个标签。

命令看起来像这样 java -cp "*" -mx5g edu.stanford.nlp.sentiment.SentimentPipeline -stdin 并输出如下：

Annotation pipeline timing information:
TokenizerAnnotator: 0.0 sec.
WordsToSentencesAnnotator: 0.0 sec.
TOTAL: 0.0 sec. for 8 tokens at 296.3 tokens/sec.
Pipeline setup: 0.0 sec.
Total time for StanfordCoreNLP pipeline: 8.7 sec.

C:\stanford-corenlp-full-2015-04-20>java -cp "*" -mx5g edu.stanford.nlp.sentiment.SentimentPipeline -stdin
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.4 sec].
Adding annotator sentiment
Reading in text from stdin.
Please enter one sentence per line.
Processing will end when EOF is reached.

Computer is fun. Not too fun.
  Positive
  Neutral

如何通过删除标点符号使输出类似于我在下面所做的单个标记：

Computer is fun Not too fun.
  Positive

似乎我应该能够轻松地做到这一点，因为有 -ssplit.isOneSentence，据我所知，情绪标记器使用 ssplit，但我不知道如何修改我的命令以合并它（我读过command line documentation）。

Answer 1

SentimentPipeline 中似乎存在错误，因为当您使用 -stdin 选项时，它不应该在一行内拆分句子。我现在修复了这个问题，但除非你编译自己的版本，否则在我们发布下一个版本的 CoreNLP 之前，这对你没有帮助。

但是还有一种替代方法（可能更好）可以使用 CoreNLP 管道为句子获取情感标签。

以下命令运行与您的命令相同的代码，但同时它允许您为各个注释器指定更多选项（包括 -ssplit.eolonly 选项）。

java -cp "*" -mx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators "tokenize,ssplit,parse,sentiment" -ssplit.eolonly

无需拆分句子的斯坦福 coreNLP 情感

Stanford coreNLP sentiment without splitting sentences

java

stanford-nlp