句子解析 运行 非常慢

sentence parsing is running extremely slowly

我正在尝试创建一个句子分析器,它可以读取文档并预测正确的点来分解句子,同时不会在 "Dr." 或“.NET”等不重要的句点上中断,所以我一直在尝试使用 CoreNLP

意识到 PCFG 运行 太慢(并且基本上是我整个工作的瓶颈)后,我尝试切换到 Shift-Reduce 解析(根据 coreNLP 网站,它更快)。

但是,SRParser 运行 非常慢,我不知道为什么(PCFG 每秒处理 1000 个句子,而 SRParser 每秒处理 100 个)。

这是两者的代码。可能值得注意的一件事是每个 "document" 大约有 10-20 个句子,所以它们非常小:

PCFG 解析器:

class StanfordPCFGParser {
  val props = new Properties()
  props.put("annotators", "tokenize, ssplit, pos, lemma")
 val pipeline = new StanfordCoreNLP(props)
  var i = 0
  val time = java.lang.System.currentTimeMillis()

  def parseSentence(doc:String ):List[String] = {
    val tokens = new Annotation(doc)
    pipeline.annotate(tokens)
    val sentences = tokens.get(classOf[SentencesAnnotation]).toList
sentences.foreach(s =>{ if(i%1000==0) println("parsed " + i + "in " + (java.lang.System.currentTimeMillis() - time)/1000 + " seconds" ); i = i+ 1})
sentences.map(_.toString)
  }
}

Shift-Reduce 解析器:

class StanfordShiftReduceParser {
  val p = new Properties()
  p.put("annotators", "tokenize ssplit pos parse lemma ")
  p.put("parse.model", "englishSR.ser.gz")
  val corenlp = new StanfordCoreNLP(p)
  var i = 0
  val time = java.lang.System.currentTimeMillis()

  def parseSentences(text:String) = {
    val annotation = new Annotation(text)
    corenlp.annotate(annotation)
    val sentences = annotation.get(classOf[SentencesAnnotation]).toList
    sentences.foreach(s =>{ if(i%1000==0) println("parsed " + i + "in " + (java.lang.System.currentTimeMillis() - time)/1000 + " seconds" ); i = i+ 1})
    sentences.map(_.toString)
  }
}

这是我用于计时的代码:

val originalParser = new StanfordPCFGParser
println("starting PCFG")
var time = getTime
sentences.foreach(originalParser.parseSentence)
time = getTime - time
println("PCFG parser took " + time.asInstanceOf[Double] / 1000 + "seconds for 1000 documents to " + originalParser.i + "sentences")
val srParser = new StanfordShiftReduceParser
println("starting SRParse")
time = getTime()
sentences.foreach(srParser.parseSentences)
time = getTime - time
println("SR parser took " + time.asInstanceOf[Double] / 1000 + "seconds for 1000 documents to " + srParser.i + "sentences")

这给了我以下输出(我已经解析了由于有问题的数据源而发生的 "Untokenizable" 警告)

Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... starting PCFG
done [0.6 sec].
Adding annotator lemma
parsed 0in 0 seconds
parsed 1000in 1 seconds
parsed 2000in 2 seconds
parsed 3000in 3 seconds
parsed 4000in 5 seconds
parsed 5000in 5 seconds
parsed 6000in 6 seconds
parsed 7000in 7 seconds
parsed 8000in 8 seconds
parsed 9000in 9 seconds
PCFG parser took 10.158 seconds for 1000 documents to 9558 sentences
Adding annotator tokenize
Adding annotator ssplit
Adding annotator pos
Adding annotator parse
Loading parser from serialized file englishSR.ser.gz ... done [8.3 sec].
starting SRParse
Adding annotator lemma
parsed 0in 0 seconds
parsed 1000in 17 seconds
parsed 2000in 30 seconds
parsed 3000in 43 seconds
parsed 4000in 56 seconds
parsed 5000in 66 seconds
parsed 6000in 77 seconds
parsed 7000in 90 seconds
parsed 8000in 101 seconds
parsed 9000in 113 seconds
SR parser took 120.506 seconds for 1000 documents to 9558 sentences

如有任何帮助,我们将不胜感激!

如果您需要做的只是将一段文本拆分成句子,则只需要 tokenizessplit 注释器。解析器是完全多余的。所以:

props.put("annotators", "tokenize, ssplit")