coreNLP 显着减慢了 spark 作业的速度`

Question

我正在尝试通过将文档切割成句子，然后对句子中的每个单词进行词形还原以进行逻辑回归来进行 class化。但是，我发现 stanford 的注释 class 在我的 spark 作业中造成了严重的瓶颈（仅处理 500k 文档需要 20 分钟）

这是我目前用于句子解析和class化

的代码

句子解析：

def prepSentences(text: String): List[CoreMap] = {
    val mod = text.replace("Sr.", "Sr") // deals with an edge case
    val doc = new Annotation(mod)
    pipeHolder.get.annotate(doc)
    val sentences = doc.get(classOf[SentencesAnnotation]).toList
    sentences
}

然后我获取每个核心映射并按如下方式处理引理

def coreMapToLemmas(map:CoreMap):Seq[String] = {
      map.get(classOf[TokensAnnotation]).par.foldLeft(Seq[String]())(
    (a, b) => {
        val lemma = b.get(classOf[LemmaAnnotation])
        if (!(stopWords.contains(b.lemma().toLowerCase) || puncWords.contains(b.originalText())))
      a :+ lemma.toLowerCase
    else a
  }
)
}

也许有一个class只涉及一些处理？

Answer 1

尝试使用 CoreNLP's Shift Reduce parser implementation。

一个基本示例（在没有编译器的情况下输入）：

val p = new Properties()
p.put("annotators", "tokenize ssplit pos parse lemma sentiment")
// use Shift-Reduce Parser with beam search
// http://nlp.stanford.edu/software/srparser.shtml
p.put("parse.model", "edu/stanford/nlp/models/srparser/englishSR.beam.ser.gz")
val corenlp = new StanfordCoreNLP(props)

val text = "text to annotate"
val annotation = new Annotation(text)
corenlp.annotate(text)

我在一个生产系统上工作，该系统在 Spark 处理管道中使用 CoreNLP。将 Shift Reduce 解析器与 Beam search 结合使用，将我的管道的解析速度提高了 16 倍，并减少了解析所需的工作内存量。 Shift Reduce 解析器在运行时复杂度上是线性的，优于标准词法化 PCFG 解析器。

要使用 shift reduce 解析器，您需要将 shift reduce 模型 jar 放在您的类路径中（您可以从 CoreNLP 的网站单独下载）。

coreNLP 显着减慢了 spark 作业的速度`

coreNLP significantly slowing down spark job`

scala

machine-learning

stanford-nlp

apache-spark