Stanford LexicalizedParser 在 spark 中使用时抛出 NPE
Stanford LexicalizedParser throws NPE when using in spark
我正在尝试在 Spark RDD 映射函数中使用 stanford 的 LexicalizedParser。
算法大致是这样的:
val parser = LexicalizedParser.loadModel(englishPCFG.ser.gz)
val parserBroadcast = sparkContext.broadcast(parser) // using Kryo serializer here
someSparkRdd.map { case sentence: List[HasWord] =>
parserBroadcast.value.parse(sentence) //NPE is being thrown see below
}
我想实例化解析器一次(在地图之外)然后广播它的原因是地图迭代了将近一百万个句子,java垃圾收集器产生太多开销和整个处理合理减速。
执行 map 语句时,抛出以下 NullPointerException:
java.lang.NullPointerException
at edu.stanford.nlp.parser.lexparser.BaseLexicon.isKnown(BaseLexicon.java:152)
at edu.stanford.nlp.parser.lexparser.BaseLexicon.ruleIteratorByWord(BaseLexicon.java:208)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.initializeChart(ExhaustivePCFGParser.java:1343)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.parse(ExhaustivePCFGParser.java:457)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseInternal(LexicalizedParserQuery.java:258)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parse(LexicalizedParserQuery.java:536)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:301)
at my.class.NounPhraseExtractionWithStanford$$anonfun$extractNounPhrases.apply(NounPhraseExtractionWithStanford.scala:28)
at my.class.NounPhraseExtractionWithStanford$$anonfun$extractNounPhrases.apply(NounPhraseExtractionWithStanford.scala:27)
at scala.collection.TraversableLike$$anonfun$flatMap.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap.apply(TraversableLike.scala:251)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at my.class.NounPhraseExtractionWithStanford$.extractNounPhrases(NounPhraseExtractionWithStanford.scala:27)
at my.class.HBaseDocumentProducerWithStanford$$anonfun$produceDocumentTokens.apply(HBaseDocumentProducerWithStanford.scala:104)
at my.class.HBaseDocumentProducerWithStanford$$anonfun$produceDocumentTokens.apply(HBaseDocumentProducerWithStanford.scala:104)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$$anonfun$apply.apply(PairRDDFunctions.scala:674)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$$anonfun$apply.apply(PairRDDFunctions.scala:674)
at scala.collection.Iterator$$anon.next(Iterator.scala:328)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:172)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
在源代码中,我看到显然由于 edu.stanford.nlp.parser.lexparser.BaseLexicon 的许多瞬态 class 变量,在广播期间执行的 SerDe(使用 Kryo 序列化程序)使 BaseLexicon 半初始化。
我意识到 LexParser 的开发人员在设计它时并没有考虑到 spark,但我仍然非常感谢任何关于如何在我的场景中使用它的提示(即 spark)。
一种可能的解决方法,不能 100% 确定它会起作用:
class ParseSentence extends (List[HasWord] => WhateverParseReturns) with Serializable {
def apply(sentence: List[HasWord]) = ParseSentence.parser.parse(sentence)
}
object ParseSentence {
val parser = LexicalizedParser.loadModel(englishPCFG.ser.gz)
}
someSparkRdd.map(new ParseSentence)
这样 parser
不需要 serialized/deserialized 因为它不会被捕获为函数对象的字段。
我正在尝试在 Spark RDD 映射函数中使用 stanford 的 LexicalizedParser。
算法大致是这样的:
val parser = LexicalizedParser.loadModel(englishPCFG.ser.gz)
val parserBroadcast = sparkContext.broadcast(parser) // using Kryo serializer here
someSparkRdd.map { case sentence: List[HasWord] =>
parserBroadcast.value.parse(sentence) //NPE is being thrown see below
}
我想实例化解析器一次(在地图之外)然后广播它的原因是地图迭代了将近一百万个句子,java垃圾收集器产生太多开销和整个处理合理减速。
执行 map 语句时,抛出以下 NullPointerException:
java.lang.NullPointerException
at edu.stanford.nlp.parser.lexparser.BaseLexicon.isKnown(BaseLexicon.java:152)
at edu.stanford.nlp.parser.lexparser.BaseLexicon.ruleIteratorByWord(BaseLexicon.java:208)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.initializeChart(ExhaustivePCFGParser.java:1343)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.parse(ExhaustivePCFGParser.java:457)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseInternal(LexicalizedParserQuery.java:258)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parse(LexicalizedParserQuery.java:536)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:301)
at my.class.NounPhraseExtractionWithStanford$$anonfun$extractNounPhrases.apply(NounPhraseExtractionWithStanford.scala:28)
at my.class.NounPhraseExtractionWithStanford$$anonfun$extractNounPhrases.apply(NounPhraseExtractionWithStanford.scala:27)
at scala.collection.TraversableLike$$anonfun$flatMap.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap.apply(TraversableLike.scala:251)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at my.class.NounPhraseExtractionWithStanford$.extractNounPhrases(NounPhraseExtractionWithStanford.scala:27)
at my.class.HBaseDocumentProducerWithStanford$$anonfun$produceDocumentTokens.apply(HBaseDocumentProducerWithStanford.scala:104)
at my.class.HBaseDocumentProducerWithStanford$$anonfun$produceDocumentTokens.apply(HBaseDocumentProducerWithStanford.scala:104)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$$anonfun$apply.apply(PairRDDFunctions.scala:674)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$$anonfun$apply.apply(PairRDDFunctions.scala:674)
at scala.collection.Iterator$$anon.next(Iterator.scala:328)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:172)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
在源代码中,我看到显然由于 edu.stanford.nlp.parser.lexparser.BaseLexicon 的许多瞬态 class 变量,在广播期间执行的 SerDe(使用 Kryo 序列化程序)使 BaseLexicon 半初始化。
我意识到 LexParser 的开发人员在设计它时并没有考虑到 spark,但我仍然非常感谢任何关于如何在我的场景中使用它的提示(即 spark)。
一种可能的解决方法,不能 100% 确定它会起作用:
class ParseSentence extends (List[HasWord] => WhateverParseReturns) with Serializable {
def apply(sentence: List[HasWord]) = ParseSentence.parser.parse(sentence)
}
object ParseSentence {
val parser = LexicalizedParser.loadModel(englishPCFG.ser.gz)
}
someSparkRdd.map(new ParseSentence)
这样 parser
不需要 serialized/deserialized 因为它不会被捕获为函数对象的字段。