Apache storm using Kafka Spout gives error: IllegalStateException

Apache storm using Kafka Spout gives error: IllegalStateException

Version Info: 
   "org.apache.storm" % "storm-core" % "1.2.1" 
   "org.apache.storm" % "storm-kafka-client" % "1.2.1" 

我有一个如下所示的风暴拓扑:

boltA -> boltB -> boltC -> boltD

boltA 只是对请求进行一些格式化并发出另一个元组。 boltB 进行一些处理并为每个被接受的元组发出大约 100 个元组。 boltCboltD 处理这些元组。所有螺栓实现 BaseBasicBolt.

我注意到的是,每当 boltD 将一些 tuple 标记为失败并通过抛出 FailedException 标记为重试时,在我的拓扑超时时间不到几分钟后,我得到以下错误:

2018-11-30T20:01:05.261+05:30 util [ERROR] Async loop died!
java.lang.IllegalStateException: Attempting to emit a message that has already been committed. This should never occur when using the at-least-once processing guarantee.
        at org.apache.storm.kafka.spout.KafkaSpout.emitOrRetryTuple(KafkaSpout.java:471) ~[stormjar.jar:?]
        at org.apache.storm.kafka.spout.KafkaSpout.emitIfWaitingNotEmitted(KafkaSpout.java:440) ~[stormjar.jar:?]
        at org.apache.storm.kafka.spout.KafkaSpout.nextTuple(KafkaSpout.java:308) ~[stormjar.jar:?]
        at org.apache.storm.daemon.executor$fn__4975$fn__4990$fn__5021.invoke(executor.clj:654) ~[storm-core-1.2.1.jar:1.2.1]
        at org.apache.storm.util$async_loop$fn__557.invoke(util.clj:484) [storm-core-1.2.1.jar:1.2.1]
        at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60]
2018-11-30T20:01:05.262+05:30 executor [ERROR]
java.lang.IllegalStateException: Attempting to emit a message that has already been committed. This should never occur when using the at-least-once processing guarantee.
        at org.apache.storm.kafka.spout.KafkaSpout.emitOrRetryTuple(KafkaSpout.java:471) ~[stormjar.jar:?]
        at org.apache.storm.kafka.spout.KafkaSpout.emitIfWaitingNotEmitted(KafkaSpout.java:440) ~[stormjar.jar:?]
        at org.apache.storm.kafka.spout.KafkaSpout.nextTuple(KafkaSpout.java:308) ~[stormjar.jar:?]
        at org.apache.storm.daemon.executor$fn__4975$fn__4990$fn__5021.invoke(executor.clj:654) ~[storm-core-1.2.1.jar:1.2.1]
        at org.apache.storm.util$async_loop$fn__557.invoke(util.clj:484) [storm-core-1.2.1.jar:1.2.1]
        at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60]

boltB 发出 1 个元组中的 100 个并且 boltD 使这 100 个元组中的一个元组失败时,就会发生这种情况,我收到此错误。无法理解如何解决这个问题,理想情况下,当所有 100 个元组都是 acked 时,它应该 ack 一个原始元组,但在所有这 100 个元组都是 [=] 之前,原始元组可能是 acked 26=],导致此错误。

编辑:

我可以用两个螺栓用下面的拓扑重现这个,它在集群模式下大约 5 分钟后重现 运行:

螺栓A

case class Abc(index: Int, rand: Boolean)

class BoltA  extends BaseBasicBolt {

  override def execute(input: Tuple, collector: BasicOutputCollector): Unit = {
    val inp = input.getBinaryByField("value").getObj[someObj]
    val randomGenerator = new Random()

    var i = 0
    val rand = randomGenerator.nextBoolean()
    1 to 100 foreach {
      collector.emit(new Values(Abc(i, rand).getJsonBytes))
      i += 1
    }
  }

  override def declareOutputFields(declarer: OutputFieldsDeclarer): Unit = {
    declarer.declare(new Fields("boltAout"))
  }

}

螺栓B

class BoltB  extends BaseBasicBolt {

  override def execute(input: Tuple, collector: BasicOutputCollector): Unit = {
    val abc = input.getBinaryByField("boltAout").getObj[Abc]
    println(s"Received ${abc.index}th tuple in BoltB")
    if(abc.index >= 97 && abc.rand){
      println(s"throwing FailedException for ${abc.index}th tuple for")
      throw new FailedException()
    }
  }

  override def declareOutputFields(declarer: OutputFieldsDeclarer): Unit = {
  }
}

KafkaSpout:

private def getKafkaSpoutConfig(source: Config) = KafkaSpoutConfig.builder("connections.kafka.producerConnProps.metadata.broker.list", "queueName")
    .setProp(ConsumerConfig.GROUP_ID_CONFIG, "grp")
    .setProp(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer")
    .setOffsetCommitPeriodMs(100)
    .setRetry(new KafkaSpoutRetryExponentialBackoff(
      KafkaSpoutRetryExponentialBackoff.TimeInterval.milliSeconds(100),
      KafkaSpoutRetryExponentialBackoff.TimeInterval.milliSeconds(100),
      10,
      KafkaSpoutRetryExponentialBackoff.TimeInterval.milliSeconds(3000)
    ))
    .setFirstPollOffsetStrategy(offsetStrategyMapping(ConnektConfig.getOrElse("connections.kafka.consumerConnProps.offset.strategy", "UNCOMMITTED_EARLIEST")))
    .setMaxUncommittedOffsets(ConnektConfig.getOrElse("connections.kafka.consumerConnProps.max.uncommited.offset", 10000))
    .build()

其他配置:

消息超时秒数:300

此修复由@Stig Rohde Døssing here. The exact cause of the issue has been described here提供,如下所示:

In the fix for STORM-2666 and followups, we added logic to handle cases where the spout received the ack for an offset after the following offsets were already acked. The issue was that the spout might commit all the acked offsets, but not adjust the consumer position forward, or clear out waitingToEmit properly. If the acked offset was sufficiently far behind the log end offset, the spout might end up polling for offsets it had already committed.

The fix is slightly wrong. When the consumer position drops behind the committed offset, we make sure to adjust the position forward, and clear out any waitingToEmit messages that are behind the committed offset. We don't clear out waitingToEmit unless we adjust the consumer position, which turns out to be a problem.

For example, say offset 1 has failed, offsets 2-10 have been acked and maxPollRecords is 10. Say there are 11 records (1-11) in Kafka. If the spout seeks back to offset 1 to replay it, it will get offsets 1-10 back from the consumer in the poll. The consumer position is now 11. The spout emits offset 1. Say it gets acked immediately. On the next poll, the spout will commit offset 1-10 and check if it should adjust the consumer position and waitingToEmit. Since the position (11) is ahead of the committed offset (10), it doesn't clear out waitingToEmit. Since waitingToEmit still contains offsets 2-10 from the previous poll, the spout will end up emitting these tuples again.

可以看到修复 here