用 __consumer_offsets 杀死节点导致消费者没有消息消费

Killing node with __consumer_offsets leads to no message consumption at consumers

我有 3 个节点(nodes0、node1、node2)Kafka 集群(broker0、broker1、broker2),复制因子为 2 和 Zookeeper(使用 Kafka 打包的 zookeeper tar)运行不同的节点(节点 4)。

在 starting zookeper 之后我有 started broker 0,然后是剩余的节点。在 broker 0 日志中可以看到它正在读取 __consumer_offsets 并且它们似乎存储在 broker 0 上。以下是示例日志:

卡夫卡版本:kafka_2.10-0.10.2.0

    2017-06-30 10:50:47,381] INFO [GroupCoordinator 0]: Loading group metadata for console-consumer-85124 with generation 2 (kafka.coordinator.GroupCoordinator)
    [2017-06-30 10:50:47,382] INFO [Group Metadata Manager on Broker 0]: Finished loading offsets from __consumer_offsets-41 in 23 milliseconds. (kafka.coordinator.GroupMetadataManager)
    [2017-06-30 10:50:47,382] INFO [Group Metadata Manager on Broker 0]: Loading offsets and group metadata from __consumer_offsets-44 (kafka.coordinator.GroupMetadataManager)
    [2017-06-30 10:50:47,387] INFO [Group Metadata Manager on Broker 0]: Finished loading offsets from __consumer_offsets-44 in 5 milliseconds. (kafka.coordinator.GroupMetadataManager)
    [2017-06-30 10:50:47,387] INFO [Group Metadata Manager on Broker 0]: Loading offsets and group metadata from __consumer_offsets-47 (kafka.coordinator.GroupMetadataManager)
    [2017-06-30 10:50:47,398] INFO [Group Metadata Manager on Broker 0]: Finished loading offsets from __consumer_offsets-47 in 11 milliseconds. (kafka.coordinator.GroupMetadataManager)
    [2017-06-30 10:50:47,398] INFO [Group Metadata Manager on Broker 0]: Loading offsets and group metadata from __consumer_offsets-1 (kafka.coordinator.GroupMetadataManager)

此外,我可以在同一个代理 0 日志中看到 GroupCoordinator 消息。

[2017-06-30 14:35:22,874] INFO [GroupCoordinator 0]: Preparing to restabilize group console-consumer-34472 with old generation 1 (kafka.coordinator.GroupCoordinator)
    [2017-06-30 14:35:22,877] INFO [GroupCoordinator 0]: Group console-consumer-34472 with generation 2 is now empty (kafka.coordinator.GroupCoordinator)
    [2017-06-30 14:35:25,946] INFO [GroupCoordinator 0]: Preparing to restabilize group console-consumer-6612 with old generation 1 (kafka.coordinator.GroupCoordinator)
    [2017-06-30 14:35:25,946] INFO [GroupCoordinator 0]: Group console-consumer-6612 with generation 2 is now empty (kafka.coordinator.GroupCoordinator)
    [2017-06-30 14:35:38,326] INFO [GroupCoordinator 0]: Preparing to restabilize group console-consumer-30165 with old generation 1 (kafka.coordinator.GroupCoordinator)
    [2017-06-30 14:35:38,326] INFO [GroupCoordinator 0]: Group console-consumer-30165 with generation 2 is now empty (kafka.coordinator.GroupCoordinator)
    [2017-06-30 14:43:15,656] INFO [Group Metadata Manager on Broker 0]: Removed 0 expired offsets in 3 milliseconds. (kafka.coordinator.GroupMetadataManager)
    [2017-06-30 14:53:15,653] INFO [Group Metadata Manager on Broker 0]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)

在使用 kafka-console-consumer.sh 和 kafka-console-producer.sh[= 为集群测试 容错性 时64=],我看到在杀死代理 1 或代理 2 时,消费者仍然可以收到来自生产者的新消息。重新平衡正在正确发生。

但是,杀死代理 0 会导致任何数量的消费者都不会使用新消息或旧消息。 下面是 broker 0 被 kill 前后 topic 的状态。

之前

Topic:test-topic    PartitionCount:3    ReplicationFactor:2 Configs:
    Topic: test-topic   Partition: 0    Leader: 2   Replicas: 2,0   Isr: 0,2
    Topic: test-topic   Partition: 1    Leader: 0   Replicas: 0,1   Isr: 0,1
    Topic: test-topic   Partition: 2    Leader: 1   Replicas: 1,2   Isr: 1,2

之后

Topic:test-topic    PartitionCount:3    ReplicationFactor:2 Configs:
    Topic: test-topic   Partition: 0    Leader: 2   Replicas: 2,0   Isr: 2
    Topic: test-topic   Partition: 1    Leader: 1   Replicas: 0,1   Isr: 1
    Topic: test-topic   Partition: 2    Leader: 1   Replicas: 1,2   Isr: 1,2

以下是 broker 0 被杀死后在消费者日志中看到的 WARN 消息

[2017-06-30 14:19:17,155] WARN Auto-commit of offsets {test-topic-2=OffsetAndMetadata{offset=4, metadata=''}, test-topic-0=OffsetAndMetadata{offset=5, metadata=''}, test-topic-1=OffsetAndMetadata{offset=4, metadata=''}} failed for group console-consumer-34472: Offset commit failed with a retriable exception. You should retry committing offsets. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2017-06-30 14:19:10,542] WARN Auto-commit of offsets {test-topic-2=OffsetAndMetadata{offset=4, metadata=''}, test-topic-0=OffsetAndMetadata{offset=5, metadata=''}, test-topic-1=OffsetAndMetadata{offset=4, metadata=''}} failed for group console-consumer-30165: Offset commit failed with a retriable exception. You should retry committing offsets. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)

经纪人属性。其余默认属性不变。

broker.id=0
delete.topic.enable=true

auto.create.topics.enable=false
listeners=PLAINTEXT://XXX:9092
advertised.listeners=PLAINTEXT://XXX:9092
log.dirs=/tmp/kafka-logs-test1
num.partitions=3
zookeeper.connect=XXX:2181

生产者属性。其余默认属性不变。

bootstrap.servers=XXX,XXX,XXX
compression.type=snappy

消费者属性。其余默认属性不变。

zookeeper.connect=XXX:2181
zookeeper.connection.timeout.ms=6000
group.id=test-consumer-group

据我了解,如果节点 holding/acting GroupCoordinator 和 __consumer_offsets 死亡,那么尽管为分区选出了新的领导者,消费者也无法恢复正常操作。

我在 post 中看到类似 post 的内容。此 post 建议恢复 tar 已死的代理节点。但是,尽管在生产环境中 broker 0 被恢复 tar 之前,尽管有更多的节点,消息消费还是会有延迟。

Q1:如何缓解上述情况?

问题 2:有没有办法将 GroupCoordinator __consumer_offsets 更改为另一个节点?

任何 suggestions/help 不胜感激。

检查 __consumer_offsets 主题的复制因子。如果不是 3 那是你的问题。

运行 以下命令 kafka-topics --zookeeper localhost:2181 --describe --topic __consumer_offsets 并查看输出的第一行是否显示 "ReplicationFactor:1" 或 "ReplicationFactor:3".

在尝试先设置一个节点然后创建此主题时,复制因子为 1 是一个常见问题。稍后当您扩展到 3 个节点时,您忘记更改此现有主题的主题级别设置,因此即使尽管您生产和消费的主题是容错的,但偏移量主题仍然只停留在代理 0 上。