当 Kafka Broker 关闭并恢复时,kafka 生产者中的数据丢失
Data Loss in kafka producer when a Kafka Broker goes down and comes back
每当 Kafka 代理关闭并重新加入时,我都会面临一些数据丢失。我想每当代理加入集群时都会触发重新平衡,此时我在我的 Kafka Producer 中观察到一些错误。
生产者写入具有 40 个分区的 Kafka 主题,以下是每当触发再平衡时我看到的日志序列。
[WARN ] 2019-06-05 20:39:08 WARN Sender:521 - [Producer clientId=producer-1] Got error produce response with correlation id 133054 on topic-partition test_ve-17, retrying (2 attempts left). Error: NOT_LEADER_FOR_PARTITION
...
...
[WARN ] 2019-06-05 20:39:31 WARN Sender:521 - [Producer clientId=producer-1] Got error produce response with correlation id 133082 on topic-partition test_ve-12, retrying (1 attempts left). Error: NOT_ENOUGH_REPLICAS
...
...
[ERROR] 2019-06-05 20:39:43 ERROR GlobalsKafkaProducer:297 - org.apache.kafka.common.errors.NotEnoughReplicasException: Messages are rejected since there are fewer in-sync replicas than required.
...
...
[WARN ] 2019-06-05 20:39:48 WARN Sender:521 - [Producer clientId=producer-1] Got error produce response with correlation id 133094 on topic-partition test_ve-22, retrying (1 attempts left). Error: NOT_ENOUGH_REPLICAS
[ERROR] 2019-06-05 20:39:53 ERROR Sender:604 - [Producer clientId=producer-1] The broker returned org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker received an out of order sequence number for topic-partition test_ve-37 at offset -1. This indicates data loss on the broker, and should be investigated.
[INFO ] 2019-06-05 20:39:53 INFO TransactionManager:372 - [Producer clientId=producer-1] ProducerId set to -1 with epoch -1
[ERROR] 2019-06-05 20:39:53 ERROR GlobalsKafkaProducer:297 - org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker received an out of order sequence number
...
...
RROR] 2019-06-05 20:39:53 ERROR GlobalsKafkaProducer:297 - org.apache.kafka.common.errors.OutOfOrderSequenceException: Attempted to retry sending a batch but the producer id changed from 417002 to 418001 in the mean time. This batch will be dropped.
我们的一些 Kafka 配置是
acks = all
min.insync.replicas=2
unclean.leader.election.enable=false
linger.ms=250
retries = 3
我在每生成 3000 条记录后调用 flush()。有什么我做错的地方吗,请指点?
让我假设一些事情,您有 3 个 Kafka 代理节点,所有主题的复制因子也是 3。您不会即时创建主题。
如您所言:
acks = all
min.insync.replicas=2
unclean.leader.election.enable=false
在那种情况下,如果同步的两个副本都发生故障,您肯定会丢失数据。由于最后剩下的副本没有资格被选为集群的领导者,因为 unclean.leader.election.enable=false
并且没有领导者接收发送请求。由于您设置 linger.ms= 250
其中一个不同步副本在短时间内恢复正常并再次被选为主题领导者,因此您将避免数据丢失。但需要注意的是 linger.ms
与 batch.size
一起工作。如果您为 batch.size
设置的值非常低,并且要发送的消息数量达到批量大小,生产者可能不会等到达到 linger.ms 设置。
因此,我推荐的明确更改之一是增加 retries
。检查参数 request.timeout.ms
的配置。找出经纪人在关闭后恢复所需的平均时间。如果存在因果关系,您的重试应该涵盖代理激活所花费的时间。如果所有其他权衡都已到位以减少数据丢失的可能性,这肯定会帮助您避免数据丢失。
每当 Kafka 代理关闭并重新加入时,我都会面临一些数据丢失。我想每当代理加入集群时都会触发重新平衡,此时我在我的 Kafka Producer 中观察到一些错误。
生产者写入具有 40 个分区的 Kafka 主题,以下是每当触发再平衡时我看到的日志序列。
[WARN ] 2019-06-05 20:39:08 WARN Sender:521 - [Producer clientId=producer-1] Got error produce response with correlation id 133054 on topic-partition test_ve-17, retrying (2 attempts left). Error: NOT_LEADER_FOR_PARTITION
...
...
[WARN ] 2019-06-05 20:39:31 WARN Sender:521 - [Producer clientId=producer-1] Got error produce response with correlation id 133082 on topic-partition test_ve-12, retrying (1 attempts left). Error: NOT_ENOUGH_REPLICAS
...
...
[ERROR] 2019-06-05 20:39:43 ERROR GlobalsKafkaProducer:297 - org.apache.kafka.common.errors.NotEnoughReplicasException: Messages are rejected since there are fewer in-sync replicas than required.
...
...
[WARN ] 2019-06-05 20:39:48 WARN Sender:521 - [Producer clientId=producer-1] Got error produce response with correlation id 133094 on topic-partition test_ve-22, retrying (1 attempts left). Error: NOT_ENOUGH_REPLICAS
[ERROR] 2019-06-05 20:39:53 ERROR Sender:604 - [Producer clientId=producer-1] The broker returned org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker received an out of order sequence number for topic-partition test_ve-37 at offset -1. This indicates data loss on the broker, and should be investigated.
[INFO ] 2019-06-05 20:39:53 INFO TransactionManager:372 - [Producer clientId=producer-1] ProducerId set to -1 with epoch -1
[ERROR] 2019-06-05 20:39:53 ERROR GlobalsKafkaProducer:297 - org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker received an out of order sequence number
...
...
RROR] 2019-06-05 20:39:53 ERROR GlobalsKafkaProducer:297 - org.apache.kafka.common.errors.OutOfOrderSequenceException: Attempted to retry sending a batch but the producer id changed from 417002 to 418001 in the mean time. This batch will be dropped.
我们的一些 Kafka 配置是
acks = all
min.insync.replicas=2
unclean.leader.election.enable=false
linger.ms=250
retries = 3
我在每生成 3000 条记录后调用 flush()。有什么我做错的地方吗,请指点?
让我假设一些事情,您有 3 个 Kafka 代理节点,所有主题的复制因子也是 3。您不会即时创建主题。
如您所言:
acks = all
min.insync.replicas=2
unclean.leader.election.enable=false
在那种情况下,如果同步的两个副本都发生故障,您肯定会丢失数据。由于最后剩下的副本没有资格被选为集群的领导者,因为 unclean.leader.election.enable=false
并且没有领导者接收发送请求。由于您设置 linger.ms= 250
其中一个不同步副本在短时间内恢复正常并再次被选为主题领导者,因此您将避免数据丢失。但需要注意的是 linger.ms
与 batch.size
一起工作。如果您为 batch.size
设置的值非常低,并且要发送的消息数量达到批量大小,生产者可能不会等到达到 linger.ms 设置。
因此,我推荐的明确更改之一是增加 retries
。检查参数 request.timeout.ms
的配置。找出经纪人在关闭后恢复所需的平均时间。如果存在因果关系,您的重试应该涵盖代理激活所花费的时间。如果所有其他权衡都已到位以减少数据丢失的可能性,这肯定会帮助您避免数据丢失。