清除来自 kafka 的偏移量 spark 结构化流

Question

我测试的时候，我的代码如下

    .format("kafka")
    .option("kafka.bootstrap.servers","...")
    .option("subscribe","...")
    .option("startingOffsets", "earliest")
//    .option("startingOffsets", "latest")
    .load()

但是当我设置 .option("startingOffsets", "latest") 时，恢复将始终从查询停止的地方开始。如何让.option("startingOffsets", "latest")生效？

ps:我尝试删除检查点文件，但没有成功

Answer 1

请参考文档

https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html

查询开始时的起点，“earliest”是从最早的偏移量开始的，“latest”是从最新的偏移量开始的，或者是一个 JSON 字符串，为每个指定起始偏移量主题分区。在JSON中，-2作为偏移量可以用来表示最早，-1表示最新。注意：对于批量查询，不允许最新（隐式或通过在 JSON 中使用 -1）。对于流式查询，这仅适用于开始新查询时，并且恢复将始终从查询停止的地方开始。查询中新发现的分区最早开始。

For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at the earliest

清除来自 kafka 的偏移量 spark 结构化流

Clear offsets spark structured streaming from kafka

apache-kafka

apache-spark

spark-structured-streaming