用于 spark 结构化流的检查点目录下的子目录

Question

spark structured streaming的checkpoint目录创建了四个子目录。它们各自的用途是什么？

/warehouse/test_topic/checkpointdir1/commits
/warehouse/test_topic/checkpointdir1/metadata
/warehouse/test_topic/checkpointdir1/offsets
/warehouse/test_topic/checkpointdir1/sources

Answer 1

来自 StreamExecution class 文档：

/**
   * A write-ahead-log that records the offsets that are present in each batch. In order to ensure
   * that a given batch will always consist of the same data, we write to this log *before* any
   * processing is done.  Thus, the Nth record in this log indicated data that is currently being
   * processed and the N-1th entry indicates which offsets have been durably committed to the sink.
   */
  val offsetLog = new OffsetSeqLog(sparkSession, checkpointFile("offsets"))

  /**
   * A log that records the batch ids that have completed. This is used to check if a batch was
   * fully processed, and its output was committed to the sink, hence no need to process it again.
   * This is used (for instance) during restart, to help identify which batch to run next.
   */
  val commitLog = new CommitLog(sparkSession, checkpointFile("commits"))

元数据日志用于与查询相关的信息。例如，在 KafkaSource 中，它用于写入查询的起始偏移量（每个分区的偏移量）

Answer 2

源文件夹包含每个分区的初始kafka偏移值。就像你的 kafka 有 3 个分区 1,2,3 并且每个分区的起始值为 0 那么它将包含像 {1:0,2:0,3:0}

这样的值

用于 spark 结构化流的检查点目录下的子目录

Sub directories under checkpoint directory for spark structured streaming

apache-spark

spark-structured-streaming