StreamingQuery.awaitTermination 的目的是什么？

Question

我有一个 Spark Structured Streaming 作业，它从 Kafka 主题读取偏移量并将其写入 aerospike 数据库。目前我正在准备这项工作并实施 SparkListener.

在浏览文档时我偶然发现了这个例子：

StreamingQuery query = wordCounts.writeStream()
    .outputMode("complete")
    .format("console")
    .start();
query.awaitTermination();
After this code is executed, the streaming computation will have
started in the background. The query object is a handle to that active
streaming query, and we have decided to wait for the termination of
the query using awaitTermination() to prevent the process from exiting
while the query is active.

我知道它在终止进程之前等待查询完成。

具体是什么意思？它有助于避免查询写入的数据丢失。

当查询每天写入数百万条记录时，它有什么帮助？

虽然我的代码看起来很简单：

dataset.writeStream()
  .option("startingOffsets", "earliest")
  .outputMode(OutputMode.Append())
  .format("console")
  .foreach(sink)
  .trigger(Trigger.ProcessingTime(triggerInterval))
  .option("checkpointLocation", checkpointLocation)
  .start();

Answer 1

I understand that it waits for query to complete before terminating the process. What does it mean exactly

仅此而已。由于查询是在后台启动的，如果没有明确的阻塞指令，您的代码将简单地到达 main 函数的末尾并立即退出。

How is it helpful when query is writing millions of records every day?

真的没有。相反，它确保查询完全执行。

Answer 2

这里有很多问题，但只需回答下面的一个问题就可以解决所有问题。

I understand that it waits for query to complete before terminating the process. What does it mean exactly?

流式查询在单独的守护线程中运行。在 Java 中，守护线程用于允许并行处理，直到 Spark 应用程序的主线程完成（死亡）。在最后一个非守护线程结束后，JVM 立即关闭，整个 Spark 应用程序结束。

这就是为什么您需要让主要的非守护线程等待其他守护线程，以便它们可以完成工作。

阅读 What is a daemon thread in Java?

中的守护线程

StreamingQuery.awaitTermination 的目的是什么？

What is the purpose of StreamingQuery.awaitTermination?

apache-spark

spark-structured-streaming