如何更改 _spark_metadata 目录的位置？

Question

我正在使用 Spark Structured Streaming 的流式查询，使用以下代码将 parquet 文件写入 S3：

ds.writeStream().format("parquet").outputMode(OutputMode.Append())
                .option("queryName", "myStreamingQuery")
                .option("checkpointLocation", "s3a://my-kafka-offset-bucket-name/")
                .option("path", "s3a://my-data-output-bucket-name/")
                .partitionBy("createdat")
                .start();

我在 s3 存储桶 my-data-output-bucket-name 中获得了所需的输出，但随着输出，我在其中获得了 _spark_metadata 文件夹。如何摆脱它？如果我无法摆脱它，如何将它的位置更改为不同的 S3 存储桶？

Answer 1

我的理解是 不可能 达到 Spark 2.3。

元数据目录的名称总是_spark_metadata
_spark_metadata 目录始终位于 path 选项指向

我认为 "fix" 的唯一方法是在 Apache Spark's JIRA 中报告问题并希望有人能解决它。

内部

流程是 DataSource 被请求到 create the sink of a streaming query and takes the path option. With that, it goes to create a FileStreamSink. The path option simply becomes the basePath，结果和元数据将被写入其中。

您会发现 initial commit 对于理解元数据目录的用途非常有用。

In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based DataSource is initialized for reading, we first check for this log directory and use it instead of file listing when present.

如何更改 _spark_metadata 目录的位置？

How to change the location of _spark_metadata directory?

amazon-s3

apache-spark

parquet

spark-structured-streaming

内部