spark writeStream 不适用于自定义 S3 端点

Question

作为 spark 的新手，在使用 Spark Structured Streaming (v2.4.3) 时，我正在尝试将我的流数据帧写入自定义 S3。我已确保我能够使用 UI 手动登录、将数据上传到 s3 存储桶，并且还为其设置了 ACCESS_KEY 和 SECRET_KEY。

val sc = spark.sparkContext
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-region1.myObjectStore.com:443")
sc.hadoopConfiguration.set("fs.s3a.access.key", "00cce9eb2c589b1b1b5b")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "flmheKX9Gb1tTlImO6xR++9kvnUByfRKZfI7LJT8")
sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true") // bucket name appended as url/bucket and not bucket.url

val writeToS3Query = stream.writeStream
      .format("csv")
      .option("sep", ",")
      .option("header", true)
      .outputMode("append")
      .trigger(Trigger.ProcessingTime("30 seconds"))
      .option("path", "s3a://bucket0/")
      .option("checkpointLocation", "/Users/home/checkpoints/s3-checkpointing")
      .start()

但是，我得到的错误是

Unable to execute HTTP request: bucket0.s3-region1.myObjectStore.com: nodename nor servname provided, or not known

我的 /etc/hosts 文件中有 URL 和 IP 的映射，并且可以从其他来源访问存储桶。有没有其他方法可以成功做到这一点？我真的不确定为什么在 Spark 执行时 URL 之前会附加存储桶名称。

这会不会是因为我在创建会话后设置了 spark context hadoop 配置，所以它们无效？但是当我在 path 中提供值 s3a://bucket0 时，它如何能够引用实际的 URL。

Answer 1

这些东西在 spark-defaults.conf

中可能更容易设置

尝试使用全小写的主机名
从引用中删除:443； https 是默认设置，有一个开关可以明确禁用它。
密钥属性是"fs.s3a.secret.key"

Answer 2

我通过在 build.sbt 中将 hadoop-aws jar 版本设置为 2.8.0 解决了这个问题。似乎在 Hadoop 2.8.0 中引入了单独的标志 fs.s3a.path.style.access，因为我找到了针对此问题的 JIRA 票证 HADOOP-12963。它奏效了。

spark writeStream 不适用于自定义 S3 端点

spark writeStream not working with custom S3 endpoint

amazon-s3

apache-spark

spark-streaming

apache-spark-sql

spark-structured-streaming