使用 fs.s3a.path.style.access 属性 的 Spark 路径样式访问不起作用
Spark path style access with fs.s3a.path.style.access property is not working
我正在尝试使用 s3a
写入本地 s3 存储桶,因此我的 spark writeStream()
API 使用路径作为 s3a://test-bucket/
。为了确保 spark 理解这一点,我在 build.sbt 中添加了 hadoop-aws-2.7.4.jar
和 aws-java-sdk-1.7.4.jar
并在代码中配置了 hadoop 如下 -
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
现在我尝试将数据写入我的自定义 s3 端点,如下所示 -
val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(
dayofmonth(current_date()) as "day",
month(current_date()) as "month",
year(current_date()) as "year",
column("time"),
column("quality"),
column("PM25"))
.writeStream
.partitionBy("year", "month", "day")
.format("csv")
.outputMode("append")
.option("path", "s3a://test-bucket/")
val streamingQuery: StreamingQuery = dataStreamWriter.start()
但似乎这种路径式访问启用不起作用,它仍在读取 URL 之前的 bucketname,因为 -
20/05/01 15:39:02 INFO AmazonHttpClient: Unable to execute HTTP request: test-bucket.s3-region0.cloudian.com
java.net.UnknownHostException: test-bucket.s3-region0.cloudian.com
如果我遗漏了什么,有人可以告诉我吗?
我找到了这个问题,感谢 mazaneicha for the comment. This is done by setting hadoop-aws jar version to 2.8.0
in my build.sbt. Seems like a separate flag fs.s3a.path.style.access was introduced in Hadoop 2.8.0 as I found a JIRA ticket HADOOP-12963 解决了这个问题。它奏效了。
我有同样的问题,它适用于 hadoop-aws: 3.2.0 版本。
并且有依赖性,它对我有用
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.11.375</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.2.0</version>
</dependency>
我正在尝试使用 s3a
写入本地 s3 存储桶,因此我的 spark writeStream()
API 使用路径作为 s3a://test-bucket/
。为了确保 spark 理解这一点,我在 build.sbt 中添加了 hadoop-aws-2.7.4.jar
和 aws-java-sdk-1.7.4.jar
并在代码中配置了 hadoop 如下 -
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
现在我尝试将数据写入我的自定义 s3 端点,如下所示 -
val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(
dayofmonth(current_date()) as "day",
month(current_date()) as "month",
year(current_date()) as "year",
column("time"),
column("quality"),
column("PM25"))
.writeStream
.partitionBy("year", "month", "day")
.format("csv")
.outputMode("append")
.option("path", "s3a://test-bucket/")
val streamingQuery: StreamingQuery = dataStreamWriter.start()
但似乎这种路径式访问启用不起作用,它仍在读取 URL 之前的 bucketname,因为 -
20/05/01 15:39:02 INFO AmazonHttpClient: Unable to execute HTTP request: test-bucket.s3-region0.cloudian.com
java.net.UnknownHostException: test-bucket.s3-region0.cloudian.com
如果我遗漏了什么,有人可以告诉我吗?
我找到了这个问题,感谢 mazaneicha for the comment. This is done by setting hadoop-aws jar version to 2.8.0
in my build.sbt. Seems like a separate flag fs.s3a.path.style.access was introduced in Hadoop 2.8.0 as I found a JIRA ticket HADOOP-12963 解决了这个问题。它奏效了。
我有同样的问题,它适用于 hadoop-aws: 3.2.0 版本。 并且有依赖性,它对我有用
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.11.375</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.2.0</version>
</dependency>