AWS Glue 作业以镶木地板格式错误写入 s3,未找到
AWS Glue job write to s3 in parquet format error with Not Found
我一直在创建 pyspark 作业,但我不断收到一个类似的间歇性错误(更像是随机错误):
An error occurred while calling o129.parquet. Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found;
Request ID: D2FA355F92AF8F05; S3 Extended Request ID: 1/fWdf1DurwPDP40HDGARlMRO/7lKzFDJ4g7DbUnM04wUvG89CG9w5T+u4UxapkWp20MfQfdjsE=)
我什至没有从 s3 读取,我实际上在做的是:
df.coalesce(100).write.partitionBy("mth").mode("overwrite").parquet("s3://"+bucket+"/"+path+"/out")
所以我更改了 coalesce
分区,但我不知道我还应该做些什么来减轻这个错误并使我的工作更稳定。
使用胶水从 s3 读取文件
datasource0 = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options = {"paths": "s3/path"}, format = "json", transformation_ctx = "datasource0")
使用胶水将文件写入 s3
output = glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3", connection_options = {"path": "s3/path"}, format = "parquet", transformation_ctx = "output")
我一直在创建 pyspark 作业,但我不断收到一个类似的间歇性错误(更像是随机错误):
An error occurred while calling o129.parquet. Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found;
Request ID: D2FA355F92AF8F05; S3 Extended Request ID: 1/fWdf1DurwPDP40HDGARlMRO/7lKzFDJ4g7DbUnM04wUvG89CG9w5T+u4UxapkWp20MfQfdjsE=)
我什至没有从 s3 读取,我实际上在做的是:
df.coalesce(100).write.partitionBy("mth").mode("overwrite").parquet("s3://"+bucket+"/"+path+"/out")
所以我更改了 coalesce
分区,但我不知道我还应该做些什么来减轻这个错误并使我的工作更稳定。
使用胶水从 s3 读取文件
datasource0 = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options = {"paths": "s3/path"}, format = "json", transformation_ctx = "datasource0")
使用胶水将文件写入 s3
output = glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3", connection_options = {"path": "s3/path"}, format = "parquet", transformation_ctx = "output")