创建多个分区时，胶水作业正在删除列

Question

我的 Glue 作业读取一个 table（一个 S3 csv 文件），然后对其进行分区并在 S3 上写入 10 个 Json 文件。

我注意到结果文件中的一些行，一些列不见了！

这是行：

etalab_named_postgre_csv = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "tab", transformation_ctx = "datasource0")
applymapping_etalab_named_postgre_csv= ApplyMapping.apply(frame = etalab_named_postgre_csv, mappings = [("compldistrib", "string", "compldistrib", "string"), ("numvoie", "long", "numvoie", "long"),....], transformation_ctx = "applymapping1")
path_s3 = "s3://Bucket"
etalab_named_postgre_csv = applymapping_etalab_named_postgre_csv.toDF()
etalab_named_postgre_csv.repartition(10).write.format("json").option("sep",",").option("header", "true").option("mode","Overwrite").save(path_s3)

在输出文件中，一些列消失了！

我在 EMR 上使用 Spark 加载相同的输入 table 来检查消失的列是否存在。

这是 Glue 的常见行为吗？请问我该如何防止呢？

编辑：

我现在确定了这个问题。

Glue 映射似乎是问题的根源。当我做

applymapping_etalab_named_postgre_csv= ApplyMapping.apply(frame = etalab_named_postgre_csv, mappings = [("compldistrib", "string", "compldistrib", "string"), ("numvoie", "long", "numvoie", "long"),....], transformation_ctx = "applymapping1")

我声明 compldistrib 是一个字符串，我希望它作为字符串输出。如果一行包含 compldistrib 中的数值，则映射将忽略它！

这是一个错误吗？

Answer 1

所以经过几个小时的搜索我没有找到解决方案。我发现的替代方法是使用 EMR 将 Glue 作业替换为 Spark 作业。它也快了很多。

我希望这会对某人有所帮助。

创建多个分区时，胶水作业正在删除列

Glue job is deleting columns when creating multiple partitions

amazon-web-services

apache-spark

aws-glue