AWS Glue 作业 - CSV 到 Parquet。如何忽略header？

Question

我需要将一堆 (23) 个 CSV 文件（源 s3）转换为 parquet 格式。输入 CSV 在所有文件中包含 headers。当我使用 Glue 生成代码时。输出包含 22 header 行也在单独的行中，这意味着它忽略了第一个 header。在进行此转换时，我需要帮助忽略所有 header。

由于我使用 from_catalog 函数进行输入，因此我没有任何 format_options 可以忽略 header 行。

此外，我能否在文件中存在 header 的 Glue table 中设置一个选项？当我的作业运行时，它会自动忽略 header 吗？

我目前的部分方法如下。我是胶水的新手。这段代码实际上是 Glue auto-generated。

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")

datasink1 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path": "s3://my-bucket-name/full/s3/path-parquet"}, format = "parquet", transformation_ctx = "datasink1")

Answer 1

在处理使用 AWS Glue 的 ETL 作业时遇到确切问题。

from_catalog 的文档说：

additional_options – A collection of optional name-value pairs. The possible options include those listed in Connection Types and Options for ETL in AWS Glue except for endpointUrl, streamName, bootstrap.servers, security.protocol, topicName, classification, and delimiter.

我尝试使用下面的代码片段及其与 from_catalog 的一些排列。但对我没有任何帮助。

additional_options = {"format": "csv", "format_options": '{"withHeader": "True"}'},

解决此问题的一种方法是使用 from_options 而不是 from_catalog 并直接指向 S3 存储桶或文件夹。它应该是这样的：

datasource0 = glueContext.create_dynamic_frame.from_options(
  connection_type="s3",
  connection_options={
      'paths': ['s3://bucket_name/folder_name'],
      "recurse": True,
      'groupFiles': 'inPartition'
  }, 
  format="csv", 
  format_options={
      "withHeader": True
  }, 
  transformation_ctx = "datasource0"
)

但是，如果您出于任何原因不能这样做并想坚持使用 from_catalog，请使用适合我的过滤器。

假设您的 header 之一的名字是 name，代码段可能如下所示：

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")
filtered_df = Filter.apply(frame = datasource0, f = lambda x: x["name"] != "name")

不太确定 spark 的数据帧或 glue 的动态帧如何处理 csv headers 以及为什么从目录读取的数据在行和模式中都有 headers，但这似乎解决了我的问题通过从行中删除 header 值来解决问题。

AWS Glue 作业 - CSV 到 Parquet。如何忽略header？

AWS Glue Job - CSV to Parquet. How to ignore header?

csv

parquet

aws-glue