GCP Bulk Decompress 维护文件结构

Question

我们在 GCS 存储桶中存储了大量压缩文件。我正在尝试使用 provided utility 批量解压缩它们。数据位于时间戳目录层次结构中； YEAR/MONTH/DAY/HOUR/files.txt.gz。 Dataflow 接受通配符输入模式； inputFilePattern=gs://source-data/raw/nginx/2019/01/01/*/*.txt.gz。然而，目录结构在输出时被展平了。所有文件都解压缩到一个目录中。是否可以使用批量解压缩器维护目录层次结构？还有其他可能的解决方案吗？

gcloud dataflow jobs run gregstest \
    --gcs-location gs://dataflow-templates/latest/Bulk_Decompress_GCS_Files \
    --service-account-email greg@gmeow.com \
    --project shopify-data-kernel \
    --parameters \
inputFilePattern=gs://source-data/raw/nginx/2019/01/01/*/*.txt.gz,\
outputDirectory=gs://uncompressed-data/uncompressed,\
outputFailureFile=gs://uncompressed-data/failed

Answer 1

我已查找 bulk decompressor 的 Java 代码，PipelineResult 方法执行以下步骤：

查找与输入模式匹配的所有文件
解压找到的文件并输出到输出目录
将任何错误写入失败输出文件

看起来 API 只解压文件，不解压包含文件的目录。我建议您查看 Whosebug 上的 this 线程，了解有关在 GCS 中解压缩文件的可能解决方案。

希望以上信息对您有用。

GCP Bulk Decompress 维护文件结构

GCP Bulk Decompress maintaining file structure

google-cloud-storage

google-cloud-platform

google-cloud-dataflow