hadoop fs -du / gsutil du 在 GCP 上运行慢

hadoop fs -du / gsutil du is running slow on GCP

我正在尝试获取 Google 存储桶中的董事大小，但命令运行很长时间。

我试过 8TB 的数据有 24k 的子目录和文件，大约需要 20~25 分钟，相反，HDFS 上的相同数据不到一分钟就可以得到大小。

我用来获取大小的命令

hadoop fs -du gs://mybucket
gsutil du gs://mybucket

请建议我怎样才能做得更快。

1 和 2 几乎相同，因为 1 使用 GCS 连接器。

GCS 通过发送列表请求来计算使用量，如果您有大量对象，这可能需要很长时间。

本文建议设置 Access Logs 作为 gsutil du 的替代： https://cloud.google.com/storage/docs/working-with-big-data#data

但是，如果您打算对数据进行任何分析，您可能仍会产生相同的 20-25 分钟费用。来自 GCS Best Practices 指南：

Forward slashes in objects have no special meaning to Cloud Storage, as there is no native directory support. Because of this, deeply nested directory- like structures using slash delimiters are possible, but won't have the performance of a native filesystem listing deeply nested sub-directories.

假设您打算分析这些数据；您可能需要考虑使用 time hadoop distcp.

对不同文件大小和 glob 表达式的提取性能进行基准测试

hadoop fs -du / gsutil du 在 GCP 上运行慢

hadoop fs -du / gsutil du is running slow on GCP

google-cloud-storage

google-cloud-platform

google-cloud-dataproc

hadoop fs -du / gsutil du 在 GCP 上 运行 慢

hadoop fs -du / gsutil du is running slow on GCP

google-cloud-storage

google-cloud-platform

google-cloud-dataproc

hadoop fs -du / gsutil du 在 GCP 上运行慢