避免 Google Dataproc 日志记录

Avoid Google Dataproc logging

我正在使用 Google Dataproc 执行数百万次操作,但有一个问题,即日志记录数据大小。 我不执行任何显示或任何其他类型的打印,但是 7 行 INFO,乘以数百万得到一个非常大的日志大小。

有什么方法可以避免 Google Dataproc 记录日志吗?

已在 Dataproc 中尝试但未成功:

https://cloud.google.com/dataproc/docs/guides/driver-output#configuring_logging

这些是我想要摆脱的 7 行:

18/07/30 13:11:54 INFO org.spark_project.jetty.util.log: Logging initialized @...

18/07/30 13:11:55 INFO org.spark_project.jetty.server.Server: ....z-SNAPSHOT

18/07/30 13:11:55 INFO org.spark_project.jetty.server.Server: Started @...

18/07/30 13:11:55 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@...

18/07/30 13:11:56 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: ...

18/07/30 13:11:57 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ...

18/07/30 13:12:01 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_...

您正在寻找的是 exclusion filter:您需要从控制台浏览到 Stackdriver Logging > Logs ingestion > Exclusions,然后单击 "Create exclusion"。正如那里所解释的那样:

To create a logs exclusion, edit the filter on the left to only match logs that you do not want to be included in Stackdriver Logging. After an exclusion has been created, matched logs will no longer be accessible in Stackdriver Logging.

在你的情况下,过滤器应该是这样的:

resource.type="cloud_dataproc_cluster"
textPayload:"INFO org.spark_project.jetty.util.log: Logging initialized"
...