Google 存储中的死锁 API

Deadlock in Google Storage API

我是 运行 Dataproc 上的一个 spark 作业,它从存储桶中读取大量文件并将它们合并到一个大文件中。我通过着色使用 google-api-services-storage 1.29.0。到目前为止,它运行良好,合并了大约 20-30K 个文件。今天我用大约 5 倍的文件尝试了它,突然间我陷入了僵局(在东部我想我是,因为似乎我所有的执行者都在无限期地等待)。

这是线程转储:

org.conscrypt.NativeCrypto.SSL_read(Native Method)
org.conscrypt.NativeSsl.read(NativeSsl.java:416)
org.conscrypt.ConscryptFileDescriptorSocket$SSLInputStream.read(ConscryptFileDescriptorSocket.java:547) => holding Monitor(java.lang.Object@1638155334})
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
java.io.BufferedInputStream.read(BufferedInputStream.java:345) => holding Monitor(java.io.BufferedInputStream@1513035694})
sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587) => holding Monitor(sun.net.www.protocol.https.DelegateHttpsURLConnection@995846771})
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492) => holding Monitor(sun.net.www.protocol.https.DelegateHttpsURLConnection@995846771})
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)
com.shaded.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
com.shaded.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:105)
com.shaded.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
com.shaded.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
com.shaded.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
com.shaded.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:380)
com.shaded.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:6189)
com.shaded.google.cloud.storage.spi.v1.HttpStorageRpc.load(HttpStorageRpc.java:584)
com.shaded.google.cloud.storage.StorageImpl.call(StorageImpl.java:464)
com.shaded.google.cloud.storage.StorageImpl.call(StorageImpl.java:461)
com.shaded.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:89)
com.shaded.google.cloud.RetryHelper.run(RetryHelper.java:74)
com.shaded.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:51)
com.shaded.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:461)
com.shaded.google.cloud.storage.Blob.getContent(Blob.java:455)
my.package.with.my.StorageAPI.readFetchedLocation(StorageAPI.java:71)
...

最终我不得不结束这份工作,因为什么也没有发生。 知道是什么原因造成的吗?我尝试在我的代码中同时使用 ThreadLocal<Storage> 和单个 Storage 实例,这似乎没有什么区别。

作业实际上并没有死锁,只是 Spark UI 由于某种原因在阶段完成之前没有显示任何任务进度。我以为什么都没有发生,但如果我反复进行线程转储,那么我可以看到它正在做一些事情。

正如tix在评论中所建议的那样,实施exponential backoff when using the Storage library directly, and retry if I get a StorageException可能是明智的,其中isRetryable()