Union of DF: OutOfMemoryError: Requested array size exceeds VM limit

Question

我的存储帐户中有 10GB CSV 文件。

我尝试调用 HTTP GET 并获取字节范围内的内容，例如在第一个循环中，获取 0 到 500MB，然后是 501MB-1000MB 等等

如果我评论 DF 部分的联合，下面的代码工作正常。我怎样才能用不同的方式来解决这个错误？

它在第 5 个循环中完全失败，我猜在处理后（500MB x 4 个循环）2GB（这是一些堆 space 被交叉）

for(i <- 1 to chunkNum) {
  
  println(i)
          // Hiding unnecessary code to get data in ranges
          val dateFormat = new SimpleDateFormat("YYYY-MM-dd HH:mm:ss.SSS")
          val currentDate = dateFormat.format(Calendar.getInstance.getTime)
          println("BeforeResponse")
          val response =  GetHttpResponse(headers, "https://mystorage.blob.core.windows.net/test/traindata.csv")
          println("AfterResponse")
          dfRestAPI = dfRestAPI.union(Seq((response,currentDate)).toDF("Chunk","InsertedDate")) 
       
        }

Answer 1

不是使用 REST API 来获取数据，而是让 Spark 自动完成它的工作——如果 CSV 文件没有被压缩，那么 Spark 应该自动将它分成块（这里有更多 details) 并由多个 worker 并行处理。

在你的情况下，你正在用大块垃圾破坏 Java 虚拟内存，这些内存很可能还没有被垃圾收集。

Union of DF: OutOfMemoryError: Requested array size exceeds VM limit

Union of DF: OutOfMemoryError: Requested array size exceeds VM limit

scala

apache-spark

azure-databricks