如何找到 spark RDD/Dataframe 大小?

How to find spark RDD/Dataframe size?

我知道如何在 scala.But 中查找文件大小 如何在 spark 中查找 RDD/dataframe 大小?

斯卡拉:

object Main extends App {
  val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
  println(file.length)
}

火花:

val distFile = sc.textFile(file)
println(distFile.length)

但是如果我处理它没有得到文件大小。如何找到RDD大小?

如果您只是想计算 rdd 中的行数,请执行:

val distFile = sc.textFile(file)
println(distFile.count)

如果你对字节感兴趣,可以使用SizeEstimator:

import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))

https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html

是的,我终于找到了解决方案。 包括这些库。

import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd

如何找到 RDD 大小:

def calcRDDSize(rdd: RDD[String]): Long = {
  rdd.map(_.getBytes("UTF-8").length.toLong)
     .reduce(_+_) //add the sizes together
}

查找DataFrame大小的函数: (这个函数只是在内部将DataFrame转为RDD)

val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path

val rddOfDataframe = dataFrame.rdd.map(_.toString())

val size = calcRDDSize(rddOfDataframe)

下面是除SizeEstimator之外的一种方法。我经常使用

通过代码了解一个 RDD 是否被缓存,更准确地说,它有多少分区缓存在内存中,有多少分区缓存在磁盘上?获取存储级别,还想知道当前实际缓存status.to 知道内存消耗。

Spark Context 有开发者 api 方法 getRDDStorageInfo() 偶尔可以用这个。

Return information about what RDDs are cached, if they are in mem or on disk, how much space they take, etc.

For Example :

scala> sc.getRDDStorageInfo
       res3: Array[org.apache.spark.storage.RDDInfo] = 
       Array(RDD "HiveTableScan [name#0], (MetastoreRelation sparkdb, 
       firsttable, None), None " (3) StorageLevel: StorageLevel(false, true, false, true, 1);  CachedPartitions: 1;

TotalPartitions: 1; MemorySize: 256.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)

似乎 spark ui 也使用了这个 code

Description
With SPARK-13992, Spark supports persisting data into off-heap memory, but the usage of off-heap is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:

  1. Spark UI's executor page will display both on-heap and off-heap memory usage.
  2. REST request returns both on-heap and off-heap memory.
  3. Also these two memory usage can be obtained programmatically from SparkListener.