如何找到 spark RDD/Dataframe 大小？

Question

我知道如何在 scala.But 中查找文件大小如何在 spark 中查找 RDD/dataframe 大小？

斯卡拉：

object Main extends App {
  val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
  println(file.length)
}

火花：

val distFile = sc.textFile(file)
println(distFile.length)

但是如果我处理它没有得到文件大小。如何找到RDD大小？

Answer 1

如果您只是想计算 rdd 中的行数，请执行：

val distFile = sc.textFile(file)
println(distFile.count)

如果你对字节感兴趣，可以使用SizeEstimator:

import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))

https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html

Answer 2

是的，我终于找到了解决方案。包括这些库。

import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd

如何找到 RDD 大小：

def calcRDDSize(rdd: RDD[String]): Long = {
  rdd.map(_.getBytes("UTF-8").length.toLong)
     .reduce(_+_) //add the sizes together
}

查找DataFrame大小的函数： （这个函数只是在内部将DataFrame转为RDD）

val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path

val rddOfDataframe = dataFrame.rdd.map(_.toString())

val size = calcRDDSize(rddOfDataframe)

Answer 3

下面是除SizeEstimator之外的一种方法。我经常使用

通过代码了解一个 RDD 是否被缓存，更准确地说，它有多少分区缓存在内存中，有多少分区缓存在磁盘上？获取存储级别，还想知道当前实际缓存status.to 知道内存消耗。

Spark Context 有开发者 api 方法 getRDDStorageInfo() 偶尔可以用这个。

Return information about what RDDs are cached, if they are in mem or on disk, how much space they take, etc.

For Example :
scala> sc.getRDDStorageInfo
       res3: Array[org.apache.spark.storage.RDDInfo] = 
       Array(RDD "HiveTableScan [name#0], (MetastoreRelation sparkdb, 
       firsttable, None), None " (3) StorageLevel: StorageLevel(false, true, false, true, 1);  CachedPartitions: 1;
TotalPartitions: 1; MemorySize: 256.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)

似乎 spark ui 也使用了这个 code

请参阅本源问题SPARK-17019，其中描述...

Description
With SPARK-13992, Spark supports persisting data into off-heap memory, but the usage of off-heap is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:

Spark UI's executor page will display both on-heap and off-heap memory usage.

REST request returns both on-heap and off-heap memory.

Also these two memory usage can be obtained programmatically from SparkListener.

如何找到 spark RDD/Dataframe 大小？

How to find spark RDD/Dataframe size?

scala

apache-spark

rdd

For Example :

似乎 spark ui 也使用了这个 code