如何找到 spark RDD/Dataframe 大小?
How to find spark RDD/Dataframe size?
我知道如何在 scala.But 中查找文件大小 如何在 spark 中查找 RDD/dataframe 大小?
斯卡拉:
object Main extends App {
val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
println(file.length)
}
火花:
val distFile = sc.textFile(file)
println(distFile.length)
但是如果我处理它没有得到文件大小。如何找到RDD大小?
如果您只是想计算 rdd
中的行数,请执行:
val distFile = sc.textFile(file)
println(distFile.count)
如果你对字节感兴趣,可以使用SizeEstimator
:
import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))
https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html
是的,我终于找到了解决方案。
包括这些库。
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd
如何找到 RDD 大小:
def calcRDDSize(rdd: RDD[String]): Long = {
rdd.map(_.getBytes("UTF-8").length.toLong)
.reduce(_+_) //add the sizes together
}
查找DataFrame大小的函数:
(这个函数只是在内部将DataFrame转为RDD)
val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path
val rddOfDataframe = dataFrame.rdd.map(_.toString())
val size = calcRDDSize(rddOfDataframe)
下面是除SizeEstimator
之外的一种方法。我经常使用
通过代码了解一个 RDD 是否被缓存,更准确地说,它有多少分区缓存在内存中,有多少分区缓存在磁盘上?获取存储级别,还想知道当前实际缓存status.to 知道内存消耗。
Spark Context 有开发者 api 方法 getRDDStorageInfo()
偶尔可以用这个。
Return information about what RDDs are cached, if they are in mem or
on disk, how much space they take, etc.
For Example :
scala> sc.getRDDStorageInfo
res3: Array[org.apache.spark.storage.RDDInfo] =
Array(RDD "HiveTableScan [name#0], (MetastoreRelation sparkdb,
firsttable, None), None " (3) StorageLevel: StorageLevel(false, true, false, true, 1); CachedPartitions: 1;
TotalPartitions: 1;
MemorySize: 256.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)
似乎 spark ui 也使用了这个 code
- 请参阅本源问题SPARK-17019,其中描述...
Description
With SPARK-13992, Spark supports persisting data into
off-heap memory, but the usage of off-heap is not exposed currently,
it is not so convenient for user to monitor and profile, so here
propose to expose off-heap memory as well as on-heap memory usage in
various places:
- Spark UI's executor page will display both on-heap and off-heap memory usage.
- REST request returns both on-heap and off-heap memory.
- Also these two memory usage can be obtained programmatically from SparkListener.
我知道如何在 scala.But 中查找文件大小 如何在 spark 中查找 RDD/dataframe 大小?
斯卡拉:
object Main extends App {
val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
println(file.length)
}
火花:
val distFile = sc.textFile(file)
println(distFile.length)
但是如果我处理它没有得到文件大小。如何找到RDD大小?
如果您只是想计算 rdd
中的行数,请执行:
val distFile = sc.textFile(file)
println(distFile.count)
如果你对字节感兴趣,可以使用SizeEstimator
:
import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))
https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html
是的,我终于找到了解决方案。 包括这些库。
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd
如何找到 RDD 大小:
def calcRDDSize(rdd: RDD[String]): Long = {
rdd.map(_.getBytes("UTF-8").length.toLong)
.reduce(_+_) //add the sizes together
}
查找DataFrame大小的函数: (这个函数只是在内部将DataFrame转为RDD)
val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path
val rddOfDataframe = dataFrame.rdd.map(_.toString())
val size = calcRDDSize(rddOfDataframe)
下面是除SizeEstimator
之外的一种方法。我经常使用
通过代码了解一个 RDD 是否被缓存,更准确地说,它有多少分区缓存在内存中,有多少分区缓存在磁盘上?获取存储级别,还想知道当前实际缓存status.to 知道内存消耗。
Spark Context 有开发者 api 方法 getRDDStorageInfo() 偶尔可以用这个。
Return information about what RDDs are cached, if they are in mem or on disk, how much space they take, etc.
For Example :
scala> sc.getRDDStorageInfo res3: Array[org.apache.spark.storage.RDDInfo] = Array(RDD "HiveTableScan [name#0], (MetastoreRelation sparkdb, firsttable, None), None " (3) StorageLevel: StorageLevel(false, true, false, true, 1); CachedPartitions: 1;
TotalPartitions: 1; MemorySize: 256.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)
似乎 spark ui 也使用了这个 code
- 请参阅本源问题SPARK-17019,其中描述...
Description
With SPARK-13992, Spark supports persisting data into off-heap memory, but the usage of off-heap is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:
- Spark UI's executor page will display both on-heap and off-heap memory usage.
- REST request returns both on-heap and off-heap memory.
- Also these two memory usage can be obtained programmatically from SparkListener.