运行 Spark 中的只读测试

Question

我想比较不同存储系统使用 Spark 的读取性能，例如HDFS/S3N。我为此编写了一个小的 Scala 程序：

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object SimpleApp {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("WordCount")
    val sc = new SparkContext(conf)
    val file = sc.textFile("s3n://test/wordtest")
    val splits = file.map(word => word)
    splits.saveAsTextFile("s3n://test/myoutput")
  }
}

我的问题是，是否可以运行使用 Spark 进行只读测试？对于上面的程序，saveAsTextFile() 是不是也导致了一些写入？

Answer 1

是。"saveAsTextFile" 使用给定路径将 RDD 数据写入文本文件。

Answer 2

我不确定这是否可能。为了运行一个转换，一个后验动作是必要的。

来自官方Spark documentation:

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.

考虑到这一点，saveAsTextFile 可能不是众多可用操作中最轻松的。存在几个轻量级的替代方案，例如 count 或 first 等操作。这些将几乎利用整个转换阶段的工作，使您能够衡量解决方案的性能。

您可能需要查看 available actions 并选择最符合您要求的选项。

运行 Spark 中的只读测试

Run a read-only test in Spark

scala

apache-spark