在 Apache Spark 中缓存 RDD 的目的是什么？

Question

我是 Apache Spark 的新手，我有几个关于 spark 的基本问题，在阅读 spark material 时我无法理解。每个 material 都有自己的解释风格。我正在 Ubuntu 上使用 PySpark Jupyter notebook 进行练习。

据我了解，当我运行下面的命令时，testfile.csv中的数据被分区并存储在各个节点的内存中。（实际上我知道这是一个懒惰的评估和在看到操作命令之前它不会处理），但概念仍然是

rdd1 = sc.textFile("testfile.csv")

我的问题是，当我运行下面的转换和操作命令时，rdd2 数据将存储在哪里。

1.Does它存储在内存中？

rdd2 = rdd1.map( lambda x: x.split(",") )

rdd2.count()

我知道 rdd2 中的数据在我关闭 jupyter 之前可用 notebook.Then cache() 的需要是什么，无论如何 rdd2 可用于进行所有转换。听说改造后内存中的数据都清空了，请问是什么情况？

RDD在内存中和cache()有区别吗

rdd2.cache()

Answer 1

Does it stores in memory?

当您通过 action（count、print、foreach）进行运行火花转换时，然后，只有这样你的图表才会具体化，在你的情况下文件正在被消耗。 RDD.cache 的目的是确保 sc.textFile("testfile.csv") 的结果在内存中可用并且不需要再次读取。

不要将变量与幕后进行的实际操作混淆。缓存允许你重新迭代数据，确保它在内存中（如果有足够的内存来完整地存储它）如果你想重新迭代所说的 RDD，只要你设置正确存储级别（默认为StorageLevel.MEMORY）。 From the documentation（感谢@RockieYang）：

In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon. These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist(). The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

Is there any difference between keeping RDD in memory and cache()

如上所述，只要您提供了正确的存储级别，您就可以通过 cache将其保存在内存中。否则，当您要重新使用它时，它不一定会保留在内存中。

在 Apache Spark 中缓存 RDD 的目的是什么？

What is the purpose of cache an RDD in Apache Spark?

caching

apache-spark

rdd

pyspark