如果我在 Spark 中两次缓存相同的 RDD 会发生什么

Question

我正在构建一个通用函数，它接收一个 RDD 并对其进行一些计算。因为我运行对输入 RDD 进行了多次计算，所以我想缓存它。例如：

public JavaRDD<String> foo(JavaRDD<String> r) {
    r.cache();
    JavaRDD t1 = r... //Some calculations
    JavaRDD t2 = r... //Other calculations
    return t1.union(t2);
}

我的问题是，因为 r 是给我的，它可能已经缓存也可能还没有缓存。如果它被缓存并且我再次对其调用缓存，将创建一个新的缓存层，这意味着在计算 t1 和 t2 时我将在缓存中有两个 r 实例？还是 spark 会意识到 r 已被缓存并会忽略它？

Answer 1

没有。如果你在缓存的 RDD 上调用 cache，没有任何反应，RDD 将被缓存（一次）。与许多其他转换一样，缓存是惰性的：

当你调用cache时，RDD的storageLevel被设置为MEMORY_ONLY
当您再次调用 cache 时，它被设置为相同的值（没有变化）
根据评估，当底层RDD具体化时，Spark将检查RDD的storageLevel，如果它需要缓存，它会缓存它。

所以你很安全。

Answer 2

在我的集群上测试一下，Zohar 是对的，没有任何反应，它只会缓存 RDD 一次。我认为，原因是每个 RDD 内部都有一个 id，spark 将使用 id 来标记 RDD 是否已被缓存。所以多次缓存一个RDD将无济于事。

下面是我的代码和截图：

已更新[根据需要添加代码]

### cache and count, then will show the storage info on WEB UI

raw_file = sc.wholeTextFiles('hdfs://10.21.208.21:8020/user/mercury/names', minPartitions=40)\
                 .setName("raw_file")\
                 .cache()
raw_file.count()

### try to cache and count again, then take a look at the WEB UI, nothing changes

raw_file.cache()
raw_file.count()

### try to change rdd's name and cache and count again, to see will it cache a new rdd as the new name again, still 
### nothing changes, so I think maybe it is using the RDD id as a mark, for more we need to take a detailed read on 
### the document even then source code

raw_file.setName("raw_file_2")
raw_file.cache().count()

如果我在 Spark 中两次缓存相同的 RDD 会发生什么

What happens if I cache the same RDD twice in Spark

java

caching

apache-spark

rdd