Spark RDD vs Dataframe——数据存储

Spark RDD vs Dataframe - Data storage

我是 Spark 的新手，正在学习 Dataframe、操作和架构。在阅读 RDD 和 Dataframe 的比较时，我对 RDD 和 Dataframe 的数据结构感到困惑。以下是我的观察，如有不妥请大家帮助clarify/correct

1)如果源数据是一个集群（如：HDFS），RDD在集群中跨节点分布式（块）存储在计算机RAM中。

如果数据源只是单个 CSV 文件，数据将分布到运行服务器（如果是笔记本电脑）RAM 中的多个块。我说的对吗？

2)block和partition有关系吗？哪一个是超集？

3)Dataframe：Dataframe 是否也以与 RDD 相同的方式存储？如果我将源数据单独存储到数据框中，是否会在后台创建 RDD？

提前致谢:)

RDD is stored in the computer RAM in a distributed manner(blocks) across the nodes in a cluster, if the source data is an a cluster(eg: HDFS).

如果启用 caching 或 checkpointing，它也可能存储在内存或磁盘上。此外，洗牌总是涉及磁盘写入。

If the data source is just a single CSV file, the data will be distributed to multiple blocks in the RAM of running server(if Laptop). Am i right?

CSV文件会被拆分成多个分区，每个任务只会读取一个数据块（起止偏移量）。

Is there any relationship between block and partition? Which one is super set?

有点混乱，看看这个，其中指出split是输入数据的逻辑划分，而block是数据的物理划分. Spark 使用自己的术语，partition 在 Spark 中的含义与 Hadoop 中的拆分大致相同。

当从 HDFS HadoopRDD 中读取文件时，每个 split 都会变成一个 partition。

Dataframe: Does the Dataframe also getting stored in the same way as RDD? Whether RDD will be created in the backed if i am storing my source data into dataframe alone?

Dataframe 只不过是引擎盖下的 RDD[InternalRow]。
看看 SparkPlan.

Spark RDD vs Dataframe——数据存储

Spark RDD vs Dataframe - Data storage

scala

hdfs

apache-spark