DataFrame API 如何依赖于 Spark 中的 RDD？

Question

一些来源，例如 this Keynote: Spark 2.0 talk by Mathei Zaharia, mention that Spark DataFrames are built on top of RDDs. I have found some mentions on RDDs in the DataFrame class（在 Spark 2.0 中我必须查看 DataSet）；但我对这两个 API 如何在幕后绑定在一起的理解仍然非常有限。

有人可以解释一下 DataFrame 如何扩展 RDD 吗？

Answer 1

根据 DataBricks 文章 Deep Dive into Spark SQL’s Catalyst Optimizer（参见在 Spark 中使用 Catalyst SQL），RDD 是由 Catalyst 构建的物理计划的元素。所以，我们用 DataFrames 来描述查询，但最终，Spark 操作的是 RDD。

此外，您可以使用 EXPLAIN 指令查看查询的物理计划。

//  Prints the physical plan to the console for debugging purpose
auction.select("auctionid").distinct.explain()

// == Physical Plan ==
// Distinct false
// Exchange (HashPartitioning [auctionid#0], 200)
//  Distinct true
//   Project [auctionid#0]
 //   PhysicalRDD   //[auctionid#0,bid#1,bidtime#2,bidder#3,bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37

DataFrame API 如何依赖于 Spark 中的 RDD？

How DataFrame API depends on RDDs in Spark?

scala

apache-spark

spark-dataframe