Spark中用什么来存储中间数据？

What to use to store intermediate data in Spark?

将中间表存储在 Dataframes 或 TempView 中有什么区别？内存有区别吗？

您可以将 TempView 视为一个临时配置单元 table，只要底层 Spark 会话未关闭，它就会存在。

因此，如果您有一个数据框 df 和运行 df.createOrReplaceTempView("something")，您可以通过 val df = spark.table("something") 在项目的任何位置（在同一个 Spark 会话中）检索 df 作为只要先调用 createOrReplaceTempView。

这里有更多信息

Dataframes本身就是中级'tables'。即可以缓存到内存and/or磁盘。我抛开通过 Catalyst 编写代码的概念。

来自 tempviews 上的手册：

运行 SQL 以编程方式查询

SparkSession 上的 sql 函数使应用程序能够运行 SQL 以编程方式查询，returns 结果作为 DataFrame。
为此，您将数据帧注册为 SQL 临时视图。这是一个“惰性”人工制品，必须已经存在数据框/数据集。只需要注册允许 SQL 界面。
- Caching 基础数据帧在重复访问方面有帮助。
- 临时视图是内存中对数据帧的引用，通常没有开销。

So in summary, within a Spark App, the dataframe is a temporary data store / intermediate table you could argue on the latter. If you need complex SQL against a dataframe that the dataframe API cannot handle, then we use the tempview. That is different to spark sql against a real Hive or jdbc read table, but the interface is the same.

顺便说一句，这是一个很好的参考： https://medium.com/@kar9475/data-sharing-between-multiple-spark-jobs-in-databricks-308687c99897

Spark中用什么来存储中间数据？

What to use to store intermediate data in Spark?

scala

view

dataframe

apache-spark