pySpark 如何将 TempView table 加入到 Hive table

Question

我有一个注册为 tempView 的 dataFrame 和一个要加入的 Hive table

    df1.createOrReplaceTempView("mydata")

    df2 = spark.sql("Select md.column1,md.column2,mht.column1 \
                    from mydata md inner join myHivetable mht on mht.key1 = md.key1 \
                     where mht.transdate between '2017-08-01' and '2017-08-10' ")

这个连接是如何发生的。如果 Hive table.

中的数据量非常大，spark 会尝试将 hive table 读入内存或决定将 tempView table 写入 hive

在第一个答案后添加以下内容以获取更多详细信息：

假设我们有

100 行作为 Spark 中名为 TABLE_A 的临时视图。

Hive TABLE_B 中的 10 亿行 table。

下一步我们需要加入 TABLE_A 和 TABLE_B .

日期范围条件 TABLE_B。

因为tableTABLE_B体积大。将 spark 将整个 table TABLE_B 读入内存或决定将 TABLE_A 写入 Hadoop 中的临时 space 以执行 Hive Join 或它将如何智能地找出最佳方法为性能做连接

Answer 1

Hive context 将注册的 temp tables/views 的信息存储在 Metastore 中。这允许对数据执行类似 SQL 的查询操作 - 我们仍然可以获得与其他方式相同的性能。

可以阅读更多关于此的信息here and here

pySpark 如何将 TempView table 加入到 Hive table

pySpark how does TempView table Joined to Hive table

hadoop

pyspark

pyspark-sql