使用 Spark 读取加入 Hive 表的记录

Question

我们可以使用以下命令轻松地从 Spark 中的 Hive table 读取记录：

Row[] results = sqlContext.sql("FROM my_table SELECT col1, col2").collect();

但是当我加入两个table时，比如：

select t1.col1, t1.col2 from table1 t1 join table2 t2 on t1.id = t2.id

如何从上述连接查询中检索记录？

Answer 1

SparkContext.sql 方法总是 returns DataFrame 所以 JOIN 和任何其他类型的查询之间没有实际区别。

尽管如此，您不应该使用 collect 方法，除非确实需要将数据提取到驱动程序。它很昂贵，如果数据不能容纳在驱动程序内存中，它会崩溃。

Read records from joining Hive tables with Spark