为什么在 SQL 查询中使用 UDF 会导致笛卡尔积？

Why using a UDF in a SQL query leads to cartesian product?

为什么使用 UDF 会导致笛卡尔积而不是完全外连接？显然，笛卡尔积比完整的外部连接（Joins 是一个例子）要多很多行，这是一个潜在的表现命中.
有什么方法可以在 Databricks-Question 中给出的示例中强制对笛卡尔积进行外部连接？

I have a Spark Streaming application that uses SQLContext to execute SQL statements on streaming data. When I register a custom UDF in Scala, the performance of the streaming application degrades significantly. Details below:

Statement 1:

Select col1, col2 from table1 as t1 join table2 as t2 on t1.foo = t2.bar

Statement 2:

Select col1, col2 from table1 as t1 join table2 as t2 on equals(t1.foo,t2.bar)

I register a custom UDF using SQLContext as follows:

sqlc.udf.register("equals", (s1: String, s2:String) => s1 == s2)

On the same input and Spark configuration, Statement2 performance significantly worse(close to 100X) compared to Statement1.

Why using UDFs leads to a Cartesian product instead of a full outer join?

使用 UDF 需要笛卡尔积的原因很简单。由于您传递的任意函数可能具有无限域和非确定性行为，因此确定其值的唯一方法是传递参数和求值。这意味着您只需检查所有可能的对。

另一方面，简单相等具有可预测的行为。如果您使用 t1.foo = t2.bar 条件，您可以简单地将 t1 和 t2 行分别按 foo 和 bar 打乱以获得预期结果。

准确地说，在关系代数中外连接实际上是用自然连接来表达的。除此之外的任何事情都只是一种优化。

Any way to force an outer join over the Cartesian product

不一定，除非你想修改 Spark SQL 引擎。

为什么在 SQL 查询中使用 UDF 会导致笛卡尔积？

Why using a UDF in a SQL query leads to cartesian product?

sql

apache-spark

apache-spark-sql