PySpark Join 基于 case 语句
PySpark Join based on case statement
我想根据 SQL 案例语句加入两个数据帧,如下所示。请告诉我处理这种情况的最佳方法是什么?
from df1
left join df2 d
on d."Date1" <= Case when v."DATE2" >= v."DATE3" then df1."col1" else df1."col2" end
我个人会把它放入一个 UDF,其中 returns 一个布尔值。因此,业务逻辑将在 Python 代码中结束,而 SQL 将保持干净:
>>> from pyspark.sql.types import BooleanType
>>> def join_based_on_dates(left_date, date0, date1, col0, col1):
>>> if(date0 >= date1):
>>> right_date = col0
>>> else:
>>> right_date = col1
>>> return left_date <= right_date
>>> sqlContext.registerFunction("join_based_on_dates", join_based_on_dates, BooleanType())
>>> join_based_on_dates("2016-01-01", "2017-01-01", "2018-01-01", "res1", "res2");
True
>>> sqlContext.sql("SELECT join_based_on_dates('2016-01-01', '2017-01-01', '2018-01-01', 'res1', 'res2')").collect();
[Row(_c0=True)]
您的查询将以如下形式结束:
FROM df1
LEFT JOIN df2 ON join_based_on_dates('2016-01-01', '2017-01-01', '2018-01-01', 'res1', 'res2')
希望对您有所帮助,祝您玩得开心!
我想根据 SQL 案例语句加入两个数据帧,如下所示。请告诉我处理这种情况的最佳方法是什么?
from df1
left join df2 d
on d."Date1" <= Case when v."DATE2" >= v."DATE3" then df1."col1" else df1."col2" end
我个人会把它放入一个 UDF,其中 returns 一个布尔值。因此,业务逻辑将在 Python 代码中结束,而 SQL 将保持干净:
>>> from pyspark.sql.types import BooleanType
>>> def join_based_on_dates(left_date, date0, date1, col0, col1):
>>> if(date0 >= date1):
>>> right_date = col0
>>> else:
>>> right_date = col1
>>> return left_date <= right_date
>>> sqlContext.registerFunction("join_based_on_dates", join_based_on_dates, BooleanType())
>>> join_based_on_dates("2016-01-01", "2017-01-01", "2018-01-01", "res1", "res2");
True
>>> sqlContext.sql("SELECT join_based_on_dates('2016-01-01', '2017-01-01', '2018-01-01', 'res1', 'res2')").collect();
[Row(_c0=True)]
您的查询将以如下形式结束:
FROM df1
LEFT JOIN df2 ON join_based_on_dates('2016-01-01', '2017-01-01', '2018-01-01', 'res1', 'res2')
希望对您有所帮助,祝您玩得开心!