Pyspark 加入混合条件

Pyspark join with mixed conditions

我有两个数据框:left_dfright_df 有共同的列加入:['col_1, 'col_2'] ,我想加入另一个条件: right_df.col_3.between(left_df.col_4, left_df.col_5)]

代码:

from pyspark.sql import functions as F

join_condition = ['col_1', 
                  'col_2', 
                  right_df.col_3.between(left_df.col_4, left_df.col_5)]
df = left_df.join(right_df, on=join_condition, how='left')

df.write.parquet('/tmp/my_df')

但是我得到以下错误:

TypeError: Column is not iterable

为什么我不能将这 3 个条件加在一起?

您不能将字符串与列混合使用。表达式必须是字符串列表或列列表,而不是两者的混合。您可以将前两项转换为列表达式,例如

from pyspark.sql import functions as F

join_condition = [left_df.col_1 == right_df.col_1, 
                  left_df.col_2 == right_df.col_2, 
                  right_df.col_3.between(left_df.col_4, left_df.col_5)]

df = left_df.join(right_df, on=join_condition, how='left')