如何在 pyspark 中设置来自不同 table 的列值？

Question

在 Pyspark 中 - 如何使用 where condition (B.list_expire_value) > 5 || (B.list_date) < 6 上的 Table B (list_date) 中的值设置 Table 中 column(listed_1) 的列值。 (B.) 是为了表明它们是 Table B 的列。

目前我在做：

  spark_df = table_1.join("table_2", on ="uuid").when((table_2['list_expire_value'] > 5) | (table_2['list_date'] < 6)).withColumn("listed_1", table_2['list_date'])

但是我收到一个错误。如何做到这一点？

Sample table : 

Table A
uuid   listed_1
001    abc
002    def
003    ghi

Table B
uuid    list_date    list_expire_value     col4
001     12           7                     dckvfd
002     14           3                     dfdfgi
003     3            8                     sdfgds

Expected Output
uuid    listed1      list_expire_value     col4
001     12           7                     dckvfd
002     def          3                     dfdfgi
003     3            8                     sdfgds

002 of listed1 will not be replaced since they do not fufil the when conditions.

Answer 1

希望对您有所帮助！

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

A = sc.parallelize([('001','abc'),('002','def'),('003','ghi')]).toDF(['uuid','listed_1'])
B = sc.parallelize([('001',12,7,'dckvfd'),('002',14,3,'dfdfgi'),('003',3,8,'sdfgds')]).\
    toDF(['uuid','list_date','list_expire_value','col4'])

def cond_fn(x, y, z):
    if (x > 5 or y < 6):
        return y
    else:
        return z

final_df = A.join(B, on="uuid")
udf_val = udf(cond_fn, StringType())
final_df = final_df.withColumn("listed1",udf_val(final_df.list_expire_value,final_df.list_date, final_df.listed_1))
final_df.select(["uuid","listed1","list_expire_value","col4"]).show()

如果它解决了您的问题，请不要忘记告诉我们:)

Answer 2

pyspark sql 查询的正确形式是

from pyspark.sql import functions as F
spark_df = table_1.join(table_2, 'uuid', 'inner').withColumn('list_expire_value',F.when((table_2.list_expire_value > 5) | (table_2.list_date < 6), table_1.listed_1).otherwise(table_2.list_date)).drop(table_1.listed_1)

如何在 pyspark 中设置来自不同 table 的列值？

How to set column values from different table in pyspark?

python

apache-spark

pyspark

pyspark-sql