如何使用不同的 id 类型两次加入 spark 数据框

How to join a spark dataframe twice with different id type

我有一个叫 eventsspark.DataFrame,我想加入另一个叫 usersspark.DataFrame。因此,可以使用两种不同类型的 Id 在 events 数据帧上识别用户。 数据帧的架构如下所示: 事件:

Id IdType Name Date EventType
324 UserId Daniel 2022-01-15 purchase
350 UserId Jack 2022-01-16 purchase
3247623322 UserCel Michelle 2022-01-10 claim

用户:

Id Name Cel
324 Daniel 5511737379
350 Jack 3247623817
380 Michelle 3247623322

我想做的是 left join events 数据帧两次,以提取所有事件,尽管 events 数据帧上使用了 IdType [=23] =]

我想要的最终数据框必须如下:

Id Name Cel Date EventType
324 Daniel 5511737379 2022-01-15 Purchase
350 Jack 3247623817 2022-01-16 Purchase
380 Michelle 3247623322 2022-01-10 Claim

我猜这个连接的 python(PySpark 代码)可能接近于:

(users.join(events, on = [users.Id == events.Id], how = 'left')
      .join(events, on = [users.Cel == events.Id], how = 'left'))

您可以使用以下代码来做到这一点

with_id = (users.join(events, on=users["Id"]==events["Id"], how='inner')
                .select(events["Id"], events["Name"],"Cel","Date","EventType"))

incorrect_id = (users.join(events, on=users["Id"]==events["Id"], how='leftanti')
                        .join(events, on=users["Cel"]==events["Id"])
                        .select(users["Id"], events["Name"],"Cel","Date","EventType"))


result = with_id.unionAll(incorrect_id)

结果

result.show()
+---+--------+----------+----------+---------+
| Id|    Name|       Cel|      Date|EventType|
+---+--------+----------+----------+---------+
|324|  Daniel|5511737379|2022-01-15| purchase|
|350|    Jack|3247623817|2022-01-16| purchase|
|380|Michelle|3247623322|2022-01-10|    claim|
+---+--------+----------+----------+---------+