如何使用不同的 id 类型两次加入 spark 数据框
How to join a spark dataframe twice with different id type
我有一个叫 events
的 spark.DataFrame
,我想加入另一个叫 users
的 spark.DataFrame
。因此,可以使用两种不同类型的 Id 在 events
数据帧上识别用户。
数据帧的架构如下所示:
事件:
Id
IdType
Name
Date
EventType
324
UserId
Daniel
2022-01-15
purchase
350
UserId
Jack
2022-01-16
purchase
3247623322
UserCel
Michelle
2022-01-10
claim
用户:
Id
Name
Cel
324
Daniel
5511737379
350
Jack
3247623817
380
Michelle
3247623322
我想做的是 left join
events
数据帧两次,以提取所有事件,尽管 events
数据帧上使用了 IdType
[=23] =]
我想要的最终数据框必须如下:
Id
Name
Cel
Date
EventType
324
Daniel
5511737379
2022-01-15
Purchase
350
Jack
3247623817
2022-01-16
Purchase
380
Michelle
3247623322
2022-01-10
Claim
我猜这个连接的 python(PySpark 代码)可能接近于:
(users.join(events, on = [users.Id == events.Id], how = 'left')
.join(events, on = [users.Cel == events.Id], how = 'left'))
您可以使用以下代码来做到这一点
with_id = (users.join(events, on=users["Id"]==events["Id"], how='inner')
.select(events["Id"], events["Name"],"Cel","Date","EventType"))
incorrect_id = (users.join(events, on=users["Id"]==events["Id"], how='leftanti')
.join(events, on=users["Cel"]==events["Id"])
.select(users["Id"], events["Name"],"Cel","Date","EventType"))
result = with_id.unionAll(incorrect_id)
结果
result.show()
+---+--------+----------+----------+---------+
| Id| Name| Cel| Date|EventType|
+---+--------+----------+----------+---------+
|324| Daniel|5511737379|2022-01-15| purchase|
|350| Jack|3247623817|2022-01-16| purchase|
|380|Michelle|3247623322|2022-01-10| claim|
+---+--------+----------+----------+---------+
我有一个叫 events
的 spark.DataFrame
,我想加入另一个叫 users
的 spark.DataFrame
。因此,可以使用两种不同类型的 Id 在 events
数据帧上识别用户。
数据帧的架构如下所示:
事件:
Id | IdType | Name | Date | EventType |
---|---|---|---|---|
324 | UserId | Daniel | 2022-01-15 | purchase |
350 | UserId | Jack | 2022-01-16 | purchase |
3247623322 | UserCel | Michelle | 2022-01-10 | claim |
用户:
Id | Name | Cel |
---|---|---|
324 | Daniel | 5511737379 |
350 | Jack | 3247623817 |
380 | Michelle | 3247623322 |
我想做的是 left join
events
数据帧两次,以提取所有事件,尽管 events
数据帧上使用了 IdType
[=23] =]
我想要的最终数据框必须如下:
Id | Name | Cel | Date | EventType |
---|---|---|---|---|
324 | Daniel | 5511737379 | 2022-01-15 | Purchase |
350 | Jack | 3247623817 | 2022-01-16 | Purchase |
380 | Michelle | 3247623322 | 2022-01-10 | Claim |
我猜这个连接的 python(PySpark 代码)可能接近于:
(users.join(events, on = [users.Id == events.Id], how = 'left')
.join(events, on = [users.Cel == events.Id], how = 'left'))
您可以使用以下代码来做到这一点
with_id = (users.join(events, on=users["Id"]==events["Id"], how='inner')
.select(events["Id"], events["Name"],"Cel","Date","EventType"))
incorrect_id = (users.join(events, on=users["Id"]==events["Id"], how='leftanti')
.join(events, on=users["Cel"]==events["Id"])
.select(users["Id"], events["Name"],"Cel","Date","EventType"))
result = with_id.unionAll(incorrect_id)
结果
result.show()
+---+--------+----------+----------+---------+
| Id| Name| Cel| Date|EventType|
+---+--------+----------+----------+---------+
|324| Daniel|5511737379|2022-01-15| purchase|
|350| Jack|3247623817|2022-01-16| purchase|
|380|Michelle|3247623322|2022-01-10| claim|
+---+--------+----------+----------+---------+