如果表在 pyspark 中的位置不同，左连接和 right_outer 连接是否相同？

Question

我在 PySpark 中有 2 个数据帧，

df1 = spark.createDataFrame([
    ("s1", "artist1"),
    ("s2", "artist2"),
    ("s3", "artist3"),
    ],
    ['song_id', 'artist'])


df1.show()

df2 = spark.createDataFrame([
    ("s1", "2"),
    ("s1", "3"),
    ("s4", "4"),
    ("s4", "5")
    ],
    ['song_id', 'duration'])

df2.show()

输出：

+-------+-------+
|song_id| artist|
+-------+-------+
|     s1|artist1|
|     s2|artist2|
|     s3|artist3|
+-------+-------+



+-------+-----+
|song_id|col_2|
+-------+-----+
|     s1|  hmm|
|     s1| hmmm|
|     s4| acha|
|     s4| ohoo|
+-------+-----+

我在这两个数据帧上应用 right_outer 和 left join，它们似乎都给我相同的结果-

df1.join(df2, on="song_id", how="right_outer").show()
df2.join(df1, on="song_id", how="left").show()

输出：

 +-------+-------+--------+
|song_id| artist|duration|
+-------+-------+--------+
|     s1|artist1|       2|
|     s1|artist1|       3|
|     s4|   null|       4|
|     s4|   null|       5|
+-------+-------+--------+

+-------+--------+-------+
|song_id|duration| artist|
+-------+--------+-------+
|     s1|       2|artist1|
|     s1|       3|artist1|
|     s4|       4|   null|
|     s4|       5|   null|
+-------+--------+-------+

我不确定如何有效地使用这 2 个联接。这 2 个连接有什么区别？

Answer 1

左连接和右连接根据连接关键字的table顺序给出结果。

Left/leftouter/left_outer join 都是一样的，显示整个左 table 和右 table.

的匹配记录

Right/rightouter/right_outer join 都是一样的，显示整个右 table 和左 table.

的匹配记录

代码中

df1.join(df2, on="song_id", how="right_outer").show()

df1 是左边的 table(dataframe)，df2 是右边的 table，连接类型是 right_outer，因此它显示了 df2 的所有行和匹配的行df1.

同样在

df2.join(df1, on="song_id", how="left").show()

df2是左table，df1是右table，join类型是left，所以显示df2的所有记录和df1的匹配记录。

因此两个代码显示相同的结果。

df1.join(df2, on="song_id", how="right_outer").show()
df1.join(df2, on="song_id", how="left").show()

在上面的代码中，我在两个查询中都将 df1 放在 table 左侧。结果如下：-

song_id	artist	duration
s1	artist1	2
s1	artist1	3
s4	null	4
s4	null	5

song_id	artist	duration
s1	artist1	2
s1	artist1	3
s2	artist2	null
s3	artist3	null

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join

你可以参考一下。

如果表在 pyspark 中的位置不同，左连接和 right_outer 连接是否相同？

Is left join and right_outer join the same if the tables are positioned differently, in pyspark?

left-join

outer-join

pyspark