SQL 语句 NOT IN 的 Pyspark 等价物是什么

What would be a Pyspark equivalent of the SQL statement NOT IN

PySpark 中的等效代码是什么?

如果我有 table A 和 Table B,并且我想 select 来自 Table A 的某些 ID,但不在 Table B 中,我可以执行以下 SQL 命令:

Select ID
来自 Table 其中 ID 不在 (Select ID 来自 Table B)

PySpark 中的等效代码是什么?

您可以使用选项 "left_anti":

执行 "left anti-join"
A_df.show()
# +-----+---+
# | type| id|
# +-----+---+
# |type1| 10|
# |type2| 20|
# +-----+---+


B_df.show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# |  1|name1|  10|
# |  2|name2|  30|
# |  3|name3|  20|
# +---+-----+----+


B_df.join(A_df, B_df.type == A_df.id, "anti").show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# |  2|name2|  30|
# +---+-----+----+

这相当于 select * from B_df where type not in (select id from A_df)

在 SQL 上下文中(参见 spark sql anti-join):

from pyspark.sql import SQLContext
sqlc = SQLContext(sc)
# register dataframe as tables in the SQL context
sqlc.registerDataFrameAsTable(A_df, "A_table")
sqlc.registerDataFrameAsTable(B_df, "B_table")

spark.sql("SELECT * FROM B_table LEFT ANTI JOIN A_table ON B_table.type == A_table.id").show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# |  2|name2|  30|
# +---+-----+----+

以下是我创建数据框的方式:

A = [("type1",10), \
    ("type2",20), \
  ]
AColumns = ["type","id"]
A_df = spark.createDataFrame(data=A, schema = AColumns)
A_df.printSchema()
A_df.show(truncate=False)

B = [(1,"name1",10), \
    (2,"name2",30), \
    (3,"name3",20) \
  ]
BColumns = ["id","name","type"]
B_df = spark.createDataFrame(data=B, schema = BColumns)
B_df.printSchema()
B_df.show(truncate=False)