SQL 语句 NOT IN 的 Pyspark 等价物是什么
What would be a Pyspark equivalent of the SQL statement NOT IN
PySpark 中的等效代码是什么?
如果我有 table A 和 Table B,并且我想 select 来自 Table A 的某些 ID,但不在 Table B 中,我可以执行以下 SQL 命令:
Select ID
来自 Table
其中 ID 不在 (Select ID 来自 Table B)
PySpark 中的等效代码是什么?
您可以使用选项 "left_anti"
:
执行 "left anti-join"
A_df.show()
# +-----+---+
# | type| id|
# +-----+---+
# |type1| 10|
# |type2| 20|
# +-----+---+
B_df.show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# | 1|name1| 10|
# | 2|name2| 30|
# | 3|name3| 20|
# +---+-----+----+
B_df.join(A_df, B_df.type == A_df.id, "anti").show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# | 2|name2| 30|
# +---+-----+----+
这相当于 select * from B_df where type not in (select id from A_df)
在 SQL 上下文中(参见 spark sql anti-join):
from pyspark.sql import SQLContext
sqlc = SQLContext(sc)
# register dataframe as tables in the SQL context
sqlc.registerDataFrameAsTable(A_df, "A_table")
sqlc.registerDataFrameAsTable(B_df, "B_table")
spark.sql("SELECT * FROM B_table LEFT ANTI JOIN A_table ON B_table.type == A_table.id").show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# | 2|name2| 30|
# +---+-----+----+
以下是我创建数据框的方式:
A = [("type1",10), \
("type2",20), \
]
AColumns = ["type","id"]
A_df = spark.createDataFrame(data=A, schema = AColumns)
A_df.printSchema()
A_df.show(truncate=False)
B = [(1,"name1",10), \
(2,"name2",30), \
(3,"name3",20) \
]
BColumns = ["id","name","type"]
B_df = spark.createDataFrame(data=B, schema = BColumns)
B_df.printSchema()
B_df.show(truncate=False)
PySpark 中的等效代码是什么?
如果我有 table A 和 Table B,并且我想 select 来自 Table A 的某些 ID,但不在 Table B 中,我可以执行以下 SQL 命令:
Select ID
来自 Table
其中 ID 不在 (Select ID 来自 Table B)
PySpark 中的等效代码是什么?
您可以使用选项 "left_anti"
:
A_df.show()
# +-----+---+
# | type| id|
# +-----+---+
# |type1| 10|
# |type2| 20|
# +-----+---+
B_df.show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# | 1|name1| 10|
# | 2|name2| 30|
# | 3|name3| 20|
# +---+-----+----+
B_df.join(A_df, B_df.type == A_df.id, "anti").show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# | 2|name2| 30|
# +---+-----+----+
这相当于 select * from B_df where type not in (select id from A_df)
在 SQL 上下文中(参见 spark sql anti-join):
from pyspark.sql import SQLContext
sqlc = SQLContext(sc)
# register dataframe as tables in the SQL context
sqlc.registerDataFrameAsTable(A_df, "A_table")
sqlc.registerDataFrameAsTable(B_df, "B_table")
spark.sql("SELECT * FROM B_table LEFT ANTI JOIN A_table ON B_table.type == A_table.id").show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# | 2|name2| 30|
# +---+-----+----+
以下是我创建数据框的方式:
A = [("type1",10), \
("type2",20), \
]
AColumns = ["type","id"]
A_df = spark.createDataFrame(data=A, schema = AColumns)
A_df.printSchema()
A_df.show(truncate=False)
B = [(1,"name1",10), \
(2,"name2",30), \
(3,"name3",20) \
]
BColumns = ["id","name","type"]
B_df = spark.createDataFrame(data=B, schema = BColumns)
B_df.printSchema()
B_df.show(truncate=False)