比较两个 Dataframe 列并显示在 df1 中可用而不在 df2 中可用的结果

Comparing two Dataframe columns and showing the result that is available in df1 and not in df2

比较两个数据帧 df1(最近的数据)和 df2(以前的数据),它们是从相同的 table 派生的不同时间戳,并根据列名(id)从 df1 中提取数据,这在df2

我使用行号来提取最近和以前的数据并将它们存储在 df1(最近的数据)和 df2(以前的数据)中。我尝试使用 left join, subtract 但我不确定我是否在正确的轨道上。

df1=

ID|Timestamp           |RowNum|
+----------+-------------------+
|1|2019-04-03 14:45:...|     1|
|2|2019-04-03 14:45:...|     1|
|3|2019-04-03 14:45:...|     1|

df2 = 
ID|Timestamp           |RowNum|
+----------+-------------------+
|2|2019-04-03 13:45:...|     2|
|3|2019-04-03 13:45:...|     2|


%%spark
result2 = df1.join(df2.select(['id']), ['id'], how='left')
result2.show(10)

but didn't give the desired output
Required Output:

ID|Timestamp           |RowNum|
+----------+-------------------+
|1|2019-04-03 14:45:...|     1|

试试这个。

scala> val df1 = Seq(("1","2019-04-03 14:45:00","1"),("2","2019-04-03 14:45:00","1"),("3","2019-04-03 14:45:00","1")).toDF("ID","Timestamp","RowNum")
df1: org.apache.spark.sql.DataFrame = [ID: string, Timestamp: string ... 1 more field]

scala> df1.show
+---+-------------------+------+
| ID|          Timestamp|RowNum|
+---+-------------------+------+
|  1|2019-04-03 14:45:00|     1|
|  2|2019-04-03 14:45:00|     1|
|  3|2019-04-03 14:45:00|     1|
+---+-------------------+------+

scala> val df2 = Seq(("2","2019-04-03 13:45:00","2"),("3","2019-04-03 13:45:00","2")).toDF("ID","Timestamp","RowNum")
df2: org.apache.spark.sql.DataFrame = [ID: string, Timestamp: string ... 1 more field]

scala> df2.show
+---+-------------------+------+
| ID|          Timestamp|RowNum|
+---+-------------------+------+
|  2|2019-04-03 13:45:00|     2|
|  3|2019-04-03 13:45:00|     2|
+---+-------------------+------+

scala> val idDiff = df1.select("ID").except(df2.select("ID"))
idDiff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: string]

scala> idDiff.show
+---+
| ID|
+---+
|  1|
+---+


scala> val outputDF = df1.join(idDiff, "ID")
outputDF: org.apache.spark.sql.DataFrame = [ID: string, Timestamp: string ... 1 more field]

scala> outputDF.show
+---+-------------------+------+
| ID|          Timestamp|RowNum|
+---+-------------------+------+
|  1|2019-04-03 14:45:00|     1|
+---+-------------------+------+

您可以使用 left_anti 连接类型来完成您想要的操作:

result2 = df1.join(df2, ['id'], how='left_anti')

它在 Spark 文档本身中没有很好地解释,但您可以找到有关此连接类型的更多信息,例如 here

有两种方法可以实现:

1 IS NOT IN - 从查找数据帧创建一个列表(df2_list)并在 isin() == False

中使用该列表
df1 = spark.createDataFrame([(1,"A") ,(2,"B",),(3,"C",),(4,"D")], ['id','item'])

df2 = spark.createDataFrame([(1,10) ,(2,20)], ['id','otherItem'])

df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect()

from pyspark.sql.functions import col

df1.where(col('id').isin(df2_list) == False).show()

2 Left Anit Join - 将 master table 放在左侧。

df1.join(df2,  df1.id==df2.id, 'left_anti').show()