Azure DataBricks:如何对两个具有一对多关系的数据框和两个数据框的 select 特定列进行内部连接。?

Azure DataBricks : How to do inner join of two dataframes which has one to many relationship and select particular columns from both dataframes.?

我已经通过以下方式从 json 个文件中读取数据:

import os,shutil,glob,time
from pyspark.sql.functions import trim 

#Get Data DF1
df1 = spark.read.format("json").load("/mnt/coi/df1.json")

#Get Data DF2
df2 = spark.read.format("json").load("/mnt/coi/df2.json")

我正在加入数据并从两个 DF 中选择列,但最终结果不正确并且没有所有数据:

df = df2.join(df1,df2.Number == df1.Number,how="inner").select(df1.abc,df2.xyz)

DF1 JSON 具有唯一的 Number 列值

{"Number":80216883,"Type":"8","ID":2,"Code":"1290","abc":"M0"} 
{"Number":80216884,"Type":"8","ID":2,"Code":"1010","abc":"MT"} 
{"Number":80216885,"Type":"8","ID":2,"Code":"1295","abc":"MS"} 

DF2 JSON 有重复的 Number 个值

{"Number":80216883,"DateTime":"2019-11-16","Year":2020,"Quarter":2,"xyz":5,"abc":"M0"}
{"Number":80216883,"DateTime":"2018-11-20","Year":2020,"Quarter":2,"xyz":5,"abc":"M0"}
{"Number":80216884,"DateTime":"2019-11-09","Year":2020,"Quarter":2,"xyz":5,"abc":"MT"}

我想要的结果是:{"Number":80216883,"Type":"8","ID":2,"Code":"1290","abc":"M0","DateTime":"2018-11-16","Year":2020,"Quarter":2,"xyz":5} {"Number":80216883,"Type":"8","ID":2,"Code":"1290","abc":"M0","DateTime":"2018-11-20","Year":2020,"Quarter":2,"xyz":5}

How to do inner join of two dataframes which has one to many relationship and select particular columns from both dataframes.?

当我进行连接时,两个 DF 中存在的一些 Number 在最终输出 json.

中不存在

此外,在将一个零件文件合并为一个文件时,只有最后一个零件文件被复制到最终数据 PFB 代码:

dfAll.write.format("json").save("/mnt/coi/DataModel")

#Read Part files
path = glob.glob("/dbfs/mnt/coi/DataModel/part-000*.json")


#Move file to FinalData folder in blbo
for file in path: 
      shutil.move(file,"/dbfs/mnt/coi/FinalData/FinalData.json")

为了得到你期望的结果,考虑到你只想要从 1 到 many 的关系形式的值,我的方法如下:

from pyspark.sql.functions import col

df = df2.join(df1,df2.Number == df1.Number,how="inner").select(df2.DateTime,df2.Number,df2.Quarter,df2.Year,df2.abc,df2.xyz)

df3 = df.groupBy("Number").count().filter(col("count")>1).select(df.Number)

df4=df3.join(df, df.Number == df3.Number,how="inner")

display(df4)

请告诉我这是否对您有帮助。