将两个数据框与 pyspark 中的结构组合

Question

我有两个数据帧（A 和 B）具有以下架构

 root
     |-- AUTHOR_ID: integer (nullable = false)
     |-- NAME: string (nullable = true)
     |-- Books: array (nullable = false)
     |    |-- element: struct (containsNull = false)
     |    |    |-- BOOK_ID: integer (nullable = false)
     |    |    |-- Chapters: array (nullable = true) 
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- NAME: string (nullable = true)
     |    |    |    |    |-- NUMBER_PAGES: integer (nullable = true)

合并两个数据帧的最佳方式和清理方式是什么，并且每个项目都将作为新列中的结构字段，从而获得结果：

+---------+-------- +------------
|AUTHOR_ID| A       + B          |     
+---------+-------- + -----------|
|  1      | {}      |   {}       |   keep the nested structs in the new column
|         |         |            |

Answer 1

也许最好的方法是使用一个简单的 join

添加一些新列

注意：我们假设 df_A 和 df_B 这两个数据帧有相同的模式

  columns = df_A.columns

  b_ = df_B.withColumnRenamed('AUTHOR_ID', 'ref_id')          # to retain null fields when no item match
  return df_A.withColumn('A', f.struct(columns))\
          .select('AUTHOR_ID', 'A')\
          .join(b_, df_A['AUTHOR_ID'] == b_.ref_id, 'left')\
          .withColumn('B', f.when(f.col('ref_id').isNotNull(), f.struct(*[columns])).otherwise(f.lit(None)))\
          .select('AUTHOR_ID', 'A', 'B')

将两个数据框与 pyspark 中的结构组合

Combine two dataframes with structs in pyspark

apache-spark

pyspark