如何为重复缩短 table 主菜

Question

我有一个包含 20 万条记录的源数据集。我有两列，我想计算不同的值：

我有这个：

val scr1 = spark.read.parquet("src1.parquet")
val dAppr = bp.groupBy("approver").count().toDF("name","Role1")
val cols1 = dAppr.columns.toSet

val dRevr = bp.groupBy("submitter").count().toDF("name","Role2")
val cols2 = dRevr.columns.toSet

val Signers1 = cols1 ++ cols2

def expr(myCols: Set[String], allCols: Set[String]) = {
  allCols.toList.map(x => x match {
    case x if myCols.contains(x) => col(x)
    case _ => lit(null).as(x)
  })
}

dAppr.select(expr(cols1, Signers1):_*).unionAll(dRevr.select(expr(cols2, Signers1):_*)).show()

我得到：

+--------+---------+----------+
|    name|    Role1|     Role2|
+--------+---------+----------+
|Person A|    19421|      null|
|Person B|    41993|      null|
|Person C|    58822|      null|
|Person D|    48920|      null|
|Person A|     null|     53615|
|Person B|     null|     55904|
|Person C|     null|       118|
|Person D|     null|     59519|
+--------+---------+----------+

我想要（或者，我想我想要）：

+--------+---------+---------+
|    name|    Role1|    Role2|
+--------+---------+---------+
|Person A|    19421|    53615|
|Person B|    41993|    55904|
|Person C|    58822|      118|
|Person D|    48920|    59519|
+--------+---------+---------+

Answer 1

您应该在最后一行代码中使用 join 而不是 union。

看起来你可以在 name 列上使用 full join

加入

Joins with another :class:DataFrame, using the given join expression

联盟

Return a new :class:DataFrame containing union of rows in this and another frame

编辑：

您可以删除这些行：

def expr(myCols: Set[String], allCols: Set[String]) = {
  allCols.toList.map(x => x match {
    case x if myCols.contains(x) => col(x)
    case _ => lit(null).as(x)
  })
}

dAppr.select(expr(cols1, Signers1):_*).unionAll(dRevr.select(expr(cols2, Signers1):_*)).show()

并在相关列上使用 select 的完全连接。

如何为重复缩短 table 主菜

How to shorten table entrees for duplications

union

join

apache-spark