Spark sql- 根据另一个聚合选项进行聚合

Question

我在下面有这个数据框，我想要实现的是，只有当一行有一个'Y'标志时，[=中的值14=] 列将聚合到一个列表中。我已经尝试了 sql 打击，但没有奏效。这个怎么做？我在下面的代码和示例输出中进行了评论。非常感谢您的帮助。

输入：

name            code    flag
big bird          A       Y
elmo              B       N     
cookie monster    C       Y
cookie monster    D       N

预期输出：

name              hasYflag    Codelist
big bird            Y          A.   
elmo                N               //elmo does not have codelist as the flag is N
cookie monster      Y          C,D. //cookie monster has codelist as there is one Y (row 3 above) flag

我试过这样做，但没有用。我希望使用 spark sql 而不是 Spark sql api:

select name,
case when max(flag) = "Y" then "Y" else "N" end as hasYflag
case when max(flag) = "Y" then sort_array(collect_set(code)) else null as Codelist
from df
groupby name

Answer 1

试试这个：

    import org.apache.spark.sql.expressions.Window
    import org.apache.spark.sql.functions._

    data
      .withColumn("code", collect_set("code") over Window.partitionBy("name"))
      .withColumn("rank", row_number() over Window.partitionBy("name").orderBy(col("flag").desc))
      .where(col("rank") === 1)
      .withColumn("code", when(col("flag") === lit("Y"), concat_ws(",", col("code"))))
      .withColumnRenamed("flag", "hasYFlag")
      .withColumnRenamed("code", "codeList")
      .select("name", "codeList", "hasYFlag")

输出：

+--------------+--------+--------+
|          name|codeList|hasYFlag|
+--------------+--------+--------+
|cookie monster|     C,D|       Y|
|      big bird|       A|       Y|
|          elmo|    null|       N|
+--------------+--------+--------+

根据要求（编辑）：

sparkSession.sql("WITH dups AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY name ORDER BY flag DESC) AS rn 
FROM (
SELECT name, flag, CONCAT_WS(',', COLLECT_SET(code) OVER (PARTITION BY name)) AS code FROM tmp_table)
) SELECT name, flag AS hasYFlag, IF(flag='Y', code, null) AS codeList FROM dups WHERE rn = 1")

Spark sql- 根据另一个聚合选项进行聚合

Spark sql- Do an aggregation based on another aggregation option

sql

dataset

dataframe

apache-spark

apache-spark-sql