window 函数上的 pyspark case 语句

Question

我有一个数据框，我需要在其中检查以下三列以过滤正确的行。

给定数据框输入：

customer_number acct_registration_ts          last_login_ts acct_create_ts
28017150        null                           null         2018-02-13T00:43:26.747+0000
28017150        null                           null         2014-09-11T15:58:29.593+0000
28017150        2014-05-14T23:11:40.167+0000   null         2014-05-12T00:00:00.000+0000

预期数据帧输出：

customer_number acct_registration_ts          last_login_ts acct_create_ts
28017150        2014-05-14T23:11:40.167+0000   null         2014-05-12T00:00:00.000+0000

过滤条件：

如果 acct_registration_ts 为 NOT NULL，则获取 acct_registration_ts 行的最大值。
如果 acct_registration_ts 为 NULL，则检查 last_login_ts，如果 last_login_ts 不为 NULL，则获取 last_login_ts 行的最大值。
如果acct_registration_ts和last_login_ts都为NULL，则获取acct_create_ts行的最大值。

这里我需要按customer_number列进行分组，然后应用上面的3个过滤逻辑。我尝试使用 pyspark window 函数，但没有得到预期的输出。任何帮助将不胜感激。

Answer 1

您可以在所有三列中使用 window：

from pyspark.sql import functions as F, Window

w = Window.partitionBy('customer_number').orderBy(*[F.desc_nulls_last(c) for c in df.columns[1:]])

df2 = df.withColumn('rn', F.dense_rank().over(w)).filter('rn = 1')

df2.show(truncate=False)
+---------------+----------------------------+-------------+----------------------------+---+
|customer_number|acct_registration_ts        |last_login_ts|acct_create_ts              |rn |
+---------------+----------------------------+-------------+----------------------------+---+
|28017150       |2014-05-14T23:11:40.167+0000|null         |2014-05-12T00:00:00.000+0000|1  |
+---------------+----------------------------+-------------+----------------------------+---+

window 函数上的 pyspark case 语句

pyspark case statement over window function

window-functions

apache-spark

apache-spark-sql

pyspark