使用 SQL/Presto/Athena,按 ID 合并记录,将时间戳列顺序保持为最新

Using SQL/Presto/Athena, Merge records by ID maintaining the timestamp column order as the most recent

是否可以根据最近的时间戳对具有相同 ID 的记录进行时间戳合并?

例如,假设一个 table 具有以下记录 user_id = 1.

user_id 姓名 地址 城市 状态 op_id 好处 phone insert_date_timestamp
1 0 2021-06-22 15:06:29.083534
1 99999999 2021-06-22 15:06:29.153258
1 N 2021-06-22 15:03:29.153258
1 1 2021-06-22 15:01:29.153258
1 999 街 凤凰 阿兹 2021-06-22 14:06:29.153258
1 李四 母鹿 2021-06-21 15:06:29.153258

您可以看到超时插入了多个新条目,如果我将所有记录从旧记录合并到最新记录,则当前记录将是:

结果

user_id 姓名 地址 城市 状态 op_id 好处 phone insert_date_timestamp
1 李四 999 街 凤凰 阿兹 0 N 99999999 2021-06-22 15:06:29.083534

如何使用 SQL 实现此目的?是否可以使用 PRESTO/Athena 查询生成相同的结果?

PS:我知道这可以使用 Pyspark、pandas 等来完成...我的用例是 Athena

谢谢!!

解决方案

select distinct user_id,
       first_value(name) ignore nulls over (partition by user_idorder by insert_date desc rows between unbounded preceding and unbounded following) as name,
       first_value(address) ignore nulls over (partition by user_id order by insert_date desc rows between unbounded preceding and unbounded following) as address,
       . . .
from t;

您可以使用 first_value():

select distinct user_id,
       first_value(name) over (partition by user_id, name is not null desc order by insert_date desc rows between unbounded preceding and unbounded following) as name,
       first_value(address) over (partition by user_id, address is not null desc order by insert_date desc rows between unbounded preceding and unbounded following) as address,
       . . .
from t;

或者,如果您更喜欢 ignore nulls

select distinct user_id,
       first_value(name) ignore nulls over (partition by user_idorder by insert_date desc rows between unbounded preceding and unbounded following) as name,
       first_value(address) ignore nulls over (partition by user_id order by insert_date desc rows between unbounded preceding and unbounded following) as address,
       . . .
from t;