使用 SQL/Presto/Athena,按 ID 合并记录,将时间戳列顺序保持为最新
Using SQL/Presto/Athena, Merge records by ID maintaining the timestamp column order as the most recent
是否可以根据最近的时间戳对具有相同 ID 的记录进行时间戳合并?
例如,假设一个 table 具有以下记录 user_id = 1.
user_id
姓名
地址
城市
状态
op_id
好处
phone
insert_date_timestamp
1
0
2021-06-22 15:06:29.083534
1
99999999
2021-06-22 15:06:29.153258
1
N
2021-06-22 15:03:29.153258
1
1
2021-06-22 15:01:29.153258
1
999 街
凤凰
阿兹
2021-06-22 14:06:29.153258
1
李四
母鹿
2021-06-21 15:06:29.153258
您可以看到超时插入了多个新条目,如果我将所有记录从旧记录合并到最新记录,则当前记录将是:
结果
user_id
姓名
地址
城市
状态
op_id
好处
phone
insert_date_timestamp
1
李四
999 街
凤凰
阿兹
0
N
99999999
2021-06-22 15:06:29.083534
如何使用 SQL 实现此目的?是否可以使用 PRESTO/Athena 查询生成相同的结果?
PS:我知道这可以使用 Pyspark、pandas 等来完成...我的用例是 Athena
谢谢!!
解决方案
select distinct user_id,
first_value(name) ignore nulls over (partition by user_idorder by insert_date desc rows between unbounded preceding and unbounded following) as name,
first_value(address) ignore nulls over (partition by user_id order by insert_date desc rows between unbounded preceding and unbounded following) as address,
. . .
from t;
您可以使用 first_value()
:
select distinct user_id,
first_value(name) over (partition by user_id, name is not null desc order by insert_date desc rows between unbounded preceding and unbounded following) as name,
first_value(address) over (partition by user_id, address is not null desc order by insert_date desc rows between unbounded preceding and unbounded following) as address,
. . .
from t;
或者,如果您更喜欢 ignore nulls
:
select distinct user_id,
first_value(name) ignore nulls over (partition by user_idorder by insert_date desc rows between unbounded preceding and unbounded following) as name,
first_value(address) ignore nulls over (partition by user_id order by insert_date desc rows between unbounded preceding and unbounded following) as address,
. . .
from t;
是否可以根据最近的时间戳对具有相同 ID 的记录进行时间戳合并?
例如,假设一个 table 具有以下记录 user_id = 1.
user_id | 姓名 | 地址 | 城市 | 状态 | op_id | 好处 | phone | insert_date_timestamp |
---|---|---|---|---|---|---|---|---|
1 | 0 | 2021-06-22 15:06:29.083534 | ||||||
1 | 99999999 | 2021-06-22 15:06:29.153258 | ||||||
1 | N | 2021-06-22 15:03:29.153258 | ||||||
1 | 1 | 2021-06-22 15:01:29.153258 | ||||||
1 | 999 街 | 凤凰 | 阿兹 | 2021-06-22 14:06:29.153258 | ||||
1 | 李四 | 母鹿 | 2021-06-21 15:06:29.153258 |
您可以看到超时插入了多个新条目,如果我将所有记录从旧记录合并到最新记录,则当前记录将是:
结果
user_id | 姓名 | 地址 | 城市 | 状态 | op_id | 好处 | phone | insert_date_timestamp |
---|---|---|---|---|---|---|---|---|
1 | 李四 | 999 街 | 凤凰 | 阿兹 | 0 | N | 99999999 | 2021-06-22 15:06:29.083534 |
如何使用 SQL 实现此目的?是否可以使用 PRESTO/Athena 查询生成相同的结果?
PS:我知道这可以使用 Pyspark、pandas 等来完成...我的用例是 Athena
谢谢!!
解决方案
select distinct user_id,
first_value(name) ignore nulls over (partition by user_idorder by insert_date desc rows between unbounded preceding and unbounded following) as name,
first_value(address) ignore nulls over (partition by user_id order by insert_date desc rows between unbounded preceding and unbounded following) as address,
. . .
from t;
您可以使用 first_value()
:
select distinct user_id,
first_value(name) over (partition by user_id, name is not null desc order by insert_date desc rows between unbounded preceding and unbounded following) as name,
first_value(address) over (partition by user_id, address is not null desc order by insert_date desc rows between unbounded preceding and unbounded following) as address,
. . .
from t;
或者,如果您更喜欢 ignore nulls
:
select distinct user_id,
first_value(name) ignore nulls over (partition by user_idorder by insert_date desc rows between unbounded preceding and unbounded following) as name,
first_value(address) ignore nulls over (partition by user_id order by insert_date desc rows between unbounded preceding and unbounded following) as address,
. . .
from t;