LAST_VALUE 里面有 IF 语句不回填它的分区 --> 选择每个分区的第一行时丢失最后的值 (BigQuery/SQL)
LAST_VALUE with IF statement inside not backfilling it's partition --> losing last values when selecting first line of each partition (BigQuery/SQL)
我遇到 window 功能问题。对于包含与用户关联的事件的数据集,我想为某些人选择 FIRST_VALUE,为其他人选择 LAST_VALUE ,并将其压缩为每个用户一行。
当使用 FIRST_VALUE/LAST_VALUE 方法时,按用户分区并按 date/timestamp 排序,我用 FIRST_VALUE 得到了令人满意的结果(= 我的第一个值中的行填充了整列).在 LAST_VALUE 子句中,我包含一个 IF 语句,以创建一个说明帐户删除时间的列。它根本不起作用。对解决此问题的方法有什么建议吗?
包括下面的一个最小示例 table,以及进一步向下的预期输出。
WITH dataset_table AS (
SELECT DATE '2020-01-01' date , 1 user, 'german' user_language, 'created_account' event UNION ALL
SELECT '2020-01-02', 1, 'german', 'successful_login' UNION ALL
SELECT '2020-01-03', 1, 'english', 'screen_view' UNION ALL
SELECT '2020-01-04', 1, 'english', 'deleted_account' UNION ALL
SELECT '2020-01-01', 2, 'english', 'login' UNION ALL
SELECT '2020-01-02', 2, 'english', 'settings' UNION ALL
SELECT '2020-01-03', 2, 'english', 'NULL' UNION ALL
SELECT '2020-01-04', 2, 'french', 'screen_view'
),
user_info AS (
SELECT
`date`,
user,
-- record first value for language = signup demographics
FIRST_VALUE(user_language IGNORE NULLS) OVER time_order user_language,
-- record last value for app removal - want to know if the user deleted their account and didn't return
LAST_VALUE(IF(event = 'deleted_account', `date`, NULL)) OVER time_order deleted_account,
ROW_NUMBER() OVER time_order row_idx
FROM dataset_table
WINDOW time_order AS (PARTITION BY user ORDER BY date)
)
SELECT
*
FROM user_info
WHERE row_idx = 1 -- Here, I select the first row, but deleted_account hasn't been populated by the last value for user 1. The same test for FIRST_VALUE does populate the whole column with german, so if I'd use row_idx = 4 I'd get a correct answer to this example, but there are different amount of events for each user in reality, so I want to use row_idx 1 to pick out the ideal line.
预期输出:
date user user_language deleted_account row_idx
2020-01-01 1 german 2020-01-04 1
2020-01-02 2 english null 1
我想你想要:
with dataset_table AS (...),
user_info AS (
SELECT
`date`,
user,
FIRST_VALUE(user_language IGNORE NULLS) OVER (PARTITION BY user ORDER BY date) user_language,
MAX(IF(event = 'deleted_account', `date`, NULL)) OVER (PARTITION BY user) deleted_account,
ROW_NUMBER() OVER (PARTITION BY user ORDER BY date) row_idx
FROM dataset_table
)
SELECT *
FROM user_info
WHERE row_idx = 1
我遇到 window 功能问题。对于包含与用户关联的事件的数据集,我想为某些人选择 FIRST_VALUE,为其他人选择 LAST_VALUE ,并将其压缩为每个用户一行。
当使用 FIRST_VALUE/LAST_VALUE 方法时,按用户分区并按 date/timestamp 排序,我用 FIRST_VALUE 得到了令人满意的结果(= 我的第一个值中的行填充了整列).在 LAST_VALUE 子句中,我包含一个 IF 语句,以创建一个说明帐户删除时间的列。它根本不起作用。对解决此问题的方法有什么建议吗?
包括下面的一个最小示例 table,以及进一步向下的预期输出。
WITH dataset_table AS (
SELECT DATE '2020-01-01' date , 1 user, 'german' user_language, 'created_account' event UNION ALL
SELECT '2020-01-02', 1, 'german', 'successful_login' UNION ALL
SELECT '2020-01-03', 1, 'english', 'screen_view' UNION ALL
SELECT '2020-01-04', 1, 'english', 'deleted_account' UNION ALL
SELECT '2020-01-01', 2, 'english', 'login' UNION ALL
SELECT '2020-01-02', 2, 'english', 'settings' UNION ALL
SELECT '2020-01-03', 2, 'english', 'NULL' UNION ALL
SELECT '2020-01-04', 2, 'french', 'screen_view'
),
user_info AS (
SELECT
`date`,
user,
-- record first value for language = signup demographics
FIRST_VALUE(user_language IGNORE NULLS) OVER time_order user_language,
-- record last value for app removal - want to know if the user deleted their account and didn't return
LAST_VALUE(IF(event = 'deleted_account', `date`, NULL)) OVER time_order deleted_account,
ROW_NUMBER() OVER time_order row_idx
FROM dataset_table
WINDOW time_order AS (PARTITION BY user ORDER BY date)
)
SELECT
*
FROM user_info
WHERE row_idx = 1 -- Here, I select the first row, but deleted_account hasn't been populated by the last value for user 1. The same test for FIRST_VALUE does populate the whole column with german, so if I'd use row_idx = 4 I'd get a correct answer to this example, but there are different amount of events for each user in reality, so I want to use row_idx 1 to pick out the ideal line.
预期输出:
date user user_language deleted_account row_idx
2020-01-01 1 german 2020-01-04 1
2020-01-02 2 english null 1
我想你想要:
with dataset_table AS (...),
user_info AS (
SELECT
`date`,
user,
FIRST_VALUE(user_language IGNORE NULLS) OVER (PARTITION BY user ORDER BY date) user_language,
MAX(IF(event = 'deleted_account', `date`, NULL)) OVER (PARTITION BY user) deleted_account,
ROW_NUMBER() OVER (PARTITION BY user ORDER BY date) row_idx
FROM dataset_table
)
SELECT *
FROM user_info
WHERE row_idx = 1