Google BigQuery:检索每行的最新版本
Google BigQuery: retrieve last version of each row
我有一个 Google BigQuery Table,其中包含所有版本的资源。每次资源 created/updated/deleted 添加一个新行,增加版本号(这个数字将是添加行时的时间戳)
+-------+------------+--------+-------+-------------+
| ID | ResourceID | Action | Count | Timestamp |
+-------+------------+--------+-------+-------------+
| ABC_1 | ABC | CREATE | 10 | {timestamp} |
| ABC_2 | ABC | UPDATE | 8 | {timestamp} |
| ABC_3 | ABC | UPDATE | 4 | {timestamp} |
| ABC_4 | ABC | DELETE | 4 | {timestamp} |
| - | | | | |
| DEF_1 | DEF | CREATE | 10 | {timestamp} |
| DEF_2 | DEF | DELETE | 10 | {timestamp} |
| - | | | | |
| GHJ_1 | GHJ | CREATE | 10 | {timestamp} |
| - | | | | |
| KLM_1 | KLM | CREATE | 10 | {timestamp} |
| KLM_2 | KLM | UPDATE | 5 | {timestamp} |
+-------+------------+--------+-------+-------------+
- ID:行的唯一ID,包含ResourceID加上版本标识
- ResourceID: 发生动作的资源ID
- 操作:资源上发生的操作
- 计数:与资源关联的值
- Timestamp:添加行的时间戳(与唯一 ID 相同)
我需要编写一个查询来检索每个资源的所有最新版本
+-------+------------+--------+-------+-------------+
| ID | ResourceID | Action | Count | Timestamp |
+-------+------------+--------+-------+-------------+
| ABC_4 | ABC | DELETE | 4 | {timestamp} |
| DEF_2 | DEF | DELETE | 10 | {timestamp} |
| GHJ_1 | GHJ | CREATE | 10 | {timestamp} |
| KLM_2 | KLM | UPDATE | 5 | {timestamp} |
+-------+------------+--------+-------+-------------+
此外,所有处于DELETE
状态的资源,都需要忽略。
所以这是我正在寻找的最终输出
+-------+------------+--------+-------+-------------+
| ID | ResourceID | Action | Count | Timestamp |
+-------+------------+--------+-------+-------------+
| GHJ_1 | GHJ | CREATE | 10 | {timestamp} |
| KLM_2 | KLM | UPDATE | 5 | {timestamp} |
+-------+------------+--------+-------+-------------+
这是我做的查询
SELECT ResourceId, Count
FROM worklog_*
WHERE ID IN (
SELECT max(ID)
FROM worklog_*
GROUP BY WorklogID
) AND Action != DELETE
这不是真正的 BigQuery 查询,但足以理解其行为。
如果可以比较 ID 列的值,则此查询工作正常,这就是为什么我选择加入 ResourceId 和 Timestamp,MAX()
值将始终提供最后一个状态
这是最好的方法吗?有人对进行这种提取的更好方法有什么建议吗?
对于 BigQuery 标准 SQL
#standardSQL
WITH worklog AS (
SELECT 'ABC_1' AS ID, 'ABC' AS ResourceID, 'CREATE' AS Action, 10 AS COUNT UNION ALL
SELECT 'ABC_2', 'ABC', 'UPDATE', 8 UNION ALL
SELECT 'ABC_3', 'ABC', 'UPDATE', 4 UNION ALL
SELECT 'ABC_4', 'ABC', 'DELETE', 4 UNION ALL
SELECT 'DEF_1', 'DEF', 'CREATE', 10 UNION ALL
SELECT 'DEF_2', 'DEF', 'DELETE', 10 UNION ALL
SELECT 'GHJ_1', 'GHJ', 'CREATE', 10 UNION ALL
SELECT 'KLM_1', 'KLM', 'CREATE', 10 UNION ALL
SELECT 'KLM_2', 'KLM', 'UPDATE', 5
)
SELECT * EXCEPT(Last)
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY ResourceID ORDER BY ID DESC) AS Last
FROM worklog
WHERE Action != 'DELETE'
)
WHERE Last = 1
-- ORDER BY ID
我有一个 Google BigQuery Table,其中包含所有版本的资源。每次资源 created/updated/deleted 添加一个新行,增加版本号(这个数字将是添加行时的时间戳)
+-------+------------+--------+-------+-------------+
| ID | ResourceID | Action | Count | Timestamp |
+-------+------------+--------+-------+-------------+
| ABC_1 | ABC | CREATE | 10 | {timestamp} |
| ABC_2 | ABC | UPDATE | 8 | {timestamp} |
| ABC_3 | ABC | UPDATE | 4 | {timestamp} |
| ABC_4 | ABC | DELETE | 4 | {timestamp} |
| - | | | | |
| DEF_1 | DEF | CREATE | 10 | {timestamp} |
| DEF_2 | DEF | DELETE | 10 | {timestamp} |
| - | | | | |
| GHJ_1 | GHJ | CREATE | 10 | {timestamp} |
| - | | | | |
| KLM_1 | KLM | CREATE | 10 | {timestamp} |
| KLM_2 | KLM | UPDATE | 5 | {timestamp} |
+-------+------------+--------+-------+-------------+
- ID:行的唯一ID,包含ResourceID加上版本标识
- ResourceID: 发生动作的资源ID
- 操作:资源上发生的操作
- 计数:与资源关联的值
- Timestamp:添加行的时间戳(与唯一 ID 相同)
我需要编写一个查询来检索每个资源的所有最新版本
+-------+------------+--------+-------+-------------+
| ID | ResourceID | Action | Count | Timestamp |
+-------+------------+--------+-------+-------------+
| ABC_4 | ABC | DELETE | 4 | {timestamp} |
| DEF_2 | DEF | DELETE | 10 | {timestamp} |
| GHJ_1 | GHJ | CREATE | 10 | {timestamp} |
| KLM_2 | KLM | UPDATE | 5 | {timestamp} |
+-------+------------+--------+-------+-------------+
此外,所有处于DELETE
状态的资源,都需要忽略。
所以这是我正在寻找的最终输出
+-------+------------+--------+-------+-------------+
| ID | ResourceID | Action | Count | Timestamp |
+-------+------------+--------+-------+-------------+
| GHJ_1 | GHJ | CREATE | 10 | {timestamp} |
| KLM_2 | KLM | UPDATE | 5 | {timestamp} |
+-------+------------+--------+-------+-------------+
这是我做的查询
SELECT ResourceId, Count
FROM worklog_*
WHERE ID IN (
SELECT max(ID)
FROM worklog_*
GROUP BY WorklogID
) AND Action != DELETE
这不是真正的 BigQuery 查询,但足以理解其行为。
如果可以比较 ID 列的值,则此查询工作正常,这就是为什么我选择加入 ResourceId 和 Timestamp,MAX()
值将始终提供最后一个状态
这是最好的方法吗?有人对进行这种提取的更好方法有什么建议吗?
对于 BigQuery 标准 SQL
#standardSQL
WITH worklog AS (
SELECT 'ABC_1' AS ID, 'ABC' AS ResourceID, 'CREATE' AS Action, 10 AS COUNT UNION ALL
SELECT 'ABC_2', 'ABC', 'UPDATE', 8 UNION ALL
SELECT 'ABC_3', 'ABC', 'UPDATE', 4 UNION ALL
SELECT 'ABC_4', 'ABC', 'DELETE', 4 UNION ALL
SELECT 'DEF_1', 'DEF', 'CREATE', 10 UNION ALL
SELECT 'DEF_2', 'DEF', 'DELETE', 10 UNION ALL
SELECT 'GHJ_1', 'GHJ', 'CREATE', 10 UNION ALL
SELECT 'KLM_1', 'KLM', 'CREATE', 10 UNION ALL
SELECT 'KLM_2', 'KLM', 'UPDATE', 5
)
SELECT * EXCEPT(Last)
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY ResourceID ORDER BY ID DESC) AS Last
FROM worklog
WHERE Action != 'DELETE'
)
WHERE Last = 1
-- ORDER BY ID