根据函数而不是原始内容查找重复行
Find duplicate rows based on functions and not the original contents
我有一个 bigquery table logs
有两列包含日志消息:
time TIMESTAMP
message STRING
我想 select 所有与模式 job .+ got machine (\d+)
匹配的消息,其中有重复的机器。例如给定行数:
10000, "job foo got machine 10"
10010, "job bar got machine 10"
10010, "job baz got machine 20"
查询将 select 前两行。
我可以select与查询重复的机器:
SELECT
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
GROUP BY
machine_id
HAVING
COUNT(message) > 1
但我不知道如何从这里获取包含这些机器的行。我尝试了以下方法:
SELECT
[time],
message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
HAVING
machine_id IN (
SELECT
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
GROUP BY
machine_id
HAVING
COUNT(message) > 1)
但这给出了错误 "Error: Field 'machine_id' not found"。
是否可以在单个查询中执行我想要的操作?
不要在该上下文中使用 HAVING,只需使用 WHERE
SELECT
[time],
message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
AND REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') IN (
SELECT
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
GROUP BY
machine_id
HAVING
COUNT(message) > 1)
我能够通过以下查询解决此问题:
SELECT
[time],
message
FROM (
SELECT
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
GROUP BY
machine_id
HAVING
COUNT(message) > 1) AS A
JOIN (
SELECT
[time],
message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')) AS B
ON
A.machine_id = B.machine_id
感觉有点笨拙,但似乎能胜任。
尝试以下
SELECT [time], message
FROM (
SELECT [time], message, machine_id,
COUNT(1) OVER(PARTITION BY machine_id) AS dups
FROM (
SELECT [time], message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
)
)
WHERE dups > 1
没有连接,不那么笨重
或进一步简化:
SELECT [time], message FROM (
SELECT [time], message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id,
COUNT(1) OVER(PARTITION BY machine_id) AS dups
FROM
[logs]
)
WHERE dups > 1
我有一个 bigquery table logs
有两列包含日志消息:
time TIMESTAMP
message STRING
我想 select 所有与模式 job .+ got machine (\d+)
匹配的消息,其中有重复的机器。例如给定行数:
10000, "job foo got machine 10"
10010, "job bar got machine 10"
10010, "job baz got machine 20"
查询将 select 前两行。
我可以select与查询重复的机器:
SELECT
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
GROUP BY
machine_id
HAVING
COUNT(message) > 1
但我不知道如何从这里获取包含这些机器的行。我尝试了以下方法:
SELECT
[time],
message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
HAVING
machine_id IN (
SELECT
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
GROUP BY
machine_id
HAVING
COUNT(message) > 1)
但这给出了错误 "Error: Field 'machine_id' not found"。
是否可以在单个查询中执行我想要的操作?
不要在该上下文中使用 HAVING,只需使用 WHERE
SELECT
[time],
message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
AND REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') IN (
SELECT
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
GROUP BY
machine_id
HAVING
COUNT(message) > 1)
我能够通过以下查询解决此问题:
SELECT
[time],
message
FROM (
SELECT
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')
GROUP BY
machine_id
HAVING
COUNT(message) > 1) AS A
JOIN (
SELECT
[time],
message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
WHERE
REGEXP_MATCH(message, r'job .+ got machine \d+')) AS B
ON
A.machine_id = B.machine_id
感觉有点笨拙,但似乎能胜任。
尝试以下
SELECT [time], message
FROM (
SELECT [time], message, machine_id,
COUNT(1) OVER(PARTITION BY machine_id) AS dups
FROM (
SELECT [time], message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
[logs]
)
)
WHERE dups > 1
没有连接,不那么笨重
或进一步简化:
SELECT [time], message FROM (
SELECT [time], message,
REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id,
COUNT(1) OVER(PARTITION BY machine_id) AS dups
FROM
[logs]
)
WHERE dups > 1