使用具有标准偏差的 BigQuery 检测异常值
Detect Outliers using BigQuery with Standard Deviation
我目前在 BigQuery 中有一个 table,其中包含一些异常值
示例table:
port - qty - datetime
--------------------------------
TCP1 - 13 - 2018/06/11 11:20:23
UDP2 - 15 - 2018/06/11 11:24:24
TCP3 - 14 - 2018/06/11 11:24:27
TCP1 - 2 - 2018/06/11 11:24:26
UDP2 - 15 - 2018/06/11 11:35:32
TCP3 - 13 - 2018/06/11 11:45:23
TCP3 - 14 - 2018/06/11 11:54:22
TCP3 - 30 - 2018/06/11 11:55:33
我希望能够使用 SQL 和标准偏差
在 2018/06/11 筛选出各个端口的异常值
结果:
TCP1 - 2 - 2018/06/11 11:24:26
TCP3 - 30 - 2018/06/11 11:55:33
我做了一些研究,发现标准偏差能够帮助筛选出异常值。但是,我不知道如何编写 SQL 查询来完成这项工作。任何帮助将不胜感激。
(这是我能找到的关于该主题的最接近的话题:Using BigQuery to find outliers with standard deviation results combined with WHERE clause)
以下示例适用于 BigQuery 标准 SQL
#standardSQL
WITH stats AS (
SELECT DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime)) dt,
AVG(qty) - 1.5 * STDDEV(qty) down,
AVG(qty) + 1.5 * STDDEV(qty) up
FROM `project.dataset.table`
GROUP BY dt
)
SELECT port, qty, datetime
FROM `project.dataset.table`
JOIN stats
ON dt = DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime))
WHERE NOT qty BETWEEN down AND up
您可以使用问题中的虚拟数据来测试和玩上面的游戏:
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'TCP1' port, 13 qty, '2018/06/11 11:20:23' datetime UNION ALL
SELECT 'UDP2', 15, '2018/06/11 11:24:24' UNION ALL
SELECT 'TCP3', 14, '2018/06/11 11:24:27' UNION ALL
SELECT 'TCP1', 2 , '2018/06/11 11:24:26' UNION ALL
SELECT 'UDP2', 15, '2018/06/11 11:35:32' UNION ALL
SELECT 'TCP3', 13, '2018/06/11 11:45:23' UNION ALL
SELECT 'TCP3', 14, '2018/06/11 11:54:22' UNION ALL
SELECT 'TCP3', 30, '2018/06/11 11:55:33'
), stats AS (
SELECT DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime)) dt,
AVG(qty) - 1.5 * STDDEV(qty) down,
AVG(qty) + 1.5 * STDDEV(qty) up
FROM `project.dataset.table`
GROUP BY dt
)
SELECT port, qty, datetime
FROM `project.dataset.table`
JOIN stats
ON dt = DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime))
WHERE NOT qty BETWEEN down AND up
结果为
Row port qty datetime
1 TCP1 2 2018/06/11 11:24:26
2 TCP3 30 2018/06/11 11:55:33
我目前在 BigQuery 中有一个 table,其中包含一些异常值
示例table:
port - qty - datetime
--------------------------------
TCP1 - 13 - 2018/06/11 11:20:23
UDP2 - 15 - 2018/06/11 11:24:24
TCP3 - 14 - 2018/06/11 11:24:27
TCP1 - 2 - 2018/06/11 11:24:26
UDP2 - 15 - 2018/06/11 11:35:32
TCP3 - 13 - 2018/06/11 11:45:23
TCP3 - 14 - 2018/06/11 11:54:22
TCP3 - 30 - 2018/06/11 11:55:33
我希望能够使用 SQL 和标准偏差
在 2018/06/11 筛选出各个端口的异常值结果:
TCP1 - 2 - 2018/06/11 11:24:26
TCP3 - 30 - 2018/06/11 11:55:33
我做了一些研究,发现标准偏差能够帮助筛选出异常值。但是,我不知道如何编写 SQL 查询来完成这项工作。任何帮助将不胜感激。
(这是我能找到的关于该主题的最接近的话题:Using BigQuery to find outliers with standard deviation results combined with WHERE clause)
以下示例适用于 BigQuery 标准 SQL
#standardSQL
WITH stats AS (
SELECT DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime)) dt,
AVG(qty) - 1.5 * STDDEV(qty) down,
AVG(qty) + 1.5 * STDDEV(qty) up
FROM `project.dataset.table`
GROUP BY dt
)
SELECT port, qty, datetime
FROM `project.dataset.table`
JOIN stats
ON dt = DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime))
WHERE NOT qty BETWEEN down AND up
您可以使用问题中的虚拟数据来测试和玩上面的游戏:
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'TCP1' port, 13 qty, '2018/06/11 11:20:23' datetime UNION ALL
SELECT 'UDP2', 15, '2018/06/11 11:24:24' UNION ALL
SELECT 'TCP3', 14, '2018/06/11 11:24:27' UNION ALL
SELECT 'TCP1', 2 , '2018/06/11 11:24:26' UNION ALL
SELECT 'UDP2', 15, '2018/06/11 11:35:32' UNION ALL
SELECT 'TCP3', 13, '2018/06/11 11:45:23' UNION ALL
SELECT 'TCP3', 14, '2018/06/11 11:54:22' UNION ALL
SELECT 'TCP3', 30, '2018/06/11 11:55:33'
), stats AS (
SELECT DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime)) dt,
AVG(qty) - 1.5 * STDDEV(qty) down,
AVG(qty) + 1.5 * STDDEV(qty) up
FROM `project.dataset.table`
GROUP BY dt
)
SELECT port, qty, datetime
FROM `project.dataset.table`
JOIN stats
ON dt = DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime))
WHERE NOT qty BETWEEN down AND up
结果为
Row port qty datetime
1 TCP1 2 2018/06/11 11:24:26
2 TCP3 30 2018/06/11 11:55:33