如何按 id 对数据进行分组并使用 SQL 获取中值?
How can I group data by id and get the median value using SQL?
我有一个 table,其中包含几天内给定商店的营业时间,如下所示(OPENING_HOUR 设置为 24 小时时间格式,因此所有时间都在table 是上午)。
>>> BUSINESS_HOURS
DATE | STORE_ID | OPENING_HOUR
________________________________________
0 2021-06-01 | 222 | 11
1 2021-06-02 | 222 | 11
2 2021-06-03 | 222 | 11
3 2021-06-04 | 222 | 11
4 2021-06-05 | 222 | 11
5 2021-06-06 | 222 | 11
6 2021-06-07 | 222 | 12
7 2021-06-08 | 222 | 11
8 2021-06-09 | 222 | 11
9 2021-06-10 | 222 | 12
现在我需要按 id 对数据进行分组并判断哪个 opening_hour 出现频率最高。在下面的案例中,上午 11 点出现在 80% 的案例中,所以我需要这样的东西:
>>> DATA_GROUPED
STORE_ID | OPENING_HOUR | FREQUENCY
________________________________________
0 222 | 11 | 0.8
是否可以只使用 SQL?谢谢你们的帮助,伙计们!
您可以使用 window 函数:
select store_id, opening_hour, count(*) as cnt,
count(*) * 1.0 / sum(count(*)) over () as ratio
from t
where store_id = 1
group by store_id, opening_hour
order by cnt desc
limit 1;
如果您希望所有商店都这样做,您可以使用 window 函数:
select t.* except (seqnum)
from (select store_id, opening_hour, count(*) as cnt,
count(*) * 1.0 / sum(count(*)) over () as ratio,
row_number() over (partition by store_id order by count(*) desc) as seqnum
from t
group by store_id, opening_hour
) t
where seqnum = 1;
我找到了使用 window 函数和 CTE 的方法。
WITH Q1 AS (
SELECT
DISTINCT STORE_ID,
OPENING_HOUR,
COUNT(OPENING_HOUR) AMOUNT,
ROW_NUMBER() OVER(PARTITION BY STORE_IDORDER BY COUNT(OPENING_HOUR) DESC) as RANK
FROM T1
GROUP BY 1, 2
)
SELECT
STORE_ID,
OPENING_HOUR,
ROUND((AMOUNT/SUM(AMOUNT) OVER(PARTITION BY STORE_ID)),2) AS SHARE
FROM Q1-- WHERE RANK = 1
不是最短的答案,但它工作正常!
有开窗功能,这一个解决方案:
WITH business_hours as (
SELECT DATE("2021-06-01") as date, 222 as store_id, 11 as opening_hour
UNION ALL
SELECT "2021-06-02", 222, 11
UNION ALL
SELECT "2021-06-03", 222, 11
UNION ALL
SELECT "2021-06-04", 222, 11
UNION ALL
SELECT "2021-06-05", 222, 11
UNION ALL
SELECT "2021-06-06", 222, 11
UNION ALL
SELECT "2021-06-07", 222, 12
UNION ALL
SELECT "2021-06-08", 222, 11
UNION ALL
SELECT "2021-06-09", 222, 11
UNION ALL
SELECT "2021-06-10", 222, 12)
, agg as (SELECT DISTINCT store_id, opening_hour,
COUNT(store_id) OVER (partition by opening_hour, EXTRACT(MONTH FROM date)) as total_open_per_hour,
COUNT(store_id) OVER (partition by EXTRACT(MONTH FROM date)) as total_open,
from business_hours)
SELECT store_id, opening_hour, safe_divide(total_open_per_hour, total_open) frequency FROM agg
结果:
考虑以下方法
select * from (
select distinct store_id, opening_hour,
count(1) over(partition by opening_hour) / count(1) over() frequency
from business_hours
)
where true
qualify row_number() over(partition by store_id order by frequency desc) = 1
为您提供最频繁的 opening_hour 每个商店
如果应用于您问题中的示例数据 - 输出为
我有一个 table,其中包含几天内给定商店的营业时间,如下所示(OPENING_HOUR 设置为 24 小时时间格式,因此所有时间都在table 是上午)。
>>> BUSINESS_HOURS
DATE | STORE_ID | OPENING_HOUR
________________________________________
0 2021-06-01 | 222 | 11
1 2021-06-02 | 222 | 11
2 2021-06-03 | 222 | 11
3 2021-06-04 | 222 | 11
4 2021-06-05 | 222 | 11
5 2021-06-06 | 222 | 11
6 2021-06-07 | 222 | 12
7 2021-06-08 | 222 | 11
8 2021-06-09 | 222 | 11
9 2021-06-10 | 222 | 12
现在我需要按 id 对数据进行分组并判断哪个 opening_hour 出现频率最高。在下面的案例中,上午 11 点出现在 80% 的案例中,所以我需要这样的东西:
>>> DATA_GROUPED
STORE_ID | OPENING_HOUR | FREQUENCY
________________________________________
0 222 | 11 | 0.8
是否可以只使用 SQL?谢谢你们的帮助,伙计们!
您可以使用 window 函数:
select store_id, opening_hour, count(*) as cnt,
count(*) * 1.0 / sum(count(*)) over () as ratio
from t
where store_id = 1
group by store_id, opening_hour
order by cnt desc
limit 1;
如果您希望所有商店都这样做,您可以使用 window 函数:
select t.* except (seqnum)
from (select store_id, opening_hour, count(*) as cnt,
count(*) * 1.0 / sum(count(*)) over () as ratio,
row_number() over (partition by store_id order by count(*) desc) as seqnum
from t
group by store_id, opening_hour
) t
where seqnum = 1;
我找到了使用 window 函数和 CTE 的方法。
WITH Q1 AS (
SELECT
DISTINCT STORE_ID,
OPENING_HOUR,
COUNT(OPENING_HOUR) AMOUNT,
ROW_NUMBER() OVER(PARTITION BY STORE_IDORDER BY COUNT(OPENING_HOUR) DESC) as RANK
FROM T1
GROUP BY 1, 2
)
SELECT
STORE_ID,
OPENING_HOUR,
ROUND((AMOUNT/SUM(AMOUNT) OVER(PARTITION BY STORE_ID)),2) AS SHARE
FROM Q1-- WHERE RANK = 1
不是最短的答案,但它工作正常!
有开窗功能,这一个解决方案:
WITH business_hours as (
SELECT DATE("2021-06-01") as date, 222 as store_id, 11 as opening_hour
UNION ALL
SELECT "2021-06-02", 222, 11
UNION ALL
SELECT "2021-06-03", 222, 11
UNION ALL
SELECT "2021-06-04", 222, 11
UNION ALL
SELECT "2021-06-05", 222, 11
UNION ALL
SELECT "2021-06-06", 222, 11
UNION ALL
SELECT "2021-06-07", 222, 12
UNION ALL
SELECT "2021-06-08", 222, 11
UNION ALL
SELECT "2021-06-09", 222, 11
UNION ALL
SELECT "2021-06-10", 222, 12)
, agg as (SELECT DISTINCT store_id, opening_hour,
COUNT(store_id) OVER (partition by opening_hour, EXTRACT(MONTH FROM date)) as total_open_per_hour,
COUNT(store_id) OVER (partition by EXTRACT(MONTH FROM date)) as total_open,
from business_hours)
SELECT store_id, opening_hour, safe_divide(total_open_per_hour, total_open) frequency FROM agg
结果:
考虑以下方法
select * from (
select distinct store_id, opening_hour,
count(1) over(partition by opening_hour) / count(1) over() frequency
from business_hours
)
where true
qualify row_number() over(partition by store_id order by frequency desc) = 1
为您提供最频繁的 opening_hour 每个商店
如果应用于您问题中的示例数据 - 输出为