SQL - 如果月中的小时存在,则为“1”,如果不存在,则为“0”

SQL - '1' IF hour in month EXISTS, '0' IF NOT EXISTS

我有一个 table 聚合到小时级别 YYYYMMDDHH。数据由外部进程聚合和加载(我无法控制)。我想按月测试数据。

我要回答的问题是:一个月中的每个小时都存在吗?

我希望生成的输出将 return 如果时间存在则 1 或如果时间不存在则 0

聚合 table 看起来像这样...

YYYYMM  YYYYMMDD    YYYYMMDDHH  DATA_AGG
201911  20191101    2019110100  100
201911  20191101    2019110101  125
201911  20191101    2019110103  135
201911  20191101    2019110105  95
…   …   …   …
201911  20191130    2019113020  100
201911  20191130    2019113021  110
201911  20191130    2019113022  125
201911  20191130    2019113023  135

并定义为...

CREATE TABLE YYYYMMDDHH_DATA_AGG AS (
    YYYYMM      VARCHAR,
    YYYYMMDD    VARCHAR,
    YYYYMMDDHH  VARCHAR,
    DATA_AGG    INT
);

我希望生成以下内容...

YYYYMMDDHH     HOUR_EXISTS
2019110100     1
2019110101     1
2019110102     0
2019110103     1
2019110104     0
2019110105     1
...            ...

在上面的例子中,两个小时不存在,20191101022019110104

我假设我必须加入聚合 table 以对抗包含所有 YYYYMMDDHH 组合的计算 table???

数据库是 Snowflake,但假设大多数通用 ANSI SQL 查询都可以工作。

以下内容可能有助于您入门。我猜你想要 'synthetic' [YYYYMMDD] 值?否则,如果该值不存在,则它们不应出现在列表中

删除 TABLE 如果存在 #_hours 删除 TABLE 如果存在 #_temp

--Populate a table with hours ranging from 00 to 23
CREATE TABLE #_hours ([hour_value] VARCHAR(2))
DECLARE @_i INT = 0
WHILE (@_i < 24)
    BEGIN
        INSERT INTO #_hours
        SELECT FORMAT(@_i, '0#')
        SET @_i += 1
    END

-- Replicate OP's sample data set
CREATE TABLE #_temp (
    [YYYYMM] INTEGER
    ,   [YYYYMMDD] INTEGER
    ,   [YYYYMMDDHH] INTEGER
    ,   [DATA_AGG] INTEGER
)
INSERT INTO #_temp
VALUES 
(201911, 20191101, 2019110100, 100),
(201911, 20191101, 2019110101, 125),
(201911, 20191101, 2019110103, 135),
(201911, 20191101, 2019110105, 95),
(201911, 20191130, 2019113020, 100),
(201911, 20191130, 2019113021, 110),
(201911, 20191130, 2019113022, 125),
(201911, 20191130, 2019113023, 135)



SELECT X.YYYYMM, X.YYYYMMDD, X.YYYYMMDDHH
    -- Case: If 'target_hours' doesn't exist, then 0, else 1
,   CASE WHEN X.target_hours IS NULL THEN '0' ELSE '1' END AS [HOUR_EXISTS]
FROM (
    -- Select right 2 characters from converted [YYYYMMDDHH] to act as 'target values'
    SELECT T.*
    ,   RIGHT(CAST(T.[YYYYMMDDHH] AS VARCHAR(10)), 2) AS [target_hours]
    FROM #_temp AS T
) AS X
-- Right join to keep all of our hours and only the target hours that match.
RIGHT JOIN #_hours AS H ON H.hour_value = X.target_hours

示例输出:

YYYYMM  YYYYMMDD    YYYYMMDDHH  HOUR_EXISTS
201911  20191101    2019110100  1
201911  20191101    2019110101  1
NULL    NULL        NULL        0
201911  20191101    2019110103  1
NULL    NULL        NULL        0
201911  20191101    2019110105  1
NULL    NULL        NULL        0

使用(几乎)标准 sql,您可以将 YYYYMMDD 的不同值交叉连接到所有可能小时数的列表,然后左连接到 table :

select concat(d.YYYYMMDD, h.hour) as YYYYMMDDHH,
  case when t.YYYYMMDDHH is null then 0 else 1 end as hour_exists
from (select distinct YYYYMMDD from tablename) as d
cross join (
  select '00' as hour union all select '01' union all
  select '02' union all select '03' union all     
  select '04' union all select '05' union all
  select '06' union all select '07' union all
  select '08' union all select '09' union all
  select '10' union all select '11' union all
  select '12' union all select '13' union all
  select '14' union all select '15' union all
  select '16' union all select '17' union all
  select '18' union all select '19' union all
  select '20' union all select '21' union all
  select '22' union all select '23'
) as h 
left join tablename as t
on concat(d.YYYYMMDD, h.hour) = t.YYYYMMDDHH
order by concat(d.YYYYMMDD, h.hour)

也许在 Snowflake 中,您可以更轻松地构建具有序列的小时列表,而不是所有那些 UNION ALLs。

你可以通过递归 CTE 得到你想要的

递归 CTE 生成可能小时列表。然后,一个简单的左外连接会为您提供标志,以判断您是否有任何与该小时匹配的记录。

WITH RECURSIVE CTE (YYYYMMDDHH) as
(
SELECT YYYYMMDDHH
FROM YYYYMMDDHH_DATA_AGG
WHERE YYYYMMDDHH = (SELECT MIN(YYYYMMDDHH) FROM YYYYMMDDHH_DATA_AGG)

UNION ALL 

SELECT TO_VARCHAR(DATEADD(HOUR, 1, TO_TIMESTAMP(C.YYYYMMDDHH, 'YYYYMMDDHH')), 'YYYYMMDDHH') YYYYMMDDHH
FROM CTE C
WHERE  TO_VARCHAR(DATEADD(HOUR, 1, TO_TIMESTAMP(C.YYYYMMDDHH, 'YYYYMMDDHH')), 'YYYYMMDDHH') <= (SELECT MAX(YYYYMMDDHH) FROM YYYYMMDDHH_DATA_AGG)
)

SELECT 
    C.YYYYMMDDHH,
    IFF(A.YYYYMMDDHH IS NOT NULL, 1, 0) HOUR_EXISTS
FROM CTE C
LEFT OUTER JOIN YYYYMMDDHH_DATA_AGG A
    ON C.YYYYMMDDHH = A.YYYYMMDDHH;

如果您的时间范围太长,您将遇到 cte 递归过多的问题。您可以创建一个包含所有可能时间的 table 或临时 table。例如:

CREATE OR REPLACE TEMPORARY TABLE HOURS (YYYYMMDDHH VARCHAR) AS
SELECT TO_VARCHAR(DATEADD(HOUR, SEQ4(), TO_TIMESTAMP((SELECT MIN(YYYYMMDDHH) FROM YYYYMMDDHH_DATA_AGG), 'YYYYMMDDHH')), 'YYYYMMDDHH')
  FROM TABLE(GENERATOR(ROWCOUNT => 10000)) V 
  ORDER BY 1;

SELECT 
    H.YYYYMMDDHH,
    IFF(A.YYYYMMDDHH IS NOT NULL, 1, 0) HOUR_EXISTS
FROM HOURS H
LEFT OUTER JOIN YYYYMMDDHH_DATA_AGG A
    ON H.YYYYMMDDHH = A.YYYYMMDDHH
WHERE H.YYYYMMDDHH <= (SELECT MAX(YYYYMMDDHH) FROM YYYYMMDDHH_DATA_AGG);

然后您可以 fiddle 使用发电机计数来确保您有足够的时间。

您可以生成一个 table 每个月的每个小时和 LEFT OUTER JOIN 您对它的聚合:

WITH EVERY_HOUR AS (
  SELECT TO_CHAR(DATEADD(HOUR, HH, TO_DATE(YYYYMM::TEXT, 'YYYYMM')),
                 'YYYYMMDDHH')::NUMBER YYYYMMDDHH
  FROM (SELECT DISTINCT YYYYMM FROM YYYYMMDDHH_DATA_AGG) t
  CROSS JOIN (
    SELECT ROW_NUMBER() OVER (ORDER BY NULL) - 1 HH
    FROM TABLE(GENERATOR(ROWCOUNT => 745))
  ) h
  QUALIFY YYYYMMDDHH < (YYYYMM + 1) * 10000
)
SELECT h.YYYYMMDDHH, NVL2(a.YYYYMM, 1, 0) HOUR_EXISTS
FROM EVERY_HOUR h
LEFT OUTER JOIN YYYYMMDDHH_DATA_AGG a ON a.YYYYMMDDHH = h.YYYYMMDDHH

这个版本涵盖了整个天数、月数和年数。这是一组可能的日期与一天中可能的几个小时的简单交叉连接——左连接到实际日期。

set first = (select min(yyyymmdd::number) from YYYYMMDDHH_DATA_AGG);
set last  = (select max(yyyymmdd::number) from YYYYMMDDHH_DATA_AGG);

with
hours as (select row_number() over (order by null) - 1 h from table(generator(rowcount=>24))),
days as  (
  select 
    row_number() over (order by null) - 1 as n,
    to_date($first::text, 'YYYYMMDD')::date + n as d,
    to_char(d, 'YYYYMMDD') as yyyymmdd
  from table(generator(rowcount=>($last-$first+1)))
)
select days.yyyymmdd || lpad(hours.h,2,0) as YYYYMMDDHH, nvl2(t.yyyymmddhh,1,0) as HOUR_EXISTS
from days cross join hours
left join YYYYMMDDHH_DATA_AGG t on t.yyyymmddhh = days.yyyymmdd || lpad(hours.h,2,0)
order by 1
;

如果您愿意,可以将$first 和$last 打包为子查询。