具有时间采样、组、映射、连接和 csv 导出的复杂 db2/sql 查询

complex db2/sql query with time-sampling, group, map, join and csv export

我在 IBM bluemix(云上的 Db2 仓库)上的 dashDB2 上的 table(名为:TESTING)中有数据,看起来像这样:

ID     TIMESTAMP                  NAME     VALUE
abc    2017-12-21 19:55:38.762    test1    123
abc    2017-12-21 19:55:42.762    test2    456
abc    2017-12-21 19:57:38.762    test1    789
abc    2017-12-21 19:58:38.762    test3    345
def    2017-12-21 19:59:38.762    test1    678

我正在寻找一个查询:

  1. 将数据(针对每个 NAME)采样为给定的时间格式(例如基于 1 分钟的时间戳)
  2. 相同时间范围内(同一分钟)的VALUES应该取平均值,空时间应该为NULL

对于 1. 和 2. 类似的东西(仅适用于一个 NAME 工作):

    with dummy(temporaer) as (
      select TIMESTAMP('2017-12-01') from SYSIBM.SYSDUMMY1
      union all
      select temporaer + 1 MINUTES from dummy where temporaer < TIMESTAMP('2018-02-01')
    )
    select temporaer, avg(VALUE) as test1 from dummy
    LEFT OUTER JOIN TESTING ON temporaer=date_trunc('minute', TIMESTAMP) and ID='abc' and NAME='test1'
    group by temporaer
    ORDER BY temporaer ASC;
  1. 将所有不同的 NAMES 按列加入矩阵,例如:

    TIMESTAMP               test1    test2    test3
    2017-12-01 00:00:00     null     null     null
    ...
    2017-12-21 19:55:00     123      456      null
    2017-12-21 19:56:00     null     null     null
    2017-12-21 19:57:00     789      null     null
    2017-12-21 19:58:00     678      null     345
    ...
    2018-01-31 23:59:00     null     null     null
    
  2. 查询结果应导出为 csv。或者作为 csv-string

  3. 返回

有谁知道如何在一次查询中或以一种简单快速的方式完成此操作?或者是否有必要以另一种表格格式保存数据 - 你能给我一个提示吗?

这里是一段代码片段,可以完成这项工作,但需要很长时间:

WITH
    -- get all distinct names in table:
    header(names) AS (SELECT DiSTINCT name
                      FROM FIELDTEST
                      WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$') AND DATE(TIMESTAMP)>='2017-12-19' AND DATE(TIMESTAMP)<'2017-12-24'),

    -- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
    dummie(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
                                    FROM FIELDTEST
                                    WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$')),

    -- generate a range of times from date to date in defined steps:
    dummy(time, rangeEnd) AS (SELECT a, a + 1 MINUTE
                                           FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
                                           UNION ALL
                                           SELECT rangeEnd, rangeEnd + 1 MINUTE
                                           FROM dummy
                                           WHERE rangeEnd < TIMESTAMP('2017-12-24')),

    -- add each name (from header) to each time/row (in dummy):
    dumpy(time, names) AS (SELECT Dummy.time, Header.names
                                FROM Dummy
                                LEFT OUTER JOIN Header
                                ON Dummy.time IS NOT NULL),

    -- averages values by name and timeinterval and sorts result to dummy:
    dummj(time, names, avgvalues) AS (SELECT Dummy.time, Dummie.names, AVG(Dummie.values)
                                      FROM Dummy
                                      LEFT OUTER JOIN Dummie
                                      ON Dummie.time = Dummy.time
                                      GROUP BY Dummie.names, Dummy.time),

    -- joins the averages (by time, name) values to the times and names in dumpy (on empty value use -9999):
    testo(time, names, avgvalues) AS (SELECT Dumpy.time, Dumpy.names, COALESCE(Dummj.avgvalues,-9999)
                                    FROM Dumpy
                                    LEFT OUTER JOIN Dummj
                                    ON Dummj.time = Dumpy.time AND Dummj.names = Dumpy.names),

    -- converts the high amount of rows to less rows with delimited strings:
    test(time, names, avgvalues) AS (SELECT time, LISTAGG(names,';') WITHIN GROUP(ORDER BY names), LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
                                     FROM Testo
                                     GROUP BY time)

SELECT* FROM test ORDER BY time ASC, names ASC;

性能问题出在 "testo" 子查询中。有人知道这里的失败是什么或知道如何改进查询吗?

好吧,我看到的一个问题是您一直在列上使用函数,但如果 id 相当独特,那应该不会造成太大的损失。如果此查询非常常见,那么永久构建和索引范围 table 也可能是值得的。嗯,你可能需要几个索引(从 FieldTest.id 开始),但你也可以试试这个版本:

-- let's name things properly, too, to keep them straight.
WITH
    -- generate a range of times from date to date in defined steps:
   Range (rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
                                    FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
                                    UNION ALL
                                    SELECT rangeEnd, rangeEnd + 1 MINUTE
                                    FROM Range
                                    WHERE rangeEnd < TIMESTAMP('2017-12-24')),
    -- get all distinct names in table:
    Header(names) AS (SELECT DISTINCT name
                      FROM FieldTest
                      WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54' 
                            -- just make the white space check part of the regex
                            AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$') 
                            AND timestamp >= TIMESTAMP('2017-12-19')
                            AND timestamp < TIMESTAMP('2017-12-24')),
    -- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
    Data (rangeStart, name, averaged) AS (SELECT Range.rangeStart, Header.names, COALESCE(AVG(FieldTest.value), -9999)
                                          FROM Range
                                          CROSS JOIN Header
                                          LEFT JOIN FieldTest
                                                 ON FieldTest.id = '7b9bbe44d45d8f2ac324849a4951da54'
                                                    AND FieldTest.names = Header.names
                                                    AND FieldTest.timestamp >= Range.rangeStart
                                                    AND FieldTest.timestamp < Range.rangeEnd
                                          GROUP BY Range.rangeStart, Header.names),
-- I can't recall if DB2 allows using the new column name this way, you may need to wrap this again
SELECT rangeStart, 
               -- converts the high amount of rows to less rows with delimited strings:
       LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS names, 
       LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
GROUP BY rangeStart
ORDER BY rangeStart, names

(未测试)

CROSS JOIN 无疑是一个很好的提示。此外,我无法像您建议的那样实施以下 LEFT JOIN,我找到了一个解决方法,我相信它仍然有改进的余地,但目前对我来说是可以接受的(与我的第一个查询解决方案相比,节省了大约 30 倍的时间).这里是实际代码:

WITH
    -- generate a range of times from date to date in defined steps:
    TimeRange(rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
                                                            FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
                                                            UNION ALL
                                                            SELECT rangeEnd, rangeEnd + 1 MINUTE
                                                            FROM TimeRange
                                                            WHERE rangeEnd < TIMESTAMP('2017-12-24')),

    -- get all distinct names in table:
    Header(names) AS (SELECT DISTINCT name
                                     FROM FIELDTEST
                                     WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
                                                   AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$') 
                                                   AND timestamp >= TIMESTAMP('2017-12-19')
                                                   AND timestamp < TIMESTAMP('2017-12-24')),

    -- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
    rawData(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
                                     FROM FIELDTEST
                                     WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
                                        AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')),

    -- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
    Data(rangeStart, name, averaged) AS (SELECT TimeRange.rangeStart, Header.names, COALESCE(AVG(rawData.values), -9999)
                                         FROM TimeRange
                                         CROSS JOIN Header
                                         LEFT JOIN rawData
                                            ON rawData.names = Header.names
                                                AND rawData.time = TimeRange.rangeStart
                                         GROUP BY TimeRange.rangeStart, Header.names),

    test(time, names, avgvalues) AS (SELECT Data.rangeStart,
                                            LISTAGG(Data.name,';') WITHIN GROUP(ORDER BY name),
                                            LISTAGG(Data.averaged,';') WITHIN GROUP(ORDER BY name)
                                    FROM Data
                                    GROUP BY Data.rangeStart)

-- build my own delimited export-string:
SELECT CONCAT(CONCAT(SUBSTR(REPLACE(time,'.',':'),1,19),';'), REPLACE(CAST(avgvalues AS VARCHAR(3980)),'-9999',''))
FROM test
UNION ALL
SELECT CONCAT(CAST('TIME;' AS VARCHAR(5)), CAST(LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS VARCHAR(3980)))
FROM Header;