加快对古老 SQL 服务器版本的分区查询

Speeding up partitioning query on ancient SQL Server version

设置

我在 SQL Server 7 运行 双核 2GHz + 2GB RAM 机器上获得正确的查询时遇到了性能和概念问题 - 没有机会获得正如您所料 :-/.

情况

我正在使用遗留数据库,我需要挖掘数据以获得各种见解。我有 all_stats table,其中包含特定上下文中某物的所有统计数据。这些上下文在 group_contexts table 的帮助下分组。简化架构:

+--------------------------------------------------------------------+
| thingies                                                           |
+--------------------------------------------------------------------|
| id          | INT PRIMARY KEY IDENTITY(1,1)                        |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
| all_stats                                                          |
+--------------------------------------------------------------------+
| id          | INT PRIMARY KEY IDENTITY(1,1)                        |
| context_id  | INT FOREIGN KEY REFERENCES contexts(id)              |
| value       | FLOAT NULL                                           |
| some_date   | DATETIME NOT NULL                                    |
| thingy_id   | INT NOT NULL FOREIGN KEY REFERENCES thingies(id)     |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
| group_contexts                                                     |
+--------------------------------------------------------------------|
| id          | INT PRIMARY KEY IDENTITY(1,1)                        |
| group_id    | INT NOT NULL FOREIGN KEY REFERENCES groups(group_id) |
| context_id  | INT NOT NULL FOREIGN KEY REFERENCES contexts(id)     |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
| contexts                                                           |
+--------------------------------------------------------------------+
| id          | INT PRIMARY KEY IDENTITY(1,1)                        |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
| groups                                                             |
+--------------------------------------------------------------------+
| group_id    | INT PRIMARY KEY IDENTITY(1,1)                        |
+--------------------------------------------------------------------+

问题

任务是,对于一组给定的事物,查找并汇总该事物具有其统计信息的所有组的 3 个最新 (all_stats.some_date) 统计信息。我知道这听起来很简单,但我不知道如何在 SQL 中正确地做到这一点 - 我不完全是神童。

我的错误解决方案(不,它真的很糟糕...)

我现在的解决方案是用所有必需的数据填充临时 table 并 UNION ALLing 我需要的数据:

-- Before I'm building this SQL I retrieve the relevant groups
-- for being able to build the `UNION ALL`s at the bottom.
-- I also retrieve the thingies that are relevant in this context
-- beforehand and include their ids as a comma separated list -
-- I said it would be awfull ...

-- Creating the temp table holding all stats data rows
-- for a thingy in a specific group
CREATE TABLE #stats
(id INT PRIMARY KEY IDENTITY(1,1),
 group_id INT NOT NULL,
 thingy_id INT NOT NULL,
 value FLOAT NOT NULL,
 some_date DATETIME NOT NULL)

-- Filling the temp table
INSERT INTO #stats(group_id,thingy_id,value,some_date)
SELECT filtered.group_id, filtered.thingy_id, filtered.some_date, filtered.value
FROM
   (SELECT joined.group_id,joined.thingy_id,joined.value,joined.some_date
    FROM
       (SELECT groups.group_id,data.value,data.thingy_id,data.some_date
        FROM
            -- Getting the groups associated with the contexts
            -- of all the stats available
           (SELECT DISTINCT context.group_id
            FROM all_stats AS stat
            INNER JOIN group_contexts AS groupcontext
                ON groupcontext.context_id = stat.context_id
        ) AS groups
        INNER JOIN
            -- Joining the available groups with the actual
            -- stat data of the group for a thingy
           (SELECT context.group_id,stat.value,stat.some_date,stat.thingy_id
            FROM all_stats AS stat
            INNER JOIN group_contexts AS groupcontext
                ON groupcontext.context_id = stat.context_id
            WHERE stat.value IS NOT NULL
              AND stat.value >= 0) AS data
        ON data.group_id = groups.group_id) AS joined
    ) AS filtered
-- I already have the thingies beforehand but if it would be possible
-- to include/query for them in another way that'd be OK by me
WHERE filtered.thingy_id in (/* somewhere around 10000 thingies are available */)

-- Now I'm building the `UNION ALL`s for each thingy as well as
-- the group the stat of the thingy belongs to

-- thingy 42 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 982
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
   (SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
    FROM #stats AS s
    WHERE s.group_id = 982
      AND s.thingy_id = 42
    ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3

UNION ALL

-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 314159
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
   (SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
    FROM #stats AS s
    WHERE s.group_id = 314159
      AND s.thingy_id = 42
    ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
-- }

UNION ALL

-- thingy 21 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 21 in group 982
/* you get the idea */

这工作 - 缓慢,但它工作 - 对于小数据集(例如,说 100 个事物,每个事物附加 10 个统计数据)但它最终必须工作的问题域是在 10000 多个事物中,可能有数百个统计数据每件事。附带说明:生成的 SQL 查询大得离谱:一个非常小的查询涉及 350 个在 3 个上下文组中有数据的东西,总计超过 250 000 行 SQL - 执行在惊人的 5 分钟内。

因此,如果有人知道如何解决这个问题,我真的非常感谢您的帮助:-)。

在您古老的 SQL 服务器版本上,您需要使用一些旧式标量子查询来获取单个查询中所有事物的最后三行:-)

SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
 (
   SELECT s.group_id,s.thingy_id,s.value
   FROM #stats AS s
   where (select count(*) from #stats as s2
          where s.group_id = s2.group_id
            and s.thingy_id = s2.thingy_id 
            and s.some_date <= s2.some_date
         ) <= 3
 ) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3

为了获得更好的性能,您需要添加聚集索引,可能 (group_id,thingy_id,some_date desc,value)#stats table。

如果 group_id,thingy_id,some_date 是唯一的,您应该删除无用的 ID 列,否则 Insert/Select 期间的 order by group_id,thingy_id,some_date desc 变为 #stats 并使用 ID 而不是 some_date 来查找最后三行。