大表 UNION 的性能问题

Question

我有七个大的 tables，它们可以随时存储 100 到 100 万行。我将它们称为 LargeTable1、LargeTable2、LargeTable3、LargeTable4...LargeTable7。这些 tables 大部分是静态的：没有更新也没有新的插入。它们每两周或每月仅更改一次，当它们被截断并在每个寄存器中插入新的一批寄存器时。

所有这些 table 都有三个共同的字段：Headquarter、Country 和 File。 Headquarter 和 Country 是格式为“000”的数字，尽管在其中两个 table 中，由于其他一些系统需要，它们被解析为 int。

我还有一个更小的 table，叫做 Headquarters，里面有每个总部的信息。这个 table 条目很少。实际上最多1000个。

现在，我需要创建一个存储过程，returns 所有那些出现在大型 table 中但在 Headquarters table 中不存在的总部或已被删除（这个 table 在逻辑上被删除：它有一个 DeletionDate 字段来检查这个）。

这是我试过的查询：

CREATE PROCEDURE deletedHeadquarters
AS
BEGIN
    DECLARE @headquartersFiles TABLE
    (
        hq int,
        countryFile varchar(MAX)
    );

    SET NOCOUNT ON

    INSERT INTO @headquartersFiles
    SELECT headquarter, CONCAT(country, ' (', file, ')')
    FROM
    (
        SELECT DISTINCT CONVERT(int, headquarter) as headquarter,
                        CONVERT(int, country) as country,
                        file
        FROM            LargeTable1     
        UNION
        SELECT DISTINCT headquarter,
                        country,
                        file
        FROM            LargeTable2
        UNION
        SELECT DISTINCT headquarter,
                        country,
                        file
        FROM            LargeTable3
        UNION
        SELECT DISTINCT headquarter,
                        country,
                        file
        FROM            LargeTable4
        UNION
        SELECT DISTINCT headquarter,
                        country,
                        file
        FROM            LargeTable5
        UNION
        SELECT DISTINCT headquarter,
                        country,
                        file
        FROM            LargeTable6
        UNION
        SELECT DISTINCT headquarter,
                        country,
                        file
        FROM            LargeTable7
    ) TC

    SELECT  RIGHT('000' + CAST(st.headquarter AS VARCHAR(3)), 3) as headquarter,
            MAX(s.deletionDate) as deletionDate,
            STUFF
            (
                (SELECT DISTINCT ', ' + st2.countryFile
                FROM @headquartersFiles st2
                WHERE st2.headquarter = st.headquarter
                FOR XML PATH('')),
                1,
                1,
                ''
            ) countryFile
    FROM    @headquartersFiles as st
    LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter
    WHERE   s.headquarter IS NULL
       OR   s.deletionDate IS NOT NULL
    GROUP BY st.headquarter

END

这个 sp 的性能对我们的应用程序来说不够好。目前大约需要 50 秒才能完成，每个 table 的总行数如下（只是为了让您了解大小）：

大表 1：1516666 行
LargeTable2：645740 行
LargeTable3：1950121 行
LargeTable4：779336 行
LargeTable5：1100999 行
LargeTable6：16499 行
LargeTable7：24454 行

我可以做些什么来提高性能？我尝试执行以下操作，但没有太大区别：

正在批量插入到本地table，不包括我已经插入的那些总部，然后为那些重复的更新countryFile字段
正在为该 UNION 查询创建视图
正在为总部字段的 LargeTables 创建索引

我也考虑过在 LargeTables 更改后将这些缺失的总部插入永久 table，但是 Headquarters table 可以更频繁地更改，并且我希望不必更改其模块来保持这些东西的整洁和更新。但如果这是最好的选择，我会选择它。

谢谢

Answer 1

使用这个过滤器

LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter
WHERE   s.headquarter IS NULL
   OR   s.deletionDate IS NOT NULL

并将其添加到联合中的每个单独查询并插入到@headquartersFiles

看起来这会产生更多的过滤器，但它实际上会加快处理速度，因为您在开始作为联合处理之前进行过滤。

同时取出你所有的 DISTINCT，它可能不会加快速度，但它看起来很愚蠢，因为你正在做一个 UNION 而不是 UNION all。

Answer 2

我会先尝试对每个人进行过滤 table。您只需要考虑这样一个事实，即总部可能出现在一个 table 中，但不会出现在另一个 table 中。您可以这样做：

SELECT
    headquarter
FROM
(

    SELECT DISTINCT
        headquarter,
        'table1' AS large_table
    FROM
        LargeTable1 LT
    LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter
    WHERE
        HQ.headquarter IS NULL OR
        HQ.deletion_date IS NOT NULL
    UNION ALL
    SELECT DISTINCT
        headquarter,
        'table2' AS large_table
    FROM
        LargeTable2 LT
    LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter
    WHERE
        HQ.headquarter IS NULL OR
        HQ.deletion_date IS NOT NULL
    UNION ALL
    ...
) SQ
GROUP BY headquarter
HAVING COUNT(*) = 5

这将确保所有五个 table 中都缺少它。

Answer 3

Table 变量的性能非常糟糕，因为 sql 服务器不为它们生成统计信息。不要使用 table 变量，而是尝试使用临时 table，如果总部 + 国家 + 文件在临时 table 中是唯一的，请添加唯一约束（这将创建聚簇索引) 在 temp table 定义中。您可以在创建临时文件 table 后为其设置索引，但由于各种原因 SQL 服务器可能会忽略它。

编辑：事实证明，您实际上可以在 table 变量上创建索引，甚至在 2014+ 中是非唯一的。

其次，尽量不要在联接或 where 子句中使用函数 - 这样做通常会导致性能问题。

Answer 4

在每一步进行过滤。但首先，修改 headquarters table 使其具有您需要的正确类型。 . .连同索引：

alter table headquarters add headquarter_int as (cast(headquarter as int));
create index idx_headquarters_int on headquarters(headquarters_int);

SELECT DISTINCT headquarter, country, file
FROM LargeTable5 lt5
WHERE NOT EXISTS (SELECT 1
                  FROM headquarters s
                  WHERE s.headquarter_int = lt5.headquarter and s.deletiondate is not null
                 );

然后，您需要 LargeTable5(headquarter, country, file) 上的索引。

到运行应该不到 5 秒。如果是这样，则构建完整查询，确保相关子查询中的类型匹配并且您在完整 table 上具有正确的索引。使用 union 删除 table 之间的重复项。

Answer 5

真正的答案是为每个 table 创建单独的 INSERT 语句，但要注意目标 table 中不存在要插入的数据。

大表 UNION 的性能问题

Performance issues with UNION of large tables

sql

database

sql-server

batch-processing