如何在 SQL 中考虑顺序来匹配行组？

Question

我有一个 table 存储相关行的组，不同的行通过 groupIdentifier 列相关。组的大小可以是任意行数。

我需要能够传入一组新的行组，然后找到新的到现有匹配组的映射。复杂的是组内每一行的顺序由 rowOrdinal 值定义，必须考虑在内。该 rowOrdinal 值并不总是基于 0，但组中的行按该值排序。另外 @existingData 包含数以千计的潜在组，因此查询需要高性能

这是一个示例输入数据集：

declare @existingData table (
    groupIdentifier int,
    rowOrdinal int,
    value varchar(1))

insert into @existingData values 
    (100, 0, 'X'),
    (100, 1, 'Y'),

    (200, 0, 'A'),
    (200, 1, 'B'),
    (200, 2, 'C'),

    (40, 0, 'X'),

    (41, 0, 'Y')


declare @newData table (
    groupIdentifier int,
    rowOrdinal int,
    value varchar(1))

insert into @newData values 
    (1, 55, 'X'),
    (1, 59, 'Y'),

    (2, 0, 'Y'),
    (2, 1, 'X')

-- @newData group 1 matches to @existingData group 100, @newData group 2 has no match in existingData

期望的结果是一个包含两列的结果集，existingGroupIdentifier 和 newGroupIdentifier。在这种情况下，唯一的结果行是 100、1。100 是 @existingData groupIdentifier，1 是 @newData groupIdentifier

编辑以下是我到目前为止的想法，假设我的最大组大小为 N，我可以手动复制粘贴 tsql 代码，该代码使用 pivot 和 temp tables 对每个组进行比较尺寸。但是，这将系统限制为 N，看起来很难看，如果可能的话，我更喜欢一种在单个查询中完成的方法

declare @existingData table (
    groupIdentifier int,
    rowOrdinal int,
    value varchar(1))

insert into @existingData values 
    (100, 0, 'X'),
    (100, 1, 'Y'),

    (200, 0, 'A'),
    (200, 1, 'B'),
    (200, 2, 'C'),

    (40, 0, 'X'),

    (41, 0, 'Y')


declare @newData table (
    groupIdentifier int,
    rowOrdinal int,
    value varchar(1))

insert into @newData values 
    (1, 55, 'X'),
    (1, 59, 'Y'),

    (2, 0, 'Y'),
    (2, 1, 'X'),

    (3, 99, 'Y'),

    (5, 4, 'A'),
    (5, 10, 'B'),
    (5, 200, 'C')


-- First build table of the size of each group, limiting @existingData to only potentially matching groups (have at least one member in common)
declare @potentialGroupsInExistingData table (groupIdentifier int, groupSize int)

insert into @potentialGroupsInExistingData
    select
        ExistingData.groupIdentifier, COUNT(ExistingData.groupIdentifier)
    from
        @existingData ExistingData
    where
        exists (select top 1 * from @newData where value = ExistingData.value)
    group by ExistingData.groupIdentifier

declare @groupsInNewData table (groupIdentifier int, groupSize int)

insert into @groupsInNewData
    select
        NewData.groupIdentifier, COUNT(NewData.groupIdentifier)
    from
        @newData NewData
    group by NewData.groupIdentifier


-- handle groups of size one, this is a simpler case of the pivoting used with more than size 1 groups
-----------------------------------
select
    ExistingData.groupIdentifier as ExistingGroupIdentifier,
    NewData.groupIdentifier as NewGroupIdentifier
from
    @potentialGroupsInExistingData PotentialExistingGroup
    cross join @groupsInNewData GroupsInNewData
    inner join @existingData ExistingData on
        ExistingData.groupIdentifier = PotentialExistingGroup.groupIdentifier
    inner join @newData NewData on
        NewData.groupIdentifier = GroupsInNewData.groupIdentifier
        and NewData.value = ExistingData.value
where
    PotentialExistingGroup.groupSize = 1
    and GroupsInNewData.groupSize = 1


-- handle groups of size two
-----------------------------------
declare @existingGroupsOfSizeTwo table (groupIdentifier int, valueOne varchar(1), valueTwo varchar(2))

insert into @existingGroupsOfSizeTwo 
    select
        *
    from
        (select
            ExistingData.groupIdentifier,
            ExistingData.value,
            ROW_NUMBER() over (partition by ExistingData.groupIdentifier order by ExistingData.rowOrdinal desc) as ActualOrdinal
        from
            @potentialGroupsInExistingData PotentialGroup
            inner join @existingData ExistingData on
                ExistingData.groupIdentifier = PotentialGroup.groupIdentifier
        where
            PotentialGroup.groupSize = 2) as T
    pivot ( min(value) for T.ActualOrdinal in ([1], [2]) ) as p

declare @newGroupsOfSizeTwo table (groupIdentifier int, valueOne varchar(1), valueTwo varchar(2))

insert into @newGroupsOfSizeTwo
    select
        *
    from
        (select
            NewData.groupIdentifier,
            NewData.value,
            ROW_NUMBER() over (partition by NewData.groupIdentifier order by NewData.rowOrdinal desc) as ActualOrdinal
        from
            @groupsInNewData NewDataGroup
            inner join @newData NewData on
                NewData.groupIdentifier = NewDataGroup.groupIdentifier
        where
            NewDataGroup.groupSize = 2) as T
    pivot ( min(value) for T.ActualOrdinal in ([1], [2]) ) as p

select
    ExistingData.groupIdentifier as ExistingGroupIdentifier,
    NewData.groupIdentifier as NewGroupIdentifier
from
    @newGroupsOfSizeTwo NewData
    inner join @existingGroupsOfSizeTwo ExistingData on
        ExistingData.valueOne = NewData.valueOne
        and ExistingData.valueTwo = NewData.valueTwo


-- handle groups of size three
-----------------------------------
declare @existingGroupsOfSizeThree table (groupIdentifier int, valueOne varchar(1), valueTwo varchar(1), valueThree varchar(1))

insert into @existingGroupsOfSizeThree 
    select
        *
    from
        (select
            ExistingData.groupIdentifier,
            ExistingData.value,
            ROW_NUMBER() over (partition by ExistingData.groupIdentifier order by ExistingData.rowOrdinal desc) as ActualOrdinal
        from
            @potentialGroupsInExistingData PotentialGroup
            inner join @existingData ExistingData on
                ExistingData.groupIdentifier = PotentialGroup.groupIdentifier
        where
            PotentialGroup.groupSize = 3) as T
    pivot ( min(value) for T.ActualOrdinal in ([1], [2], [3]) ) as p

declare @newGroupsOfSizeThree table (groupIdentifier int, valueOne varchar(1), valueTwo varchar(1), valueThree varchar(1))

insert into @newGroupsOfSizeThree
    select
        *
    from
        (select
            NewData.groupIdentifier,
            NewData.value,
            ROW_NUMBER() over (partition by NewData.groupIdentifier order by NewData.rowOrdinal desc) as ActualOrdinal
        from
            @groupsInNewData NewDataGroup
            inner join @newData NewData on
                NewData.groupIdentifier = NewDataGroup.groupIdentifier
        where
            NewDataGroup.groupSize = 3) as T
    pivot ( min(value) for T.ActualOrdinal in ([1], [2], [3]) ) as p

select
    ExistingData.groupIdentifier as ExistingGroupIdentifier,
    NewData.groupIdentifier as NewGroupIdentifier
from
    @newGroupsOfSizeThree NewData
    inner join @existingGroupsOfSizeThree ExistingData on
        ExistingData.valueOne = NewData.valueOne
        and ExistingData.valueTwo = NewData.valueTwo
        and ExistingData.valueThree = NewData.valueThree

Answer 1

试试这个：

declare @existingData table (
    groupIdentifier int,
    rowOrdinal int,
    value varchar(1))

insert into @existingData values 
    (100, 0, 'X'),
    (100, 1, 'Y'),

    (200, 0, 'A'),
    (200, 1, 'B'),
    (200, 2, 'C'),

    (40, 0, 'X'),

    (41, 0, 'Y')


declare @newData table (
    groupIdentifier int,
    rowOrdinal int,
    value varchar(1))

insert into @newData values 
    (1, 55, 'X'),
    (1, 59, 'Y'),

    (2, 0, 'Y'),
    (2, 1, 'X')

declare @results table (
    existingGID int,
    newGID int)

DECLARE @existingGroupID int
DECLARE outer_cursor CURSOR FOR
SELECT DISTINCT groupIdentifier FROM @existingData
OPEN outer_cursor
FETCH NEXT FROM outer_cursor INTO @existingGroupID
WHILE @@FETCH_STATUS = 0
BEGIN
    DECLARE @existingGroupCount int
    SELECT @existingGroupCount = COUNT(value) FROM @existingData WHERE groupIdentifier = @existingGroupID
    DECLARE @newGroupID int
    DECLARE inner_cursor CURSOR FOR
    SELECT DISTINCT groupIdentifier from @newData
    OPEN inner_cursor
    FETCH NEXT FROM inner_cursor INTO @newGroupID
    WHILE @@FETCH_STATUS = 0
    BEGIN
        DECLARE @newGroupCount int
        SELECT @newGroupCount = COUNT(value) FROM @newData WHERE groupIdentifier = @newGroupID
        -- if groups are different sizes, skip
        IF @newGroupCount = @existingGroupCount
        BEGIN
            DECLARE @newStart int = -1
            DECLARE @currentValue varchar(1)
            DECLARE @validGroup bit = 1
            DECLARE equality_cursor CURSOR FOR
            SELECT value FROM @existingData WHERE groupIdentifier = @existingGroupID ORDER BY rowOrdinal
            OPEN equality_cursor
            FETCH NEXT FROM equality_cursor INTO @currentValue
            WHILE @@FETCH_STATUS = 0
            BEGIN
                DECLARE @newValue varchar(1)
                SELECT TOP 1 @newValue = value, @newStart = rowOrdinal FROM @newData WHERE groupIdentifier = @newGroupID AND @newStart < rowOrdinal ORDER BY rowOrdinal
                IF(@newValue <> @currentValue)
                BEGIN
                    SET @validGroup = 0
                    BREAK
                END
                FETCH NEXT FROM equality_cursor INTO @currentValue
            END
            CLOSE equality_cursor
            DEALLOCATE equality_cursor
            IF @validGroup = 1
            BEGIN
                INSERT INTO @results (existingGID, newGID) VALUES (@existingGroupID, @newGroupID)
            END
        END
        FETCH NEXT FROM inner_cursor INTO @newGroupID
    END
    CLOSE inner_cursor
    DEALLOCATE inner_cursor
    FETCH NEXT FROM outer_cursor INTO @existingGroupID
END
CLOSE outer_cursor
DEALLOCATE outer_cursor

SELECT * FROM @results

我要开始了，但稍后我会用更好的注释对其进行编辑，以解释代码的作用。

Answer 2

总体思路

对于同一组 ID，给定的表可以有几行。如果我们有一种方法以这样一种方式收敛给定的表，即每个组 ID 在一行中加上该组的所有值在一列中，那么找到所有匹配的组就会变得微不足道。

如果我们进行这种转换

@existingData -> @ExistingDataGrouped (ID, DataValues)

@newData -> @NewDataGrouped (ID, DataValues)

那么最终查询将如下所示（请注意，我们是在 DataValues 加入，而不是 ID）：

SELECT
    E.ID, N.ID
FROM
    @ExistingDataGrouped AS E
    INNER JOIN @NewDataGrouped AS N ON N.DataValues = E.DataValues

如何制作 grouped 表格

将值转换为 XML（为 SQL 服务器搜索 "group_concat"，例如 How to make a query with group_concat in sql server）
使用 GroupConcat 函数的 CLR 实现和额外的参数来指定顺序。我个人使用了 http://groupconcat.codeplex.com/，这可能是一个好的开始。

一些优化

如果源行数很大，可以使用CHECKSUM_AGG.

做一些初步过滤

WITH
CTE_ExistingRN
AS
(
    SELECT
        GroupIdentifier
        ,ROW_NUMBER() OVER(PARTITION BY GroupIdentifier ORDER BY RowOrdinal) AS rn
        ,Value
    FROM @ExistingData
)
,CTE_NewRN
AS
(
    SELECT
        GroupIdentifier
        ,ROW_NUMBER() OVER(PARTITION BY GroupIdentifier ORDER BY RowOrdinal) AS rn
        ,Value
    FROM @NewData
)
,CTE_ExistingAgg
AS
(
    SELECT
        GroupIdentifier
        , CHECKSUM_AGG(CHECKSUM(rn, Value)) AS DataValues
    FROM CTE_ExistingRN
    GROUP BY GroupIdentifier
)
,CTE_NewAgg
AS
(
    SELECT
        GroupIdentifier
        , CHECKSUM_AGG(CHECKSUM(rn, Value)) AS DataValues
    FROM CTE_NewRN
    GROUP BY GroupIdentifier
)
SELECT
    CTE_ExistingAgg.GroupIdentifier AS ExistingGroupIdentifier
    , CTE_NewAgg.GroupIdentifier AS NewGroupIdentifier
FROM
    CTE_ExistingAgg
    INNER JOIN CTE_NewAgg ON CTE_NewAgg.DataValues = CTE_ExistingAgg.DataValues
;

首先我们对所有行重新编号，以便每组从 1 开始（CTE_ExistingRN 和 CTE_NewRN）。

CHECKSUM(rn, Value) returns 每个源行的一些整数，考虑到行号及其值。不同的值通常会产生不同的校验和。

CHECKSUM_AGG 将所有校验和组合在一起。

结果集：

ExistingGroupIdentifier    NewGroupIdentifier
100                        1
100                        2

这个结果将包含所有个完全匹配的组 (100, 1)，它也可以包含一些不匹配的组，但碰巧它们的校验和发生了相同 (100, 2)。这就是为什么这一步是初步的。要获得准确的结果，您应该比较实际值，而不是它们的校验和。但是这一步可能会过滤掉大量绝对不匹配的组。

解决方案使用XML

此解决方案将每个组的值转换为 XML 并将提供准确的结果。我个人以前从未使用过 FOR XML，很想知道它是如何工作的。

WITH
CTE_ExistingGroups
AS
(
    SELECT DISTINCT GroupIdentifier
    FROM @ExistingData
)
,CTE_NewGroups
AS
(
    SELECT DISTINCT GroupIdentifier
    FROM @NewData
)
,CTE_ExistingAgg
AS
(
    SELECT
        GroupIdentifier
        ,CA_Data.XML_Value AS DataValues
    FROM
        CTE_ExistingGroups
        CROSS APPLY
        (
            SELECT Value+','
            FROM @ExistingData
            WHERE GroupIdentifier = CTE_ExistingGroups.GroupIdentifier
            ORDER BY RowOrdinal FOR XML PATH(''), TYPE
        ) AS CA_XML(XML_Value)
        CROSS APPLY
        (
            SELECT CA_XML.XML_Value.value('.', 'NVARCHAR(MAX)')
        ) AS CA_Data(XML_Value)
)
,CTE_NewAgg
AS
(
    SELECT
        GroupIdentifier
        ,CA_Data.XML_Value AS DataValues
    FROM
        CTE_NewGroups
        CROSS APPLY
        (
            SELECT Value+','
            FROM @NewData
            WHERE GroupIdentifier = CTE_NewGroups.GroupIdentifier
            ORDER BY RowOrdinal FOR XML PATH(''), TYPE
        ) AS CA_XML(XML_Value)
        CROSS APPLY
        (
            SELECT CA_XML.XML_Value.value('.', 'NVARCHAR(MAX)')
        ) AS CA_Data(XML_Value)
)
SELECT
    CTE_ExistingAgg.GroupIdentifier AS ExistingGroupIdentifier
    , CTE_NewAgg.GroupIdentifier AS NewGroupIdentifier
FROM
    CTE_ExistingAgg
    INNER JOIN CTE_NewAgg ON CTE_NewAgg.DataValues = CTE_ExistingAgg.DataValues
;

结果集：

ExistingGroupIdentifier    NewGroupIdentifier
100                        1

如何在 SQL 中考虑顺序来匹配行组？

How to match groups of rows taking order into account in TSQL?

sql

tsql

sql-server