如何在 SQL 中考虑顺序来匹配行组?
How to match groups of rows taking order into account in TSQL?
我有一个 table 存储相关行的组,不同的行通过 groupIdentifier 列相关。组的大小可以是任意行数。
我需要能够传入一组新的行组,然后找到新的到现有匹配组的映射。复杂的是组内每一行的顺序由 rowOrdinal 值定义,必须考虑在内。该 rowOrdinal 值并不总是基于 0,但组中的行按该值排序。另外 @existingData 包含数以千计的潜在组,因此查询需要高性能
这是一个示例输入数据集:
declare @existingData table (
groupIdentifier int,
rowOrdinal int,
value varchar(1))
insert into @existingData values
(100, 0, 'X'),
(100, 1, 'Y'),
(200, 0, 'A'),
(200, 1, 'B'),
(200, 2, 'C'),
(40, 0, 'X'),
(41, 0, 'Y')
declare @newData table (
groupIdentifier int,
rowOrdinal int,
value varchar(1))
insert into @newData values
(1, 55, 'X'),
(1, 59, 'Y'),
(2, 0, 'Y'),
(2, 1, 'X')
-- @newData group 1 matches to @existingData group 100, @newData group 2 has no match in existingData
期望的结果是一个包含两列的结果集,existingGroupIdentifier 和 newGroupIdentifier。在这种情况下,唯一的结果行是 100、1。100 是 @existingData groupIdentifier,1 是 @newData groupIdentifier
编辑
以下是我到目前为止的想法,假设我的最大组大小为 N,我可以手动复制粘贴 tsql 代码,该代码使用 pivot 和 temp tables 对每个组进行比较尺寸。但是,这将系统限制为 N,看起来很难看,如果可能的话,我更喜欢一种在单个查询中完成的方法
declare @existingData table (
groupIdentifier int,
rowOrdinal int,
value varchar(1))
insert into @existingData values
(100, 0, 'X'),
(100, 1, 'Y'),
(200, 0, 'A'),
(200, 1, 'B'),
(200, 2, 'C'),
(40, 0, 'X'),
(41, 0, 'Y')
declare @newData table (
groupIdentifier int,
rowOrdinal int,
value varchar(1))
insert into @newData values
(1, 55, 'X'),
(1, 59, 'Y'),
(2, 0, 'Y'),
(2, 1, 'X'),
(3, 99, 'Y'),
(5, 4, 'A'),
(5, 10, 'B'),
(5, 200, 'C')
-- First build table of the size of each group, limiting @existingData to only potentially matching groups (have at least one member in common)
declare @potentialGroupsInExistingData table (groupIdentifier int, groupSize int)
insert into @potentialGroupsInExistingData
select
ExistingData.groupIdentifier, COUNT(ExistingData.groupIdentifier)
from
@existingData ExistingData
where
exists (select top 1 * from @newData where value = ExistingData.value)
group by ExistingData.groupIdentifier
declare @groupsInNewData table (groupIdentifier int, groupSize int)
insert into @groupsInNewData
select
NewData.groupIdentifier, COUNT(NewData.groupIdentifier)
from
@newData NewData
group by NewData.groupIdentifier
-- handle groups of size one, this is a simpler case of the pivoting used with more than size 1 groups
-----------------------------------
select
ExistingData.groupIdentifier as ExistingGroupIdentifier,
NewData.groupIdentifier as NewGroupIdentifier
from
@potentialGroupsInExistingData PotentialExistingGroup
cross join @groupsInNewData GroupsInNewData
inner join @existingData ExistingData on
ExistingData.groupIdentifier = PotentialExistingGroup.groupIdentifier
inner join @newData NewData on
NewData.groupIdentifier = GroupsInNewData.groupIdentifier
and NewData.value = ExistingData.value
where
PotentialExistingGroup.groupSize = 1
and GroupsInNewData.groupSize = 1
-- handle groups of size two
-----------------------------------
declare @existingGroupsOfSizeTwo table (groupIdentifier int, valueOne varchar(1), valueTwo varchar(2))
insert into @existingGroupsOfSizeTwo
select
*
from
(select
ExistingData.groupIdentifier,
ExistingData.value,
ROW_NUMBER() over (partition by ExistingData.groupIdentifier order by ExistingData.rowOrdinal desc) as ActualOrdinal
from
@potentialGroupsInExistingData PotentialGroup
inner join @existingData ExistingData on
ExistingData.groupIdentifier = PotentialGroup.groupIdentifier
where
PotentialGroup.groupSize = 2) as T
pivot ( min(value) for T.ActualOrdinal in ([1], [2]) ) as p
declare @newGroupsOfSizeTwo table (groupIdentifier int, valueOne varchar(1), valueTwo varchar(2))
insert into @newGroupsOfSizeTwo
select
*
from
(select
NewData.groupIdentifier,
NewData.value,
ROW_NUMBER() over (partition by NewData.groupIdentifier order by NewData.rowOrdinal desc) as ActualOrdinal
from
@groupsInNewData NewDataGroup
inner join @newData NewData on
NewData.groupIdentifier = NewDataGroup.groupIdentifier
where
NewDataGroup.groupSize = 2) as T
pivot ( min(value) for T.ActualOrdinal in ([1], [2]) ) as p
select
ExistingData.groupIdentifier as ExistingGroupIdentifier,
NewData.groupIdentifier as NewGroupIdentifier
from
@newGroupsOfSizeTwo NewData
inner join @existingGroupsOfSizeTwo ExistingData on
ExistingData.valueOne = NewData.valueOne
and ExistingData.valueTwo = NewData.valueTwo
-- handle groups of size three
-----------------------------------
declare @existingGroupsOfSizeThree table (groupIdentifier int, valueOne varchar(1), valueTwo varchar(1), valueThree varchar(1))
insert into @existingGroupsOfSizeThree
select
*
from
(select
ExistingData.groupIdentifier,
ExistingData.value,
ROW_NUMBER() over (partition by ExistingData.groupIdentifier order by ExistingData.rowOrdinal desc) as ActualOrdinal
from
@potentialGroupsInExistingData PotentialGroup
inner join @existingData ExistingData on
ExistingData.groupIdentifier = PotentialGroup.groupIdentifier
where
PotentialGroup.groupSize = 3) as T
pivot ( min(value) for T.ActualOrdinal in ([1], [2], [3]) ) as p
declare @newGroupsOfSizeThree table (groupIdentifier int, valueOne varchar(1), valueTwo varchar(1), valueThree varchar(1))
insert into @newGroupsOfSizeThree
select
*
from
(select
NewData.groupIdentifier,
NewData.value,
ROW_NUMBER() over (partition by NewData.groupIdentifier order by NewData.rowOrdinal desc) as ActualOrdinal
from
@groupsInNewData NewDataGroup
inner join @newData NewData on
NewData.groupIdentifier = NewDataGroup.groupIdentifier
where
NewDataGroup.groupSize = 3) as T
pivot ( min(value) for T.ActualOrdinal in ([1], [2], [3]) ) as p
select
ExistingData.groupIdentifier as ExistingGroupIdentifier,
NewData.groupIdentifier as NewGroupIdentifier
from
@newGroupsOfSizeThree NewData
inner join @existingGroupsOfSizeThree ExistingData on
ExistingData.valueOne = NewData.valueOne
and ExistingData.valueTwo = NewData.valueTwo
and ExistingData.valueThree = NewData.valueThree
试试这个:
declare @existingData table (
groupIdentifier int,
rowOrdinal int,
value varchar(1))
insert into @existingData values
(100, 0, 'X'),
(100, 1, 'Y'),
(200, 0, 'A'),
(200, 1, 'B'),
(200, 2, 'C'),
(40, 0, 'X'),
(41, 0, 'Y')
declare @newData table (
groupIdentifier int,
rowOrdinal int,
value varchar(1))
insert into @newData values
(1, 55, 'X'),
(1, 59, 'Y'),
(2, 0, 'Y'),
(2, 1, 'X')
declare @results table (
existingGID int,
newGID int)
DECLARE @existingGroupID int
DECLARE outer_cursor CURSOR FOR
SELECT DISTINCT groupIdentifier FROM @existingData
OPEN outer_cursor
FETCH NEXT FROM outer_cursor INTO @existingGroupID
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @existingGroupCount int
SELECT @existingGroupCount = COUNT(value) FROM @existingData WHERE groupIdentifier = @existingGroupID
DECLARE @newGroupID int
DECLARE inner_cursor CURSOR FOR
SELECT DISTINCT groupIdentifier from @newData
OPEN inner_cursor
FETCH NEXT FROM inner_cursor INTO @newGroupID
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @newGroupCount int
SELECT @newGroupCount = COUNT(value) FROM @newData WHERE groupIdentifier = @newGroupID
-- if groups are different sizes, skip
IF @newGroupCount = @existingGroupCount
BEGIN
DECLARE @newStart int = -1
DECLARE @currentValue varchar(1)
DECLARE @validGroup bit = 1
DECLARE equality_cursor CURSOR FOR
SELECT value FROM @existingData WHERE groupIdentifier = @existingGroupID ORDER BY rowOrdinal
OPEN equality_cursor
FETCH NEXT FROM equality_cursor INTO @currentValue
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @newValue varchar(1)
SELECT TOP 1 @newValue = value, @newStart = rowOrdinal FROM @newData WHERE groupIdentifier = @newGroupID AND @newStart < rowOrdinal ORDER BY rowOrdinal
IF(@newValue <> @currentValue)
BEGIN
SET @validGroup = 0
BREAK
END
FETCH NEXT FROM equality_cursor INTO @currentValue
END
CLOSE equality_cursor
DEALLOCATE equality_cursor
IF @validGroup = 1
BEGIN
INSERT INTO @results (existingGID, newGID) VALUES (@existingGroupID, @newGroupID)
END
END
FETCH NEXT FROM inner_cursor INTO @newGroupID
END
CLOSE inner_cursor
DEALLOCATE inner_cursor
FETCH NEXT FROM outer_cursor INTO @existingGroupID
END
CLOSE outer_cursor
DEALLOCATE outer_cursor
SELECT * FROM @results
我要开始了,但稍后我会用更好的注释对其进行编辑,以解释代码的作用。
总体思路
对于同一组 ID,给定的表可以有几行。
如果我们有一种方法以这样一种方式收敛给定的表,即每个组 ID 在一行中加上该组的所有值在一列中,那么找到所有匹配的组就会变得微不足道。
如果我们进行这种转换
@existingData
->
@ExistingDataGrouped (ID, DataValues)
@newData
->
@NewDataGrouped (ID, DataValues)
那么最终查询将如下所示(请注意,我们是在 DataValues
加入,而不是 ID
):
SELECT
E.ID, N.ID
FROM
@ExistingDataGrouped AS E
INNER JOIN @NewDataGrouped AS N ON N.DataValues = E.DataValues
如何制作 grouped
表格
- 将值转换为
XML
(为 SQL 服务器搜索 "group_concat",例如 How to make a query with group_concat in sql server)
- 使用
GroupConcat
函数的 CLR 实现和额外的参数来指定顺序。我个人使用了 http://groupconcat.codeplex.com/,这可能是一个好的开始。
一些优化
如果源行数很大,可以使用CHECKSUM_AGG
.
做一些初步过滤
WITH
CTE_ExistingRN
AS
(
SELECT
GroupIdentifier
,ROW_NUMBER() OVER(PARTITION BY GroupIdentifier ORDER BY RowOrdinal) AS rn
,Value
FROM @ExistingData
)
,CTE_NewRN
AS
(
SELECT
GroupIdentifier
,ROW_NUMBER() OVER(PARTITION BY GroupIdentifier ORDER BY RowOrdinal) AS rn
,Value
FROM @NewData
)
,CTE_ExistingAgg
AS
(
SELECT
GroupIdentifier
, CHECKSUM_AGG(CHECKSUM(rn, Value)) AS DataValues
FROM CTE_ExistingRN
GROUP BY GroupIdentifier
)
,CTE_NewAgg
AS
(
SELECT
GroupIdentifier
, CHECKSUM_AGG(CHECKSUM(rn, Value)) AS DataValues
FROM CTE_NewRN
GROUP BY GroupIdentifier
)
SELECT
CTE_ExistingAgg.GroupIdentifier AS ExistingGroupIdentifier
, CTE_NewAgg.GroupIdentifier AS NewGroupIdentifier
FROM
CTE_ExistingAgg
INNER JOIN CTE_NewAgg ON CTE_NewAgg.DataValues = CTE_ExistingAgg.DataValues
;
首先我们对所有行重新编号,以便每组从 1 开始(CTE_ExistingRN
和 CTE_NewRN
)。
CHECKSUM(rn, Value)
returns 每个源行的一些整数,考虑到行号及其值。不同的值通常会产生不同的校验和。
CHECKSUM_AGG
将所有校验和组合在一起。
结果集:
ExistingGroupIdentifier NewGroupIdentifier
100 1
100 2
这个结果将包含 所有 个完全匹配的组 (100, 1
),它也可以包含一些不匹配的组,但碰巧它们的校验和发生了相同 (100, 2
)。这就是为什么这一步是初步的。要获得准确的结果,您应该比较实际值,而不是它们的校验和。但是这一步可能会过滤掉大量绝对不匹配的组。
解决方案使用XML
此解决方案将每个组的值转换为 XML 并将提供准确的结果。我个人以前从未使用过 FOR XML
,很想知道它是如何工作的。
WITH
CTE_ExistingGroups
AS
(
SELECT DISTINCT GroupIdentifier
FROM @ExistingData
)
,CTE_NewGroups
AS
(
SELECT DISTINCT GroupIdentifier
FROM @NewData
)
,CTE_ExistingAgg
AS
(
SELECT
GroupIdentifier
,CA_Data.XML_Value AS DataValues
FROM
CTE_ExistingGroups
CROSS APPLY
(
SELECT Value+','
FROM @ExistingData
WHERE GroupIdentifier = CTE_ExistingGroups.GroupIdentifier
ORDER BY RowOrdinal FOR XML PATH(''), TYPE
) AS CA_XML(XML_Value)
CROSS APPLY
(
SELECT CA_XML.XML_Value.value('.', 'NVARCHAR(MAX)')
) AS CA_Data(XML_Value)
)
,CTE_NewAgg
AS
(
SELECT
GroupIdentifier
,CA_Data.XML_Value AS DataValues
FROM
CTE_NewGroups
CROSS APPLY
(
SELECT Value+','
FROM @NewData
WHERE GroupIdentifier = CTE_NewGroups.GroupIdentifier
ORDER BY RowOrdinal FOR XML PATH(''), TYPE
) AS CA_XML(XML_Value)
CROSS APPLY
(
SELECT CA_XML.XML_Value.value('.', 'NVARCHAR(MAX)')
) AS CA_Data(XML_Value)
)
SELECT
CTE_ExistingAgg.GroupIdentifier AS ExistingGroupIdentifier
, CTE_NewAgg.GroupIdentifier AS NewGroupIdentifier
FROM
CTE_ExistingAgg
INNER JOIN CTE_NewAgg ON CTE_NewAgg.DataValues = CTE_ExistingAgg.DataValues
;
结果集:
ExistingGroupIdentifier NewGroupIdentifier
100 1
我有一个 table 存储相关行的组,不同的行通过 groupIdentifier 列相关。组的大小可以是任意行数。
我需要能够传入一组新的行组,然后找到新的到现有匹配组的映射。复杂的是组内每一行的顺序由 rowOrdinal 值定义,必须考虑在内。该 rowOrdinal 值并不总是基于 0,但组中的行按该值排序。另外 @existingData 包含数以千计的潜在组,因此查询需要高性能
这是一个示例输入数据集:
declare @existingData table (
groupIdentifier int,
rowOrdinal int,
value varchar(1))
insert into @existingData values
(100, 0, 'X'),
(100, 1, 'Y'),
(200, 0, 'A'),
(200, 1, 'B'),
(200, 2, 'C'),
(40, 0, 'X'),
(41, 0, 'Y')
declare @newData table (
groupIdentifier int,
rowOrdinal int,
value varchar(1))
insert into @newData values
(1, 55, 'X'),
(1, 59, 'Y'),
(2, 0, 'Y'),
(2, 1, 'X')
-- @newData group 1 matches to @existingData group 100, @newData group 2 has no match in existingData
期望的结果是一个包含两列的结果集,existingGroupIdentifier 和 newGroupIdentifier。在这种情况下,唯一的结果行是 100、1。100 是 @existingData groupIdentifier,1 是 @newData groupIdentifier
编辑 以下是我到目前为止的想法,假设我的最大组大小为 N,我可以手动复制粘贴 tsql 代码,该代码使用 pivot 和 temp tables 对每个组进行比较尺寸。但是,这将系统限制为 N,看起来很难看,如果可能的话,我更喜欢一种在单个查询中完成的方法
declare @existingData table (
groupIdentifier int,
rowOrdinal int,
value varchar(1))
insert into @existingData values
(100, 0, 'X'),
(100, 1, 'Y'),
(200, 0, 'A'),
(200, 1, 'B'),
(200, 2, 'C'),
(40, 0, 'X'),
(41, 0, 'Y')
declare @newData table (
groupIdentifier int,
rowOrdinal int,
value varchar(1))
insert into @newData values
(1, 55, 'X'),
(1, 59, 'Y'),
(2, 0, 'Y'),
(2, 1, 'X'),
(3, 99, 'Y'),
(5, 4, 'A'),
(5, 10, 'B'),
(5, 200, 'C')
-- First build table of the size of each group, limiting @existingData to only potentially matching groups (have at least one member in common)
declare @potentialGroupsInExistingData table (groupIdentifier int, groupSize int)
insert into @potentialGroupsInExistingData
select
ExistingData.groupIdentifier, COUNT(ExistingData.groupIdentifier)
from
@existingData ExistingData
where
exists (select top 1 * from @newData where value = ExistingData.value)
group by ExistingData.groupIdentifier
declare @groupsInNewData table (groupIdentifier int, groupSize int)
insert into @groupsInNewData
select
NewData.groupIdentifier, COUNT(NewData.groupIdentifier)
from
@newData NewData
group by NewData.groupIdentifier
-- handle groups of size one, this is a simpler case of the pivoting used with more than size 1 groups
-----------------------------------
select
ExistingData.groupIdentifier as ExistingGroupIdentifier,
NewData.groupIdentifier as NewGroupIdentifier
from
@potentialGroupsInExistingData PotentialExistingGroup
cross join @groupsInNewData GroupsInNewData
inner join @existingData ExistingData on
ExistingData.groupIdentifier = PotentialExistingGroup.groupIdentifier
inner join @newData NewData on
NewData.groupIdentifier = GroupsInNewData.groupIdentifier
and NewData.value = ExistingData.value
where
PotentialExistingGroup.groupSize = 1
and GroupsInNewData.groupSize = 1
-- handle groups of size two
-----------------------------------
declare @existingGroupsOfSizeTwo table (groupIdentifier int, valueOne varchar(1), valueTwo varchar(2))
insert into @existingGroupsOfSizeTwo
select
*
from
(select
ExistingData.groupIdentifier,
ExistingData.value,
ROW_NUMBER() over (partition by ExistingData.groupIdentifier order by ExistingData.rowOrdinal desc) as ActualOrdinal
from
@potentialGroupsInExistingData PotentialGroup
inner join @existingData ExistingData on
ExistingData.groupIdentifier = PotentialGroup.groupIdentifier
where
PotentialGroup.groupSize = 2) as T
pivot ( min(value) for T.ActualOrdinal in ([1], [2]) ) as p
declare @newGroupsOfSizeTwo table (groupIdentifier int, valueOne varchar(1), valueTwo varchar(2))
insert into @newGroupsOfSizeTwo
select
*
from
(select
NewData.groupIdentifier,
NewData.value,
ROW_NUMBER() over (partition by NewData.groupIdentifier order by NewData.rowOrdinal desc) as ActualOrdinal
from
@groupsInNewData NewDataGroup
inner join @newData NewData on
NewData.groupIdentifier = NewDataGroup.groupIdentifier
where
NewDataGroup.groupSize = 2) as T
pivot ( min(value) for T.ActualOrdinal in ([1], [2]) ) as p
select
ExistingData.groupIdentifier as ExistingGroupIdentifier,
NewData.groupIdentifier as NewGroupIdentifier
from
@newGroupsOfSizeTwo NewData
inner join @existingGroupsOfSizeTwo ExistingData on
ExistingData.valueOne = NewData.valueOne
and ExistingData.valueTwo = NewData.valueTwo
-- handle groups of size three
-----------------------------------
declare @existingGroupsOfSizeThree table (groupIdentifier int, valueOne varchar(1), valueTwo varchar(1), valueThree varchar(1))
insert into @existingGroupsOfSizeThree
select
*
from
(select
ExistingData.groupIdentifier,
ExistingData.value,
ROW_NUMBER() over (partition by ExistingData.groupIdentifier order by ExistingData.rowOrdinal desc) as ActualOrdinal
from
@potentialGroupsInExistingData PotentialGroup
inner join @existingData ExistingData on
ExistingData.groupIdentifier = PotentialGroup.groupIdentifier
where
PotentialGroup.groupSize = 3) as T
pivot ( min(value) for T.ActualOrdinal in ([1], [2], [3]) ) as p
declare @newGroupsOfSizeThree table (groupIdentifier int, valueOne varchar(1), valueTwo varchar(1), valueThree varchar(1))
insert into @newGroupsOfSizeThree
select
*
from
(select
NewData.groupIdentifier,
NewData.value,
ROW_NUMBER() over (partition by NewData.groupIdentifier order by NewData.rowOrdinal desc) as ActualOrdinal
from
@groupsInNewData NewDataGroup
inner join @newData NewData on
NewData.groupIdentifier = NewDataGroup.groupIdentifier
where
NewDataGroup.groupSize = 3) as T
pivot ( min(value) for T.ActualOrdinal in ([1], [2], [3]) ) as p
select
ExistingData.groupIdentifier as ExistingGroupIdentifier,
NewData.groupIdentifier as NewGroupIdentifier
from
@newGroupsOfSizeThree NewData
inner join @existingGroupsOfSizeThree ExistingData on
ExistingData.valueOne = NewData.valueOne
and ExistingData.valueTwo = NewData.valueTwo
and ExistingData.valueThree = NewData.valueThree
试试这个:
declare @existingData table (
groupIdentifier int,
rowOrdinal int,
value varchar(1))
insert into @existingData values
(100, 0, 'X'),
(100, 1, 'Y'),
(200, 0, 'A'),
(200, 1, 'B'),
(200, 2, 'C'),
(40, 0, 'X'),
(41, 0, 'Y')
declare @newData table (
groupIdentifier int,
rowOrdinal int,
value varchar(1))
insert into @newData values
(1, 55, 'X'),
(1, 59, 'Y'),
(2, 0, 'Y'),
(2, 1, 'X')
declare @results table (
existingGID int,
newGID int)
DECLARE @existingGroupID int
DECLARE outer_cursor CURSOR FOR
SELECT DISTINCT groupIdentifier FROM @existingData
OPEN outer_cursor
FETCH NEXT FROM outer_cursor INTO @existingGroupID
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @existingGroupCount int
SELECT @existingGroupCount = COUNT(value) FROM @existingData WHERE groupIdentifier = @existingGroupID
DECLARE @newGroupID int
DECLARE inner_cursor CURSOR FOR
SELECT DISTINCT groupIdentifier from @newData
OPEN inner_cursor
FETCH NEXT FROM inner_cursor INTO @newGroupID
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @newGroupCount int
SELECT @newGroupCount = COUNT(value) FROM @newData WHERE groupIdentifier = @newGroupID
-- if groups are different sizes, skip
IF @newGroupCount = @existingGroupCount
BEGIN
DECLARE @newStart int = -1
DECLARE @currentValue varchar(1)
DECLARE @validGroup bit = 1
DECLARE equality_cursor CURSOR FOR
SELECT value FROM @existingData WHERE groupIdentifier = @existingGroupID ORDER BY rowOrdinal
OPEN equality_cursor
FETCH NEXT FROM equality_cursor INTO @currentValue
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @newValue varchar(1)
SELECT TOP 1 @newValue = value, @newStart = rowOrdinal FROM @newData WHERE groupIdentifier = @newGroupID AND @newStart < rowOrdinal ORDER BY rowOrdinal
IF(@newValue <> @currentValue)
BEGIN
SET @validGroup = 0
BREAK
END
FETCH NEXT FROM equality_cursor INTO @currentValue
END
CLOSE equality_cursor
DEALLOCATE equality_cursor
IF @validGroup = 1
BEGIN
INSERT INTO @results (existingGID, newGID) VALUES (@existingGroupID, @newGroupID)
END
END
FETCH NEXT FROM inner_cursor INTO @newGroupID
END
CLOSE inner_cursor
DEALLOCATE inner_cursor
FETCH NEXT FROM outer_cursor INTO @existingGroupID
END
CLOSE outer_cursor
DEALLOCATE outer_cursor
SELECT * FROM @results
我要开始了,但稍后我会用更好的注释对其进行编辑,以解释代码的作用。
总体思路
对于同一组 ID,给定的表可以有几行。 如果我们有一种方法以这样一种方式收敛给定的表,即每个组 ID 在一行中加上该组的所有值在一列中,那么找到所有匹配的组就会变得微不足道。
如果我们进行这种转换
@existingData
->
@ExistingDataGrouped (ID, DataValues)
@newData
->
@NewDataGrouped (ID, DataValues)
那么最终查询将如下所示(请注意,我们是在 DataValues
加入,而不是 ID
):
SELECT
E.ID, N.ID
FROM
@ExistingDataGrouped AS E
INNER JOIN @NewDataGrouped AS N ON N.DataValues = E.DataValues
如何制作 grouped
表格
- 将值转换为
XML
(为 SQL 服务器搜索 "group_concat",例如 How to make a query with group_concat in sql server) - 使用
GroupConcat
函数的 CLR 实现和额外的参数来指定顺序。我个人使用了 http://groupconcat.codeplex.com/,这可能是一个好的开始。
一些优化
如果源行数很大,可以使用CHECKSUM_AGG
.
WITH
CTE_ExistingRN
AS
(
SELECT
GroupIdentifier
,ROW_NUMBER() OVER(PARTITION BY GroupIdentifier ORDER BY RowOrdinal) AS rn
,Value
FROM @ExistingData
)
,CTE_NewRN
AS
(
SELECT
GroupIdentifier
,ROW_NUMBER() OVER(PARTITION BY GroupIdentifier ORDER BY RowOrdinal) AS rn
,Value
FROM @NewData
)
,CTE_ExistingAgg
AS
(
SELECT
GroupIdentifier
, CHECKSUM_AGG(CHECKSUM(rn, Value)) AS DataValues
FROM CTE_ExistingRN
GROUP BY GroupIdentifier
)
,CTE_NewAgg
AS
(
SELECT
GroupIdentifier
, CHECKSUM_AGG(CHECKSUM(rn, Value)) AS DataValues
FROM CTE_NewRN
GROUP BY GroupIdentifier
)
SELECT
CTE_ExistingAgg.GroupIdentifier AS ExistingGroupIdentifier
, CTE_NewAgg.GroupIdentifier AS NewGroupIdentifier
FROM
CTE_ExistingAgg
INNER JOIN CTE_NewAgg ON CTE_NewAgg.DataValues = CTE_ExistingAgg.DataValues
;
首先我们对所有行重新编号,以便每组从 1 开始(CTE_ExistingRN
和 CTE_NewRN
)。
CHECKSUM(rn, Value)
returns 每个源行的一些整数,考虑到行号及其值。不同的值通常会产生不同的校验和。
CHECKSUM_AGG
将所有校验和组合在一起。
结果集:
ExistingGroupIdentifier NewGroupIdentifier
100 1
100 2
这个结果将包含 所有 个完全匹配的组 (100, 1
),它也可以包含一些不匹配的组,但碰巧它们的校验和发生了相同 (100, 2
)。这就是为什么这一步是初步的。要获得准确的结果,您应该比较实际值,而不是它们的校验和。但是这一步可能会过滤掉大量绝对不匹配的组。
解决方案使用XML
此解决方案将每个组的值转换为 XML 并将提供准确的结果。我个人以前从未使用过 FOR XML
,很想知道它是如何工作的。
WITH
CTE_ExistingGroups
AS
(
SELECT DISTINCT GroupIdentifier
FROM @ExistingData
)
,CTE_NewGroups
AS
(
SELECT DISTINCT GroupIdentifier
FROM @NewData
)
,CTE_ExistingAgg
AS
(
SELECT
GroupIdentifier
,CA_Data.XML_Value AS DataValues
FROM
CTE_ExistingGroups
CROSS APPLY
(
SELECT Value+','
FROM @ExistingData
WHERE GroupIdentifier = CTE_ExistingGroups.GroupIdentifier
ORDER BY RowOrdinal FOR XML PATH(''), TYPE
) AS CA_XML(XML_Value)
CROSS APPLY
(
SELECT CA_XML.XML_Value.value('.', 'NVARCHAR(MAX)')
) AS CA_Data(XML_Value)
)
,CTE_NewAgg
AS
(
SELECT
GroupIdentifier
,CA_Data.XML_Value AS DataValues
FROM
CTE_NewGroups
CROSS APPLY
(
SELECT Value+','
FROM @NewData
WHERE GroupIdentifier = CTE_NewGroups.GroupIdentifier
ORDER BY RowOrdinal FOR XML PATH(''), TYPE
) AS CA_XML(XML_Value)
CROSS APPLY
(
SELECT CA_XML.XML_Value.value('.', 'NVARCHAR(MAX)')
) AS CA_Data(XML_Value)
)
SELECT
CTE_ExistingAgg.GroupIdentifier AS ExistingGroupIdentifier
, CTE_NewAgg.GroupIdentifier AS NewGroupIdentifier
FROM
CTE_ExistingAgg
INNER JOIN CTE_NewAgg ON CTE_NewAgg.DataValues = CTE_ExistingAgg.DataValues
;
结果集:
ExistingGroupIdentifier NewGroupIdentifier
100 1