均匀分布相关记录

Evenly distributing related records

我有一个 table 拥有超过 100k 个邮箱和具有权限的用户。

+---------+---------+
| Mailbox | Trustee |
+---------+---------+
| smb1    | mbx1    |
| smb2    | mbx1    |
| smb2    | mbx2    |
| smb2    | mbx3    |
| smb3    | mbx4    |
| smb3    | mbx5    |
| mbx1    | mbx6    |
| mbx7    | mbx4    |
| smb4    | mbx8    |
| smb4    | mbx9    |
| mbx8    | mbx10   |
+---------+---------+

需要在邮箱列中对受托人和他们有权访问的邮箱进行分组。例如mbx1、mbx2 和 mbx 3 通过访问 smb2 而相关,因此它们进入存储桶 1。mbx 进入存储桶 1 意味着 smb1 也进入存储桶 1,因为 mbx 1 是该存储桶的受托人。然后再往下,因为 mbx6 与 mbx1 有关系,它也进入桶 1。希望其他人有意义。所以请注意受托人可以访问 smb(共享邮箱)或 mbx(邮箱)

我选择的 table 只有邮箱和受托人,我想写入下面的临时文件 table。

+---------+---------+--------+
| Mailbox | Trustee | Bucket |
+---------+---------+--------+
| smb1    | mbx1    |      1 |
| smb2    | mbx1    |      1 |
| smb2    | mbx2    |      1 |
| smb2    | mbx3    |      1 |
| smb3    | mbx4    |      2 |
| smb3    | mbx5    |      2 |
| mbx1    | mbx6    |      1 |
| mbx7    | mbx4    |      2 |
| smb4    | mbx8    |      3 |
| smb4    | mbx9    |      3 |
| mbx8    | mbx10   |      3 |
+---------+---------+--------+

然后我想将桶计数放在一起以形成均匀的组。想法是我可以说例如最大计数 100,因此创建最多可容纳 100 个用户的存储桶组。

+---------+---------+-------+
| Groups  | Buckets | Count |
+---------+---------+-------+
|       1 |       1 |     5 |
|       2 |     2,3 |     6 |
+---------+---------+-------+

编辑: 我已经走到这一步了,我可以传入一个邮箱并获取所有受托人,然后是这些受托人有权访问的其他邮箱。

DECLARE @int int = 1;
WITH Buckets_CTE
    (Trustee)
AS (
    SELECT DISTINCT Trustee
    FROM EXOPerms
        WHERE Mailbox = 'smb1'
)
SELECT DISTINCT Mailbox,Trustee
    FROM EXOPerms
    Where Trustee IN (
    SELECT DISTINCT Trustee
    FROM Buckets_CTE)
    ORDER BY Trustee

目前顶部的 DECLARE Int 是多余的,只是为了看看我是否可以实现存储桶功能。

这是一个 while 循环解决方案。它只是遍历每一行并更新 Bucket。

添加

ID 以逐行循环数据

要检查 mailbox/trustee 是否存在于另一行中,请检查 i.Mailbox in (m.Mailbox, m.Trustee) :

from @mailbox i
     inner join @mailbox m  
     on  i.ID   <> m.ID   -- don't compare the same row
     and (
            i.Mailbox   in (m.Mailbox, m.Trustee) 
         or i.Trustee   in (m.Mailbox, m.Trustee) 
         )

注意,更新Bucket时,会与当前Bucket进行比较,只取较小的值。这是为了解决像下面这样的情况,即前面的行之间的关系直到后面的行才知道。

ID  MailBox  Trustee 
1   a        b      
2   c        d
3   e        f
4   c        f

ID 1, 2, 3 在顺序处理时分配单独的Bucket。只有当进程 ID 为 4 时,它才会将 ID 2 和 3 链接在一起


完成查询

declare @mailbox table
(
    ID      int        identity,
    Mailbox varchar(5),
    Trustee varchar(5),
    Bucket  int
)

insert into @mailbox (Mailbox, Trustee) values
( 'smb1',    'mbx1' ),
( 'smb2',    'mbx1' ),
( 'smb2',    'mbx2' ),
( 'smb2',    'mbx3' ),
( 'smb3',    'mbx4' ),
( 'smb3',    'mbx5' ),
( 'mbx1',    'mbx6' ),
( 'mbx7',    'mbx4' ),
( 'smb4',    'mbx8' ),
( 'smb4',    'mbx9' ),
( 'mbx8',    'mbx10');

declare @ID     int,
        @Bucket int = 1    -- start from 1

-- get the minimum ID for start
select  @ID = min(ID) from @mailbox where Bucket is null

while   exists 
        (
            select  *
            from    @mailbox
            where   ID >= @ID
        )
begin
        -- if the mailbox is found in other row with Bucket value
        -- (Bucket is not null)
        if  exists
        (
            select  *
            from    @mailbox i
                    inner join @mailbox m   
                    on  i.ID    <> m.ID
                    and (
                            i.Mailbox   in (m.Mailbox, m.Trustee) 
                        or  i.Trustee   in (m.Mailbox, m.Trustee) 
                        )
            where   i.ID    = @ID
            and     m.Bucket    is not null
        )
        begin
            -- Update Bucket from other row
            update  i
            set     Bucket  = case  when i.Bucket is null
                                    or   i.Bucket > m.Bucket
                                    then m.Bucket
                                    else i.Bucket
                                    end
            from    @mailbox i
                    inner join @mailbox m   
                    on  i.ID    <> m.ID
                    and (
                            i.Mailbox   in (m.Mailbox, m.Trustee) 
                        or  i.Trustee   in (m.Mailbox, m.Trustee) 
                        )
            where   i.ID    = @ID
            and     m.Bucket    is not null

            -- Update other rows that might linked to current ID
            update  m
            set     Bucket  = case  when    i.Bucket > m.Bucket
                                    then    m.Bucket
                                    else    i.Bucket
                                    end
            from    @mailbox i
                    inner join @mailbox m   
                    on  i.ID    <> m.ID
                    and (
                            i.Mailbox   in (m.Mailbox, m.Trustee) 
                        or  i.Trustee   in (m.Mailbox, m.Trustee) 
                        )
            where   i.ID        = @ID
    end
    else
    begin
        -- no other row found with same mailbox. 
        -- Assign Bucket from @Bucket, increment @Bucket
        update  m
        set     Bucket  = @Bucket
        from    @mailbox m
        where   m.ID    = @ID;

        select  @Bucket = @Bucket + 1;
    end

    -- Get next ID
    select  @ID = min(ID) from @mailbox where ID > @ID;
end

select  *
from    @mailbox
order by ID