在 PostgreSQL 中按数组重叠分组
Group by array overlap in PostgreSQL
我想编写一个查询,通过重叠数组对行进行分组。考虑以下示例:
id | name | family_ids
--------------------------
1 | Alice | [f1, f2]
2 | Bob | [f1]
3 | Freddy | [f2, f3]
4 | James | [f3]
5 | Joe | [f4, f5]
6 | Tim | [f5]
Alice 和 Bob 是同一个家庭的成员,f1
。在 Alice 的姻亲 (f2
) 中,她也与 Freddy 有亲戚关系。而且考虑到Freddy的姻亲(f3
),James也跟他们有亲戚关系
所以,基本上,我想按 family_ids
中完全重叠的数组进行分组。但是,请注意 f2 -> f3
也应该被发现,这对于 1 个简单的 group by
查询是不可能的。
ids | family_ids
----------------------------
[1, 2, 3, 4] | [f1, f2, f3]
[5, 6] | [f4, f5]
我经常使用 inner join
s,group by t1.family_ids && t2.family_ids
,但似乎找不到性能良好的解决方案。 table 现在的大小约为 100k 行。将来,这个 table 将增长到 ~500k-1M 行。
这是一个图行走问题。
一种常见的方法是取消嵌套数组以生成节点,然后自连接匹配的族以计算所有边。然后我们可以使用递归查询遍历图,同时注意不要访问同一个节点两次,然后聚合生成组。最后一步就是恢复对应的family ids。
with recursive
nodes as (
select t.id, x.family_id
from mytable t
cross join lateral unnest(t.family_ids) as x(family_id)
),
edges as (
select n1.id as id1, n2.id as id2
from nodes n1
inner join nodes n2 using (family_id)
),
cte as (
select id1, id2, array[id1] as visited
from edges
where id1 = id2
union all
select c.id1, e.id2, c.visited || e.id2
from cte c
inner join edges e on e.id1 = c.id2
where e.id2 <> all(c.visited)
),
res as (
select id1, array_agg(distinct id2 order by id2) as id2s
from cte
group by id1
)
select
array_agg(distinct n.id order by n.id) as ids,
array_agg(distinct n.family_id order by n.family_id) as family_ids
from res r
inner join nodes n on n.id = r.id1
group by r.id2s
ids | family_ids
:-------- | :---------
{1,2,3,4} | {f1,f2,f3}
{5,6} | {f4,f5}
我想编写一个查询,通过重叠数组对行进行分组。考虑以下示例:
id | name | family_ids
--------------------------
1 | Alice | [f1, f2]
2 | Bob | [f1]
3 | Freddy | [f2, f3]
4 | James | [f3]
5 | Joe | [f4, f5]
6 | Tim | [f5]
Alice 和 Bob 是同一个家庭的成员,f1
。在 Alice 的姻亲 (f2
) 中,她也与 Freddy 有亲戚关系。而且考虑到Freddy的姻亲(f3
),James也跟他们有亲戚关系
所以,基本上,我想按 family_ids
中完全重叠的数组进行分组。但是,请注意 f2 -> f3
也应该被发现,这对于 1 个简单的 group by
查询是不可能的。
ids | family_ids
----------------------------
[1, 2, 3, 4] | [f1, f2, f3]
[5, 6] | [f4, f5]
我经常使用 inner join
s,group by t1.family_ids && t2.family_ids
,但似乎找不到性能良好的解决方案。 table 现在的大小约为 100k 行。将来,这个 table 将增长到 ~500k-1M 行。
这是一个图行走问题。
一种常见的方法是取消嵌套数组以生成节点,然后自连接匹配的族以计算所有边。然后我们可以使用递归查询遍历图,同时注意不要访问同一个节点两次,然后聚合生成组。最后一步就是恢复对应的family ids。
with recursive
nodes as (
select t.id, x.family_id
from mytable t
cross join lateral unnest(t.family_ids) as x(family_id)
),
edges as (
select n1.id as id1, n2.id as id2
from nodes n1
inner join nodes n2 using (family_id)
),
cte as (
select id1, id2, array[id1] as visited
from edges
where id1 = id2
union all
select c.id1, e.id2, c.visited || e.id2
from cte c
inner join edges e on e.id1 = c.id2
where e.id2 <> all(c.visited)
),
res as (
select id1, array_agg(distinct id2 order by id2) as id2s
from cte
group by id1
)
select
array_agg(distinct n.id order by n.id) as ids,
array_agg(distinct n.family_id order by n.family_id) as family_ids
from res r
inner join nodes n on n.id = r.id1
group by r.id2s
ids | family_ids :-------- | :--------- {1,2,3,4} | {f1,f2,f3} {5,6} | {f4,f5}