以 COO 格式构建图形连接矩阵

Construct graph connectivity matrices in COO format

我在处理图形数据时遇到了以下子任务:

我需要构建 COO 格式的图连接矩阵,用于具有来自“边界”索引数组的多个完全连接组件的图。

例如,给定数组

borders = [0, 2, 5]

生成的 COO 矩阵应该是

coo_matrix = [[0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
              [0, 1, 0, 1, 2, 3, 4, 2, 3, 4, 2, 3, 4]].

也就是说,borders 数组包含 ranges 个应该形成全连接子图的节点(包括起始索引和排除结束索引) .

我提出了以下算法,但我怀疑性能可以提高:

import numpy as np

def get_coo(borders):

    edge_list = []
    for s, e in zip(borders, borders[1:]):
        
        # create fully-connected subgraph
        arr = np.arange(s, e)
        t = np.array(np.meshgrid(arr, arr)).T.reshape(-1, 2)
        t = t.T

        edge_list.append(t)

    edge_list = np.concatenate(edge_list, axis=1)

    return edge_list

我觉得可能有一个更快的解决方案,也许使用一些 numpy 向量化操作。

有没有人有什么想法?

由于您的目标是获得比现有解决方案更快的解决方案,因此您可以探索 itertools 以高效解决此问题。在更大的 border 列表上测试时,此方法的基准测试 比您当前的方法快 大约 25 倍。

import numpy as np
from itertools import product, chain

def get_coo(borders):
    edges = chain(*[product(range(*i),repeat=2) for i in zip(borders, borders[1:])])
    return list(edges)

output = get_coo(borders)

## NOTE: Remember to can convert to array and Transpose for all approaches below to get Coo format.
np.array(output).T
array([[0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
       [0, 1, 0, 1, 2, 3, 4, 2, 3, 4, 2, 3, 4]])

替代方法和基准:

Note: These have been benchmarked on your current small list as well as on a larger border list as generated by borders = np.arange(300)[np.random.randint(0,2,(300,),dtype=bool)]

完全图的不相交并集

您要做的实际上是组合不相交的完整图。此类图的邻接矩阵具有沿其对角线的选择性项目的完整连接。您可以使用 networkx 来解决这些问题。

虽然比您当前的解决方案慢,但您会发现处理这些图形对象比使用 NumPy 表示图形要容易得多,也更有收获。

方法一:

  1. 假设节点是有序的,计算每个子图中的节点个数为i
  2. 创建一个用 1 填充形状 i*i
  3. 的完整矩阵
  4. 使用nx.disjoint_union_all
  5. 合并图表
  6. 获取此图的边
import numpy as np
import networkx as nx

def get_coo(borders):
    graphs = [nx.from_numpy_matrix(np.ones((i,i))).to_directed() for i in np.diff(borders)]
    edges = nx.disjoint_union_all(graphs).edges()
    return edges

%timeit get_coo(borders)

#Small- 277 µs ± 33.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#Large- 300 ms ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

方法二:

  1. 使用 zip
  2. 迭代 borders 的滚动 2 元组
  3. 使用来自这些元组的 ranges(start, end) 创建 nx.complete_graph
  4. 使用nx.disjoint_union_all
  5. 合并图表
  6. 获取此图的边
import numpy as np
import networkx as nx

def get_coo(borders):
    graphs = [nx.complete_graph(range(*i),nx.DiGraph()) for i in zip(borders, borders[1:])]
    edges = nx.disjoint_union_all(graphs).edges()
    return edges

%timeit get_coo(borders)

#Small- 116 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#Large- 207 ms ± 35.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

输出比以前快一点,但缺少节点上的自循环,必须单独添加

使用itertools.product

方法三:

  1. 使用 zip
  2. 迭代 borders 的滚动 2 元组
  3. 对每个子图使用itertools.product个完全连接的边列表
  4. 使用itertools.chain“附加”两个迭代器
  5. Return 它们作为边缘
import numpy as np
from itertools import product, chain

def get_coo(borders):
    edges = chain(*[product(range(*i),repeat=2) for i in zip(borders, borders[1:])])
    return list(edges)

%timeit get_coo(borders)

#Small- 3.91 µs ± 787 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#Large- 183 µs ± 21.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这种方法比您当前的方法快大约 25 倍

您当前的方法 - 基准测试

def get_coo(borders):

    edge_list = []
    for s, e in zip(borders, borders[1:]):
        
        # create fully-connected subgraph
        arr = np.arange(s, e)
        t = np.array(np.meshgrid(arr, arr)).T.reshape(-1, 2)
        t = t.T

        edge_list.append(t)

    edge_list = np.concatenate(edge_list, axis=1)

    return edge_list

%timeit get_coo(borders)

#Small- 95.1 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#Large- 3.91 ms ± 962 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)