列表的快速 Python 外部差异

Question

我想计算 Python 等长列表中每个元素之间的差异，并将其放入 Numpy 数组中。

当我说两个列表之间的差异时，我指的是两个列表之间相应元素之间的差异数。这是一个使用列表理解的 difference 函数示例：

def list_difference(list_a, list_b):
    len_lists = len(list_a)
    assert len_lists == len(list_b), "Lists must be the same length."
    return sum([list_a[i] != list_b[i] for i in range(len_lists)])

然后我对列表列表中的每一对调用 difference 函数并将其放入一个 numpy 数组中。你可以称它为列表列表的外差，就像外积一样。我在天真的循环中这样做：

import numpy as np
import time

sequences = [
    ["A", "A", "A", "B", "C"],
    ["B", "A", "B", "A", "B"],
    ["B", "A", "C", "C", "B"],
    ["B", "A", "C", "C", "C"],
]

start = time.time()
n_seq = len(sequences)
dists = np.zeros((n_seq, n_seq))
for row in range(n_seq):
    for col in range(n_seq):
        if row >= col:
            continue
        dists[row, col] = list_difference(sequences[row], sequences[col])
dists += dists.T
print(dists)
print(f"Time: {time.time() - start} seconds")

结果是

[[0. 4. 4. 3.]
 [4. 0. 2. 3.]
 [4. 2. 0. 1.]
 [3. 3. 1. 0.]]
Time: 0.0003669261932373047 seconds

这个例子在我的电脑上已经足够快了（上面三个中的最佳时间）。然而，运行这在 64 个列表（序列）上每个长度为 1008 需要 293.572190 秒，这是一段时间。 有更快的方法吗？

尝试次数：

1 我尝试将内部 for 循环放入列表理解中：

dists = np.zeros((n_seq, n_seq))
for row in range(n_seq):
    dist_row = [list_difference(sequences[row], sequences[col]) for col in range(n_seq) if row >= col]
    dists[row, n_seq-row-1:] = dist_row
dists += dists.T

但它实际上使它变慢了，需要 0.000523 秒（快 0.70 倍）。在我更大的数据集上，它需要 319.2168769 秒（快 0.92 倍）。

2 我想知道在本机中执行 for 循环 Python 然后在最后复制到 Numpy 是否会有帮助。

_dists = [ [0]*n_seq for i in range(n_seq)]
for row in range(n_seq):
    for col in range(n_seq):
        if col <= row:
            continue
        _dists[row][col] = list_difference(sequences[row], sequences[col])
dists = np.array(_dists)
dists += dists.T

这需要 0.0001862 秒，大约是原始代码的两倍。在我较大的数据集上，加速并不显着，为 229.945951 秒（快 1.28 倍），但仍然有所提升。

3 只是想到可能有一种方法可以直接在 Numpy 中执行两个外部 for 循环。

Answer 1

一种更简洁、更简单、更快速的方法是使用 numpy 广播：

sequences = np.array(sequences)
dists = (sequences[:, None] != sequences).sum(axis=2)

输出：

>>> dists
array([[0, 4, 4, 3],
       [4, 0, 2, 3],
       [4, 2, 0, 1],
       [3, 3, 1, 0]])

列表的快速 Python 外部差异

Fast Python outer difference of list

python

numpy

array-broadcasting