计算非齐次向量的距离:python 列表 vs numpy 数组
Computing distances on inhomogeneous vectors: python lists vs numpy array
在我的项目中(关于聚类算法,特别是 k-medoids)对于能够有效地计算成对距离至关重要。我有一个约 60,000 个对象的数据集。问题是,必须计算非齐次向量之间的距离,即长度可能不同的向量(在这种情况下,缺失的项目被视为 0)。
这是一个最小的工作示例:
# %%
MAX_LEN = 11
N = 100
import random
def manhattan_distance(vec1, vec2):
n1, n2 = len(vec1), len(vec2)
n = min(n1, n2)
dist = 0
for i in range(n):
dist += abs(vec1[i] - vec2[i])
if n1 > n2:
for i in range(n, n1):
dist += abs(vec1[i])
else:
for i in range(n, n2):
dist += abs(vec2[i])
return dist
def compute_distances():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance(data[i], data[j])
data = []
for i in range(N):
data.append([])
for k in range(random.randint(5, MAX_LEN)):
data[i].append(random.randint(0, 10))
%timeit compute_distances()
import numpy as np
def manhattan_distance_np(vec1, vec2):
return np.absolute(vec1 - vec2).sum()
def compute_distances_np():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance_np(data_np[i], data_np[j])
data_np = [np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data]
%timeit compute_distances_np()
我正在测试我的 Python 列表实现与 numpy
实现。
以下是结果(计算时间):
- Python 列出:
79.6 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
numpy
数组:226 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
为什么会有这么大的差异?我认为 numpy
数组真的很快。
有没有办法改进我的代码?我是不是误解了 numpy
的内部工作原理?
编辑:将来我可能需要能够使用自定义距离函数进行成对距离计算。该方法也适用于长度为 60'000 且 运行 内存不足的数据集。
我相信你可以让你的数组更密集,并将未使用的最后一个元素设置为 0。
import numpy as np
from scipy.spatial.distance import cdist, pdist, squareform
def batch_pdist(x, metric, batchsize=1000):
dists = np.zeros((len(x), len(x)))
for i in range(0, len(x), batchsize):
for j in range(0, len(x), batchsize):
dist_batch = cdist(x[i:i+batchsize], x[j:j+batchsize], metric=metric)
dists[i:i+batchsize, j:j+batchsize] = dist_batch
return dists
MIN_LEN = 5
MAX_LEN = 11
N = 10000
M = 10
data = []
data = np.zeros((N,MAX_LEN))
for i in range(N):
num_nonzero = np.random.randint(MIN_LEN, MAX_LEN)
data[i, :num_nonzero] = np.random.randint(0, M, num_nonzero)
dists = squareform(pdist(data, metric='cityblock'))
dists2 = batch_pdist(data, metric='cityblock', batchsize=500)
print((dists == dists2).all())
定时输出:
%timeit squareform(pdist(data, metric='cityblock'))
43.8 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
编辑:
对于自定义距离函数,请参阅此 documentation 的最底部。
我终于找到了解决这个问题的最直接的方法,无需更改太多代码并且仅依赖于计算而不是内存(因为这对于非常大的数据集可能不可行)。
根据juanpa.arrivillaga的建议,我尝试了numba
,这是一个加速面向数组和数学密集型Python代码的库,主要针对numpy
.您可以在此处阅读有关优化 Python 代码的良好指南:https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/.
MAX_LEN = 11
N = 100
# Pure Python lists implementation.
import random
def manhattan_distance(vec1, vec2):
n1, n2 = len(vec1), len(vec2)
n = min(n1, n2)
dist = 0
for i in range(n):
dist += abs(vec1[i] - vec2[i])
if n1 > n2:
for i in range(n, n1):
dist += abs(vec1[i])
else:
for i in range(n, n2):
dist += abs(vec2[i])
return dist
def compute_distances():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance(data[i], data[j])
data = []
for i in range(N):
data.append([])
for k in range(random.randint(5, MAX_LEN)):
data[i].append(random.randint(0, 10))
%timeit compute_distances()
# numpy+numba implementation.
import numpy as np
from numba import jit
@jit
def manhattan_distance_np(vec1, vec2):
return np.absolute(vec1 - vec2).sum()
@jit
def compute_distances_np():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance_np(data_np[i], data_np[j])
data_np = np.array([np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data])
%timeit compute_distances_np()
定时输出:
%timeit compute_distances()
78.4 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit compute_distances_np()
4.1 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
如您所见,numpy
和 numba
优化大约快 19 倍(不涉及其他代码优化)。
在我的项目中(关于聚类算法,特别是 k-medoids)对于能够有效地计算成对距离至关重要。我有一个约 60,000 个对象的数据集。问题是,必须计算非齐次向量之间的距离,即长度可能不同的向量(在这种情况下,缺失的项目被视为 0)。
这是一个最小的工作示例:
# %%
MAX_LEN = 11
N = 100
import random
def manhattan_distance(vec1, vec2):
n1, n2 = len(vec1), len(vec2)
n = min(n1, n2)
dist = 0
for i in range(n):
dist += abs(vec1[i] - vec2[i])
if n1 > n2:
for i in range(n, n1):
dist += abs(vec1[i])
else:
for i in range(n, n2):
dist += abs(vec2[i])
return dist
def compute_distances():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance(data[i], data[j])
data = []
for i in range(N):
data.append([])
for k in range(random.randint(5, MAX_LEN)):
data[i].append(random.randint(0, 10))
%timeit compute_distances()
import numpy as np
def manhattan_distance_np(vec1, vec2):
return np.absolute(vec1 - vec2).sum()
def compute_distances_np():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance_np(data_np[i], data_np[j])
data_np = [np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data]
%timeit compute_distances_np()
我正在测试我的 Python 列表实现与 numpy
实现。
以下是结果(计算时间):
- Python 列出:
79.6 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
numpy
数组:226 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
为什么会有这么大的差异?我认为 numpy
数组真的很快。
有没有办法改进我的代码?我是不是误解了 numpy
的内部工作原理?
编辑:将来我可能需要能够使用自定义距离函数进行成对距离计算。该方法也适用于长度为 60'000 且 运行 内存不足的数据集。
我相信你可以让你的数组更密集,并将未使用的最后一个元素设置为 0。
import numpy as np
from scipy.spatial.distance import cdist, pdist, squareform
def batch_pdist(x, metric, batchsize=1000):
dists = np.zeros((len(x), len(x)))
for i in range(0, len(x), batchsize):
for j in range(0, len(x), batchsize):
dist_batch = cdist(x[i:i+batchsize], x[j:j+batchsize], metric=metric)
dists[i:i+batchsize, j:j+batchsize] = dist_batch
return dists
MIN_LEN = 5
MAX_LEN = 11
N = 10000
M = 10
data = []
data = np.zeros((N,MAX_LEN))
for i in range(N):
num_nonzero = np.random.randint(MIN_LEN, MAX_LEN)
data[i, :num_nonzero] = np.random.randint(0, M, num_nonzero)
dists = squareform(pdist(data, metric='cityblock'))
dists2 = batch_pdist(data, metric='cityblock', batchsize=500)
print((dists == dists2).all())
定时输出:
%timeit squareform(pdist(data, metric='cityblock'))
43.8 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
编辑:
对于自定义距离函数,请参阅此 documentation 的最底部。
我终于找到了解决这个问题的最直接的方法,无需更改太多代码并且仅依赖于计算而不是内存(因为这对于非常大的数据集可能不可行)。
根据juanpa.arrivillaga的建议,我尝试了numba
,这是一个加速面向数组和数学密集型Python代码的库,主要针对numpy
.您可以在此处阅读有关优化 Python 代码的良好指南:https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/.
MAX_LEN = 11
N = 100
# Pure Python lists implementation.
import random
def manhattan_distance(vec1, vec2):
n1, n2 = len(vec1), len(vec2)
n = min(n1, n2)
dist = 0
for i in range(n):
dist += abs(vec1[i] - vec2[i])
if n1 > n2:
for i in range(n, n1):
dist += abs(vec1[i])
else:
for i in range(n, n2):
dist += abs(vec2[i])
return dist
def compute_distances():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance(data[i], data[j])
data = []
for i in range(N):
data.append([])
for k in range(random.randint(5, MAX_LEN)):
data[i].append(random.randint(0, 10))
%timeit compute_distances()
# numpy+numba implementation.
import numpy as np
from numba import jit
@jit
def manhattan_distance_np(vec1, vec2):
return np.absolute(vec1 - vec2).sum()
@jit
def compute_distances_np():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance_np(data_np[i], data_np[j])
data_np = np.array([np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data])
%timeit compute_distances_np()
定时输出:
%timeit compute_distances()
78.4 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit compute_distances_np()
4.1 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
如您所见,numpy
和 numba
优化大约快 19 倍(不涉及其他代码优化)。