如何找到从一个系列到另一个系列的最近邻索引

Question

我有一个目标数组 A，它表示 NCEP 再分析数据中的等压压力水平。我也有将云观察为长时间序列的压力，B.

我正在寻找的是一个 k 最近邻查找，它 return 是那些最近邻的索引，类似于 Matlab 中的 knnsearch 可以在 [=60 中表示相同=] 例如：indices, distance = knnsearch(A, B, n) 其中 indices 是 A 中最近的 n 索引，对于 B 中的每个值，distance 是 B 中的值被移除的距离从 A 中最接近的值开始，A 和 B 可以有不同的长度（这是迄今为止我在大多数解决方案中发现的瓶颈，因此我必须循环每个B 到 return 我的 indices 和 distance)

中的值

import numpy as np

A = np.array([1000, 925, 850, 700, 600, 500, 400, 300, 250, 200, 150, 100, 70, 50, 30, 20, 10]) # this is a fixed 17-by-1 array
B = np.array([923, 584.2, 605.3, 153.2]) # this can be any n-by-1 array
n = 2

我想从 indices, distance = knnsearch(A, B, n) 中 return 得到的是：

indices = [[1, 2],[4, 5] etc...]

其中 A 中的 923 匹配第一个 A[1]=925 然后 A[2]=850 并且 A 中的 584.2 匹配到第一个 A[4]=600 然后 A[5]=500

distance = [[72, 77],[15.8, 84.2] etc...]

其中 72 表示 B 中的查询值与 A 中最接近的值之间的距离，例如distance[0, 0] == np.abs(B[0] - A[1])

我能想出的唯一解决办法是：

import numpy as np


def knnsearch(A, B, n):
    indices = np.zeros((len(B), n))
    distances = np.zeros((len(B), n))

    for i in range(len(B)):
        a = A
        for N in range(n):
            dif = np.abs(a - B[i])
            ind = np.argmin(dif)

            indices[i, N] = ind + N
            distances[i, N] = dif[ind + N]
            # remove this neighbour from from future consideration
            np.delete(a, ind)

    return indices, distances


array_A = np.array([1000, 925, 850, 700, 600, 500, 400, 300, 250, 200, 150, 100, 70, 50, 30, 20, 10])
array_B = np.array([923, 584.2, 605.3, 153.2])
neighbours = 2

indices, distances = knnsearch(array_A, array_B, neighbours)

print(indices)
print(distances)

returns:

[[ 1.  2.]
 [ 4.  5.]
 [ 4.  3.]
 [10. 11.]]

[[  2.   73. ]
 [ 15.8  84.2]
 [  5.3  94.7]
 [  3.2  53.2]]

必须有一种删除 for 循环的方法，因为如果我的 A 和 B 数组包含数千个元素和许多最近的邻居，我需要性能...

请帮忙！谢谢:)

Answer 1

第二个循环很容易被矢量化。最直接的方法是使用 np.argsort 和 select 对应于 n 个最小 dif 值的索引。但是，对于大型数组，因为只需要对 n 个值进行排序，所以最好使用 np.argpartition。

因此，代码看起来像这样：

def vector_knnsearch(A, B, n):
    indices = np.empty((len(B), n))
    distances = np.empty((len(B), n))

    for i,b in enumerate(B):
        dif = np.abs(A - b)
        min_ind = np.argpartition(dif,n)[:n] # Returns the indexes of the 3 smallest
                                             # numbers but not necessarily sorted
        ind = min_ind[np.argsort(dif[min_ind])] # sort output of argpartition just in case
        indices[i, :] = ind
        distances[i, :] = dif[ind]

    return indices, distances

正如评论中所说，第一个循环也可以使用 meshgrid 删除，但是，额外使用内存和计算时间来构建 meshgrid 使得这种方法对于我尝试的维度来说更慢（这可能会对于大型数组会变得更糟并最终导致内存错误）。此外，代码的可读性降低。总的来说，这可能会使这种方法不那么pythonic。

def mesh_knnsearch(A, B, n):
    m = len(B)
    rng = np.arange(m).reshape((m,1))
    Amesh, Bmesh = np.meshgrid(A,B)
    dif = np.abs(Amesh-Bmesh)
    min_ind = np.argpartition(dif,n,axis=1)[:,:n]
    ind = min_ind[rng,np.argsort(dif[rng,min_ind],axis=1)]

    return ind, dif[rng,ind]

并不是说为了检索 a[rng[0],ind[0]]、a[rng[1],ind[1]] 等并保持数组的维度，将此 rng 定义为二维数组很重要，而不是a[:,ind] 检索 a[:,ind[0]]、a[:,ind[1]] 等

如何找到从一个系列到另一个系列的最近邻索引

How to find the nearest neighbour index from one series to another

python

vectorization

nearest-neighbor

knn

python-3.x