有效地找到具有切割和 return 索引的邻居

Question

我在x,y平面上有很多点，长度在10000左右，每个点(x,y)都有一个固有半径r。这个小数据集只是我整个数据集中的一个小角落。我有一个感兴趣的点(x1,y1)，我想在1内找到(x1,y1)附近的点，并且满足(x,y)和(x1,y1)之间的距离小于[=14的条件=].我要return那些优点的索引，而不是优点本身。

import numpy as np
np.random.seed(2000)
x = 20.*np.random.rand(10000)
y = 20.*np.random.rand(10000)
r = 0.3*np.random.rand(10000)
x1 = 10.  ### (x1,y1) is an interest point 
y1 = 12.
def index_finder(x,y,r,x1,y1):
    idx = (abs(x - x1) < 1.) & (abs(y - y1) < 1.)  ### This cut will probably cut 90% of the data
    x_temp = x[idx]   ### but if I do like this, then I lose the track of the original index
    y_temp = y[idx]
    dis_square = (x_temp - x1)*(x_temp - x1) + (y_temp - y1)*(y_temp - y1)
    idx1 = dis_square < r*r    ### after this cut, there are only a few left 
    x_good = x_temp[idx1]
    y_good = y_temp[idx1]

在这个函数中，我可以找到(x1,y1)周围的好点，但找不到那些好点的索引。但是，我需要原始索引，因为原始索引用于提取与坐标 (x,y) 关联的其他数据。正如我提到的，样本数据集只是我整个数据集的一个小角落，我将对我的整个数据集调用上述函数大约 1,000,000 次，因此上述 index_finder 函数的效率也是一个考虑因素.

对这样的任务有什么想法吗？

Answer 1

方法 #1

我们可以简单地索引到第一个掩码，它有自己的掩码，用于从第二阶段选择 True 地方掩码值，就像这样 -

idx[idx] = idx1

因此，idx 将具有对应于原始数组 x 和 y 的最终有效屏蔽值/有价值的位置，即 -

x_good = x[idx]
y_good = y[idx]

此掩码可用于索引问题中提到的其他数组。

方法 #2

作为另一种方法，我们可以使用两个条件语句，从而用它们创建两个掩码。最后，将它们与 AND-ing 组合以获得组合掩码，可以将其索引到 x 和 y 数组中以获得最终输出。我们不需要以这种方式获取实际索引，因此这是另一个好处。

因此，实施 -

X = x-x1
Y = y-y1
mask1 = (np.abs(X) < 1.) & (np.abs(Y) < 1.)
mask2 = X**2 + Y*2 < r**2
comb_mask = mask1 & mask2

x_good = x[comb_mask]
y_good = y[comb_mask]

如果出于某种原因，您仍然需要相应的索引，只需执行 -

comb_idx = np.flatnonzero(comb_mask)

如果您对同一 x 和 y 数据集的不同 x1 和 y1 对执行这些操作，我建议使用 broadcasting通过所有这些 x1、y1 配对数据集对其进行矢量化，如 .

所示

Answer 2

您可以使用索引掩码，如下所示：

def index_finder(x,y,r,x1,y1):
    idx = np.nonzero((abs(x - x1) < 1.) & (abs(y - y1) < 1.))  #numerical, not boolean
    mask = (x[idx] - x1)*(x[idx] - x1) + (y[idx] - y1)*(y[idx] - y1) < r*r
    idx1 = [i[mask] for i in idx]
    x_good = x_temp[idx1]
    y_good = y_temp[idx1]

现在 idx1 是您要提取的索引。

通常更快的方法是使用 scipy.spatial.KDTree

from scipy.spatial import KDTree

xy = np.stack((x,y))
kdt = KDTree(xy)
kdt.query_ball_point([x1, y1], r)

如果您有很多点要针对同一个数据集进行查询，这将比顺序调用您的 index_finder 应用程序快很多。

x1y1 = np.stack((x1, y1)) #`x1` and `y1` are arrays of coordinates.
kdt.query_ball_point(x1y1, r)

也错了：如果每个点的距离不同，你可以这样做：

def query_variable_ball(kdtree, x, y, r):
    out = []
    for x_, y_, r_ in zip(x, y, r):
        out.append(kdt.query_ball_point([x_, y_], r_)
    return out

xy = np.stack((x,y))
kdt = KDTree(xy)
query_variable_ball(kdt, x1, y1, r)

编辑 2： 这应该适用于每个点的不同 r 值

from scipy.spatial import KDTree

def index_finder_kd(x, y, r, x1, y1):  # all arrays
    xy = np.stack((x,y), axis = -1)
    x1y1 = np.stack((x1, y1), axis = -1)
    xytree = KDTree(xy)
    d, i = xytree.query(x1y1, k = None, distance_upper_bound = 1.)
    good_idx = np.zeros(x.size, dtype = bool)
    for idx, dist in zip(i, d):
        good_idx[idx] |= r[idx] > dist
    x_good = x[good_idx]
    y_good = y[good_idx]
    return x_good, y_good, np.flatnonzero(good_idx)

只有一对 (x1, y1) 这非常慢，因为 KDTree 需要一段时间来填充。但如果你有数百万对，这会快得多。

（我假设你想要所有 (x1, y1) 的 (x, y) 数据中所有好的点的联合，如果你想要它们分开，也可以使用类似的方法，删除元素i[j] 基于是否 d[j] < r[i[j]])

Answer 3

numpy.where 似乎是为了找到索引

矢量化范数计算 + np.where() 可能比循环更快

sq_norm = (x - x1)**2 + (y - y1)**2  # no need to take 10000 sqrt
idcs = np.where(sq_norm < 1.)

len(idcs[0])
Out[193]: 69

np.stack((idcs[0], x[idcs], y[idcs]), axis=1)[:5]
Out[194]: 
array([[  38.        ,    9.47165956,   11.94250173],
       [  39.        ,    9.6966941 ,   11.67505453],
       [ 276.        ,   10.68835317,   12.11589316],
       [ 288.        ,    9.93632584,   11.07624915],
       [ 344.        ,    9.48644057,   12.04911857]])

范数计算也可以包含r数组，第2步？

r_sq_norm = (x[idcs] - x1)**2 + (y[idcs] - y1)**2 - r[idcs]**2
r_idcs = np.where(r_sq_norm < 0.)

idcs[0][r_idcs]
Out[11]: array([1575, 3476, 3709], dtype=int64)

您可能想要计时 2 步测试与在第一个向量化范数计算中包含 r？

sq_norm = (x - x1)**2 + (y - y1)**2 - r**2
idcs = np.where(sq_norm < 0.)

idcs[0]
Out[13]: array([1575, 3476, 3709], dtype=int64)

有效地找到具有切割和 return 索引的邻居

Find neighbors with cuts efficiently and return index

python

indexing

numpy

nearest-neighbor