根据 Python 中另一个数组的值选择数组元素的有效方法？

Question

我有两个数组，例如一个是标签另一个是距离：

labels= array([3, 1, 0, 1, 3, 2, 3, 2, 1, 1, 3, 1, 2, 1, 3, 2, 2, 3, 3, 3, 2, 3,
        0, 3, 3, 2, 3, 2, 3, 2,...])

distances = array([2.32284095, 0.36254613, 0.95734965, 0.35429638, 2.79098656,
        5.45921793, 2.63795657, 1.34516461, 1.34028463, 1.10808795,
        1.60549826, 1.42531201, 1.16280383, 1.22517273, 4.48511033,
        0.71543217, 0.98840598,...])

我想要做的是根据唯一标签值的数量将距离值分组到 N 数组中（在本例中为 N=4 )。因此，所有带有 label = 3 的值都放在一个数组中，而 label = 2 放在另一个数组中，依此类推。

我可以想到带有循环和 if 条件的简单蛮力，但这会导致大型数组的严重减速。我觉得有更好的方法可以通过使用本机列表理解或 numpy 或其他东西来做到这一点，只是不确定是什么。什么是最好、最有效的方法？

“蛮力”示例供参考，注意：(len(labels)==len(distances))：

all_distance_arrays = []
for id in np.unique(labels):

   sorted_distances = []
   
   for index in range(len(labels)):

        if id == labels[index]:

          sorted_distances.append(distances[index])
    
   all_distance_arrays.append(sorted_distances)

Answer 1

一个简单的列表理解会很好而且很快：

groups = [distances[labels == i] for i in np.unique(labels)]

输出：

>>> groups
[array([0.95734965]),
 array([0.36254613, 0.35429638, 1.34028463, 1.10808795, 1.42531201,
        1.22517273]),
 array([5.45921793, 1.34516461, 1.16280383, 0.71543217, 0.98840598]),
 array([2.32284095, 2.79098656, 2.63795657, 1.60549826, 4.48511033])]

Answer 2

仅使用 NumPy 作为：

_, counts = np.unique(labels, return_counts=True)  # counts is the repeatation number of each index
sor = labels.argsort()
sections = np.cumsum(counts)                       # end index of slices
labels_sor = np.split(labels[sor], sections)[:-1]
distances_sor = np.split(distances[sor], sections)[:-1]

Answer 3

对于合理数量的标签，“蛮力”似乎就足够了：

from collections import defaultdict

dist_group = defaultdict(list)
for lb, ds in zip(labels, distances):
    dist_group[lb].append(ds)

很难说为什么这不符合您的目的。

Answer 4

您只能使用 numpy 函数执行此操作。首先对数组进行同步排序（这就是 np.unique 在幕后所做的），然后在标签更改的地方拆分它们：

i = np.argsort(labels)
labels = labels[i]
distances = distances[i]
splitpoints = np.flatnonzero(np.diff(labels)) + 1
result = np.split(distances, splitpoints)
unique_labels = labels[np.r_[0, split_points]]

根据 Python 中另一个数组的值选择数组元素的有效方法？

Efficient way of selecting elements of an array based on values from another array in Python?

python

arrays

performance

numpy