查找数组中重复元素的函数

Question

我有一个类似列表的 python 正整数对象，我想获取该列表中的哪些位置具有重复值。例如如果输入是 [0,1,1] 函数应该 return [1,2] 因为值 1，即输入数组位置 1 和 2 的元素出现了两次。同样：

[0,13,13] 应该 return [[1, 2]]

[0,1,2,1,3,4,2,2] 应该 return [[1, 3], [2, 6, 7]] 因为 1 出现了两次，在输入数组的位置 [1, 3] 并且 2 出现了 3 次在位置 [2, 6, 7]

[1, 2, 3] 应该 return 一个空数组 []

我写的是这样的：

def get_locations(labels):
    out = []
    label_set = set(labels)
    for label in list(label_set):
        temp = [i for i, j in enumerate(labels) if j == label]
        if len(temp) > 1:
            out.append(np.array(temp))

    return np.array(out)

虽然它适用于小型输入数组，但当大小增加时会变得太慢。例如，下面的代码在我的电脑上，从 n=1000 时的 0.14secs 飙升到 n = 10000

时的 12secs

from timeit import default_timer as timer
start = timer()
n = 10000
a = np.arange(n)
b = np.append(a, a[-1]) # append the last element to the end
out = get_locations(b)
end = timer()
print(out)
print(end - start) # Time in seconds

请问我怎样才能加快速度？高度赞赏任何想法

Answer 1

您可以尝试使用“collections”模块中的“计数器”功能

from collections import Counter
list1 = [1,1,2,3,4,4,4]
Counter(list1)

您将得到类似于此的输出

Counter({4: 3, 1: 2, 2: 1, 3: 1})

Answer 2

这本质上是一个 time-complexity 问题。您的算法嵌套了循环遍历列表两次的循环，因此时间复杂度约为 n^2，其中 n 是列表的大小。因此，当您将列表的大小乘以 10（从 1,000 到 10,000）时，您会看到大约时间增加了 10^2 = 100。这就是它从 0.14 秒到 12 秒的原因。

这是一个简单的解决方案，不需要额外的库：

def get_locations(labels):
    locations = {}
    for index, label in enumerate(labels):
        if label in locations:
            locations[label].append(index)
        else:
            locations[label] = [index]

    return [locations[i] for i in locations if len(locations[i]) > 1]

由于 for 循环没有嵌套，时间复杂度约为 2n，所以当问题规模增加一倍时，您应该会看到大约 4 倍的时间增加。

Answer 3

由于嵌套的 for 循环，您的代码运行缓慢。您可以使用另一种数据结构以更有效的方式解决此问题：

from collections import defaultdict
mylist = [0,1,2,1,3,4,2,2]

output = defaultdict(list)
# Loop once over mylist, store the indices of all unique elements
for i, el in enumerate(mylist):
    output[el].append(i)

# Filter out elements that occur only once
output = {k:v for k, v in output.items() if len(v) > 1}

这会为您的示例生成以下输出 b：

{1: [1, 3], 2: [2, 6, 7]}

您可以将此结果转换为所需的格式：

list(output.values())
> [[1, 3], [2, 6, 7]]

但是请注意，这依赖于字典的插入顺序，只有 python 3.6 才会如此。

Answer 4

您的嵌套循环导致时间复杂度为 O(n ^ 2)。您可以改为创建一个列表字典来将索引映射到每个标签，并且仅当子列表的长度大于 1 时才提取字典的子列表，这样可以将时间复杂度降低到 O (n):

def get_locations(labels):
    positions = {}
    for index, label in enumerate(labels):
        positions.setdefault(label, []).append(index)
    return [indices for indices in positions.values() if len(indices) > 1]

所以 get_locations([0, 1, 2, 1, 3, 4, 2, 2]) returns:

[[1, 3], [2, 6, 7]]

Answer 5

这是我实现的代码。它以线性时间运行：

l = [0,1,2,1,3,4,2,2]
dict1 = {}

for j,i in enumerate(l): # O(n)
    temp = dict1.get(i) # O(1) most cases
    if not temp:
        dict1[i] = [j]
    else:
        dict1[i].append(j) # O(1) 

print([item for item in dict1.values() if len(item) > 1]) # O(n)

输出：

[[1, 3], [2, 6, 7]]

查找数组中重复元素的函数

lookup function for repeated elements in an array

python

arrays

big-o

numpy

python-3.x