从字典中删除重复项的更快算法，比较两个

Question

我目前正在使用 python 2.7 处理 notMINST 数据库，试图删除重复的图像。我把每个图像都变成了 MD5 哈希，并创建了一个字典 image_hash

第一种方法可行，但是数据集中总共有500000张图像，花了将近一个小时。

image_hash_identical = {}
for key,value in image_hash.items():
    if value not in image_hash_identical.values():
        image_hash_identical[key] = value

我尝试使用 'set' 函数创建第二种方法来加快速度：

image_hash_set_values = list(set(image_hash.values()))
for i in range(len(image_hash_set_values)):
    for j in range(i, len(image_hash)):
        image_hash[j] == image_hash_set_values[i]:
            image_hash_identical[i] = image_hash[j]
            break

但是，此代码未能加速 'set' 函数打乱 image_hash 顺序的过程。有没有什么方法可以通过 'set' 函数或任何可以处理这种情况的更快的算法来抑制改组？

Answer 1

为什么不使用集合来跟踪看到的值：

image_hash_identical, seen = {}, set()
for key, value in image_hash.items():
    if value not in seen:  # contains of set: O(1)
        image_hash_identical[key] = value
        seen.add(value)

从字典中删除重复项的更快算法，比较两个

Faster algorithm for removing duplicates from dictionaries, compare of two

python

dictionary

image

duplicates

computer-vision