从 python 中的列表中删除重复的 JSON 个对象

Remove duplicate JSON objects from list in python

我有一个字典列表,其中某个特定值重复了多次,我想删除重复值。

我的名单:

te = [
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      }
    ]

删除重复值的函数:

def removeduplicate(it):
    seen = set()
    for x in it:
        if x not in seen:
            yield x
            seen.add(x)

当我调用这个函数时,我得到 generator object

<generator object removeduplicate at 0x0170B6E8>

当我尝试遍历生成器时,我得到 TypeError: unhashable type: 'dict'

有没有办法删除重复值或迭代生成器

您可以通过字典理解轻松删除重复键,因为字典不允许重复键,如下所示-

te = [
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
          "Name": "Bala1",
          "phone": "None"
      }      
    ]

unique = { each['Name'] : each for each in te }.values()

print unique

输出-

[{'phone': 'None', 'Name': 'Bala1'}, {'phone': 'None', 'Name': 'Bala'}]

因为你不能在set上加一个dict。来自 this question:

You're trying to use a dict as a key to another dict or in a set. That does not work because the keys have to be hashable.

As a general rule, only immutable objects (strings, integers, floats, frozensets, tuples of immutables) are hashable (though exceptions are possible).

>>> foo = dict()
>>> bar = set()
>>> bar.add(foo)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>> 

相反,您已经在使用 if x not in seen,因此只需使用列表:

>>> te = [
...       {
...         "Name": "Bala",
...         "phone": "None"
...       },
...       {
...         "Name": "Bala",
...         "phone": "None"
...       },
...       {
...         "Name": "Bala",
...         "phone": "None"
...       },
...       {
...         "Name": "Bala",
...         "phone": "None"
...       }
...     ]

>>> def removeduplicate(it):
...     seen = []
...     for x in it:
...         if x not in seen:
...             yield x
...             seen.append(x)

>>> removeduplicate(te)
<generator object removeduplicate at 0x7f3578c71ca8>

>>> list(removeduplicate(te))
[{'phone': 'None', 'Name': 'Bala'}]
>>> 

您仍然可以使用 set 进行重复检测,您只需要将字典转换为可散列的内容,例如 tuple。您的字典可以通过 tuple(d.items()) 转换为元组,其中 d 是字典。将其应用于您的生成器函数:

def removeduplicate(it):
    seen = set()
    for x in it:
        t = tuple(x.items())
        if t not in seen:
            yield x
            seen.add(t)

>>> for d in removeduplicate(te):
...    print(d)
{'phone': 'None', 'Name': 'Bala'}

>>> te.append({'Name': 'Bala', 'phone': '1234567890'})
>>> te.append({'Name': 'Someone', 'phone': '1234567890'})

>>> for d in removeduplicate(te):
...    print(d)
{'phone': 'None', 'Name': 'Bala'}
{'phone': '1234567890', 'Name': 'Bala'}
{'phone': '1234567890', 'Name': 'Someone'}

这提供了比 "seen" list (O(n)) 更快的查找(平均 O(1))。将每个字典转换为元组的额外计算是否值得取决于您拥有的字典数量和重复项的数量。如果有很多重复项,"seen" list 将变得非常大,并且测试是否已经看到一个 dict 可能会成为一项昂贵的操作。这可能证明元组转换是合理的 - 你必须 test/profile 它。

我只是用 md5 比较所有东西。

filtered_json = []
md5_list = []

for item in json_fin:
    md5_result = hashlib.md5(json.dumps(item, separators=(',', ':')).encode("utf-8")).hexdigest()
    if md5_result not in md5_list:
        md5_list.append(md5_result)
        filtered_json.append(item)