根据两个属性在 JSON 文件中复制

duplicates in a JSON file based on two attributes

我有一个 JSON 文件,它是一个嵌套的 JSON。我想根据两个键删除重复项。

JSON 示例:

"books": [
{
            "id": "1",
            "story": {
                "title": "Lonely lion"
            },
            "description": [
                {
                    "release": false,
                    "author": [
                        {
                            "name": "John",
                            "main": 1
                        },
                        {
                            "name": "Jeroge",
                            "main": 0
                        },
                        {
                            "name": "Peter",
                            "main": 0
                        }
                    ]
                }
            ]
        },
    {
            "id": "2",
            "story": {
                "title": "Lonely lion"
            },
            "description": [
                {
                    "release": false,
                    "author": [
                        {
                            "name": "Jeroge",
                            "main": 1
                        },
                        {
                            "name": "Peter",
                            "main": 0
                        },
                        {
                            "name": "John",
                            "main": 0
                        }
                    ]
                }
            ]
        },
{
            "id": "3",
            "story": {
                "title": "Lonely lion"
            },
            "description": [
                {
                    "release": false,
                    "author": [
                        {
                            "name": "John",
                            "main": 1
                        },
                        {
                            "name": "Jeroge",
                            "main": 0
                        }
                        
                    ]
                }
            ]
        }
]

这里我尝试匹配标题和作者姓名。例如,id 1 和 id 2 是重复的(因为标题相同,作者姓名也相同(作者顺序无关紧要,无需考虑主要属性)。因此,在输出中 JSON 只有 id:1 或 id:2 将保留 id:3。在最终输出中,我需要两个文件。

Output_JSON:
"books": [
{
            "id": "1",
            "story": {
                "title": "Lonely lion"
            },
            "description": [
                {
                    "release": false,
                    "author": [
                        {
                            "name": "John",
                            "main": 1
                        },
                        {
                            "name": "Jeroge",
                            "main": 0
                        },
                        {
                            "name": "Peter",
                            "main": 0
                        }
                    ]
                }
            ]
        },
 
{
            "id": "3",
            "story": {
                "title": "Lonely lion"
            },
            "description": [
                {
                    "release": false,
                    "author": [
                        {
                            "name": "John",
                            "main": 1
                        },
                        {
                            "name": "Jeroge",
                            "main": 0
                        }

                    ]
                }
            ]
        }
]
duplicatedID.csv:

1-2

我尝试了以下方法,但没有给出正确的结果:

list= []
duplicate_Id = []
for data in (json_data['books'])[:]:   
    
    elements= []
    id = data['id']
    title = data['story']['title']
    elements.append(title)
    for i in (data['description'][0]['author']):
        name = (i['name'])
        elements.append(name)
    
   if not list:
        list.append(elements)
    
    
    else:
        for j in list:
            if set(elements) == set(j):
                duplicate_Id.append(id)
                elements = []
     

总体思路是:

  • 获取由某些收集重复项的函数标识的组。
  • 然后return每个组的第一个条目,确保没有重复。
  • 将关键函数定义为作者和排序列表。由于作者列表根据定义是唯一键,但可能以任何顺序出现。
import json
from itertools import groupby

j = json.load(books)


def transform(books):
    groups = [list(group) for _, group in groupby(books, key=getAuthors)]
    return [group[0] for group in groups]

def getAuthors(book):
    authors = book['description'][0]['author']
    return sorted([author['name'] for author in authors])

print(transform(j['books']))

如果我们想得到重复项,那么我们进行相同的计算,但是 return 任何带有 length > 1 的子列表,因为根据我们的定义,这是重复数据。

def transform(books):
    groups = [list(group) for _, group in groupby(books, key=getAuthors)]
    return [group for group in groups if len(group) > 1]

其中 j['books'] 是您提供的包含在对象中的 JSON。