根据两个属性在 JSON 文件中复制
duplicates in a JSON file based on two attributes
我有一个 JSON 文件,它是一个嵌套的 JSON。我想根据两个键删除重复项。
JSON 示例:
"books": [
{
"id": "1",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "John",
"main": 1
},
{
"name": "Jeroge",
"main": 0
},
{
"name": "Peter",
"main": 0
}
]
}
]
},
{
"id": "2",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "Jeroge",
"main": 1
},
{
"name": "Peter",
"main": 0
},
{
"name": "John",
"main": 0
}
]
}
]
},
{
"id": "3",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "John",
"main": 1
},
{
"name": "Jeroge",
"main": 0
}
]
}
]
}
]
这里我尝试匹配标题和作者姓名。例如,id 1 和 id 2 是重复的(因为标题相同,作者姓名也相同(作者顺序无关紧要,无需考虑主要属性)。因此,在输出中 JSON 只有 id:1 或 id:2 将保留 id:3。在最终输出中,我需要两个文件。
Output_JSON:
"books": [
{
"id": "1",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "John",
"main": 1
},
{
"name": "Jeroge",
"main": 0
},
{
"name": "Peter",
"main": 0
}
]
}
]
},
{
"id": "3",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "John",
"main": 1
},
{
"name": "Jeroge",
"main": 0
}
]
}
]
}
]
duplicatedID.csv:
1-2
我尝试了以下方法,但没有给出正确的结果:
list= []
duplicate_Id = []
for data in (json_data['books'])[:]:
elements= []
id = data['id']
title = data['story']['title']
elements.append(title)
for i in (data['description'][0]['author']):
name = (i['name'])
elements.append(name)
if not list:
list.append(elements)
else:
for j in list:
if set(elements) == set(j):
duplicate_Id.append(id)
elements = []
总体思路是:
- 获取由某些收集重复项的函数标识的组。
- 然后return每个组的第一个条目,确保没有重复。
- 将关键函数定义为作者和排序列表。由于作者列表根据定义是唯一键,但可能以任何顺序出现。
import json
from itertools import groupby
j = json.load(books)
def transform(books):
groups = [list(group) for _, group in groupby(books, key=getAuthors)]
return [group[0] for group in groups]
def getAuthors(book):
authors = book['description'][0]['author']
return sorted([author['name'] for author in authors])
print(transform(j['books']))
如果我们想得到重复项,那么我们进行相同的计算,但是 return 任何带有 length > 1
的子列表,因为根据我们的定义,这是重复数据。
def transform(books):
groups = [list(group) for _, group in groupby(books, key=getAuthors)]
return [group for group in groups if len(group) > 1]
其中 j['books']
是您提供的包含在对象中的 JSON。
我有一个 JSON 文件,它是一个嵌套的 JSON。我想根据两个键删除重复项。
JSON 示例:
"books": [
{
"id": "1",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "John",
"main": 1
},
{
"name": "Jeroge",
"main": 0
},
{
"name": "Peter",
"main": 0
}
]
}
]
},
{
"id": "2",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "Jeroge",
"main": 1
},
{
"name": "Peter",
"main": 0
},
{
"name": "John",
"main": 0
}
]
}
]
},
{
"id": "3",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "John",
"main": 1
},
{
"name": "Jeroge",
"main": 0
}
]
}
]
}
]
这里我尝试匹配标题和作者姓名。例如,id 1 和 id 2 是重复的(因为标题相同,作者姓名也相同(作者顺序无关紧要,无需考虑主要属性)。因此,在输出中 JSON 只有 id:1 或 id:2 将保留 id:3。在最终输出中,我需要两个文件。
Output_JSON:
"books": [
{
"id": "1",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "John",
"main": 1
},
{
"name": "Jeroge",
"main": 0
},
{
"name": "Peter",
"main": 0
}
]
}
]
},
{
"id": "3",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "John",
"main": 1
},
{
"name": "Jeroge",
"main": 0
}
]
}
]
}
]
duplicatedID.csv:
1-2
我尝试了以下方法,但没有给出正确的结果:
list= []
duplicate_Id = []
for data in (json_data['books'])[:]:
elements= []
id = data['id']
title = data['story']['title']
elements.append(title)
for i in (data['description'][0]['author']):
name = (i['name'])
elements.append(name)
if not list:
list.append(elements)
else:
for j in list:
if set(elements) == set(j):
duplicate_Id.append(id)
elements = []
总体思路是:
- 获取由某些收集重复项的函数标识的组。
- 然后return每个组的第一个条目,确保没有重复。
- 将关键函数定义为作者和排序列表。由于作者列表根据定义是唯一键,但可能以任何顺序出现。
import json
from itertools import groupby
j = json.load(books)
def transform(books):
groups = [list(group) for _, group in groupby(books, key=getAuthors)]
return [group[0] for group in groups]
def getAuthors(book):
authors = book['description'][0]['author']
return sorted([author['name'] for author in authors])
print(transform(j['books']))
如果我们想得到重复项,那么我们进行相同的计算,但是 return 任何带有 length > 1
的子列表,因为根据我们的定义,这是重复数据。
def transform(books):
groups = [list(group) for _, group in groupby(books, key=getAuthors)]
return [group for group in groups if len(group) > 1]
其中 j['books']
是您提供的包含在对象中的 JSON。