根据重叠值将字典拆分为字典列表
Splitting a dictionary into a list of dictionary based on overlapping values
我有一本带有染色体坐标的字典,如下例所示:
First_dict = {Key1: ['chr10', 19010495, 19014590, 19014064],
Key2: ['chr10', 19010495, 19014658],
Key3: ['chr10', 19010502, 19014641],
Key4: ['chr10', 37375766, 37377526],
Key5: ['chr10', 76310389, 76315990, 76312224, 76312963],
Key6: ['chr11', 14806147, 14814006]}
我想创建一个字典列表,其中那些具有最小和最大染色体坐标(字典值)的当前键与让我们说至少 1000 重叠,被组合到一个新字典中,并且其余的是新列表中的单独词典。
理想情况下是这样的:
New_list =
[{Key1: ['chr10', 19010495, 19014590, 19014064],Key2: ['chr10', 19010495, 19014658], Key3: ['chr10', 19010502, 19014641]},
{Key4: ['chr10', 37375766, 37377526]},
{Key5: ['chr10', 76310389, 76315990, 76312224, 76312963]},
{Key6: ['chr11', 14806147, 14814006]}]
其中 key1、key2 和 key3 在 New_list 中被组合为一个新字典,因为它们的染色体坐标重叠,而 key4、key5、key6 是具有 New_list 的单独字典,因为它们不是完全重叠。
我最初的想法是使用
将“First_dict”分隔成一个字典列表
[{k: v} for (k, v) in First_dict.items()]
然后遍历每个字典,将最小值和最大值与前一个字典进行比较,检查是否重叠,然后创建一个新列表。但是我有几个问题,我无法解决。
我还寻找了其他将词典分组在一起的问题,例如问题:
Grouping Python dictionary keys as a list and create a new dictionary with this list as a value
但我的问题是我的值并不总是完全相同,如上例所示。在考虑重叠时,我还必须考虑染色体。
任何人都可以帮忙解决这个问题,或者提出一个尝试的建议吗?非常感谢。
这个问题可能更适合基于图形的解决方案。没有任何东西可以阻止多个范围以不同的间隔重叠。
#!/usr/bin/env python3
from pprint import pprint
from itertools import groupby
def mapper(d, overlap=1000):
"""Each chromsomal coordinate must be interrogated
to determine if it is within +/-overlap of any other
Range within any other Original Dictionary Transcript
value will match key and chromosome element from the list
------------------------ ---------------------- ----------
(el-overlap, el+overlap), (dict-key, chromosome), el)
"""
for key, ch in d.items():
for el in ch[1:]:
yield ((el-overlap, el+overlap), (key, ch[0]), el)
def sorted_mapper(d, overlap=1000):
"""Simply sort the mapper data by its first element
"""
for r in sorted(mapper(d, overlap), key=lambda x: x[0]):
yield r
def groups(iter_):
previous = next(iter_)
retval = [previous]
for chrm in iter_:
if previous[0][0] <= chrm[-1] <= previous[0][1]:
retval.append(chrm)
else:
yield retval
previous = chrm
retval = [previous]
yield retval
def reduce_phase1(iter_):
for l in iter_:
retval = {}
for (minc, maxc), (key, lbl), chrm in l:
x = retval.get(key,[lbl])
x.append(chrm)
retval[key] = x
yield retval
def update_dict(d1, d2):
retval = d1
for key, value in d2.items():
if key in d1.keys():
retval[key].extend(value[1:])
return retval
def reduce_phase2(iter_):
retval = [next(iter_)]
retval_keys = [set([k for k in retval[0].keys()])]
for d in iter_:
keyset = set([k for k in d.keys()])
isnew = True
for i, e in enumerate(retval_keys):
if keyset <= e:
isnew = False
retval[i] = update_dict(retval[i], d)
if isnew:
retval.append(d)
retval_keys.append(keyset)
return retval
First_dict = {Key1: ['chr10', 19010495, 19014590, 19014064],
Key2: ['chr10', 19010495, 19014658],
Key3: ['chr10', 19010502, 19014641],
Key4: ['chr10', 37375766, 37377526],
Key5: ['chr10', 76310389, 76315990, 76312224, 76312963],
Key6: ['chr11', 14806147, 14814006]}
New_list = [
{
"Key1": ['chr10', 19010495, 19014590, 19014064],
"Key2": ['chr10', 19010495, 19014658],
"Key3": ['chr10', 19010502, 19014641]
},
{"Key4": ['chr10', 37375766, 37377526]},
{"Key5": ['chr10', 76310389, 76315990, 76312224, 76312963]},
{"Key6": ['chr11', 14806147, 14814006]}
]
pprint(First_dict)
print('-'*40)
g = groups(sorted_ranges(First_dict))
p1 = reduce_phase1(groups(sorted_ranges(First_dict)))
p2 = reduce_phase2(p1)
pprint(p2)
输出
{'Key1': ['chr10', 19010495, 19014590, 19014064],
'Key2': ['chr10', 19010495, 19014658],
'Key3': ['chr10', 19010502, 19014641],
'Key4': ['chr10', 37375766, 37377526],
'Key5': ['chr10', 76310389, 76315990, 76312224, 76312963],
'Key6': ['chr11', 14806147, 14814006]}
----------------------------------------
[{'Key6': ['chr11', 14806147, 14814006]},
{'Key1': ['chr10', 19010495, 19014064, 19014590],
'Key2': ['chr10', 19010495, 19014658],
'Key3': ['chr10', 19010502, 19014641]},
{'Key4': ['chr10', 37375766, 37377526]},
{'Key5': ['chr10', 76310389, 76312224, 76312963, 76315990]}]
TLDR;
映射器输出
映射器为每个字典键和染色体元素发出一条记录。每条记录都有一个关联的范围,可以在其中匹配其元素。
((el-1000, el+1000), (dict-key, chromosome), el)
(el-1000, el+1000)是任何其他染色体元素可以匹配的范围。
(dict-key, chromosome) 这个染色体的原始字典。
el是染色体坐标的一个元素。
((19009495, 19011495), ('Key1', 'chr10'), 19010495)
((19013590, 19015590), ('Key1', 'chr10'), 19014590)
((19013064, 19015064), ('Key1', 'chr10'), 19014064)
((19009495, 19011495), ('Key2', 'chr10'), 19010495)
((19013658, 19015658), ('Key2', 'chr10'), 19014658)
((19009502, 19011502), ('Key3', 'chr10'), 19010502)
((19013641, 19015641), ('Key3', 'chr10'), 19014641)
((37374766, 37376766), ('Key4', 'chr10'), 37375766)
((37376526, 37378526), ('Key4', 'chr10'), 37377526)
((76309389, 76311389), ('Key5', 'chr10'), 76310389)
((76314990, 76316990), ('Key5', 'chr10'), 76315990)
((76311224, 76313224), ('Key5', 'chr10'), 76312224)
((76311963, 76313963), ('Key5', 'chr10'), 76312963)
((14805147, 14807147), ('Key6', 'chr11'), 14806147)
((14813006, 14815006), ('Key6', 'chr11'), 14814006)
注意: 映射器的输出未排序。
排序
我们需要使用 (el-1000, el+1000) 作为键对转换后的数据进行排序。
这将允许我们检查下一个值是否在前一个值的范围内。因为键是按排序顺序排列的,所以我们将能够将指定重叠范围内的值链接在一起。
((14805147, 14807147), ('Key6', 'chr11'), 14806147)
((14813006, 14815006), ('Key6', 'chr11'), 14814006)
((19009495, 19011495), ('Key1', 'chr10'), 19010495)
((19009495, 19011495), ('Key2', 'chr10'), 19010495)
((19009502, 19011502), ('Key3', 'chr10'), 19010502)
((19013064, 19015064), ('Key1', 'chr10'), 19014064)
((19013590, 19015590), ('Key1', 'chr10'), 19014590)
((19013641, 19015641), ('Key3', 'chr10'), 19014641)
((19013658, 19015658), ('Key2', 'chr10'), 19014658)
((37374766, 37376766), ('Key4', 'chr10'), 37375766)
((37376526, 37378526), ('Key4', 'chr10'), 37377526)
((76309389, 76311389), ('Key5', 'chr10'), 76310389)
((76311224, 76313224), ('Key5', 'chr10'), 76312224)
((76311963, 76313963), ('Key5', 'chr10'), 76312963)
((76314990, 76316990), ('Key5', 'chr10'), 76315990)
组
将指定重叠范围内的值分组。
出现的列表将包含来自染色体的值
与前一条染色体重叠。
[((14805147, 14807147), ('Key6', 'chr11'), 14806147)]
----------------------------------------
[((14813006, 14815006), ('Key6', 'chr11'), 14814006)]
----------------------------------------
[((19009495, 19011495), ('Key1', 'chr10'), 19010495),
((19009495, 19011495), ('Key2', 'chr10'), 19010495),
((19009502, 19011502), ('Key3', 'chr10'), 19010502)]
----------------------------------------
[((19013064, 19015064), ('Key1', 'chr10'), 19014064),
((19013590, 19015590), ('Key1', 'chr10'), 19014590),
((19013641, 19015641), ('Key3', 'chr10'), 19014641),
((19013658, 19015658), ('Key2', 'chr10'), 19014658)]
----------------------------------------
[((37374766, 37376766), ('Key4', 'chr10'), 37375766)]
----------------------------------------
[((37376526, 37378526), ('Key4', 'chr10'), 37377526)]
----------------------------------------
[((76309389, 76311389), ('Key5', 'chr10'), 76310389)]
----------------------------------------
[((76311224, 76313224), ('Key5', 'chr10'), 76312224),
((76311963, 76313963), ('Key5', 'chr10'), 76312963)]
----------------------------------------
[((76314990, 76316990), ('Key5', 'chr10'), 76315990)]
----------------------------------------
减少 - 第 1 阶段
通过删除工程特征来清理数据。
{'Key6': ['chr11', 14806147]}
----------------------------------------
{'Key6': ['chr11', 14814006]}
----------------------------------------
{'Key1': ['chr10', 19010495],
'Key2': ['chr10', 19010495],
'Key3': ['chr10', 19010502]}
----------------------------------------
{'Key1': ['chr10', 19014064, 19014590],
'Key2': ['chr10', 19014658],
'Key3': ['chr10', 19014641]}
----------------------------------------
{'Key4': ['chr10', 37375766]}
----------------------------------------
{'Key4': ['chr10', 37377526]}
----------------------------------------
{'Key5': ['chr10', 76310389]}
----------------------------------------
{'Key5': ['chr10', 76312224, 76312963]}
----------------------------------------
{'Key5': ['chr10', 76315990]}
----------------------------------------
减少 - 第 2 阶段
将移位的字典键与它们的
原字典。附加相应的值
字典键匹配时的染色体。
{'Key6': ['chr11', 14806147, 14814006]}
----------------------------------------
{'Key1': ['chr10', 19010495, 19014064, 19014590],
'Key2': ['chr10', 19010495, 19014658],
'Key3': ['chr10', 19010502, 19014641]}
----------------------------------------
{'Key4': ['chr10', 37375766, 37377526]}
----------------------------------------
{'Key5': ['chr10', 76310389, 76312224, 76312963, 76315990]}
----------------------------------------
我有一本带有染色体坐标的字典,如下例所示:
First_dict = {Key1: ['chr10', 19010495, 19014590, 19014064],
Key2: ['chr10', 19010495, 19014658],
Key3: ['chr10', 19010502, 19014641],
Key4: ['chr10', 37375766, 37377526],
Key5: ['chr10', 76310389, 76315990, 76312224, 76312963],
Key6: ['chr11', 14806147, 14814006]}
我想创建一个字典列表,其中那些具有最小和最大染色体坐标(字典值)的当前键与让我们说至少 1000 重叠,被组合到一个新字典中,并且其余的是新列表中的单独词典。
理想情况下是这样的:
New_list =
[{Key1: ['chr10', 19010495, 19014590, 19014064],Key2: ['chr10', 19010495, 19014658], Key3: ['chr10', 19010502, 19014641]},
{Key4: ['chr10', 37375766, 37377526]},
{Key5: ['chr10', 76310389, 76315990, 76312224, 76312963]},
{Key6: ['chr11', 14806147, 14814006]}]
其中 key1、key2 和 key3 在 New_list 中被组合为一个新字典,因为它们的染色体坐标重叠,而 key4、key5、key6 是具有 New_list 的单独字典,因为它们不是完全重叠。
我最初的想法是使用
将“First_dict”分隔成一个字典列表[{k: v} for (k, v) in First_dict.items()]
然后遍历每个字典,将最小值和最大值与前一个字典进行比较,检查是否重叠,然后创建一个新列表。但是我有几个问题,我无法解决。
我还寻找了其他将词典分组在一起的问题,例如问题: Grouping Python dictionary keys as a list and create a new dictionary with this list as a value
但我的问题是我的值并不总是完全相同,如上例所示。在考虑重叠时,我还必须考虑染色体。
任何人都可以帮忙解决这个问题,或者提出一个尝试的建议吗?非常感谢。
这个问题可能更适合基于图形的解决方案。没有任何东西可以阻止多个范围以不同的间隔重叠。
#!/usr/bin/env python3
from pprint import pprint
from itertools import groupby
def mapper(d, overlap=1000):
"""Each chromsomal coordinate must be interrogated
to determine if it is within +/-overlap of any other
Range within any other Original Dictionary Transcript
value will match key and chromosome element from the list
------------------------ ---------------------- ----------
(el-overlap, el+overlap), (dict-key, chromosome), el)
"""
for key, ch in d.items():
for el in ch[1:]:
yield ((el-overlap, el+overlap), (key, ch[0]), el)
def sorted_mapper(d, overlap=1000):
"""Simply sort the mapper data by its first element
"""
for r in sorted(mapper(d, overlap), key=lambda x: x[0]):
yield r
def groups(iter_):
previous = next(iter_)
retval = [previous]
for chrm in iter_:
if previous[0][0] <= chrm[-1] <= previous[0][1]:
retval.append(chrm)
else:
yield retval
previous = chrm
retval = [previous]
yield retval
def reduce_phase1(iter_):
for l in iter_:
retval = {}
for (minc, maxc), (key, lbl), chrm in l:
x = retval.get(key,[lbl])
x.append(chrm)
retval[key] = x
yield retval
def update_dict(d1, d2):
retval = d1
for key, value in d2.items():
if key in d1.keys():
retval[key].extend(value[1:])
return retval
def reduce_phase2(iter_):
retval = [next(iter_)]
retval_keys = [set([k for k in retval[0].keys()])]
for d in iter_:
keyset = set([k for k in d.keys()])
isnew = True
for i, e in enumerate(retval_keys):
if keyset <= e:
isnew = False
retval[i] = update_dict(retval[i], d)
if isnew:
retval.append(d)
retval_keys.append(keyset)
return retval
First_dict = {Key1: ['chr10', 19010495, 19014590, 19014064],
Key2: ['chr10', 19010495, 19014658],
Key3: ['chr10', 19010502, 19014641],
Key4: ['chr10', 37375766, 37377526],
Key5: ['chr10', 76310389, 76315990, 76312224, 76312963],
Key6: ['chr11', 14806147, 14814006]}
New_list = [
{
"Key1": ['chr10', 19010495, 19014590, 19014064],
"Key2": ['chr10', 19010495, 19014658],
"Key3": ['chr10', 19010502, 19014641]
},
{"Key4": ['chr10', 37375766, 37377526]},
{"Key5": ['chr10', 76310389, 76315990, 76312224, 76312963]},
{"Key6": ['chr11', 14806147, 14814006]}
]
pprint(First_dict)
print('-'*40)
g = groups(sorted_ranges(First_dict))
p1 = reduce_phase1(groups(sorted_ranges(First_dict)))
p2 = reduce_phase2(p1)
pprint(p2)
输出
{'Key1': ['chr10', 19010495, 19014590, 19014064],
'Key2': ['chr10', 19010495, 19014658],
'Key3': ['chr10', 19010502, 19014641],
'Key4': ['chr10', 37375766, 37377526],
'Key5': ['chr10', 76310389, 76315990, 76312224, 76312963],
'Key6': ['chr11', 14806147, 14814006]}
----------------------------------------
[{'Key6': ['chr11', 14806147, 14814006]},
{'Key1': ['chr10', 19010495, 19014064, 19014590],
'Key2': ['chr10', 19010495, 19014658],
'Key3': ['chr10', 19010502, 19014641]},
{'Key4': ['chr10', 37375766, 37377526]},
{'Key5': ['chr10', 76310389, 76312224, 76312963, 76315990]}]
TLDR;
映射器输出
映射器为每个字典键和染色体元素发出一条记录。每条记录都有一个关联的范围,可以在其中匹配其元素。
((el-1000, el+1000), (dict-key, chromosome), el)
(el-1000, el+1000)是任何其他染色体元素可以匹配的范围。
(dict-key, chromosome) 这个染色体的原始字典。
el是染色体坐标的一个元素。
((19009495, 19011495), ('Key1', 'chr10'), 19010495)
((19013590, 19015590), ('Key1', 'chr10'), 19014590)
((19013064, 19015064), ('Key1', 'chr10'), 19014064)
((19009495, 19011495), ('Key2', 'chr10'), 19010495)
((19013658, 19015658), ('Key2', 'chr10'), 19014658)
((19009502, 19011502), ('Key3', 'chr10'), 19010502)
((19013641, 19015641), ('Key3', 'chr10'), 19014641)
((37374766, 37376766), ('Key4', 'chr10'), 37375766)
((37376526, 37378526), ('Key4', 'chr10'), 37377526)
((76309389, 76311389), ('Key5', 'chr10'), 76310389)
((76314990, 76316990), ('Key5', 'chr10'), 76315990)
((76311224, 76313224), ('Key5', 'chr10'), 76312224)
((76311963, 76313963), ('Key5', 'chr10'), 76312963)
((14805147, 14807147), ('Key6', 'chr11'), 14806147)
((14813006, 14815006), ('Key6', 'chr11'), 14814006)
注意: 映射器的输出未排序。
排序
我们需要使用 (el-1000, el+1000) 作为键对转换后的数据进行排序。
这将允许我们检查下一个值是否在前一个值的范围内。因为键是按排序顺序排列的,所以我们将能够将指定重叠范围内的值链接在一起。
((14805147, 14807147), ('Key6', 'chr11'), 14806147)
((14813006, 14815006), ('Key6', 'chr11'), 14814006)
((19009495, 19011495), ('Key1', 'chr10'), 19010495)
((19009495, 19011495), ('Key2', 'chr10'), 19010495)
((19009502, 19011502), ('Key3', 'chr10'), 19010502)
((19013064, 19015064), ('Key1', 'chr10'), 19014064)
((19013590, 19015590), ('Key1', 'chr10'), 19014590)
((19013641, 19015641), ('Key3', 'chr10'), 19014641)
((19013658, 19015658), ('Key2', 'chr10'), 19014658)
((37374766, 37376766), ('Key4', 'chr10'), 37375766)
((37376526, 37378526), ('Key4', 'chr10'), 37377526)
((76309389, 76311389), ('Key5', 'chr10'), 76310389)
((76311224, 76313224), ('Key5', 'chr10'), 76312224)
((76311963, 76313963), ('Key5', 'chr10'), 76312963)
((76314990, 76316990), ('Key5', 'chr10'), 76315990)
组
将指定重叠范围内的值分组。 出现的列表将包含来自染色体的值 与前一条染色体重叠。
[((14805147, 14807147), ('Key6', 'chr11'), 14806147)]
----------------------------------------
[((14813006, 14815006), ('Key6', 'chr11'), 14814006)]
----------------------------------------
[((19009495, 19011495), ('Key1', 'chr10'), 19010495),
((19009495, 19011495), ('Key2', 'chr10'), 19010495),
((19009502, 19011502), ('Key3', 'chr10'), 19010502)]
----------------------------------------
[((19013064, 19015064), ('Key1', 'chr10'), 19014064),
((19013590, 19015590), ('Key1', 'chr10'), 19014590),
((19013641, 19015641), ('Key3', 'chr10'), 19014641),
((19013658, 19015658), ('Key2', 'chr10'), 19014658)]
----------------------------------------
[((37374766, 37376766), ('Key4', 'chr10'), 37375766)]
----------------------------------------
[((37376526, 37378526), ('Key4', 'chr10'), 37377526)]
----------------------------------------
[((76309389, 76311389), ('Key5', 'chr10'), 76310389)]
----------------------------------------
[((76311224, 76313224), ('Key5', 'chr10'), 76312224),
((76311963, 76313963), ('Key5', 'chr10'), 76312963)]
----------------------------------------
[((76314990, 76316990), ('Key5', 'chr10'), 76315990)]
----------------------------------------
减少 - 第 1 阶段
通过删除工程特征来清理数据。
{'Key6': ['chr11', 14806147]}
----------------------------------------
{'Key6': ['chr11', 14814006]}
----------------------------------------
{'Key1': ['chr10', 19010495],
'Key2': ['chr10', 19010495],
'Key3': ['chr10', 19010502]}
----------------------------------------
{'Key1': ['chr10', 19014064, 19014590],
'Key2': ['chr10', 19014658],
'Key3': ['chr10', 19014641]}
----------------------------------------
{'Key4': ['chr10', 37375766]}
----------------------------------------
{'Key4': ['chr10', 37377526]}
----------------------------------------
{'Key5': ['chr10', 76310389]}
----------------------------------------
{'Key5': ['chr10', 76312224, 76312963]}
----------------------------------------
{'Key5': ['chr10', 76315990]}
----------------------------------------
减少 - 第 2 阶段
将移位的字典键与它们的 原字典。附加相应的值 字典键匹配时的染色体。
{'Key6': ['chr11', 14806147, 14814006]}
----------------------------------------
{'Key1': ['chr10', 19010495, 19014064, 19014590],
'Key2': ['chr10', 19010495, 19014658],
'Key3': ['chr10', 19010502, 19014641]}
----------------------------------------
{'Key4': ['chr10', 37375766, 37377526]}
----------------------------------------
{'Key5': ['chr10', 76310389, 76312224, 76312963, 76315990]}
----------------------------------------