从嵌套字典列表创建统计嵌套字典
Create a statistics nested dictionary from a list of nested dictionaries
我有一个包含许多嵌套字典的列表,每个字典代表一个 Windows OS 并且看起来像这样:
windows1 = {"version": "windows 10",
"installed apps": {"chrome": "installed",
"python": {"python version": "2.7",
"folder": "c:\python27"},
"minecraft": "not installed"}}
windows2 = {"version": "windows XP",
"installed apps": {"chrome": "not installed",
"python": {"python version": "not installed",
"folder": "c:\python27"},
"minecraft": "not installed"}}
我的目标是创建一个最终的嵌套字典,以存储有关列表的统计信息,如下所示:
stats_dic = {"version": {"windows 10": 20,
"windows 7": 4,
"windows XP": 11},
"installed apps": {"chrome": {"installed": 12,
"not installed": 6},
"python": {"python version": {"2.7": 4, "3.6": 8, "3.7": 2},
"minecraft": {"installed": 15,
"not installed": 2}}}
如您所见,我试图获取列表中每个 windows 字典中的所有值(python 文件夹除外),将它们作为最终嵌套统计字典中的键。这些键的值将是它们的计数器,它们必须保持与以前相同的嵌套方式。
经过一些阅读,我明白这可以在递归函数中完成,我已经尝试了几个函数但没有成功。我得到的最接近的(不处理 python 文件夹)是:
stats_dic = {}
windows_list = [s1, s2.....]
def update_recursive(s,d):
for k, v in s.iteritems():
if isinstance(v, dict):
update_recursive(v, d)
else:
if v in d.keys():
d[v] += 1
else:
d.update({v: 1})
return d
for window in windows_list():
stats_dic = update_recursive(window, stats_dic)
这给了我 windows1 和 windows2:
{'windows XP': 1, 'windows 10': 1, '2.7': 1, 'not installed': 2, 'c:\python27': 1, 'installed': 1}
如您所见,它不保留其嵌套形式,而且混合了相同的值(chrome 和 mincraft 'not installed')
我尝试过的其他所有方法要么没有增加计数器,要么只将嵌套形式保留为一个深度。我知道我不亲近,但我错过了什么?
这是一个递归函数,它将执行我认为您希望它执行的操作。
from pprint import pp # Skip if you're not running Python >= 3.8
def combiner(inp, d=None):
if d == None:
d = {}
for key, value in inp.items():
if isinstance(value, str):
x = d.setdefault(key, {})
x.setdefault(value, 0)
x[value] += 1
elif isinstance(value, dict):
x = d.setdefault(key, {})
combiner(value, x)
else:
raise TypeError("Unexpected type '{}' for 'value'".format(type(value)))
return d
windows1 = {"version": "windows 10",
"installed apps": {"chrome": "installed",
"python": {"python version": "2.7",
"folder": "c:\python27"},
"minecraft": "not installed"}}
windows2 = {"version": "windows XP",
"installed apps": {"chrome": "not installed",
"python": {"python version": "not installed",
"folder": "c:\python27"},
"minecraft": "not installed"}}
windowsList = [windows1, windows2]
x = {}
for comp in windowsList:
combiner(comp, x)
pp(x) # Use print if you're not running Python >= 3.8
输出:
{'version': {'windows 10': 1, 'windows XP': 1},
'installed apps': {'chrome': {'installed': 1, 'not installed': 1},
'python': {'python version': {'2.7': 1, 'not installed': 1},
'folder': {'c:\python27': 2}},
'minecraft': {'not installed': 2}}}
这是针对您的请求的另一种解决方案。
答案分为三部分:
- 拼合输入字典
- 创建 table(pandas 数据帧)
- 计算统计数据并构造输出
(要查看没有解释步骤的完整代码,请滚动到最底部。)
说明
扁平化输入词典是什么意思?答案很简单:不是嵌套的字典,因此只有键值对是一维的。
# Flat dictionary vs. nested dictionary
flat = {'a':1, 'b':2, 'c':3}
nested = {'a':1, 'b':{'c':2, 'd':3}} # 'b' has another dictionary as value
1.
# Flatten input dictionaries
# Following function returns a 1 dimensional dictionary where
# the before nested structure is still recognizable in its keys
# in the form parent.child.subchild...
def flatten(dic):
#
update = False
for key, val in dic.items():
if isinstance(val, dict):
update = True
break
if update:
val_key_tree = dict([(f'{key}.{k}', v) for k,v in val.items()])
dic.update(val_key_tree); dic.pop(key); flatten(dic)
return dic
# Example
windows1 = {"version": "windows 10",
"installed apps": {"chrome": "installed",
"python": {"python version": "2.7",
"folder": "c:\python27"},
"minecraft": "not installed"}}
flatten(windows1)
>>> {'version': 'windows 10',
'installed apps.chrome': 'installed',
'installed apps.minecraft': 'not installed',
'installed apps.python.python version': '2.7',
'installed apps.python.folder': 'c:\python27'}
在键中引用嵌套结构将在稍后重新创建原始词典的结构时派上用场。
2.
# Create table (pandas DataFrame)
# With one dimensional dictionaries, it easy to create a pandas DataFrame where each row represents a dictionary
import pandas as pd
# Input
windows1 = {"version": "windows 10",
"installed apps": {"chrome": "installed",
"python": {"python version": "2.7",
"folder": "c:\python27"},
"minecraft": "not installed"}}
windows2 = {"version": "windows XP",
"installed apps": {"chrome": "not installed",
"python": {"python version": "not installed",
"folder": "c:\python27"},
"minecraft": "not installed"}}
dics = [windows1, windows2]
# Create DataFrame
frames = [pd.DataFrame(flatten(dic), index=[0]) for dic in dics]
df = pd.concat(frames, ignore_index=True)
df
>>>
3.
# Statistics
# Thanks to the DataFrame it is relatively simple to count how many times a value appears within a column
for c in df.columns: # iterate over dataframe columns
#
if 'folder' in c: # exclude certain columns (for example when 'folder' appears in column)
continue
uniques = df[c].unique() # all different values from a column
# Count how many times a value appears per column
counts = {}
for u in uniques:
tmp_u = u if not '\' in u else u.replace('\','\\') # backlash needs to be escaped specially
counts[u] = int(df[c].str.count('^'+tmp_u).sum()) # with the following str-method from dataframe
print(counts) # output from variable counts for each iteration
>>> {'windows 10': 1, 'windows XP': 1}
{'installed': 1, 'not installed': 1}
{'not installed': 2}
{'2.7': 1, 'not installed': 1}
从这里开始,我们可以重新创建原始字典的结构,因为我们在 DataFrame 列中有这些引用。
以下函数将创建一个嵌套字典,其结构类似于带有上面计算的统计数据的原始字典:
# Recreate structured dictionary
def build_nested(struct, tree, res):
#
tree_split = tree.split('.',1)
try:
struct[tree_split[0]]
build_nested(struct[tree_split[0]], tree_split[-1], res)
except KeyError:
struct[tree_split[0]] = {}
if len(tree_split) < 2:
struct[tree_split[0]].update(res)
else:
struct[tree_split[0]][tree_split[1]] = {}
struct[tree_split[0]][tree_split[1]].update(res)
return struct
因此,我们可以将找到的属性传递给上述函数 build_nested,而不是像上面第 3 部分那样在每次迭代期间打印:
# Statistics
stats = {}
for c in df.columns:
#
if 'folder' in c:
continue
uniques = df[c].unique()
# Count how many times a value appears per column
counts = {}
for u in uniques:
tmp_u = u if not '\' in u else u.replace('\','\\')
counts[u] = int(df[c].str.count('^'+tmp_u).sum())
# Recreate the structure of nested dictionary
build_nested(stats, c, counts)
stats
>>>{'version': {'windows 10': 1, 'windows XP': 1},
'installed apps': {'chrome': {'installed': 1, 'not installed': 1},
'minecraft': {'not installed': 2},
'python': {'python version': {'2.7': 1, 'not installed': 1}}}}
完整代码
# Whole process put together
import json
import pandas as pd
# Helper functions
def flatten(dic):
#
update = False
for key, val in dic.items():
if isinstance(val, dict):
update = True
break
if update:
val_key_tree = dict([(f'{key}.{k}', v) for k,v in val.items()])
dic.update(val_key_tree); dic.pop(key); flatten(dic)
return dic
def build_nested(struct, tree, res):
#
tree_split = tree.split('.',1)
try:
struct[tree_split[0]]
build_nested(struct[tree_split[0]], tree_split[-1], res)
except KeyError:
struct[tree_split[0]] = {}
if len(tree_split) < 2:
struct[tree_split[0]].update(res)
else:
struct[tree_split[0]][tree_split[1]] = {}
struct[tree_split[0]][tree_split[1]].update(res)
return struct
# 1. & 2. Flatten input dictionaries and create table (pandas DataFrame)
windows1 = {"version": "windows 10",
"installed apps": {"chrome": "installed",
"python": {"python version": "2.7",
"folder": "c:\python27"},
"minecraft": "not installed"}}
windows2 = {"version": "windows XP",
"installed apps": {"chrome": "not installed",
"python": {"python version": "not installed",
"folder": "c:\python27"},
"minecraft": "not installed"}}
dics = [windows1, windows2]
frames = [pd.DataFrame(flatten(dic), index=[0]) for dic in dics]
df = pd.concat(frames, ignore_index=True)
# 3. Recreate nested dictionary with statistics
stats = {}
for c in df.columns:
#
if 'folder' in c:
continue
uniques = df[c].unique()
# Count how many times a value appears per column
counts = {}
for u in uniques:
tmp_u = u if not '\' in u else u.replace('\','\\')
counts[u] = int(df[c].str.count('^'+tmp_u).sum())
# Recreate the structure of nested dictionary
build_nested(stats, c, counts)
print(json.dumps(stats, indent=5))
>>>
{
"version": {
"windows 10": 1,
"windows XP": 1
},
"installed apps": {
"chrome": {
"installed": 1,
"not installed": 1
},
"minecraft": {
"not installed": 2
},
"python": {
"python version": {
"2.7": 1,
"not installed": 1
}
}
}
}
我有一个包含许多嵌套字典的列表,每个字典代表一个 Windows OS 并且看起来像这样:
windows1 = {"version": "windows 10",
"installed apps": {"chrome": "installed",
"python": {"python version": "2.7",
"folder": "c:\python27"},
"minecraft": "not installed"}}
windows2 = {"version": "windows XP",
"installed apps": {"chrome": "not installed",
"python": {"python version": "not installed",
"folder": "c:\python27"},
"minecraft": "not installed"}}
我的目标是创建一个最终的嵌套字典,以存储有关列表的统计信息,如下所示:
stats_dic = {"version": {"windows 10": 20,
"windows 7": 4,
"windows XP": 11},
"installed apps": {"chrome": {"installed": 12,
"not installed": 6},
"python": {"python version": {"2.7": 4, "3.6": 8, "3.7": 2},
"minecraft": {"installed": 15,
"not installed": 2}}}
如您所见,我试图获取列表中每个 windows 字典中的所有值(python 文件夹除外),将它们作为最终嵌套统计字典中的键。这些键的值将是它们的计数器,它们必须保持与以前相同的嵌套方式。
经过一些阅读,我明白这可以在递归函数中完成,我已经尝试了几个函数但没有成功。我得到的最接近的(不处理 python 文件夹)是:
stats_dic = {}
windows_list = [s1, s2.....]
def update_recursive(s,d):
for k, v in s.iteritems():
if isinstance(v, dict):
update_recursive(v, d)
else:
if v in d.keys():
d[v] += 1
else:
d.update({v: 1})
return d
for window in windows_list():
stats_dic = update_recursive(window, stats_dic)
这给了我 windows1 和 windows2:
{'windows XP': 1, 'windows 10': 1, '2.7': 1, 'not installed': 2, 'c:\python27': 1, 'installed': 1}
如您所见,它不保留其嵌套形式,而且混合了相同的值(chrome 和 mincraft 'not installed') 我尝试过的其他所有方法要么没有增加计数器,要么只将嵌套形式保留为一个深度。我知道我不亲近,但我错过了什么?
这是一个递归函数,它将执行我认为您希望它执行的操作。
from pprint import pp # Skip if you're not running Python >= 3.8
def combiner(inp, d=None):
if d == None:
d = {}
for key, value in inp.items():
if isinstance(value, str):
x = d.setdefault(key, {})
x.setdefault(value, 0)
x[value] += 1
elif isinstance(value, dict):
x = d.setdefault(key, {})
combiner(value, x)
else:
raise TypeError("Unexpected type '{}' for 'value'".format(type(value)))
return d
windows1 = {"version": "windows 10",
"installed apps": {"chrome": "installed",
"python": {"python version": "2.7",
"folder": "c:\python27"},
"minecraft": "not installed"}}
windows2 = {"version": "windows XP",
"installed apps": {"chrome": "not installed",
"python": {"python version": "not installed",
"folder": "c:\python27"},
"minecraft": "not installed"}}
windowsList = [windows1, windows2]
x = {}
for comp in windowsList:
combiner(comp, x)
pp(x) # Use print if you're not running Python >= 3.8
输出:
{'version': {'windows 10': 1, 'windows XP': 1},
'installed apps': {'chrome': {'installed': 1, 'not installed': 1},
'python': {'python version': {'2.7': 1, 'not installed': 1},
'folder': {'c:\python27': 2}},
'minecraft': {'not installed': 2}}}
这是针对您的请求的另一种解决方案。
答案分为三部分:
- 拼合输入字典
- 创建 table(pandas 数据帧)
- 计算统计数据并构造输出
说明
扁平化输入词典是什么意思?答案很简单:不是嵌套的字典,因此只有键值对是一维的。
# Flat dictionary vs. nested dictionary
flat = {'a':1, 'b':2, 'c':3}
nested = {'a':1, 'b':{'c':2, 'd':3}} # 'b' has another dictionary as value
1.
# Flatten input dictionaries
# Following function returns a 1 dimensional dictionary where
# the before nested structure is still recognizable in its keys
# in the form parent.child.subchild...
def flatten(dic):
#
update = False
for key, val in dic.items():
if isinstance(val, dict):
update = True
break
if update:
val_key_tree = dict([(f'{key}.{k}', v) for k,v in val.items()])
dic.update(val_key_tree); dic.pop(key); flatten(dic)
return dic
# Example
windows1 = {"version": "windows 10",
"installed apps": {"chrome": "installed",
"python": {"python version": "2.7",
"folder": "c:\python27"},
"minecraft": "not installed"}}
flatten(windows1)
>>> {'version': 'windows 10',
'installed apps.chrome': 'installed',
'installed apps.minecraft': 'not installed',
'installed apps.python.python version': '2.7',
'installed apps.python.folder': 'c:\python27'}
在键中引用嵌套结构将在稍后重新创建原始词典的结构时派上用场。
2.# Create table (pandas DataFrame)
# With one dimensional dictionaries, it easy to create a pandas DataFrame where each row represents a dictionary
import pandas as pd
# Input
windows1 = {"version": "windows 10",
"installed apps": {"chrome": "installed",
"python": {"python version": "2.7",
"folder": "c:\python27"},
"minecraft": "not installed"}}
windows2 = {"version": "windows XP",
"installed apps": {"chrome": "not installed",
"python": {"python version": "not installed",
"folder": "c:\python27"},
"minecraft": "not installed"}}
dics = [windows1, windows2]
# Create DataFrame
frames = [pd.DataFrame(flatten(dic), index=[0]) for dic in dics]
df = pd.concat(frames, ignore_index=True)
df
>>>
# Statistics
# Thanks to the DataFrame it is relatively simple to count how many times a value appears within a column
for c in df.columns: # iterate over dataframe columns
#
if 'folder' in c: # exclude certain columns (for example when 'folder' appears in column)
continue
uniques = df[c].unique() # all different values from a column
# Count how many times a value appears per column
counts = {}
for u in uniques:
tmp_u = u if not '\' in u else u.replace('\','\\') # backlash needs to be escaped specially
counts[u] = int(df[c].str.count('^'+tmp_u).sum()) # with the following str-method from dataframe
print(counts) # output from variable counts for each iteration
>>> {'windows 10': 1, 'windows XP': 1}
{'installed': 1, 'not installed': 1}
{'not installed': 2}
{'2.7': 1, 'not installed': 1}
从这里开始,我们可以重新创建原始字典的结构,因为我们在 DataFrame 列中有这些引用。 以下函数将创建一个嵌套字典,其结构类似于带有上面计算的统计数据的原始字典:
# Recreate structured dictionary
def build_nested(struct, tree, res):
#
tree_split = tree.split('.',1)
try:
struct[tree_split[0]]
build_nested(struct[tree_split[0]], tree_split[-1], res)
except KeyError:
struct[tree_split[0]] = {}
if len(tree_split) < 2:
struct[tree_split[0]].update(res)
else:
struct[tree_split[0]][tree_split[1]] = {}
struct[tree_split[0]][tree_split[1]].update(res)
return struct
因此,我们可以将找到的属性传递给上述函数 build_nested,而不是像上面第 3 部分那样在每次迭代期间打印:
# Statistics
stats = {}
for c in df.columns:
#
if 'folder' in c:
continue
uniques = df[c].unique()
# Count how many times a value appears per column
counts = {}
for u in uniques:
tmp_u = u if not '\' in u else u.replace('\','\\')
counts[u] = int(df[c].str.count('^'+tmp_u).sum())
# Recreate the structure of nested dictionary
build_nested(stats, c, counts)
stats
>>>{'version': {'windows 10': 1, 'windows XP': 1},
'installed apps': {'chrome': {'installed': 1, 'not installed': 1},
'minecraft': {'not installed': 2},
'python': {'python version': {'2.7': 1, 'not installed': 1}}}}
完整代码
# Whole process put together
import json
import pandas as pd
# Helper functions
def flatten(dic):
#
update = False
for key, val in dic.items():
if isinstance(val, dict):
update = True
break
if update:
val_key_tree = dict([(f'{key}.{k}', v) for k,v in val.items()])
dic.update(val_key_tree); dic.pop(key); flatten(dic)
return dic
def build_nested(struct, tree, res):
#
tree_split = tree.split('.',1)
try:
struct[tree_split[0]]
build_nested(struct[tree_split[0]], tree_split[-1], res)
except KeyError:
struct[tree_split[0]] = {}
if len(tree_split) < 2:
struct[tree_split[0]].update(res)
else:
struct[tree_split[0]][tree_split[1]] = {}
struct[tree_split[0]][tree_split[1]].update(res)
return struct
# 1. & 2. Flatten input dictionaries and create table (pandas DataFrame)
windows1 = {"version": "windows 10",
"installed apps": {"chrome": "installed",
"python": {"python version": "2.7",
"folder": "c:\python27"},
"minecraft": "not installed"}}
windows2 = {"version": "windows XP",
"installed apps": {"chrome": "not installed",
"python": {"python version": "not installed",
"folder": "c:\python27"},
"minecraft": "not installed"}}
dics = [windows1, windows2]
frames = [pd.DataFrame(flatten(dic), index=[0]) for dic in dics]
df = pd.concat(frames, ignore_index=True)
# 3. Recreate nested dictionary with statistics
stats = {}
for c in df.columns:
#
if 'folder' in c:
continue
uniques = df[c].unique()
# Count how many times a value appears per column
counts = {}
for u in uniques:
tmp_u = u if not '\' in u else u.replace('\','\\')
counts[u] = int(df[c].str.count('^'+tmp_u).sum())
# Recreate the structure of nested dictionary
build_nested(stats, c, counts)
print(json.dumps(stats, indent=5))
>>>
{
"version": {
"windows 10": 1,
"windows XP": 1
},
"installed apps": {
"chrome": {
"installed": 1,
"not installed": 1
},
"minecraft": {
"not installed": 2
},
"python": {
"python version": {
"2.7": 1,
"not installed": 1
}
}
}
}