如何按标签和 return 计算值列表对数据进行分组?
How to group data by labels and return a list of calculated values?
让我们假设我有两个元素数量相同的列表。第一个只包含浮点数,第二个包含字符串标签。例如:
[1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
[ "ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]
让我们假设我还有一个有序的唯一标签列表:
["ABC", "LMN", "XYZ"]
我想编写最高效的 Python 代码:
- 按标签对值进行分组
- 对这些值应用特定函数(例如总和、平均值、标准差)
- returns计算值列表与标签列表的顺序相同。
例如,如果函数是求和,我希望得到一个包含三个值的列表:
[sum(1.98, 9.35, 6.23), sum(5.56, 7.49), sum(4.34, 2.54, 8.31)]
如果函数是均值,我希望得到一个包含三个值的列表:
[mean(1.98, 9.35, 6.23), mean(5.56, 7.49), mean(4.34, 2.54, 8.31)]
有什么提示吗?
首先使用字典重塑数据以对每个键的值进行分组。由于 python 3.7 字典键保证按插入顺序排列。
values = [1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
keys = ["ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]
order = ["ABC", "LMN", "XYZ"]
out = {k:[] for k in order}
for key, value in zip(keys, values):
out[key].append(value)
输出:
>>> out
{'ABC': [1.98, 9.35, 6.23], 'LMN': [5.56, 7.49], 'XYZ': [4.34, 2.54, 8.31]}
然后应用您想要的任何变换
# sum
[round(sum(v),2) for v in out.values()]
#[17.56, 13.05, 15.19]
# mean
from statistics import mean
[round(mean(v),2) for v in out.values()]
# [5.85, 6.53, 5.06]
初始键的顺序
如果需要,您还可以保留键列表中键的顺序,而不需要明确的键列表:
from collections import defaultdict
values = [1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
keys = ["ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]
out = defaultdict(list)
for key, value in zip(keys, values):
out[key].append(value)
values = [1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
keys = ["ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]
order = ["ABC", "LMN", "XYZ"]
out = {k:[] for k in order}
for key, value in zip(keys, values):
out[key].append(value)
data = [1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
keys = [ "ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]
labels = ["ABC", "LMN", "XYZ"]
func = max
result = [func([data[ind] for ind in [i for i, x in enumerate(keys) if x == label]]) for label in labels]
print(result)
您可以根据需要更改您的功能。
我的代码在一行中连接循环。
这是一个常见的用例,当一个人从某个模型预测中获得一个 list/array 值的标签列表(例如,当在 sklearn 中使用函数 predict
时)并且想要查看对于每个不同的标签,根据模型分配给该标签的值。
values = [ 1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49 ]
labels = [ "ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]
按标签对值进行分组
from collections import defaultdict
d = defaultdict(list)
for k,v in zip(labels, values):
d[k].append(v)
d
# defaultdict(list,
# {'ABC': [1.98, 9.35, 6.23],
# 'LMN': [5.56, 7.49],
# 'XYZ': [4.34, 2.54, 8.31]})
按排序标签显示分组值
for label in sorted(d):
print('{} {}'.format(label, d[label]))
# ABC [1.98, 9.35, 6.23]
# LMN [5.56, 7.49]
# XYZ [4.34, 2.54, 8.31]
按标签对值求和
for label in d:
print('{} {:.4f}'.format(label, sum(d[label])))
# ABC 17.5600
# LMN 13.0500
# XYZ 15.1900
按标签计算平均值
d_avg = {k: sum(d[k])/len(d[k]) for k in d}
for label in d_avg:
print('{} {:.4f}'.format(label, d_avg[label]))
# ABC 5.8533
# LMN 6.5250
# XYZ 5.0633
让我们假设我有两个元素数量相同的列表。第一个只包含浮点数,第二个包含字符串标签。例如:
[1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
[ "ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]
让我们假设我还有一个有序的唯一标签列表:
["ABC", "LMN", "XYZ"]
我想编写最高效的 Python 代码:
- 按标签对值进行分组
- 对这些值应用特定函数(例如总和、平均值、标准差)
- returns计算值列表与标签列表的顺序相同。
例如,如果函数是求和,我希望得到一个包含三个值的列表:
[sum(1.98, 9.35, 6.23), sum(5.56, 7.49), sum(4.34, 2.54, 8.31)]
如果函数是均值,我希望得到一个包含三个值的列表:
[mean(1.98, 9.35, 6.23), mean(5.56, 7.49), mean(4.34, 2.54, 8.31)]
有什么提示吗?
首先使用字典重塑数据以对每个键的值进行分组。由于 python 3.7 字典键保证按插入顺序排列。
values = [1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
keys = ["ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]
order = ["ABC", "LMN", "XYZ"]
out = {k:[] for k in order}
for key, value in zip(keys, values):
out[key].append(value)
输出:
>>> out
{'ABC': [1.98, 9.35, 6.23], 'LMN': [5.56, 7.49], 'XYZ': [4.34, 2.54, 8.31]}
然后应用您想要的任何变换
# sum
[round(sum(v),2) for v in out.values()]
#[17.56, 13.05, 15.19]
# mean
from statistics import mean
[round(mean(v),2) for v in out.values()]
# [5.85, 6.53, 5.06]
初始键的顺序
如果需要,您还可以保留键列表中键的顺序,而不需要明确的键列表:
from collections import defaultdict
values = [1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
keys = ["ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]
out = defaultdict(list)
for key, value in zip(keys, values):
out[key].append(value)
values = [1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
keys = ["ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]
order = ["ABC", "LMN", "XYZ"]
out = {k:[] for k in order}
for key, value in zip(keys, values):
out[key].append(value)
data = [1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
keys = [ "ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]
labels = ["ABC", "LMN", "XYZ"]
func = max
result = [func([data[ind] for ind in [i for i, x in enumerate(keys) if x == label]]) for label in labels]
print(result)
您可以根据需要更改您的功能。 我的代码在一行中连接循环。
这是一个常见的用例,当一个人从某个模型预测中获得一个 list/array 值的标签列表(例如,当在 sklearn 中使用函数 predict
时)并且想要查看对于每个不同的标签,根据模型分配给该标签的值。
values = [ 1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49 ]
labels = [ "ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]
按标签对值进行分组
from collections import defaultdict
d = defaultdict(list)
for k,v in zip(labels, values):
d[k].append(v)
d
# defaultdict(list,
# {'ABC': [1.98, 9.35, 6.23],
# 'LMN': [5.56, 7.49],
# 'XYZ': [4.34, 2.54, 8.31]})
按排序标签显示分组值
for label in sorted(d):
print('{} {}'.format(label, d[label]))
# ABC [1.98, 9.35, 6.23]
# LMN [5.56, 7.49]
# XYZ [4.34, 2.54, 8.31]
按标签对值求和
for label in d:
print('{} {:.4f}'.format(label, sum(d[label])))
# ABC 17.5600
# LMN 13.0500
# XYZ 15.1900
按标签计算平均值
d_avg = {k: sum(d[k])/len(d[k]) for k in d}
for label in d_avg:
print('{} {:.4f}'.format(label, d_avg[label]))
# ABC 5.8533
# LMN 6.5250
# XYZ 5.0633