如何按标签和 return 计算值列表对数据进行分组?

How to group data by labels and return a list of calculated values?

让我们假设我有两个元素数量相同的列表。第一个只包含浮点数,第二个包含字符串标签。例如:

[1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
[ "ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]

让我们假设我还有一个有序的唯一标签列表:

["ABC", "LMN", "XYZ"]

我想编写最高效的 Python 代码:

  1. 按标签对值进行分组
  2. 对这些值应用特定函数(例如总和、平均值、标准差)
  3. returns计算值列表与标签列表的顺序相同。

例如,如果函数是求和,我希望得到一个包含三个值的列表:

[sum(1.98, 9.35, 6.23), sum(5.56, 7.49), sum(4.34, 2.54, 8.31)]

如果函数是均值,我希望得到一个包含三个值的列表:

[mean(1.98, 9.35, 6.23), mean(5.56, 7.49), mean(4.34, 2.54, 8.31)]

有什么提示吗?

首先使用字典重塑数据以对每个键的值进行分组。由于 python 3.7 字典键保证按插入顺序排列。


values = [1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
keys = ["ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]

order = ["ABC", "LMN", "XYZ"]

out = {k:[] for k in order}

for key, value in zip(keys, values):
    out[key].append(value)

输出:

>>> out
{'ABC': [1.98, 9.35, 6.23], 'LMN': [5.56, 7.49], 'XYZ': [4.34, 2.54, 8.31]}

然后应用您想要的任何变换

# sum
[round(sum(v),2) for v in out.values()]
#[17.56, 13.05, 15.19]

# mean
from statistics import mean
[round(mean(v),2) for v in out.values()]
# [5.85, 6.53, 5.06]
初始键的顺序

如果需要,您还可以保留键列表中键的顺序,而不需要明确的键列表:

from collections import defaultdict

values = [1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
keys = ["ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]

out = defaultdict(list)
for key, value in zip(keys, values):
    out[key].append(value)
values = [1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
keys = ["ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]

order = ["ABC", "LMN", "XYZ"]

out = {k:[] for k in order}

for key, value in zip(keys, values):
    out[key].append(value)
data = [1.98, 5.56, 4.34, 9.35, 6.23, 2.54, 8.31, 7.49]
keys = [ "ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]
labels = ["ABC", "LMN", "XYZ"]
func = max
result = [func([data[ind] for ind in [i for i, x in enumerate(keys) if x == label]]) for label in labels]
print(result)

您可以根据需要更改您的功能。 我的代码在一行中连接循环。

这是一个常见的用例,当一个人从某个模型预测中获得一个 list/array 值的标签列表(例如,当在 sklearn 中使用函数 predict 时)并且想要查看对于每个不同的标签,根据模型分配给该标签的值。

values = [ 1.98,   5.56,  4.34,  9.35,  6.23,  2.54,  8.31, 7.49 ]
labels = [ "ABC", "LMN", "XYZ", "ABC", "ABC", "XYZ", "XYZ", "LMN"]

按标签对值进行分组

from collections import defaultdict
d = defaultdict(list)
for k,v in zip(labels, values):
    d[k].append(v)

d
# defaultdict(list,
#             {'ABC': [1.98, 9.35, 6.23],
#              'LMN': [5.56, 7.49],
#              'XYZ': [4.34, 2.54, 8.31]})

按排序标签显示分组值

for label in sorted(d):
    print('{} {}'.format(label, d[label]))
# ABC [1.98, 9.35, 6.23]
# LMN [5.56, 7.49]
# XYZ [4.34, 2.54, 8.31]

按标签对值求和

for label in d:
    print('{} {:.4f}'.format(label, sum(d[label])))
# ABC 17.5600
# LMN 13.0500
# XYZ 15.1900

按标签计算平均值

d_avg = {k: sum(d[k])/len(d[k]) for k in d}
for label in d_avg:
    print('{} {:.4f}'.format(label, d_avg[label]))
# ABC 5.8533
# LMN 6.5250
# XYZ 5.0633