删除一个列表中的重复项并平均另一个列表的相应列表条目

Remove duplicates in one list and average corresponding list entries of another list

所以我有一个非常大的数据集,所以我需要写一些高效的东西。 我的数据包含一个列表中各个艺术家的专辑发行年份和另一个列表中每张专辑的平均歌曲长度。

这里的例子是一些编造的数据。此处歌曲长度以分钟为单位。

release_year=[2017,2017,2019,2020,2020,2021]
avg_songlength=[3,5,3,4,2,3]

我想要一个数据集,它删除了 release_year 列表中的重复项,并且对于每个重复项,它再次平均歌曲长度。所以我想要得到的结果是:

years_without duplicates=[2017,2019,2020,2021]
avg_length_of_year=[3+5/2,3,4+2/2,3]

我发现 set() 可以有效地删除重复项,但我不知道如何将其他列表中的所有内容组合起来 有什么简单的方法可以做到这一点?

一种选择是使用 itertools.groupby:

release_year=[2017,2017,2019,2020,2020,2021] 
avg_songlength=[3,5,3,4,2,3]

from itertools import groupby
from statistics import mean

years_without_duplicates, avg_length_of_year = zip(*(
             (k, mean(list(zip(*g))[1])) for k, g in
             groupby(sorted(zip(release_year, avg_songlength)),
                     lambda x: x[0]))
                                                  )

years_without_duplicates, avg_length_of_year
# ((2017, 2019, 2020, 2021), (4, 3, 3, 3))

或使用collections.defaultdict:

from collections import defaultdict

out = defaultdict(lambda : [0, 0]) # sum / count

for year, sl in zip(release_year, avg_songlength):
    out[year][0] += sl  # add length
    out[year][1] += 1   # increment counter of occurrences 
    
d = {k: v[0]/v[1] for k,v in out.items()} # avg = sum / count
years_without_duplicates, avg_length_of_year = zip(*d.items())

这是在基础 python 中解决此问题的简单方法。这里的想法是将我们在字典中看到的年份存储起来,并跟踪歌曲的总运行时间以及占总数的歌曲数量。然后最后我们可以遍历字典中的键并将它们转换为平均运行时间。使用字典还有助于使此数据比两个单独的列表更加结构化。

release_year=[2017,2017,2019,2020,2020,2021]
avg_songlength=[3,5,3,4,2,3]

year_averages = dict()
for year, length in zip(release_year, avg_songlength):
    if year in year_averages:
        year_averages[year][0] += length
        year_averages[year][1] += 1
    else:
        year_averages[year] = [length, 1]

year_averages = {year: lst[0]/lst[1] for year, lst in year_averages.items()}
print(year_averages)

输出:

{2017: 4.0, 2019: 3.0, 2020: 3.0, 2021: 3.0}

转换为 Pandas 数据框并使用聚合函数作为 np.mean

import pandas as pd
import numpy as np

df = pd.DataFrame({"release_year":[2017,2017,2019,2020,2020,2021],"avg_song_length":[3,5,3,4,2,3]})

print(df)

print(df.groupby("release_year",as_index=False).agg(avg_length_of_year=("avg_song_length",np.mean)))

这是一种简单的方法,使用一个字典存储每年值的总和,另一个字典计算添加了多少个值。

avg_dict = {}
count_dict = {}

for i in range(0, len(release_year)):
    if str(release_year[i]) in avg_dict:
        avg_dict[str(release_year[i])] = avg_dict[str(release_year[i])] + avg_songlength[i]
        count_dict[str(release_year[i])] = count_dict[str(release_year[i])] + 1
    else:
        avg_dict[str(release_year[i])] = avg_songlength[i]
        count_dict[str(release_year[i])] = 1

for key in avg_dict:
    avg_dict[key] = avg_dict[key] / count_dict[key]

print(avg_dict) # {'2017': 4.0, '2019': 3.0, '2020': 3.0, '2021': 3.0}