使用 Python 3.6 中的 setdefault 来使用来自两个不同文件的信息显示(名称、ID 和频率计数)

Using setdefault in Python 3.6 to display (name, id, and frequency count) using info from two different files

我正在尝试读取两个 .dat 文件并创建一个程序,该程序使用 aid2name 的值作为字典中的键,该字典具有 aid2numplays 的键和值,设置为它的值。这一切都是为了希望该文件将产生一个包含(艺术家姓名、艺术家 ID、播放频率)的结果。值得注意的是,第一个文件提供艺术家姓名和艺术家 ID,而第二个文件提供用户 ID、艺术家 ID 和每个用户的频率。有什么想法可以按用户聚合这些频率,然后以(艺术家姓名、艺术家 ID、播放频率)格式显示它们吗?以下是我到目前为止所管理的内容:

import codecs
aid2name = {}
d2 = {}
fp = codecs.open("artists.dat", encoding = "utf-8")
fp.readline()  #skip first line of headers
for line in fp:
    line = line.strip()
    fields = line.split('\t')
    aid = int(fields[0])
    name = fields[1]
    aid2name = {int(aid), name}
    d2.setdefault(fields[1], {})
    #print (aid2name)
# do other processing
    #print(dictionary)

aid2numplays = {}
fp = codecs.open("user_artists.dat", encoding = "utf-8")
fp.readline()  #skip first line of headers
for line in fp:
    line = line.strip()
    fields = line.split('\t')
    uid = int(fields[0])
    aid = int(fields[1])
    weight = int(fields[2])
    aid2numplays = [int(aid), int(weight)]
    #print(aid2numplays)
    #print(uid, aid, weight)

for (d2.fields[1], value) in d2:
    group = d2.setdefault(d2.fields[1], {}) # key might exist already
    group.append(aid2numplays)

print(group)

编辑:关于setdefault的使用,如果你想按artistID对用户数据进行分组,那么你可以:

grouped_data = {}
for u in users:
    k, v = u[1], {'userID': u[0], 'weight': u[2]}
    grouped_data.setdefault(k, []).append(v)

这与写作基本相同:

grouped_data = {}
for u in users:
    k, v = u[1], {'userID': u[0], 'weight': u[2]}
    if k in grouped_data:
        grouped_data[k].append(v)
    else:
        grouped_data[k] = [v]

作为如何计算艺术家出现在不同用户数据中的次数的示例,您可以将数据读入列表列表:

with codecs.open("artists.dat", encoding = "utf-8") as f:
    artists = f.readlines()

with codecs.open("user_artists.dat", encoding = "utf-8") as f:
    users = f.readlines()

artists = [x.strip().split('\t') for x in artists][1:]  # [['1', 'MALICE MIZER', ..
users = [x.strip().split('\t') for x in users][1:]  # [['2', '51', '13883'], ..]

迭代艺术家使用 artistID 作为键创建字典。为播放统计数据添加一个占位符。

data = {}
for a in artists:
    artistID, name = a[0], a[1]
    data[artistID] = {'name': name, 'plays': 0}

迭代用户更新每一行的字典:

for u in users:
    artistID = u[1]
    data[artistID]['plays'] += 1

数据的输出:

{'1': {'name': 'MALICE MIZER', 'plays': 3},
 '2': {'name': 'Diary of Dreams', 'plays': 12},
 '3': {'name': 'Carpathian Forest', 'plays': 3},  ..}

编辑:要遍历用户数据并创建与用户关联的所有艺术家的字典,我们可以:

artist_list = [x.strip().split('\t') for x in artists][1:]
user_stats_list = [x.strip().split('\t') for x in users][1:]

artists = {}
for a in artist_list:
    artistID, name = a[0], a[1]
    artists[artistID] = name

grouped_user_stats = {}
for u in user_stats_list:
    userID, artistID, weight = u
    if userID not in grouped_user_stats:
        grouped_user_stats[userID] = { artistID: {'name': artists[artistID], 'plays': 1} }
    else:
        if artistID not in grouped_user_stats[userID]:
            grouped_user_stats[userID][artistID] = {'name': artists[artistID], 'plays': 1}
        else:
            grouped_user_stats[userID][artistID]['plays'] += 1
            print('this never happens') 
            # it looks the same artist is never listed twice for the same user

输出:

{'2': {'100': {'name': 'ABC', 'plays': 1},
       '51': {'name': 'Duran Duran', 'plays': 1},
       '52': {'name': 'Morcheeba', 'plays': 1},
       '53': {'name': 'Air', 'plays': 1}, .. }, 
 ..
}