对大列表的首字母进行排序和迭代

Sort and iterate over the first letters of a large list

我有一个 input 字符串列表,由逗号分隔,如下所示:

list_to_split = ['flyer, black and white', 'flyer, blue', 'fly-swatter, black', 'helmet, heavy',
'armlet, silver and gold', 'cherry, black', 'violin, very old', 'concrete, grey']

我想遍历以相同字母开头的项目并用它们更新空字典以获得所需的输出看起来像这样:

letter_ordered_dict = {'a': ['armet'], 'c': ['cherry', 'concrete'], 'f': ['flyer', 'fly-swatter'],
'h': ['helmet'], 'v': ['violin']}

我开始尝试的方式显然是首先理解原始列表的第一个元素:

list_split_by_first_element = [first_elm.split(',')[0] for elm in list_that_has_been_ordered]
list_split_by_first_element.sort()

这会产生输出:

['armlet', 'cherry', 'concrete', 'flyer', 'flyer', 'fly-swatter', 'helmet', 'violin']

我坚持的部分是如何根据这些元素的第一个字母进行分组并跳过重复项以生成上面的输出。

有更好的方法吗?

这应该可以完成工作:

import itertools
tmp = sorted(e.split(',')[0] for e in list_to_split) # list_split_by_first_element
letter_ordered_dict = {k:list(set(v)) for k,v in itertools.groupby(tmp, lambda item: item[0])}

输出结果在letter_ordered_dict:

{'a': ['armlet'],
 'c': ['concrete', 'cherry'],
 'f': ['fly-swatter', 'flyer'],
 'h': ['helmet'],
 'v': ['violin']}

所以我在得到 list_of_words

之后就开始写代码了
list_of_words = ['armlet', 'cherry', 'concrete', 'flyer', 'flyer', 'fly-swatter', 'helmet', 'violin']

list_of_words = list(set(list_of_words))

first_char_dict = dict()

for word in list_of_words:
    if word[0] in first_char_dict:
        first_char_dict[word[0]].append(word)
    else:
        first_char_dict[word[0]] = [word]
        
print(first_char_dict)

输出:{'h': ['helmet'], 'v': ['violin'], 'f': ['fly-swatter', 'flyer'], 'c': ['cherry', 'concrete'], 'a': ['armlet']}

虽然我想提醒您注意,您在拆分字符串时只选择了一个世界。这是你需要的吗?

对于这种 scanning/aggregation 问题,您通常需要循环而不是列表理解。假设 list_split_by_first_element 已排序,这应该有效:

letter_ordered_dict = dict()
prev_word = ''
for word in list_split_by_first_element:
    if word == prev_word:
        # skip repeated words
        continue
    letter = word[0]
    letter_ordered_dict.setdefault(letter, []).append(word)

请注意 dict.setdefault 要么查找键,要么在键不存在时将其设置为指定值,这正是您在这里想要的。

对于很长的列表或很多重复的单词,您可能会发现对子列表进行排序比对整个列表进行排序更快。然后像这样的东西可以工作:

list_to_split = ['flyer, black and white', 'flyer, blue', 'fly-swatter, black', 'helmet, heavy',
'armlet, silver and gold', 'cherry, black', 'violin, very old', 'concrete, grey']

set_dict = dict()
for phrase in list_to_split:
    word, rest = phrase.split(',', 1)
    set_dict.setdefault(word[0], set()).add(word)
letter_ordered_dict = {
    letter: sorted(words)
    for letter, words in set_dict.items()
}

如果不需要在内部对每个字母的子列表进行排序,您可以通过将第二个示例中的 sorted 替换为 list 来节省一些时间。