具有非常大列表的排列

Question

我正在尝试使用 Python 运行一个非常大的排列。目标是将四个或更少的项目配对，由 1) 句点、2) 破折号和 3) 分隔，没有任何分隔。顺序很重要。

# input
food = ['', 'apple', 'banana', 'bread', 'tomato', 'yogurt', ...] `

# ideal output would be a list that contains strings like the following:
apple-banana-bread (no dashes before or after!)
apple.banana.bread (using periods)
applebananabread (no spaces)
apple-banana (by combining with the first item in the list, I also get shorter groups but need to delete empty items before joining)
... for all the possible groups of 4, order is important

# Requirements:
# Avoiding a symbol at the beginning or end of a resulting string
# Also creating groups of length 1, 2, and 3

我用 itertools.permutations 创建了一个 itertools.chain (perms)。但是，在转换为列表后删除空元素时，这会失败并显示 MemoryError 。即使使用具有大量 RAM 的机器。

food = ['', 'apple', 'banana', 'bread', 'tomato', 'yogurt', ...] `
perms_ = itertools.permutations(food, 4)
perms = [list(filter(None, tup)) for tup in perms]     # remove empty nested elements, to prevent two symbols in a row or a symbol before/after
perms = filter(None, perms)                            # remove empty lists, to prevent two symbols in a row or a symbol before/after

names_t = (
['.'.join(group) for group in perms_t] +     # join using dashes
['-'.join(group) for group in perms_t] +     # join using periods
[''.join(group) for group in perms_t]        # join without spaces
)

names_t = list(set(names_t))                 # remove all duplicates

如何使这段代码的内存效率更高，使其不会因大列表而崩溃？如果需要，我可以运行为每个项目分隔符（逗号、句点、直接连接）分别编写代码。

Answer 1

鉴于我不太确定您将如何处理已保存的 6B 事物列表，但我认为如果您想继续前进，您有 2 个策略。

首先，您可以通过为每个项目替换 numpy unit8 之类的东西来减少列表中事物的大小，这将减少结果列表的大小，但您不会有你想要的格式。

In [15]: import sys                                                             

In [16]: import numpy as np                                                     

In [17]: list_of_strings = ['dog food'] * 1000000                               

In [18]: list_of_uint8s = np.ones(1000000, dtype=np.uint8)                      

In [19]: sys.getsizeof(list_of_strings)                                         
Out[19]: 8000056

In [20]: sys.getsizeof(list_of_uint8s)                                          
Out[20]: 1000096

其次，如果你只是想"save" 项目到某种大文件，你不需要在内存中实现列表。只需使用 itertools.permutations 并即时将对象写入文件。如果您只想将列表推送到文件，则无需在内存中创建列表...

In [48]: from itertools import permutations                                     

In [49]: stuff = ['dog', 'cat', 'mouse']                                        

In [50]: perms = permutations(stuff, 2)                                         

In [51]: with open('output.csv', 'w') as tgt: 
    ...:     for p in perms: 
    ...:         line = '-'.join(p) 
    ...:         tgt.write(line) 
    ...:         tgt.write('\n') 
    ...:                                                                        

In [52]: %more output.csv                                                       
dog-cat
dog-mouse
cat-dog
cat-mouse
mouse-dog
mouse-cat

具有非常大列表的排列

Permutations with very large list

python

performance

permutation

out-of-memory