具有非常大列表的排列
Permutations with very large list
我正在尝试使用 Python 运行 一个非常大的排列。目标是将四个或更少的项目配对,由 1) 句点、2) 破折号和 3) 分隔,没有任何分隔。顺序很重要。
# input
food = ['', 'apple', 'banana', 'bread', 'tomato', 'yogurt', ...] `
# ideal output would be a list that contains strings like the following:
apple-banana-bread (no dashes before or after!)
apple.banana.bread (using periods)
applebananabread (no spaces)
apple-banana (by combining with the first item in the list, I also get shorter groups but need to delete empty items before joining)
... for all the possible groups of 4, order is important
# Requirements:
# Avoiding a symbol at the beginning or end of a resulting string
# Also creating groups of length 1, 2, and 3
我用 itertools.permutations
创建了一个 itertools.chain (perms
)。但是,在转换为列表后删除空元素时,这会失败并显示 MemoryError
。即使使用具有大量 RAM 的机器。
food = ['', 'apple', 'banana', 'bread', 'tomato', 'yogurt', ...] `
perms_ = itertools.permutations(food, 4)
perms = [list(filter(None, tup)) for tup in perms] # remove empty nested elements, to prevent two symbols in a row or a symbol before/after
perms = filter(None, perms) # remove empty lists, to prevent two symbols in a row or a symbol before/after
names_t = (
['.'.join(group) for group in perms_t] + # join using dashes
['-'.join(group) for group in perms_t] + # join using periods
[''.join(group) for group in perms_t] # join without spaces
)
names_t = list(set(names_t)) # remove all duplicates
如何使这段代码的内存效率更高,使其不会因大列表而崩溃?如果需要,我可以 运行 为每个项目分隔符(逗号、句点、直接连接)分别编写代码。
鉴于我不太确定您将如何处理已保存的 6B 事物列表,但我认为如果您想继续前进,您有 2 个策略。
首先,您可以通过为每个项目替换 numpy
unit8 之类的东西来减少列表中事物的大小,这将减少结果列表的大小,但您不会有你想要的格式。
In [15]: import sys
In [16]: import numpy as np
In [17]: list_of_strings = ['dog food'] * 1000000
In [18]: list_of_uint8s = np.ones(1000000, dtype=np.uint8)
In [19]: sys.getsizeof(list_of_strings)
Out[19]: 8000056
In [20]: sys.getsizeof(list_of_uint8s)
Out[20]: 1000096
其次,如果你只是想"save" 项目到某种大文件,你不需要在内存中实现列表。只需使用 itertools.permutations
并即时将对象写入文件。如果您只想将列表推送到文件,则无需在内存中创建列表...
In [48]: from itertools import permutations
In [49]: stuff = ['dog', 'cat', 'mouse']
In [50]: perms = permutations(stuff, 2)
In [51]: with open('output.csv', 'w') as tgt:
...: for p in perms:
...: line = '-'.join(p)
...: tgt.write(line)
...: tgt.write('\n')
...:
In [52]: %more output.csv
dog-cat
dog-mouse
cat-dog
cat-mouse
mouse-dog
mouse-cat
我正在尝试使用 Python 运行 一个非常大的排列。目标是将四个或更少的项目配对,由 1) 句点、2) 破折号和 3) 分隔,没有任何分隔。顺序很重要。
# input
food = ['', 'apple', 'banana', 'bread', 'tomato', 'yogurt', ...] `
# ideal output would be a list that contains strings like the following:
apple-banana-bread (no dashes before or after!)
apple.banana.bread (using periods)
applebananabread (no spaces)
apple-banana (by combining with the first item in the list, I also get shorter groups but need to delete empty items before joining)
... for all the possible groups of 4, order is important
# Requirements:
# Avoiding a symbol at the beginning or end of a resulting string
# Also creating groups of length 1, 2, and 3
我用 itertools.permutations
创建了一个 itertools.chain (perms
)。但是,在转换为列表后删除空元素时,这会失败并显示 MemoryError
。即使使用具有大量 RAM 的机器。
food = ['', 'apple', 'banana', 'bread', 'tomato', 'yogurt', ...] `
perms_ = itertools.permutations(food, 4)
perms = [list(filter(None, tup)) for tup in perms] # remove empty nested elements, to prevent two symbols in a row or a symbol before/after
perms = filter(None, perms) # remove empty lists, to prevent two symbols in a row or a symbol before/after
names_t = (
['.'.join(group) for group in perms_t] + # join using dashes
['-'.join(group) for group in perms_t] + # join using periods
[''.join(group) for group in perms_t] # join without spaces
)
names_t = list(set(names_t)) # remove all duplicates
如何使这段代码的内存效率更高,使其不会因大列表而崩溃?如果需要,我可以 运行 为每个项目分隔符(逗号、句点、直接连接)分别编写代码。
鉴于我不太确定您将如何处理已保存的 6B 事物列表,但我认为如果您想继续前进,您有 2 个策略。
首先,您可以通过为每个项目替换 numpy
unit8 之类的东西来减少列表中事物的大小,这将减少结果列表的大小,但您不会有你想要的格式。
In [15]: import sys
In [16]: import numpy as np
In [17]: list_of_strings = ['dog food'] * 1000000
In [18]: list_of_uint8s = np.ones(1000000, dtype=np.uint8)
In [19]: sys.getsizeof(list_of_strings)
Out[19]: 8000056
In [20]: sys.getsizeof(list_of_uint8s)
Out[20]: 1000096
其次,如果你只是想"save" 项目到某种大文件,你不需要在内存中实现列表。只需使用 itertools.permutations
并即时将对象写入文件。如果您只想将列表推送到文件,则无需在内存中创建列表...
In [48]: from itertools import permutations
In [49]: stuff = ['dog', 'cat', 'mouse']
In [50]: perms = permutations(stuff, 2)
In [51]: with open('output.csv', 'w') as tgt:
...: for p in perms:
...: line = '-'.join(p)
...: tgt.write(line)
...: tgt.write('\n')
...:
In [52]: %more output.csv
dog-cat
dog-mouse
cat-dog
cat-mouse
mouse-dog
mouse-cat