确定有序子列表是否在大型列表列表中的最快方法?
Fastest way to determine if an ordered sublist is in a large lists of lists?
假设我有一个 my_huge_list_of_lists,其中包含 2,000,000 个列表,每个列表的长度约为 50。
我想通过丢弃序列中不包含两个元素的子列表来缩短 2,000,000 my_huge_list_of_lists
。
到目前为止我有:
#
def check_if_list_is_sublist(lst, sublst):
#checks if a list appears in order in another larger list.
n = len(sublst)
return any((sublst == lst[i:i + n]) for i in xrange(len(lst) - n + 1))
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [a,b])]
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [b,a])]
搜索词 [a,b] 或 [b,a] 的连续性很重要,所以我不能使用 set.issubset()
。
我觉得这很慢。我想加快速度。我考虑了一些选项,例如使用 'early exit' 和语句:
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if (a in x and not check_if_list_is_sublist(x, [a,b]))]
在 for
循环中使用 or
语句的次数更少:
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not (check_if_list_is_sublist(x, [a,b])
or check_if_list_is_sublist(x, [b,a]))]
并且还在努力加速功能(WIP)
#
def check_if_list_is_sublist(lst, sublst):
checks if a list appears in order in another larger list.
set_of_sublists = {tuple(sublst) for sublist in lst}
并在 Stack Overflow 上进行了一些搜索;但是想不出办法因为check_if_list_is_sublist()
被调用的次数是len(my_huge_list) * 2
.
编辑:根据要求添加一些用户数据
from random import randint
from string import ascii_lowercase
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(2000000)]
my_neighbor_search_fwd = [i,c]
my_neighbor_search_rev = my_neighbor_search_fwd.reverse()
将 n-sized 子序列中的项目解包到 n 个变量中。然后写一个列表理解来过滤列表,在 sub-list.e.g.
中检查 a, b 或 b, a
a, b = sublst
def checklst(lst, a, b):
l = len(lst)
start = 0
while True:
try:
a_index = lst.index(a, start)
except ValueError:
return False
try:
return a_index > -1 and lst[a_index+1] == b
except IndexError:
try:
return a_index > -1 and lst[a_index-1] == b
except IndexError:
start = a_index + 1
if start == l:
return False
continue # keep looking at the next a
%timeit found = [l for l in lst if checklst(l, a, b)]
1.88 s ± 31.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit found = [x for x in lst if (a in x and not check_if_list_is_sublist(x, [a,b]))]
22.1 s ± 1.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
所以,我想不出任何巧妙的算法检查来真正减少这里的工作量。但是,您在代码中进行了大量分配,并且迭代过多。所以,只是将一些声明移出函数有点让我
sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
for i in range(len(lst) - (l -1)):
if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
return True
if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
return True
return False
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x)]
这将上面示例代码的 run-time 减少了大约 50%。对于这种大小的列表,产生更多的进程并划分工作也可能会提高性能。想不出任何方法来真正减少比较量...
对于在一个大列表中搜索匹配,我相信散列(元素)然后建立索引将是一个很好的解决方案。
您将获得的收益:
建立一次索引,节省您的时间以备将来使用(不需要为每次搜索一次又一次地循环)。
甚至,我们可以在启动程序时建立索引,然后在程序退出时释放它,
下面的代码使用了两种获取hash值的方法:hash()和str();有时你应该根据你的具体场景自定义一个哈希函数。
如果使用str(),代码看起来很简单,不需要考虑hash冲突。但它可能会导致内存炸弹。
对于hash(),我使用列表保存了所有具有相同哈希值的sub_lst。并且可以使用 hash(sub_lst)%designed_length 来控制哈希大小(但会增加哈希冲突率)。
以下代码的输出:
By Hash: 0.00023986603994852955
By str(): 0.00022884208565612796
By OP's: 0.3001317172469765
[Finished in 1.781s]
测试代码:
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(10000)]
#print(my_huge_list_of_lists)
test_lst = [['a', 'b', 'c' ], ['a', 'b', 'c'] ]
#Solution 1: By using built-in hash function
def prepare1(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append(huge_list[index:index+interval])
else:
hash_db[hash_sub] = [huge_list[index:index+interval]]
return hash_db
hash_db = prepare1(my_huge_list_of_lists, interval=2)
def check_sublist1(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return any([sublst == item for item in hash_db[hash_sub]])
return False
print('By Hash:', timeit.timeit("check_sublist1(hash_db, test_lst)", setup="from __main__ import check_sublist1, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): #use str() as hash function
return { str(huge_list[index:index+interval]):huge_list[index:index+interval] for index in range(len(huge_list) - interval + 1)}
hash_db = prepare2(my_huge_list_of_lists, interval=2)
def check_sublist2(hash_db, sublst): #use str() as hash function
hash_sub = str(sublst)
if hash_sub in hash_db:
return sublst == hash_db[hash_sub]
return False
print('By str():', timeit.timeit("check_sublist2(hash_db, test_lst)", setup="from __main__ import check_sublist2, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 3: OP's current solution
def check_if_list_is_sublist(lst, sublst):
#checks if a list appears in order in another larger list.
n = len(sublst)
return any((sublst == lst[i:i + n]) for i in range(len(lst) - n + 1))
print('By OP\'s:', timeit.timeit("check_if_list_is_sublist(my_huge_list_of_lists, test_lst)", setup="from __main__ import check_if_list_is_sublist, my_huge_list_of_lists, test_lst ", number=100))
如果您想从一个列表中删除匹配的元素,这是可行的,但结果是您可能必须为新列表重建索引。除非列表是链表,否则将指针保存在索引中的每个元素。我只是 google Python how to get the pointer for one element of a list
,但找不到任何有用的东西。如果有人知道该怎么做,请不要犹豫,分享您的解决方案。谢谢
下面是一个示例:(它生成一个新列表而不是 return 原始列表,有时我们仍然需要从原始列表中过滤一些东西名单)
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 1)] for x in range(2)] for y in range(100)]
#print(my_huge_list_of_lists)
test_lst = [[['a', 'b'], ['a', 'b'] ], [['b', 'a'], ['a', 'b']]]
#Solution 1: By using built-in hash function
def prepare(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append({'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]})
else:
hash_db[hash_sub] = [{'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]}]
return hash_db
hash_db = prepare(my_huge_list_of_lists, interval=2)
def check_sublist(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return [ item for item in hash_db[hash_sub] if sublst == item['data'] ]
return []
def remove_if_match_sublist(target_list, hash_db, sublsts):
matches = []
for sublst in sublsts:
matches += check_sublist(hash_db, sublst)
#make sure delete elements from end to begin
sorted_match = sorted(matches, key=lambda item:item['beg'], reverse=True)
new_list = list(target_list)
for item in sorted_match:
del new_list[item['beg']:item['end']]
return new_list
print('Removed By Hash:', timeit.timeit("remove_if_match_sublist(my_huge_list_of_lists, hash_db, test_lst)", setup="from __main__ import check_sublist, my_huge_list_of_lists, test_lst, hash_db, remove_if_match_sublist ", number=1))
虽然这不是您所谓的 "answer" 本身,但它是一个基准测试框架,可以帮助您确定完成所需任务的最快方法,因为它允许相对容易的修改,因为以及添加不同的方法。
我已经将当前发布的答案以及 运行 的结果放入其中。
注意事项:请注意,我尚未验证其中所有经过测试的答案在某种意义上都是"correct"他们实际上做你想做的事,也不知道他们在这个过程中会消耗多少内存——这可能是另一个考虑因素。
目前@Oluwafemi Sule 的答案似乎比最接近的竞争对手快一个数量级(10 倍)。
from __future__ import print_function
from collections import namedtuple
import sys
from textwrap import dedent
import timeit
import traceback
N = 10 # Number of executions of each "algorithm".
R = 3 # Number of repetitions of those N executions.
from random import randint, randrange, seed
from string import ascii_lowercase
a, b = 'a', 'b'
NUM_SUBLISTS = 1000
SUBLIST_LEN = 50
PERCENTAGE = 50 # Percentage of sublist that should get removed.
seed(42) # Initialize random number so the results are reproducible.
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for __ in range(SUBLIST_LEN)]
for __ in range(NUM_SUBLISTS)]
# Put the target sequence in percentage of the sublists so they'll be removed.
for __ in range(NUM_SUBLISTS*PERCENTAGE // 100):
list_index = randrange(NUM_SUBLISTS)
sublist_index = randrange(SUBLIST_LEN)
my_huge_list_of_lists[list_index][sublist_index:sublist_index+2] = [a, b]
# Common setup for all testcases (executed before any algorithm specific setup).
COMMON_SETUP = dedent("""
from __main__ import a, b, my_huge_list_of_lists, NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE
""")
class TestCase(namedtuple('CodeFragments', ['setup', 'test'])):
""" A test case is composed of separate setup and test code fragments. """
def __new__(cls, setup, test):
""" Dedent code fragment in each string argument. """
return tuple.__new__(cls, (dedent(setup), dedent(test)))
testcases = {
"OP (Nas Banov)": TestCase("""
#
def check_if_list_is_sublist(lst, sublst):
''' Checks if a list appears in order in another larger list. '''
n = len(sublst)
return any((sublst == lst[i:i+n]) for i in range(len(lst) - n + 1))
""", """
shortened = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [a, b])]
"""
),
"Sphinx Solution 1 (hash)": TestCase("""
#
# Solution 1: By using built-in hash function.
def prepare1(huge_list, interval=1): # Use built-in hash function.
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append(huge_list[index:index+interval])
else:
hash_db[hash_sub] = [huge_list[index:index+interval]]
return hash_db
def check_sublist1(hash_db, sublst): # Use built-in hash function.
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return any([sublst == item for item in hash_db[hash_sub]])
return False
""", """
hash_db = prepare1(my_huge_list_of_lists, interval=2)
shortened = [x for x in my_huge_list_of_lists
if check_sublist1(hash_db, x)]
"""
),
"Sphinx Solution 2 (str)": TestCase("""
#
#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): # Use str() as hash function.
return {str(huge_list[index:index+interval]):huge_list[index:index+interval]
for index in range(len(huge_list) - interval + 1)}
def check_sublist2(hash_db, sublst): #use str() as hash function
hash_sub = str(sublst)
if hash_sub in hash_db:
return sublst == hash_db[hash_sub]
return False
""", """
hash_db = prepare2(my_huge_list_of_lists, interval=2)
shortened = [x for x in my_huge_list_of_lists
if check_sublist2(hash_db, x)]
"""
),
"Paul Becotte": TestCase("""
#
sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
for i in range(len(lst) - (l -1)):
if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
return True
if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
return True
return False
""", """
shortened = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x)]
"""
),
"Oluwafemi Sule": TestCase("""
#
def checklst(lst, a, b):
try:
a_index = lst.index(a)
except ValueError:
return False
try:
return a_index > -1 and lst[a_index+1] == b
except IndexError:
try:
return a_index > -1 and lst[a_index-1] == b
except IndexError:
return False
""", """
shortened = [x for x in my_huge_list_of_lists
if not checklst(x, a, b)]
"""
),
}
# Collect timing results of executing each testcase multiple times.
try:
results = [
(label,
min(timeit.repeat(testcases[label].test,
setup=COMMON_SETUP + testcases[label].setup,
repeat=R, number=N)),
) for label in testcases
]
except Exception:
traceback.print_exc(file=sys.stdout) # direct output to stdout
sys.exit(1)
# Display results.
print('Results for {:,d} sublists of length {:,d} with {}% percent of them matching.'
.format(NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE))
major, minor, micro = sys.version_info[:3]
print('Fastest to slowest execution speeds using Python {}.{}.{}\n'
'({:,d} executions, best of {:d} repetitions)'.format(major, minor, micro, N, R))
print()
longest = max(len(result[0]) for result in results) # length of longest label
ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
fastest = ranked[0][1]
for result in ranked:
print('{:>{width}} : {:9.6f} secs, rel speed {:5.2f}x, {:6.2f}% slower '
''.format(
result[0], result[1], round(result[1]/fastest, 2),
round((result[1]/fastest - 1) * 100, 2),
width=longest))
print()
输出:
Results for 1,000 sublists of length 50 with 50% percent of them matching
Fastest to slowest execution speeds using Python 3.6.4
(10 executions, best of 3 repetitions)
Oluwafemi Sule : 0.006441 secs, rel speed 1.00x, 0.00% slower
Paul Becotte : 0.069462 secs, rel speed 10.78x, 978.49% slower
OP (Nas Banov) : 0.082758 secs, rel speed 12.85x, 1184.92% slower
Sphinx Solution 2 (str) : 0.152119 secs, rel speed 23.62x, 2261.84% slower
Sphinx Solution 1 (hash) : 0.154562 secs, rel speed 24.00x, 2299.77% slower
假设我有一个 my_huge_list_of_lists,其中包含 2,000,000 个列表,每个列表的长度约为 50。
我想通过丢弃序列中不包含两个元素的子列表来缩短 2,000,000 my_huge_list_of_lists
。
到目前为止我有:
#
def check_if_list_is_sublist(lst, sublst):
#checks if a list appears in order in another larger list.
n = len(sublst)
return any((sublst == lst[i:i + n]) for i in xrange(len(lst) - n + 1))
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [a,b])]
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [b,a])]
搜索词 [a,b] 或 [b,a] 的连续性很重要,所以我不能使用 set.issubset()
。
我觉得这很慢。我想加快速度。我考虑了一些选项,例如使用 'early exit' 和语句:
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if (a in x and not check_if_list_is_sublist(x, [a,b]))]
在 for
循环中使用 or
语句的次数更少:
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not (check_if_list_is_sublist(x, [a,b])
or check_if_list_is_sublist(x, [b,a]))]
并且还在努力加速功能(WIP)
#
def check_if_list_is_sublist(lst, sublst):
checks if a list appears in order in another larger list.
set_of_sublists = {tuple(sublst) for sublist in lst}
并在 Stack Overflow 上进行了一些搜索;但是想不出办法因为check_if_list_is_sublist()
被调用的次数是len(my_huge_list) * 2
.
编辑:根据要求添加一些用户数据
from random import randint
from string import ascii_lowercase
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(2000000)]
my_neighbor_search_fwd = [i,c]
my_neighbor_search_rev = my_neighbor_search_fwd.reverse()
将 n-sized 子序列中的项目解包到 n 个变量中。然后写一个列表理解来过滤列表,在 sub-list.e.g.
中检查 a, b 或 b, aa, b = sublst
def checklst(lst, a, b):
l = len(lst)
start = 0
while True:
try:
a_index = lst.index(a, start)
except ValueError:
return False
try:
return a_index > -1 and lst[a_index+1] == b
except IndexError:
try:
return a_index > -1 and lst[a_index-1] == b
except IndexError:
start = a_index + 1
if start == l:
return False
continue # keep looking at the next a
%timeit found = [l for l in lst if checklst(l, a, b)]
1.88 s ± 31.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit found = [x for x in lst if (a in x and not check_if_list_is_sublist(x, [a,b]))]
22.1 s ± 1.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
所以,我想不出任何巧妙的算法检查来真正减少这里的工作量。但是,您在代码中进行了大量分配,并且迭代过多。所以,只是将一些声明移出函数有点让我
sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
for i in range(len(lst) - (l -1)):
if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
return True
if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
return True
return False
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x)]
这将上面示例代码的 run-time 减少了大约 50%。对于这种大小的列表,产生更多的进程并划分工作也可能会提高性能。想不出任何方法来真正减少比较量...
对于在一个大列表中搜索匹配,我相信散列(元素)然后建立索引将是一个很好的解决方案。
您将获得的收益: 建立一次索引,节省您的时间以备将来使用(不需要为每次搜索一次又一次地循环)。 甚至,我们可以在启动程序时建立索引,然后在程序退出时释放它,
下面的代码使用了两种获取hash值的方法:hash()和str();有时你应该根据你的具体场景自定义一个哈希函数。
如果使用str(),代码看起来很简单,不需要考虑hash冲突。但它可能会导致内存炸弹。
对于hash(),我使用列表保存了所有具有相同哈希值的sub_lst。并且可以使用 hash(sub_lst)%designed_length 来控制哈希大小(但会增加哈希冲突率)。
以下代码的输出:
By Hash: 0.00023986603994852955
By str(): 0.00022884208565612796
By OP's: 0.3001317172469765
[Finished in 1.781s]
测试代码:
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(10000)]
#print(my_huge_list_of_lists)
test_lst = [['a', 'b', 'c' ], ['a', 'b', 'c'] ]
#Solution 1: By using built-in hash function
def prepare1(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append(huge_list[index:index+interval])
else:
hash_db[hash_sub] = [huge_list[index:index+interval]]
return hash_db
hash_db = prepare1(my_huge_list_of_lists, interval=2)
def check_sublist1(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return any([sublst == item for item in hash_db[hash_sub]])
return False
print('By Hash:', timeit.timeit("check_sublist1(hash_db, test_lst)", setup="from __main__ import check_sublist1, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): #use str() as hash function
return { str(huge_list[index:index+interval]):huge_list[index:index+interval] for index in range(len(huge_list) - interval + 1)}
hash_db = prepare2(my_huge_list_of_lists, interval=2)
def check_sublist2(hash_db, sublst): #use str() as hash function
hash_sub = str(sublst)
if hash_sub in hash_db:
return sublst == hash_db[hash_sub]
return False
print('By str():', timeit.timeit("check_sublist2(hash_db, test_lst)", setup="from __main__ import check_sublist2, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 3: OP's current solution
def check_if_list_is_sublist(lst, sublst):
#checks if a list appears in order in another larger list.
n = len(sublst)
return any((sublst == lst[i:i + n]) for i in range(len(lst) - n + 1))
print('By OP\'s:', timeit.timeit("check_if_list_is_sublist(my_huge_list_of_lists, test_lst)", setup="from __main__ import check_if_list_is_sublist, my_huge_list_of_lists, test_lst ", number=100))
如果您想从一个列表中删除匹配的元素,这是可行的,但结果是您可能必须为新列表重建索引。除非列表是链表,否则将指针保存在索引中的每个元素。我只是 google Python how to get the pointer for one element of a list
,但找不到任何有用的东西。如果有人知道该怎么做,请不要犹豫,分享您的解决方案。谢谢
下面是一个示例:(它生成一个新列表而不是 return 原始列表,有时我们仍然需要从原始列表中过滤一些东西名单)
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 1)] for x in range(2)] for y in range(100)]
#print(my_huge_list_of_lists)
test_lst = [[['a', 'b'], ['a', 'b'] ], [['b', 'a'], ['a', 'b']]]
#Solution 1: By using built-in hash function
def prepare(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append({'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]})
else:
hash_db[hash_sub] = [{'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]}]
return hash_db
hash_db = prepare(my_huge_list_of_lists, interval=2)
def check_sublist(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return [ item for item in hash_db[hash_sub] if sublst == item['data'] ]
return []
def remove_if_match_sublist(target_list, hash_db, sublsts):
matches = []
for sublst in sublsts:
matches += check_sublist(hash_db, sublst)
#make sure delete elements from end to begin
sorted_match = sorted(matches, key=lambda item:item['beg'], reverse=True)
new_list = list(target_list)
for item in sorted_match:
del new_list[item['beg']:item['end']]
return new_list
print('Removed By Hash:', timeit.timeit("remove_if_match_sublist(my_huge_list_of_lists, hash_db, test_lst)", setup="from __main__ import check_sublist, my_huge_list_of_lists, test_lst, hash_db, remove_if_match_sublist ", number=1))
虽然这不是您所谓的 "answer" 本身,但它是一个基准测试框架,可以帮助您确定完成所需任务的最快方法,因为它允许相对容易的修改,因为以及添加不同的方法。
我已经将当前发布的答案以及 运行 的结果放入其中。
注意事项:请注意,我尚未验证其中所有经过测试的答案在某种意义上都是"correct"他们实际上做你想做的事,也不知道他们在这个过程中会消耗多少内存——这可能是另一个考虑因素。
目前@Oluwafemi Sule 的答案似乎比最接近的竞争对手快一个数量级(10 倍)。
from __future__ import print_function
from collections import namedtuple
import sys
from textwrap import dedent
import timeit
import traceback
N = 10 # Number of executions of each "algorithm".
R = 3 # Number of repetitions of those N executions.
from random import randint, randrange, seed
from string import ascii_lowercase
a, b = 'a', 'b'
NUM_SUBLISTS = 1000
SUBLIST_LEN = 50
PERCENTAGE = 50 # Percentage of sublist that should get removed.
seed(42) # Initialize random number so the results are reproducible.
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for __ in range(SUBLIST_LEN)]
for __ in range(NUM_SUBLISTS)]
# Put the target sequence in percentage of the sublists so they'll be removed.
for __ in range(NUM_SUBLISTS*PERCENTAGE // 100):
list_index = randrange(NUM_SUBLISTS)
sublist_index = randrange(SUBLIST_LEN)
my_huge_list_of_lists[list_index][sublist_index:sublist_index+2] = [a, b]
# Common setup for all testcases (executed before any algorithm specific setup).
COMMON_SETUP = dedent("""
from __main__ import a, b, my_huge_list_of_lists, NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE
""")
class TestCase(namedtuple('CodeFragments', ['setup', 'test'])):
""" A test case is composed of separate setup and test code fragments. """
def __new__(cls, setup, test):
""" Dedent code fragment in each string argument. """
return tuple.__new__(cls, (dedent(setup), dedent(test)))
testcases = {
"OP (Nas Banov)": TestCase("""
#
def check_if_list_is_sublist(lst, sublst):
''' Checks if a list appears in order in another larger list. '''
n = len(sublst)
return any((sublst == lst[i:i+n]) for i in range(len(lst) - n + 1))
""", """
shortened = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [a, b])]
"""
),
"Sphinx Solution 1 (hash)": TestCase("""
#
# Solution 1: By using built-in hash function.
def prepare1(huge_list, interval=1): # Use built-in hash function.
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append(huge_list[index:index+interval])
else:
hash_db[hash_sub] = [huge_list[index:index+interval]]
return hash_db
def check_sublist1(hash_db, sublst): # Use built-in hash function.
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return any([sublst == item for item in hash_db[hash_sub]])
return False
""", """
hash_db = prepare1(my_huge_list_of_lists, interval=2)
shortened = [x for x in my_huge_list_of_lists
if check_sublist1(hash_db, x)]
"""
),
"Sphinx Solution 2 (str)": TestCase("""
#
#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): # Use str() as hash function.
return {str(huge_list[index:index+interval]):huge_list[index:index+interval]
for index in range(len(huge_list) - interval + 1)}
def check_sublist2(hash_db, sublst): #use str() as hash function
hash_sub = str(sublst)
if hash_sub in hash_db:
return sublst == hash_db[hash_sub]
return False
""", """
hash_db = prepare2(my_huge_list_of_lists, interval=2)
shortened = [x for x in my_huge_list_of_lists
if check_sublist2(hash_db, x)]
"""
),
"Paul Becotte": TestCase("""
#
sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
for i in range(len(lst) - (l -1)):
if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
return True
if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
return True
return False
""", """
shortened = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x)]
"""
),
"Oluwafemi Sule": TestCase("""
#
def checklst(lst, a, b):
try:
a_index = lst.index(a)
except ValueError:
return False
try:
return a_index > -1 and lst[a_index+1] == b
except IndexError:
try:
return a_index > -1 and lst[a_index-1] == b
except IndexError:
return False
""", """
shortened = [x for x in my_huge_list_of_lists
if not checklst(x, a, b)]
"""
),
}
# Collect timing results of executing each testcase multiple times.
try:
results = [
(label,
min(timeit.repeat(testcases[label].test,
setup=COMMON_SETUP + testcases[label].setup,
repeat=R, number=N)),
) for label in testcases
]
except Exception:
traceback.print_exc(file=sys.stdout) # direct output to stdout
sys.exit(1)
# Display results.
print('Results for {:,d} sublists of length {:,d} with {}% percent of them matching.'
.format(NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE))
major, minor, micro = sys.version_info[:3]
print('Fastest to slowest execution speeds using Python {}.{}.{}\n'
'({:,d} executions, best of {:d} repetitions)'.format(major, minor, micro, N, R))
print()
longest = max(len(result[0]) for result in results) # length of longest label
ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
fastest = ranked[0][1]
for result in ranked:
print('{:>{width}} : {:9.6f} secs, rel speed {:5.2f}x, {:6.2f}% slower '
''.format(
result[0], result[1], round(result[1]/fastest, 2),
round((result[1]/fastest - 1) * 100, 2),
width=longest))
print()
输出:
Results for 1,000 sublists of length 50 with 50% percent of them matching
Fastest to slowest execution speeds using Python 3.6.4
(10 executions, best of 3 repetitions)
Oluwafemi Sule : 0.006441 secs, rel speed 1.00x, 0.00% slower
Paul Becotte : 0.069462 secs, rel speed 10.78x, 978.49% slower
OP (Nas Banov) : 0.082758 secs, rel speed 12.85x, 1184.92% slower
Sphinx Solution 2 (str) : 0.152119 secs, rel speed 23.62x, 2261.84% slower
Sphinx Solution 1 (hash) : 0.154562 secs, rel speed 24.00x, 2299.77% slower