python 中列表的时间高效过滤

Question

我有一个名为“do_not_call”的数据库 table，其中包含有关以递增顺序保存 10 位 phone 数字的文件的信息。 “filename”列包含包含从“first_phone”到“last_phone”的数字范围的文件的名称。 'do_not_call' table.

中大约有 2500 条记录

而且我有一个 sqlalchemy 记录列表。我需要找到哪个文件保存了这些记录的 'phone' 字段。所以我创建了一个函数，它接收 sqlalchemy 记录和 returns 字典，其中键是文件名，值是来自 sqlalchemy 记录的 phone 数字列表，这些数字落在范围内第一个和最后一个 phone 个数字，包含在文件中。

def get_file_mappings(dbcursor, import_records):
    start_time = datetime.now()
    phone_list = [int(rec.phone) for rec in import_records]
    dnc_sql = "SELECT * from do_not_call;"
    dbcursor.execute(dnc_sql)
    dnc_result = dbcursor.fetchall()
    file_mappings = {}
    for file_info in dnc_result:
        first_phone = int(file_info.get('first_phone'))
        last_phone = int(file_info.get('last_phone'))
        phone_ranges = list(filter(lambda phone: phone in range(first_phone, last_phone), phone_list))
        if phone_ranges:
            file_mappings.update({file_info.get('filename'): phone_ranges})
            phone_list = list(set(phone_list) - set(phone_ranges))
    # print(file_mappings)
    print("Time = ", datetime.now() - start_time)
    return file_mappings

例如，如果 phone_list 是 [2023143300, 2024393100, 2027981539, 2022760321, 2026416368, 2027585911]，返回的file_mappings会是

    {'1500000_2020-9-24_Global_45A62481-17A2-4E45-82D6-DDF8B58B1BF8.txt': [2023143300, 2022760321],
     '1700000_2020-9-24_Global_45A62481-17A2-4E45-82D6-DDF8B58B1BF8.txt': [2024393100], 
'1900000_2020-9-24_Global_45A62481-17A2-4E45-82D6-DDF8B58B1BF8.txt': [2027981539, 2026416368, 2027585911]}

这里的问题是执行起来要花很多时间。平均而言，1000 条记录大约需要 1.5 秒。有没有更好的approach/algorithm解决这个问题。感谢任何帮助。

Answer 1

这是一种将事物分箱到排序列表中的非常低效的方法。您没有利用您的垃圾箱已排序（或者如果未排序则可以轻松排序）这一事实。您正在通过使用 lambda 语句测试 phone 数字来在此处创建一个大的嵌套循环。

您可以通过与 set 使用保持一致来做出一些边际改进（见下文。）但最后，您 could/should 只是找到每个 phone 在列出有效的搜索，如二分法。请参阅下面的示例，其中包含原始时间、集合实现和二分插入。

如果您的 phone_list 非常庞大，那么其他方法可能会更有利，例如找到截止 bin 适合 phone 列表的排序副本的位置...但下面是处理 1,000 或 10,000 条记录的速度比现在快 500 倍

# phone sorter
import random
import bisect
import time
from collections import defaultdict

# make some fake data of representative size
low_phone = 200_000_0000
data = []   # [file, low_phone, high_phone]
for idx in range(2500):
    row = []
    row.append(f'file_{idx}')
    row.append(low_phone + idx * 20000000)
    row.append(low_phone + (idx + 1) * 20000000 - 20)  # some gap
    data.append(row)

high_phone = data[-1][-1]

# generate some random phone numbers in range
num_phones = 10000
phone_list_orig = [random.randint(low_phone, high_phone) for t in range(num_phones)]

# orig method...
phone_list = phone_list_orig[:]
tic = time.time()
results = {}
for row in data:
    low = row[1]
    high = row[2]
    phone_ranges = list(filter(lambda phone: phone in range(low, high), phone_list))
    if phone_ranges:
        results.update({row[0]:phone_ranges})
        phone_list = list(set(phone_list) - set(phone_ranges))
toc = time.time()
print(f'orig time: {toc-tic:.3f}')

# with sets across the board...
phone_list = set(phone_list_orig)
tic = time.time()
results2 = {}
for row in data:
    low = row[1]
    high = row[2]
    phone_ranges = set(filter(lambda phone: phone in range(low, high), phone_list))
    if phone_ranges:
        results2.update({row[0]:phone_ranges})
        phone_list = phone_list - phone_ranges
toc = time.time()
print(f'using sets time: {toc-tic:.3f}')

# using bisection search
phone_list = set(phone_list_orig)
tic = time.time()
results3 = defaultdict(list)
lows = [t[1] for t in data]
for phone in phone_list:
    location = bisect.bisect(lows, phone) - 1
    if phone <= data[location][2]:  # it is within the high limit of bin
        results3[data[location][0]].append(phone)
toc = time.time()
print(f'using bisection sort time: {toc-tic:.3f}')

# for k in sorted(results3):
#   print(k, ':', results.get(k))

assert(results==results2==results3)

结果：

orig time: 5.236
using sets time: 4.597
using bisection sort time: 0.012
[Finished in 9.9s]

python 中列表的时间高效过滤

Time efficient filtering of list in python

list

filter

time-complexity

python-3.x

结果：