迭代大型列表(18,895 个元素)时 Double for 循环的更快方法
Faster Approach of Double for loop when iterating large list (18,895 elements)
代码如下:
import csv
import re
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
reader = csv.reader(csv_f)
city_lst = cities.readlines()
for row in reader:
for city in city_lst:
city = city.strip()
match = re.search((r'\b{0}\b').format(city), row[0])
if match:
writer.writerow(row)
break
"alcohol_rehab_ltp.csv" 有 145 行,"cities2.txt" 有 18,895 行(转换为列表时变为 18,895)。这个过程需要一段时间 运行,我没有计时,但大概 5 分钟左右。我在这里忽略了一些简单(或更复杂)的东西,可以使这个脚本 运行 更快。我将使用其他 .csv 文件 运行 来对抗 "cities.txt" 的大文本文件,这些 csv 文件可能有多达 1000 行。任何关于如何加快速度的想法将不胜感激!
这是 csv file:Keywords (144),Avg. CPC、本地搜索、广告商竞争
[alcohol rehab san diego],.54,90,High
[alcohol rehab dallas],.48,110,High
[alcohol rehab atlanta],.93,50,High
[free alcohol rehab centers],.88,110,High
[christian alcohol rehab centers],–,70,High
[alcohol rehab las vegas],.40,70,High
[alcohol rehab cost],.37,110,High
文本文件中的一些行:
san diego
dallas
atlanta
dallas
los angeles
denver
虽然我不认为 loop/IO 是大瓶颈,但如果你可以尝试从他们开始。
我可以提供两个提示:
(r'\b{0}\b').format(c.strip())
可以在循环外,这样会提高一些性能,因为我们不必在每个循环中都 strip(), format on。
另外,你不必在每个循环中写入输出结果,而是可以创建一个结果列表ouput_list
在循环中保存结果并在循环后写入一次。
import csv
import re
import datetime
start = datetime.datetime.now()
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
space = ""
reader = csv.reader(csv_f)
city_lst = [(r'\b{0}\b').format(c.strip()) for c in cities.readlines()]
output_list = []
for row in reader:
for city in city_lst:
#city = city.strip()
match = re.search(city, row[0])
if match:
output_list.append(row)
break
writer.writerows(output_list)
end = datetime.datetime.now()
print end - start
请注意,我假设您可以使用比使用 re.search
更好的方法来查找行中的城市,因为通常城市将由 space 等分隔符分隔。否则就是复杂度大于O(n*m)
一种方法是使用哈希表。
ht = [0]*MAX
读取所有城市(假设以千为单位)并填充哈希表
ht[hash(city)] = 1
现在,当您遍历 reader 中的每一行时,
for row in reader:
for word in row:
if ht[hash(word)] == 1:
# found, do stuff here
pass
我认为你可以使用 set
和索引:
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
space = ""
reader = csv.reader(csv_f)
# make set of all city names, lookups are 0(1)
city_set = {line.rstrip() for line in cities}
output_list = []
header = next(reader) # skip header
for row in reader:
try:
# names are either first or last with two words preceding or following
# so split twice on whitespace from either direction
if row[0].split(None,2)[-1].rstrip("]") in city_set or row[0].rsplit(None, 2)[0][1:] in city_set:
output_list.append(row)
except IndexError as e:
print(e,row[0])
writer.writerows(output_list)
运行 时间现在是 0(n)
而不是二次方。
首先,正如@Shawn Zhang 所建议的那样,(r'\b{0}\b').format(c.strip())
可以在外部循环,并且您可以创建结果列表,以避免在每次迭代中写入文件。
其次,您可以尝试re.compile
编译正则表达式,这可能会提高您在正则表达式上的性能。
第三,尝试对其进行一些剖析以找到瓶颈,例如使用 timeit
或其他分析器,例如 ica
如果您有 SciPy.
另外,如果城市总是在第一列,我假设它被命名为 'City' 为什么不使用 csv.DictReader()
来读取 csv?我确定它比正则表达式更快。
编辑
正如您提供的文件示例一样,我删除了 re
(因为您似乎真的不需要它们),并使用以下代码使速度提高了 10 倍以上:
import csv
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
output_list = []
reader = csv.reader(csv_f)
city_lst = cities.readlines()
for row in reader:
for city in city_lst:
city = city.strip()
if city in row[0]:
output_list.append(row)
writer.writerows(output_list)
构建一个包含所有城市名称的正则表达式:
city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')
然后做:
for row in reader:
match = city_re.search(row[0])
if match:
writer.writerow(row)
这会将循环迭代的次数从 18895 x 145 减少到只有 18895,而正则表达式引擎会在这 145 个城市名称的字符串前缀匹配方面发挥最大作用。
为了您的方便和测试,这里是完整的清单:
import csv
import re
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
reader = csv.reader(csv_f)
city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')
for row in reader:
match = city_re.search(row[0])
if match:
writer.writerow(row)
代码如下:
import csv
import re
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
reader = csv.reader(csv_f)
city_lst = cities.readlines()
for row in reader:
for city in city_lst:
city = city.strip()
match = re.search((r'\b{0}\b').format(city), row[0])
if match:
writer.writerow(row)
break
"alcohol_rehab_ltp.csv" 有 145 行,"cities2.txt" 有 18,895 行(转换为列表时变为 18,895)。这个过程需要一段时间 运行,我没有计时,但大概 5 分钟左右。我在这里忽略了一些简单(或更复杂)的东西,可以使这个脚本 运行 更快。我将使用其他 .csv 文件 运行 来对抗 "cities.txt" 的大文本文件,这些 csv 文件可能有多达 1000 行。任何关于如何加快速度的想法将不胜感激! 这是 csv file:Keywords (144),Avg. CPC、本地搜索、广告商竞争
[alcohol rehab san diego],.54,90,High
[alcohol rehab dallas],.48,110,High
[alcohol rehab atlanta],.93,50,High
[free alcohol rehab centers],.88,110,High
[christian alcohol rehab centers],–,70,High
[alcohol rehab las vegas],.40,70,High
[alcohol rehab cost],.37,110,High
文本文件中的一些行:
san diego
dallas
atlanta
dallas
los angeles
denver
虽然我不认为 loop/IO 是大瓶颈,但如果你可以尝试从他们开始。
我可以提供两个提示:
(r'\b{0}\b').format(c.strip())
可以在循环外,这样会提高一些性能,因为我们不必在每个循环中都 strip(), format on。
另外,你不必在每个循环中写入输出结果,而是可以创建一个结果列表ouput_list
在循环中保存结果并在循环后写入一次。
import csv
import re
import datetime
start = datetime.datetime.now()
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
space = ""
reader = csv.reader(csv_f)
city_lst = [(r'\b{0}\b').format(c.strip()) for c in cities.readlines()]
output_list = []
for row in reader:
for city in city_lst:
#city = city.strip()
match = re.search(city, row[0])
if match:
output_list.append(row)
break
writer.writerows(output_list)
end = datetime.datetime.now()
print end - start
请注意,我假设您可以使用比使用 re.search
更好的方法来查找行中的城市,因为通常城市将由 space 等分隔符分隔。否则就是复杂度大于O(n*m)
一种方法是使用哈希表。
ht = [0]*MAX
读取所有城市(假设以千为单位)并填充哈希表
ht[hash(city)] = 1
现在,当您遍历 reader 中的每一行时,
for row in reader:
for word in row:
if ht[hash(word)] == 1:
# found, do stuff here
pass
我认为你可以使用 set
和索引:
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
space = ""
reader = csv.reader(csv_f)
# make set of all city names, lookups are 0(1)
city_set = {line.rstrip() for line in cities}
output_list = []
header = next(reader) # skip header
for row in reader:
try:
# names are either first or last with two words preceding or following
# so split twice on whitespace from either direction
if row[0].split(None,2)[-1].rstrip("]") in city_set or row[0].rsplit(None, 2)[0][1:] in city_set:
output_list.append(row)
except IndexError as e:
print(e,row[0])
writer.writerows(output_list)
运行 时间现在是 0(n)
而不是二次方。
首先,正如@Shawn Zhang 所建议的那样,(r'\b{0}\b').format(c.strip())
可以在外部循环,并且您可以创建结果列表,以避免在每次迭代中写入文件。
其次,您可以尝试re.compile
编译正则表达式,这可能会提高您在正则表达式上的性能。
第三,尝试对其进行一些剖析以找到瓶颈,例如使用 timeit
或其他分析器,例如 ica
如果您有 SciPy.
另外,如果城市总是在第一列,我假设它被命名为 'City' 为什么不使用 csv.DictReader()
来读取 csv?我确定它比正则表达式更快。
编辑
正如您提供的文件示例一样,我删除了 re
(因为您似乎真的不需要它们),并使用以下代码使速度提高了 10 倍以上:
import csv
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
output_list = []
reader = csv.reader(csv_f)
city_lst = cities.readlines()
for row in reader:
for city in city_lst:
city = city.strip()
if city in row[0]:
output_list.append(row)
writer.writerows(output_list)
构建一个包含所有城市名称的正则表达式:
city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')
然后做:
for row in reader:
match = city_re.search(row[0])
if match:
writer.writerow(row)
这会将循环迭代的次数从 18895 x 145 减少到只有 18895,而正则表达式引擎会在这 145 个城市名称的字符串前缀匹配方面发挥最大作用。
为了您的方便和测试,这里是完整的清单:
import csv
import re
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
reader = csv.reader(csv_f)
city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')
for row in reader:
match = city_re.search(row[0])
if match:
writer.writerow(row)