循环文件值并检查字典中 any/all 对应实例的最有效方法
Most efficient way to loop through a file's values and check a dictionary for any/all corresponding instances
我有一个包含用户名的文件,每行一个,我需要将文件中的每个名称与 csv 文件中的所有值进行比较,并在每次用户名出现在 csv 文件中时记录下来。我需要使搜索尽可能高效,因为 csv 文件有 40K 行长
我的例子persons.txt 文件:
Smith, Robert
Samson, David
Martin, Patricia
Simpson, Marge
我的例子locations.csv 文件:
GreaterLocation,LesserLocation,GroupName,DisplayName,InBook
NorthernHemisphere,UnitedStates,Pilots,"Wilbur, Andy, super pilot",Yes
WesternHemisphere,China,Pilots,"Kirby, Mabry, loves pizza",Yes
WesternHemisphere,Japan,Drivers,"Samson, David, big kahuna",Yes
NortherHemisphere,Canada,Drivers,"Randos, Jorge",Yes
SouthernHemispher,Australia,Mechanics,"Freeman, Gordon",Yes
NortherHemisphere,Mexico,Pilots,"Simpson, Marge",Yes
SouthernHemispher,New Zealand,Mechanics,"Samson, David",Yes
我的代码:
import csv
def parse_files():
with open('data_file/persons.txt', 'r') as user_list:
lines = user_list.readlines()
for user_row in lines:
new_user = user_row.strip()
per = []
with open('data_file/locations.csv', newline='') as target_csv:
DictReader_loc = csv.DictReader(target_csv)
for loc_row in DictReader_loc:
if new_user.lower() in loc_row['DisplayName'].lower():
per.append(DictReader_loc.line_num)
print(DictReader_loc.line_num, loc_row['DisplayName'])
if len(per) > 0:
print("\n"+new_user, per)
print("Parse Complete")
def main():
parse_files()
main()
我的代码目前有效。根据示例文件中的示例数据,代码匹配 locations.csv 文件中的“Samson, David”的 2 个实例和“Simpson, Marge”的 1 个实例。我希望有人可以指导我如何转换 persons.txt 文件或 locations.csv 文件(40K+ 行),以便该过程尽可能高效。我认为目前需要 10-15 分钟。我知道循环不是最有效的,但我确实需要检查每个名称并查看它在 csv 文件中出现的位置。
我认为@Tomalak 的 SQLite 解决方案非常有用,但如果你想让它更接近你的原始代码,请参阅下面的版本。
有效地,它减少了正在处理的文件量 opening/closing/reading,并有望加快处理速度。
由于你的样本很小,我无法进行任何实际测量。
展望未来,您可以考虑将 pandas 用于此类任务 - 使用 CSV 非常方便,并且比 csv 模块更优化。
import csv
def parse_files():
with open('persons.txt', 'r') as user_list:
# sets are faster to match against than lists
# do the lower() here to avoid repetition
user_set = set([u.strip().lower() for u in user_list.readlines()])
# open file at beginning, close after done
# you could also encapsulate the whole thing into a `with` clause if
# desired
target_csv = open("locations.csv", "r", newline='')
DictReader_loc = csv.DictReader(target_csv)
for user in user_set:
per = []
for loc_row in DictReader_loc:
if user in loc_row['DisplayName'].lower():
per.append(DictReader_loc.line_num)
print(DictReader_loc.line_num, loc_row['DisplayName'])
if len(per) > 0:
print("\n"+user, per)
print("Parse Complete")
target_csv.close()
def main():
parse_files()
main()
我有一个包含用户名的文件,每行一个,我需要将文件中的每个名称与 csv 文件中的所有值进行比较,并在每次用户名出现在 csv 文件中时记录下来。我需要使搜索尽可能高效,因为 csv 文件有 40K 行长
我的例子persons.txt 文件:
Smith, Robert
Samson, David
Martin, Patricia
Simpson, Marge
我的例子locations.csv 文件:
GreaterLocation,LesserLocation,GroupName,DisplayName,InBook
NorthernHemisphere,UnitedStates,Pilots,"Wilbur, Andy, super pilot",Yes
WesternHemisphere,China,Pilots,"Kirby, Mabry, loves pizza",Yes
WesternHemisphere,Japan,Drivers,"Samson, David, big kahuna",Yes
NortherHemisphere,Canada,Drivers,"Randos, Jorge",Yes
SouthernHemispher,Australia,Mechanics,"Freeman, Gordon",Yes
NortherHemisphere,Mexico,Pilots,"Simpson, Marge",Yes
SouthernHemispher,New Zealand,Mechanics,"Samson, David",Yes
我的代码:
import csv
def parse_files():
with open('data_file/persons.txt', 'r') as user_list:
lines = user_list.readlines()
for user_row in lines:
new_user = user_row.strip()
per = []
with open('data_file/locations.csv', newline='') as target_csv:
DictReader_loc = csv.DictReader(target_csv)
for loc_row in DictReader_loc:
if new_user.lower() in loc_row['DisplayName'].lower():
per.append(DictReader_loc.line_num)
print(DictReader_loc.line_num, loc_row['DisplayName'])
if len(per) > 0:
print("\n"+new_user, per)
print("Parse Complete")
def main():
parse_files()
main()
我的代码目前有效。根据示例文件中的示例数据,代码匹配 locations.csv 文件中的“Samson, David”的 2 个实例和“Simpson, Marge”的 1 个实例。我希望有人可以指导我如何转换 persons.txt 文件或 locations.csv 文件(40K+ 行),以便该过程尽可能高效。我认为目前需要 10-15 分钟。我知道循环不是最有效的,但我确实需要检查每个名称并查看它在 csv 文件中出现的位置。
我认为@Tomalak 的 SQLite 解决方案非常有用,但如果你想让它更接近你的原始代码,请参阅下面的版本。
有效地,它减少了正在处理的文件量 opening/closing/reading,并有望加快处理速度。
由于你的样本很小,我无法进行任何实际测量。
展望未来,您可以考虑将 pandas 用于此类任务 - 使用 CSV 非常方便,并且比 csv 模块更优化。
import csv
def parse_files():
with open('persons.txt', 'r') as user_list:
# sets are faster to match against than lists
# do the lower() here to avoid repetition
user_set = set([u.strip().lower() for u in user_list.readlines()])
# open file at beginning, close after done
# you could also encapsulate the whole thing into a `with` clause if
# desired
target_csv = open("locations.csv", "r", newline='')
DictReader_loc = csv.DictReader(target_csv)
for user in user_set:
per = []
for loc_row in DictReader_loc:
if user in loc_row['DisplayName'].lower():
per.append(DictReader_loc.line_num)
print(DictReader_loc.line_num, loc_row['DisplayName'])
if len(per) > 0:
print("\n"+user, per)
print("Parse Complete")
target_csv.close()
def main():
parse_files()
main()