循环文件值并检查字典中 any/all 对应实例的最有效方法

Question

我有一个包含用户名的文件，每行一个，我需要将文件中的每个名称与 csv 文件中的所有值进行比较，并在每次用户名出现在 csv 文件中时记录下来。我需要使搜索尽可能高效，因为 csv 文件有 40K 行长

我的例子persons.txt 文件：

Smith, Robert
Samson, David
Martin, Patricia
Simpson, Marge

我的例子locations.csv 文件：

GreaterLocation,LesserLocation,GroupName,DisplayName,InBook
NorthernHemisphere,UnitedStates,Pilots,"Wilbur, Andy, super pilot",Yes
WesternHemisphere,China,Pilots,"Kirby, Mabry, loves pizza",Yes
WesternHemisphere,Japan,Drivers,"Samson, David, big kahuna",Yes
NortherHemisphere,Canada,Drivers,"Randos, Jorge",Yes
SouthernHemispher,Australia,Mechanics,"Freeman, Gordon",Yes
NortherHemisphere,Mexico,Pilots,"Simpson, Marge",Yes
SouthernHemispher,New Zealand,Mechanics,"Samson, David",Yes

我的代码：

import csv

def parse_files():

    with open('data_file/persons.txt', 'r') as user_list:
        lines = user_list.readlines()
        for user_row in lines:
            new_user = user_row.strip()
            per = []
            with open('data_file/locations.csv', newline='') as target_csv:
                DictReader_loc = csv.DictReader(target_csv)
            
                for loc_row in DictReader_loc:
                    if new_user.lower() in loc_row['DisplayName'].lower():
                        per.append(DictReader_loc.line_num)
                        print(DictReader_loc.line_num, loc_row['DisplayName'])
            if len(per) > 0:
                print("\n"+new_user, per)
    print("Parse Complete")
        
def main():
    parse_files()

main()

我的代码目前有效。根据示例文件中的示例数据，代码匹配 locations.csv 文件中的“Samson, David”的 2 个实例和“Simpson, Marge”的 1 个实例。我希望有人可以指导我如何转换 persons.txt 文件或 locations.csv 文件（40K+ 行），以便该过程尽可能高效。我认为目前需要 10-15 分钟。我知道循环不是最有效的，但我确实需要检查每个名称并查看它在 csv 文件中出现的位置。

Answer 1

我认为@Tomalak 的 SQLite 解决方案非常有用，但如果你想让它更接近你的原始代码，请参阅下面的版本。

有效地，它减少了正在处理的文件量 opening/closing/reading，并有望加快处理速度。

由于你的样本很小，我无法进行任何实际测量。

展望未来，您可以考虑将 pandas 用于此类任务 - 使用 CSV 非常方便，并且比 csv 模块更优化。

import csv

def parse_files():
    with open('persons.txt', 'r') as user_list:
        # sets are faster to match against than lists
        # do the lower() here to avoid repetition
        user_set  = set([u.strip().lower() for u in user_list.readlines()])
    # open file at beginning, close after done
    # you could also encapsulate the whole thing into a `with` clause if
    # desired
    target_csv = open("locations.csv", "r", newline='')
    DictReader_loc = csv.DictReader(target_csv)
    for user in user_set:
        per = []
        for loc_row in DictReader_loc:
            if user in loc_row['DisplayName'].lower():
                per.append(DictReader_loc.line_num)
                print(DictReader_loc.line_num, loc_row['DisplayName'])
        if len(per) > 0:
            print("\n"+user, per)
    print("Parse Complete")
    target_csv.close()
        
def main():
    parse_files()

main()

循环文件值并检查字典中 any/all 对应实例的最有效方法

Most efficient way to loop through a file's values and check a dictionary for any/all corresponding instances

python

csv