如何将列表中的每个元素与另一个列表中的每个元素进行比较?
how to compare each element in a list with each element in another list?
我想将提取的促销代码列表与正确的促销代码列表进行比较。
如果 extracted_list 中的促销代码正在与 correct_promo_code 列表中的促销代码进行比较,则表示促销代码有错误。为了从 correct_promo_codes 列表中找到正确的促销代码,我需要找到与被比较的促销代码(来自 extracted_list)具有最小编辑距离(levenshtein 距离)的促销代码。
代码到现在:-
import csv
with open("all_correct_promo.csv","rb") as file1:
reader1 = csv.reader(file1)
correctPromoList = list(reader1)
#print correctPromoList
with open("all_extracted_promo.csv","rb") as file2:
reader2 = csv.reader(file2)
extractedPromoList = list(reader2)
#print extractedPromoList
incorrectPromo = []
count = 0
for extracted in extractedPromoList:
if(extracted not in correctPromoList):
incorrectPromo.append(extracted)
else:
count = count + 1
#print incorrectPromo
for promos in incorrectPromo:
print promos
nltk.metrics.distance.edit_distance(s1, s2, transpositions=False)
计算两个字符串之间的 Levenshtein 编辑距离。编辑距离是将 s1 转换为 s2 需要替换、插入或删除的字符数。例如,将“rain”转换为“shine”需要三个步骤,包括两次替换和一次插入:“rain”->“sain”->“shin”->“shine”。这些操作本来可以在其他顺序中完成,但至少需要三个步骤。
来到你的代码,我认为下半部分的一些变化将捕获编辑距离 -
from nltk.metrics import distance # slow to load
extractedPromoList = ['abc','acd','abd'] # csv of extracted promo codes dummy
correctPromoList = ['abc','aba','xbz','abz','abx'] # csv to real promo codes dummy
def find_min_edit(str_,list_):
nearest_correct_promos = []
distances = {}
min_dist = 100 # arbitrary large assignment
for correct_promo in list_:
dist = distance.edit_distance(extracted,correct_promo,True) # compute Levenshtein distance
distances[correct_promo] = dist # store each score for real promo codes
if dist<min_dist:
min_dist = dist # store min distance
# extract all real promo codes with minimum Levenshtein distance
nearest_correct_promos.append(','.join([i[0] for i in distances.items() if i[1]==min_dist]))
return ','.join(nearest_correct_promos) # return a comma separated string of nearest real promo codes
incorrectPromo = {}
count = 0
for extracted in extractedPromoList:
print 'Computing %dth promo code...' % count
incorrectPromo[extracted] = find_min_edit(extracted,correctPromoList) # get comma separated str of real promo codes nearest to extracted
count+=1
print incorrectPromo
输出
Computing 0th promo code...
Computing 1th promo code...
Computing 2th promo code...
{'abc': 'abc', 'abd': 'abx,aba,abz,abc', 'acd': 'abx,aba,abz,abc'}
我想将提取的促销代码列表与正确的促销代码列表进行比较。
如果 extracted_list 中的促销代码正在与 correct_promo_code 列表中的促销代码进行比较,则表示促销代码有错误。为了从 correct_promo_codes 列表中找到正确的促销代码,我需要找到与被比较的促销代码(来自 extracted_list)具有最小编辑距离(levenshtein 距离)的促销代码。
代码到现在:-
import csv
with open("all_correct_promo.csv","rb") as file1:
reader1 = csv.reader(file1)
correctPromoList = list(reader1)
#print correctPromoList
with open("all_extracted_promo.csv","rb") as file2:
reader2 = csv.reader(file2)
extractedPromoList = list(reader2)
#print extractedPromoList
incorrectPromo = []
count = 0
for extracted in extractedPromoList:
if(extracted not in correctPromoList):
incorrectPromo.append(extracted)
else:
count = count + 1
#print incorrectPromo
for promos in incorrectPromo:
print promos
nltk.metrics.distance.edit_distance(s1, s2, transpositions=False)
计算两个字符串之间的 Levenshtein 编辑距离。编辑距离是将 s1 转换为 s2 需要替换、插入或删除的字符数。例如,将“rain”转换为“shine”需要三个步骤,包括两次替换和一次插入:“rain”->“sain”->“shin”->“shine”。这些操作本来可以在其他顺序中完成,但至少需要三个步骤。
来到你的代码,我认为下半部分的一些变化将捕获编辑距离 -
from nltk.metrics import distance # slow to load
extractedPromoList = ['abc','acd','abd'] # csv of extracted promo codes dummy
correctPromoList = ['abc','aba','xbz','abz','abx'] # csv to real promo codes dummy
def find_min_edit(str_,list_):
nearest_correct_promos = []
distances = {}
min_dist = 100 # arbitrary large assignment
for correct_promo in list_:
dist = distance.edit_distance(extracted,correct_promo,True) # compute Levenshtein distance
distances[correct_promo] = dist # store each score for real promo codes
if dist<min_dist:
min_dist = dist # store min distance
# extract all real promo codes with minimum Levenshtein distance
nearest_correct_promos.append(','.join([i[0] for i in distances.items() if i[1]==min_dist]))
return ','.join(nearest_correct_promos) # return a comma separated string of nearest real promo codes
incorrectPromo = {}
count = 0
for extracted in extractedPromoList:
print 'Computing %dth promo code...' % count
incorrectPromo[extracted] = find_min_edit(extracted,correctPromoList) # get comma separated str of real promo codes nearest to extracted
count+=1
print incorrectPromo
输出
Computing 0th promo code...
Computing 1th promo code...
Computing 2th promo code...
{'abc': 'abc', 'abd': 'abx,aba,abz,abc', 'acd': 'abx,aba,abz,abc'}