excel 数据的模糊逻辑 -Pandas

Fuzzy logic for excel data -Pandas

我有两个数据帧 DF(~100k 行),这是一个原始数据文件和 DF1(15k 行),映射文件。我试图将 DF.address 和 DF.Name 列与 DF1.Address 和 DF1.Name 相匹配。找到匹配项后,DF1.ID 应填充到 DF.ID 中(如果 DF1.ID 不是 None),否则 DF1.top_ID 应填充到 DF.ID 中。

我可以在模糊逻辑的帮助下匹配地址和姓名,但我不知道如何连接获得的结果来填充 ID。

DF1-映射文件

DF原始数据文件

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from operator import itemgetter




df=pd.read_excel("Test1", index=False)
df1=pd.read_excel("Test2", index=False)


df=df[df['ID'].isnull()]
zip_code=df['Zip'].tolist()
Facility_city=df['City'].tolist()
Address=df['Address'].tolist()
Name_list=df['Name'].tolist()


def fuzzy_match(x, choice, scorer, cutoff):
    return (process.extractOne(x, 
                               choices=choice, 
                               scorer=scorer, 
                               score_cutoff=cutoff))

for pin,city,Add,Name in zip(zip_code,Facility_city,Address,Name_list):
        #====Address Matching=====#
        choice=df1.loc[(df1['Zip']==pin) &(df1['City']==city),'Address1']
        result=fuzzy_match(Add,choice,fuzz.ratio,70)
        #====Name Matching========#
        if (result is not None):
            if (result[3]>70):
                choice_1=(df1.loc[(df1['Zip']==pin) &(df1['City']==city),'Name'])
                result_1=(fuzzy_match(Name,choice_1,fuzz.ratio,95))
                print(ID)
                if (result_1 is not None):
                    if(result_1[3]>95):
                       #Here populating the matching ID
                        print("ok")       


                    else:
                        continue
                else:
                    continue
            else:
                continue
        else:

IIUC:这是一个解决方案:

from fuzzywuzzy import fuzz
import pandas as pd

#Read raw data from clipboard
raw = pd.read_clipboard()

#Read map data from clipboard
mp = pd.read_clipboard()

#Merge raw data and mp data as following 
dfr = mp.merge(raw, on=['Hospital Name', 'City', 'Pincode'], how='outer')

#dfr will have many duplicate rows - eliminate duplicate
#To eliminate duplicate using toke_sort_ratio, compare address x and y
dfr['SCORE'] = dfr.apply(lambda x: fuzz.token_sort_ratio(x['Address_x'], x['Address_y']), axis=1)

#Filter only max ratio rows grouped by Address_x
dfr1 = dfr.iloc[dfr.groupby('Address_x').apply(lambda x: x['SCORE'].idxmax())]
#dfr1 shall have the desired result

link 包含用于测试所提供解决方案的示例数据。