字符串中的字符串匹配,然后添加一列匹配的子字符串
string match within a string then add a column of matched subtrings
我有两个数据帧 df
和 df_ref
:
我正在使用以下代码:
df['match']=df[df['swift_codes'].str.contains('|'.join(df_ref['Hfrase_best_search_string']))]
这给了我这个结果:
我想看到的结果是这样的:
所以能够通过以下方法解决这个问题,但我实际上是在标记的代码中开始匹配和合并过程:
"# 创建具有目标值的数据框以进行比较和合并 ULTIMATELY______________________________________________________________
# 使用来自团队 RECORDS__________________________________________________________________ 的 'BANK_SHORT_NAME 生成一个新的数据框
def find_swift(df, col_name = 'ORGNTR_BNK_NM'):"
"""Reads in raw data and isolates possible new swift codes which are then analysed
to determine if they are in fact new swift codes. To support decisions, extracted
11-character substrings codes are referenced against established swift code substrings
already compiled in records. Then the function extracts bank codes and country codes
and put them in their own columns. The function will produces a Excel file with two
tabs, one with swift codes and it associate bank name and another with the rejected
strings"""
# BEGIN CLEANING AND ORGANIZING RAW DATA, SPLIT SUBTRINGS TO THEIR OWN ROWS_____________________________________________________
# AND THEN STACK ALL STRINGS ON TOP OF EACH OTHER________________________________________________________________________________
df_col = df[col_name].dropna()
df_col = df_col.to_frame()
# take out all non letters and digits from raw data
df_col = df_col.iloc[:, 0].str.replace(r'[^0-9a-zA-Z]', ' ')
df_col = df_col.to_frame()
# perform boolean test to see if elements are all numeric or not, results saved in new column 'numeric'
# create a new dataframe with only non-numeric elements
df_col['numeric'] = df_col.iloc[:, 0].map(lambda x: all(i.isdecimal() for i in x.split()))
df_non_num = df_col[df_col['numeric']==False]
df_col = df_col.drop(columns=['numeric'])
# count number of splits and assign maximums splits to a variable
# so they can be put into a new dataframe where each substring
# has their own column
count=(df_non_num.iloc[:, 0].str.split().str.len())
count=count.to_frame()
max_cols = count.iloc[:, 0].max()
# create a new dataframe where each substring has their own column
df_split = df_non_num.iloc[:, 0].str.split(' ', max_cols, expand=True)
# BEGIN EXTRACTING 11-CHARACTER STRINGS FOR ANALYSIS ____________________________________________________________________________
# APPEND ALL THE RESULTING DATAFRAMES INTO ONE _______________________________________________________________________________
# select only the 11-character strings from each split column and saves the
# columns as a list of series in the variable splits
splits = list()
for column in df_split:
df_split[column]=df_split[column][(df_split[column].str.len() == 11)]
df_split[column] = df_split[column].to_frame()
splits.append(df_split[column])
# drop NaN columns and remove duplicates and save it to list split2
split2=list()
for series in splits:
series = series.dropna()
series = series.drop_duplicates()
split2.append(series)
# iterates over each series to identify all the probable
# swift codes by comparing portions of its strings to
# verify they correspond to bank codes and country codes
# found in precompiled records (df1) in each series, which are all
# stored in the list split3
split3=list()
for series in split2:
series = series[(series.str[0:4].isin(df1['swift_bank_code'])&series.str[4:6].isin(df2['ISO ALPHA-2 Code']))|
(series.str[8:11].isin(df1['swift_branch_code'])&series.str[0:4].isin(df1['swift_bank_code']))|
(series.str[4:6].isin(df2['ISO ALPHA-2 Code'])&series.str[8:11].isin(df1['swift_branch_code']))]
series = series.drop_duplicates()
split3.append(series)
# append everything together in dataframe3 by first creating
# an empty series to hold all the series, then save as a dataframe
s = pd.Series()
s = s.append(split3)
s = s.to_frame()
s.columns = ['swift_codes']
s = s['swift_codes'].dropna()
s = s.astype(str)
s = s.to_frame()
# CREATE DATAFRAMES WITH TARGETED VALUES TO COMPARE AND MERGE TO ULTIMATELY______________________________________________________________
# YIELD A NEW DATAFRAME WITH THE 'BANK_SHORT_NAME FROM TEAM'S RECORDS__________________________________________________________________
# create dataframes from appended series for portions of strings specified
# below to facilitate merging with team's records (df1)
s_four=s.swift_codes.str[0:4]
s_four=s_four.to_frame()
s_six=s['swift_codes'].str[0:6]
s_six=s_six.to_frame()
s_eight=s['swift_codes'].str[0:8]
s_eight=s_eight.to_frame()
s_ten=s['swift_codes'].str[0:10]
s_ten=s_ten.to_frame()
s_eleven=s['swift_codes'].str[0:11]
s_eleven=s_eleven.to_frame()
# create a dataframe from df1 with only the 'short bank name' and
# 'best search string' columns for easier dataframe management
df1b = df1[['Hfrase_short_name', 'Hfrase_best_search_string']]
# create dataframes from previously compiled records to facilitate
# comparison and merging with identified
df1_11 = df1b[df1b['Hfrase_best_search_string'].str.len()==11]
df1_10 = df1b[df1b['Hfrase_best_search_string'].str.len()==10]
df1_8 = df1b[df1b['Hfrase_best_search_string'].str.len()==8]
df1_6 = df1b[df1b['Hfrase_best_search_string'].str.len()==6]
df1_4 = df1b[df1b['Hfrase_best_search_string'].str.len()==4]
# perform merge between each of the newly created, corresponding
# dataframes, the merge creates a new dataframe with the "bank_short_name"
# from previously compiled records
s_eleven=s_eleven.reset_index().merge(df1_11, how='left', left_on=['swift_codes'], right_on=['Hfrase_best_search_string']).set_index('index')
s_ten=s_ten.reset_index().merge(df1_10, how='left', left_on=['swift_codes'], right_on=['Hfrase_best_search_string']).set_index('index')
s_eight=s_eight.reset_index().merge(df1_8, how='left', left_on=['swift_codes'], right_on=['Hfrase_best_search_string']).set_index('index')
s_six=s_six.reset_index().merge(df1_6, how='left', left_on=['swift_codes'], right_on=['Hfrase_best_search_string']).set_index('index')
s_four=s_four.reset_index().merge(df1_4, how='left', left_on=['swift_codes'], right_on=['Hfrase_best_search_string']).set_index('index')
# assigned new columns to all the dataframes then stack them
# together to produce a new dataframe.
all_dfs = [s_four, s_six, s_eight, s_ten, s_eleven]
swift_result = pd.concat(all_dfs)
# drop NaN and duplicate values from the and sort by index
swift_result = swift_result.dropna()
swift_result = swift_result.drop_duplicates()
swift_result = swift_result.sort_index()
# prepare dataframe, swift_result, then merge with all the swift codes found earlier
# select only 'bank_short_name' in the new dataframe of the same name
swift_result = swift_result[['Hfrase_short_name']]
swift_bank=s.join(swift_result, how='left', lsuffix='_original')
swift_bank.columns = ['swift_codes', 'bank_short_name']
# create new columns to store country code and bank code extracted from column mega
swift_bank['swift_country_code'] = swift_bank['swift_codes'].str[4:6]
swift_bank['swift_bank_code'] = swift_bank['swift_codes'].str[0:4]
# drop duplicates
swift_bank = swift_bank.drop_duplicates()
# define a variable for DataFrame(swift_bank)
swift = pd.DataFrame(swift_bank)
# merge df and swift on the index and drop column 'numeric' from dataframe new_df
new_df = pd.merge(df,swift, how='outer', left_index=True, right_index=True)
new_skinny_df = pd.merge(df_col,swift, how='outer', left_index=True, right_index=True)
return (swift,new_df,new_skinny_df)
我知道很多,但我可以解释一下。如果有人想要解释,请告诉我。
我开始分解两个文件以进行一些合并操作以获得我想要的结果的地方开始于:
我有两个数据帧 df
和 df_ref
:
我正在使用以下代码:
df['match']=df[df['swift_codes'].str.contains('|'.join(df_ref['Hfrase_best_search_string']))]
这给了我这个结果:
我想看到的结果是这样的:
所以能够通过以下方法解决这个问题,但我实际上是在标记的代码中开始匹配和合并过程:
"# 创建具有目标值的数据框以进行比较和合并 ULTIMATELY______________________________________________________________
# 使用来自团队 RECORDS__________________________________________________________________ 的 'BANK_SHORT_NAME 生成一个新的数据框
def find_swift(df, col_name = 'ORGNTR_BNK_NM'):"
"""Reads in raw data and isolates possible new swift codes which are then analysed
to determine if they are in fact new swift codes. To support decisions, extracted
11-character substrings codes are referenced against established swift code substrings
already compiled in records. Then the function extracts bank codes and country codes
and put them in their own columns. The function will produces a Excel file with two
tabs, one with swift codes and it associate bank name and another with the rejected
strings"""
# BEGIN CLEANING AND ORGANIZING RAW DATA, SPLIT SUBTRINGS TO THEIR OWN ROWS_____________________________________________________
# AND THEN STACK ALL STRINGS ON TOP OF EACH OTHER________________________________________________________________________________
df_col = df[col_name].dropna()
df_col = df_col.to_frame()
# take out all non letters and digits from raw data
df_col = df_col.iloc[:, 0].str.replace(r'[^0-9a-zA-Z]', ' ')
df_col = df_col.to_frame()
# perform boolean test to see if elements are all numeric or not, results saved in new column 'numeric'
# create a new dataframe with only non-numeric elements
df_col['numeric'] = df_col.iloc[:, 0].map(lambda x: all(i.isdecimal() for i in x.split()))
df_non_num = df_col[df_col['numeric']==False]
df_col = df_col.drop(columns=['numeric'])
# count number of splits and assign maximums splits to a variable
# so they can be put into a new dataframe where each substring
# has their own column
count=(df_non_num.iloc[:, 0].str.split().str.len())
count=count.to_frame()
max_cols = count.iloc[:, 0].max()
# create a new dataframe where each substring has their own column
df_split = df_non_num.iloc[:, 0].str.split(' ', max_cols, expand=True)
# BEGIN EXTRACTING 11-CHARACTER STRINGS FOR ANALYSIS ____________________________________________________________________________
# APPEND ALL THE RESULTING DATAFRAMES INTO ONE _______________________________________________________________________________
# select only the 11-character strings from each split column and saves the
# columns as a list of series in the variable splits
splits = list()
for column in df_split:
df_split[column]=df_split[column][(df_split[column].str.len() == 11)]
df_split[column] = df_split[column].to_frame()
splits.append(df_split[column])
# drop NaN columns and remove duplicates and save it to list split2
split2=list()
for series in splits:
series = series.dropna()
series = series.drop_duplicates()
split2.append(series)
# iterates over each series to identify all the probable
# swift codes by comparing portions of its strings to
# verify they correspond to bank codes and country codes
# found in precompiled records (df1) in each series, which are all
# stored in the list split3
split3=list()
for series in split2:
series = series[(series.str[0:4].isin(df1['swift_bank_code'])&series.str[4:6].isin(df2['ISO ALPHA-2 Code']))|
(series.str[8:11].isin(df1['swift_branch_code'])&series.str[0:4].isin(df1['swift_bank_code']))|
(series.str[4:6].isin(df2['ISO ALPHA-2 Code'])&series.str[8:11].isin(df1['swift_branch_code']))]
series = series.drop_duplicates()
split3.append(series)
# append everything together in dataframe3 by first creating
# an empty series to hold all the series, then save as a dataframe
s = pd.Series()
s = s.append(split3)
s = s.to_frame()
s.columns = ['swift_codes']
s = s['swift_codes'].dropna()
s = s.astype(str)
s = s.to_frame()
# CREATE DATAFRAMES WITH TARGETED VALUES TO COMPARE AND MERGE TO ULTIMATELY______________________________________________________________
# YIELD A NEW DATAFRAME WITH THE 'BANK_SHORT_NAME FROM TEAM'S RECORDS__________________________________________________________________
# create dataframes from appended series for portions of strings specified
# below to facilitate merging with team's records (df1)
s_four=s.swift_codes.str[0:4]
s_four=s_four.to_frame()
s_six=s['swift_codes'].str[0:6]
s_six=s_six.to_frame()
s_eight=s['swift_codes'].str[0:8]
s_eight=s_eight.to_frame()
s_ten=s['swift_codes'].str[0:10]
s_ten=s_ten.to_frame()
s_eleven=s['swift_codes'].str[0:11]
s_eleven=s_eleven.to_frame()
# create a dataframe from df1 with only the 'short bank name' and
# 'best search string' columns for easier dataframe management
df1b = df1[['Hfrase_short_name', 'Hfrase_best_search_string']]
# create dataframes from previously compiled records to facilitate
# comparison and merging with identified
df1_11 = df1b[df1b['Hfrase_best_search_string'].str.len()==11]
df1_10 = df1b[df1b['Hfrase_best_search_string'].str.len()==10]
df1_8 = df1b[df1b['Hfrase_best_search_string'].str.len()==8]
df1_6 = df1b[df1b['Hfrase_best_search_string'].str.len()==6]
df1_4 = df1b[df1b['Hfrase_best_search_string'].str.len()==4]
# perform merge between each of the newly created, corresponding
# dataframes, the merge creates a new dataframe with the "bank_short_name"
# from previously compiled records
s_eleven=s_eleven.reset_index().merge(df1_11, how='left', left_on=['swift_codes'], right_on=['Hfrase_best_search_string']).set_index('index')
s_ten=s_ten.reset_index().merge(df1_10, how='left', left_on=['swift_codes'], right_on=['Hfrase_best_search_string']).set_index('index')
s_eight=s_eight.reset_index().merge(df1_8, how='left', left_on=['swift_codes'], right_on=['Hfrase_best_search_string']).set_index('index')
s_six=s_six.reset_index().merge(df1_6, how='left', left_on=['swift_codes'], right_on=['Hfrase_best_search_string']).set_index('index')
s_four=s_four.reset_index().merge(df1_4, how='left', left_on=['swift_codes'], right_on=['Hfrase_best_search_string']).set_index('index')
# assigned new columns to all the dataframes then stack them
# together to produce a new dataframe.
all_dfs = [s_four, s_six, s_eight, s_ten, s_eleven]
swift_result = pd.concat(all_dfs)
# drop NaN and duplicate values from the and sort by index
swift_result = swift_result.dropna()
swift_result = swift_result.drop_duplicates()
swift_result = swift_result.sort_index()
# prepare dataframe, swift_result, then merge with all the swift codes found earlier
# select only 'bank_short_name' in the new dataframe of the same name
swift_result = swift_result[['Hfrase_short_name']]
swift_bank=s.join(swift_result, how='left', lsuffix='_original')
swift_bank.columns = ['swift_codes', 'bank_short_name']
# create new columns to store country code and bank code extracted from column mega
swift_bank['swift_country_code'] = swift_bank['swift_codes'].str[4:6]
swift_bank['swift_bank_code'] = swift_bank['swift_codes'].str[0:4]
# drop duplicates
swift_bank = swift_bank.drop_duplicates()
# define a variable for DataFrame(swift_bank)
swift = pd.DataFrame(swift_bank)
# merge df and swift on the index and drop column 'numeric' from dataframe new_df
new_df = pd.merge(df,swift, how='outer', left_index=True, right_index=True)
new_skinny_df = pd.merge(df_col,swift, how='outer', left_index=True, right_index=True)
return (swift,new_df,new_skinny_df)
我知道很多,但我可以解释一下。如果有人想要解释,请告诉我。 我开始分解两个文件以进行一些合并操作以获得我想要的结果的地方开始于: