Python/Pandas 匹配另一个子串中的子串
Python/Pandas matching substring in another substrings
我一直在寻找存储在 2 个不同数据帧的 2 个不同子字符串中的公共密钥,然后输出第 3 列:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name':['John','Michael','Dan','George', 'Adam'], 'Code1':['AAA OO','BBB UU','JJ',np.nan,'II']})
df2 = pd.DataFrame({'Second Name':['Smith','Cohen','Moore','Kas', 'Faber'], 'code2':['UU HHH','AAA GGG',np.nan , 'TT II', np.nan]})
预期输出:
我已经做了我的研究.......这个问题真的和这个问题很相似:。但是这里的键只有一个项目,我的例子在两个键中都有 2 个项目。
假设您的代码总是由空格分隔。
您可以使用 list comprehensions
检查 Code2
列中 Code1
列中每个代码的存在。通过检索匹配代码的索引,我们可以获得 Dataframe
包含具有重叠代码的行。
然后我们可以更新原始数据帧以获得预期的输出。
# Create a list of matching codes
list_of_matches = df1['Code1'].apply(lambda x: [
any([word in str(list_of_words).split()
for word in str(x).split()])
for list_of_words in df2['code2']])
# Get the indices of matching codes
i, j = np.where(list_of_matches.values.tolist())
# Create a new dataframe with name and second name of rows with matching code
# And drop rows with NA, as they don't make sense
df3 = pd.DataFrame(np.column_stack([df1.loc[i], df2.loc[j]]),
columns=df1.columns.append(df2.columns)).dropna()
# Create columns in your original dataframe to be able to update the dataframe
df1['Second Name'] = np.nan
df1['code2'] = np.nan
# Update dataframe with matching rows
df1.update(df3)
输出
Name Code1 Second Name code2
0 John AAA OO Cohen AAA GGG
1 Michael BBB UU Smith UU HHH
2 Dan JJ NaN NaN
3 George NaN NaN NaN
4 Adam II Kas TT II
我一直在寻找存储在 2 个不同数据帧的 2 个不同子字符串中的公共密钥,然后输出第 3 列:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name':['John','Michael','Dan','George', 'Adam'], 'Code1':['AAA OO','BBB UU','JJ',np.nan,'II']})
df2 = pd.DataFrame({'Second Name':['Smith','Cohen','Moore','Kas', 'Faber'], 'code2':['UU HHH','AAA GGG',np.nan , 'TT II', np.nan]})
预期输出:
我已经做了我的研究.......这个问题真的和这个问题很相似:
假设您的代码总是由空格分隔。
您可以使用 list comprehensions
检查 Code2
列中 Code1
列中每个代码的存在。通过检索匹配代码的索引,我们可以获得 Dataframe
包含具有重叠代码的行。
然后我们可以更新原始数据帧以获得预期的输出。
# Create a list of matching codes
list_of_matches = df1['Code1'].apply(lambda x: [
any([word in str(list_of_words).split()
for word in str(x).split()])
for list_of_words in df2['code2']])
# Get the indices of matching codes
i, j = np.where(list_of_matches.values.tolist())
# Create a new dataframe with name and second name of rows with matching code
# And drop rows with NA, as they don't make sense
df3 = pd.DataFrame(np.column_stack([df1.loc[i], df2.loc[j]]),
columns=df1.columns.append(df2.columns)).dropna()
# Create columns in your original dataframe to be able to update the dataframe
df1['Second Name'] = np.nan
df1['code2'] = np.nan
# Update dataframe with matching rows
df1.update(df3)
输出
Name Code1 Second Name code2
0 John AAA OO Cohen AAA GGG
1 Michael BBB UU Smith UU HHH
2 Dan JJ NaN NaN
3 George NaN NaN NaN
4 Adam II Kas TT II