您需要在 2 种情况下映射来自另一个数据框的值
You need to map values from another data frame in 2 conditions
我需要在第二列中按条件在字符串中进行子字符串搜索。我有 2 个数据框:
df1
df2
(第 1 步)对于 df1 中的第一行,N_Product 列是 VALVE。
(第2步)在df2的每一行的N_Product列中查找VALVE,并找到3个与以下对匹配的(
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (DONE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (PRESSURE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] ('').
(第3步)然后你需要检查M_Product是否包含以下值:
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (DONE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (PRESSURE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] ('')
在df1 ['Descr']中,如果包含,则需要写N_Product + ":" + M_Product + ";",如果不包含,则只写N_Product + ';' .对于 'VALVE' 你需要在 df1 ['Descr'] 中寻找 df2 ["M_Product"] 只有 "DONE", "PRESSURE" 和 "", 其他的不需要, for N_Product('GEEKU')——只有"ELECTRICAL","OVERBOARD"和""(值)等,取决于对应的值('M_Product')('N_Product'), Df1中要查找的其他值('N_Product')对应的值('M_Product') ['Descr '] - 不需要
df1 = {'Descr': ["VALVE, DONE", "pump ttf", "pump electrical", "Valve, ww","Geeku MBA , electrical","valve PRESSURE, OVERBOARD","VALVE, Electrical DONE","Geeku electrical OVERBOARD","Geeku OVERBOARD , electrical"],
'N_Product': ["VALVE", "PUMP", "PUMP", "VALVE", "GEEKU","VALVE","VALVE", "GEEKU", "GEEKU"],
}
df2 = {'N_Product': ["GEEKU","GEEKU","GEEKU", "PUMP", "PUMP","VALVE", "VALVE","VALVE"],
'M_Product': ["ELECTRICAL", "OVERBOARD","", "TTF","", "DONE","PRESSURE",""],
}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
desired result
我使用此代码,但它会搜索 df2 ['M_Product'] 以获取所有值,但仅搜索 df1 ['N_product'] == df2 ['N_Product'] /如果能帮我解决这个问题,我将不胜感激
def foo(x):
descr = x['Descr'].upper()
match = None
if x['N_Product'].upper() in list(df2['N_Product']):
for mStr in df2['M_Product'].str.upper():
if mStr in descr:
match = mStr
break
if match is None:
return x['N_Product'] + ';'
else:
return x['N_Product'] + ': ' + match + ';'
df1['Result'] = df1.apply(foo, axis = 1)
我添加了一张图片来可视化需要做的事情,例如 df1 的值 ['N_Product'] "Valve") 同样,所有值都需要完成:
picture
根据您使用问题中的图片描述的结果,以下是我对您尝试执行的操作的理解:
- 每个 N_Product 值在 df2 中都有一个关联的 M_Product 值列表。
- df1 中的每个 N_Product 值都有一个 Descr 值,它是以下格式的 csv 列表:N_Product 后跟该行的一个或多个 M_Product 兼容值。
- Objective:将结果列附加到 df1,其中包含每行的 N_Product 值 n 以及相应描述的第一个 M_Product 值 m 使得 (n, m) 在 df2 中找到。
这里有一些代码,我相信可以满足您的要求:
import pandas as pd
df1 = {'Descr': ["VALVE, DONE", "pump ttf", "pump electrical", "Valve, ww","Geeku MBA , electrical","valve PRESSURE, OVERBOARD","VALVE, Electrical DONE","Geeku electrical OVERBOARD","Geeku OVERBOARD , electrical"],
'N_Product': ["VALVE", "PUMP", "PUMP", "VALVE", "GEEKU","VALVE","VALVE", "GEEKU", "GEEKU"],
}
df2 = {'N_Product': ["GEEKU","GEEKU","GEEKU", "PUMP", "PUMP","VALVE", "VALVE","VALVE"],
'M_Product': ["ELECTRICAL", "OVERBOARD","", "TTF","", "DONE","PRESSURE",""],
}
df1 = pd.DataFrame(df1).apply(lambda x: x.astype(str).str.upper())
df2 = pd.DataFrame(df2).apply(lambda x: x.astype(str).str.upper())
print('df1:'); print(df1)
print('df2:'); print(df2)
df1['M_Product'] = df1['Descr'].apply(lambda x: [val.strip(',') for val in x.split() if val.strip(',')]).str.slice(start=1)
df1['df1_row'] = df1.index
df3 = df1[['df1_row', 'N_Product', 'M_Product']].explode('M_Product')
df5 = df3.merge(df2, on=['N_Product', 'M_Product']).groupby('df1_row').nth(0)
df1['M_Product'] = df5['M_Product']
df1['Result'] = df1['N_Product'] + (~df1['M_Product'].isna()) * (': ' + df1['M_Product'].astype(str).str.strip()) + ';'
df1 = df1.drop(columns=['M_Product', 'df1_row'])
print('result:'); print(df1)
输出:
df1:
Descr N_Product
0 VALVE, DONE VALVE
1 PUMP TTF PUMP
2 PUMP ELECTRICAL PUMP
3 VALVE, WW VALVE
4 GEEKU MBA , ELECTRICAL GEEKU
5 VALVE PRESSURE, OVERBOARD VALVE
6 VALVE, ELECTRICAL DONE VALVE
7 GEEKU ELECTRICAL OVERBOARD GEEKU
8 GEEKU OVERBOARD , ELECTRICAL GEEKU
df2:
N_Product M_Product
0 GEEKU ELECTRICAL
1 GEEKU OVERBOARD
2 GEEKU
3 PUMP TTF
4 PUMP
5 VALVE DONE
6 VALVE PRESSURE
7 VALVE
result:
Descr N_Product Result
0 VALVE, DONE VALVE VALVE: DONE;
1 PUMP TTF PUMP PUMP: TTF;
2 PUMP ELECTRICAL PUMP PUMP;
3 VALVE, WW VALVE VALVE;
4 GEEKU MBA , ELECTRICAL GEEKU GEEKU: ELECTRICAL;
5 VALVE PRESSURE, OVERBOARD VALVE VALVE: PRESSURE;
6 VALVE, ELECTRICAL DONE VALVE VALVE: DONE;
7 GEEKU ELECTRICAL OVERBOARD GEEKU GEEKU: ELECTRICAL;
8 GEEKU OVERBOARD , ELECTRICAL GEEKU GEEKU: ELECTRICAL;
解释:
- 使 df1 和 df2 中的所有内容大写以简化匹配
- 将 Descr 拆分为标记,除了第一个标记(它只是 N_Product 的副本),将它们放入 df1 中名为
M_Product
[=47= 的新列中的列表中]
- 在名为
df1_row
的列中记录原始 df1 行的索引
- 使用
explode()
创建一个数据框df3,df1 中上述M_Product
列中的每个值一行
- 使用
merge()
到 select df3 中与 df2 中的行匹配的行 (N_Product, M_Product)
- 在
df1_row
和 nth(0)
上使用 groupby()
为每个这样的 (N_Product, M_Product) 对取第 0 个匹配项
- 将这个新数据框中的 M_Product 列添加回 df1
- 使用
apply()
在 df1 中使用 (1) N_Product + ;
填充新的 Result
列(如果 M_Product 列为空( isna()
) 或 (2) N_Product + ':' + M_Product + ';'如果有 M_Product 匹配。
- 删除我们不再需要的中间列(M_Product、df1_row)。
我需要在第二列中按条件在字符串中进行子字符串搜索。我有 2 个数据框: df1 df2
(第 1 步)对于 df1 中的第一行,N_Product 列是 VALVE。
(第2步)在df2的每一行的N_Product列中查找VALVE,并找到3个与以下对匹配的(
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (DONE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (PRESSURE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] ('').
(第3步)然后你需要检查M_Product是否包含以下值:
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (DONE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (PRESSURE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] ('')
在df1 ['Descr']中,如果包含,则需要写N_Product + ":" + M_Product + ";",如果不包含,则只写N_Product + ';' .对于 'VALVE' 你需要在 df1 ['Descr'] 中寻找 df2 ["M_Product"] 只有 "DONE", "PRESSURE" 和 "", 其他的不需要, for N_Product('GEEKU')——只有"ELECTRICAL","OVERBOARD"和""(值)等,取决于对应的值('M_Product')('N_Product'), Df1中要查找的其他值('N_Product')对应的值('M_Product') ['Descr '] - 不需要
df1 = {'Descr': ["VALVE, DONE", "pump ttf", "pump electrical", "Valve, ww","Geeku MBA , electrical","valve PRESSURE, OVERBOARD","VALVE, Electrical DONE","Geeku electrical OVERBOARD","Geeku OVERBOARD , electrical"],
'N_Product': ["VALVE", "PUMP", "PUMP", "VALVE", "GEEKU","VALVE","VALVE", "GEEKU", "GEEKU"],
}
df2 = {'N_Product': ["GEEKU","GEEKU","GEEKU", "PUMP", "PUMP","VALVE", "VALVE","VALVE"],
'M_Product': ["ELECTRICAL", "OVERBOARD","", "TTF","", "DONE","PRESSURE",""],
}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
desired result
我使用此代码,但它会搜索 df2 ['M_Product'] 以获取所有值,但仅搜索 df1 ['N_product'] == df2 ['N_Product'] /如果能帮我解决这个问题,我将不胜感激
def foo(x):
descr = x['Descr'].upper()
match = None
if x['N_Product'].upper() in list(df2['N_Product']):
for mStr in df2['M_Product'].str.upper():
if mStr in descr:
match = mStr
break
if match is None:
return x['N_Product'] + ';'
else:
return x['N_Product'] + ': ' + match + ';'
df1['Result'] = df1.apply(foo, axis = 1)
我添加了一张图片来可视化需要做的事情,例如 df1 的值 ['N_Product'] "Valve") 同样,所有值都需要完成:
picture
根据您使用问题中的图片描述的结果,以下是我对您尝试执行的操作的理解:
- 每个 N_Product 值在 df2 中都有一个关联的 M_Product 值列表。
- df1 中的每个 N_Product 值都有一个 Descr 值,它是以下格式的 csv 列表:N_Product 后跟该行的一个或多个 M_Product 兼容值。
- Objective:将结果列附加到 df1,其中包含每行的 N_Product 值 n 以及相应描述的第一个 M_Product 值 m 使得 (n, m) 在 df2 中找到。
这里有一些代码,我相信可以满足您的要求:
import pandas as pd
df1 = {'Descr': ["VALVE, DONE", "pump ttf", "pump electrical", "Valve, ww","Geeku MBA , electrical","valve PRESSURE, OVERBOARD","VALVE, Electrical DONE","Geeku electrical OVERBOARD","Geeku OVERBOARD , electrical"],
'N_Product': ["VALVE", "PUMP", "PUMP", "VALVE", "GEEKU","VALVE","VALVE", "GEEKU", "GEEKU"],
}
df2 = {'N_Product': ["GEEKU","GEEKU","GEEKU", "PUMP", "PUMP","VALVE", "VALVE","VALVE"],
'M_Product': ["ELECTRICAL", "OVERBOARD","", "TTF","", "DONE","PRESSURE",""],
}
df1 = pd.DataFrame(df1).apply(lambda x: x.astype(str).str.upper())
df2 = pd.DataFrame(df2).apply(lambda x: x.astype(str).str.upper())
print('df1:'); print(df1)
print('df2:'); print(df2)
df1['M_Product'] = df1['Descr'].apply(lambda x: [val.strip(',') for val in x.split() if val.strip(',')]).str.slice(start=1)
df1['df1_row'] = df1.index
df3 = df1[['df1_row', 'N_Product', 'M_Product']].explode('M_Product')
df5 = df3.merge(df2, on=['N_Product', 'M_Product']).groupby('df1_row').nth(0)
df1['M_Product'] = df5['M_Product']
df1['Result'] = df1['N_Product'] + (~df1['M_Product'].isna()) * (': ' + df1['M_Product'].astype(str).str.strip()) + ';'
df1 = df1.drop(columns=['M_Product', 'df1_row'])
print('result:'); print(df1)
输出:
df1:
Descr N_Product
0 VALVE, DONE VALVE
1 PUMP TTF PUMP
2 PUMP ELECTRICAL PUMP
3 VALVE, WW VALVE
4 GEEKU MBA , ELECTRICAL GEEKU
5 VALVE PRESSURE, OVERBOARD VALVE
6 VALVE, ELECTRICAL DONE VALVE
7 GEEKU ELECTRICAL OVERBOARD GEEKU
8 GEEKU OVERBOARD , ELECTRICAL GEEKU
df2:
N_Product M_Product
0 GEEKU ELECTRICAL
1 GEEKU OVERBOARD
2 GEEKU
3 PUMP TTF
4 PUMP
5 VALVE DONE
6 VALVE PRESSURE
7 VALVE
result:
Descr N_Product Result
0 VALVE, DONE VALVE VALVE: DONE;
1 PUMP TTF PUMP PUMP: TTF;
2 PUMP ELECTRICAL PUMP PUMP;
3 VALVE, WW VALVE VALVE;
4 GEEKU MBA , ELECTRICAL GEEKU GEEKU: ELECTRICAL;
5 VALVE PRESSURE, OVERBOARD VALVE VALVE: PRESSURE;
6 VALVE, ELECTRICAL DONE VALVE VALVE: DONE;
7 GEEKU ELECTRICAL OVERBOARD GEEKU GEEKU: ELECTRICAL;
8 GEEKU OVERBOARD , ELECTRICAL GEEKU GEEKU: ELECTRICAL;
解释:
- 使 df1 和 df2 中的所有内容大写以简化匹配
- 将 Descr 拆分为标记,除了第一个标记(它只是 N_Product 的副本),将它们放入 df1 中名为
M_Product
[=47= 的新列中的列表中] - 在名为
df1_row
的列中记录原始 df1 行的索引
- 使用
explode()
创建一个数据框df3,df1 中上述 - 使用
merge()
到 select df3 中与 df2 中的行匹配的行 (N_Product, M_Product) - 在
df1_row
和nth(0)
上使用groupby()
为每个这样的 (N_Product, M_Product) 对取第 0 个匹配项 - 将这个新数据框中的 M_Product 列添加回 df1
- 使用
apply()
在 df1 中使用 (1) N_Product +;
填充新的Result
列(如果 M_Product 列为空(isna()
) 或 (2) N_Product + ':' + M_Product + ';'如果有 M_Product 匹配。 - 删除我们不再需要的中间列(M_Product、df1_row)。
M_Product
列中的每个值一行