如何匹配和合并两个具有完全不同值的数据帧,除了数据帧列中的数字?
How to match and merge two dataframes having completely different values except numericals in columns of dataframe?
有一个数据框 ABC 的值
id | price | type
0 easdca | Rs.1,599.00 was trasn by you | unknown
1 vbbngy | txn of INR 191.00 using | unknown
2 awerfa | Rs.190.78 credits was used by you | unknown
3 zxcmo5 | DLR.2000 credits was used by you | unknown
和其他 XYZ 值
price | type
0 190.78 | food
1 191.00 | movie
2 2,000 | football
3 1,599.00 | basketball
如何将 XYZ 与 ABC 映射,以便使用 XYZ 价格中的值(数字)更新 ABC 中的类型与 xyz 中的类型。
我需要的输出
id | price | type
0 easdca | Rs.1,599.00 was trasn by you | basketball
1 vbbngy | txn of INR 191.00 using | movie
2 awerfa | Rs.190.78 credits was used by you | food
3 zxcmo5 | DLR.2,000 credits was used by you| football
用过这个
d = dict(zip(XYZ['PRICE'],XYZ['TYPE']))
pat = (r'({})'.format('|'.join(d.keys())))
ABC['TYPE']=ABC['PRICE'].str.extract(pat,expand=False).map(d)
但是 190.78 和 191.00 等值变得不匹配。
例如,在处理大量数据时,190.78 应该与食物值相匹配,例如 190.77 与分配了其他值的食物不匹配。并且 198.78 也与其他一些应该与食物匹配的不匹配
您可以执行以下操作:
'''
First we make a artificial key column to be able to merge
We basically just substract the floating numbers from the string
And convert it to type float
'''
df1['price_key'] = df1['price'].str.replace(',', '').str.extract('(\d+\.\d+)').astype(float)
# After that we do a merge on price and price_key and drop the columns which we dont need
df_final = pd.merge(df1, df2, left_on='price_key', right_on='price', suffixes=['', '_2'])
df_final = df_final.drop(['type', 'price_key', 'price_2'], axis='columns')
输出
id price type_2
0 easdca Rs.1,599.00 was trasn by you basketball
1 vbbngy txn of INR 191.00 using movie
2 awerfa Rs.190.78 credits was used by you food
3 zxcmo5 DLR.2000.78 credits was used by you football
我猜你打错了 xyz
table,第三个价格应该是 2000.78
而不是 2000
。
df
id price type
0 easdca Rs.1,599.00 was trasn by you unknown
1 vbbngy txn of INR 191.00 using unknown
2 awerfa Rs.190.78 credits was used by you unknown
3 zxcmo5 DLR.2000 credits was used by you unknown
df2
price type
0 190.78 food
1 191.00 movie
2 2,000 football
3 1,599.00 basketball
使用 re
df['price_'] = df['price'].apply(lambda x: re.findall(r'(?<=[\.\s])[\d\.]+',x.replace(',',''))[0])
df2.columns = ['price_','type']
df2['price_'] = df2['price_'].str.repalce(',','')
将类型更改为 float
df2['price_'] = df2['price_'].astype(float)
df['price_'] = df['price_'] .astype(float)
使用 pd.merge
df = df.merge(df2, on='price_')
df.drop('type_x', axis=1)
输出
id price price_ type_y
0 easdca Rs.1,599.00 was trasn by you 1599.00 basketball
1 vbbngy txn of INR 191.00 using 191.00 movie
2 awerfa Rs.190.78 credits was used by you 190.78 food
3 zxcmo5 DLR.2000 credits was used by you 2000 football
有一个数据框 ABC 的值
id | price | type
0 easdca | Rs.1,599.00 was trasn by you | unknown
1 vbbngy | txn of INR 191.00 using | unknown
2 awerfa | Rs.190.78 credits was used by you | unknown
3 zxcmo5 | DLR.2000 credits was used by you | unknown
和其他 XYZ 值
price | type
0 190.78 | food
1 191.00 | movie
2 2,000 | football
3 1,599.00 | basketball
如何将 XYZ 与 ABC 映射,以便使用 XYZ 价格中的值(数字)更新 ABC 中的类型与 xyz 中的类型。
我需要的输出
id | price | type
0 easdca | Rs.1,599.00 was trasn by you | basketball
1 vbbngy | txn of INR 191.00 using | movie
2 awerfa | Rs.190.78 credits was used by you | food
3 zxcmo5 | DLR.2,000 credits was used by you| football
用过这个
d = dict(zip(XYZ['PRICE'],XYZ['TYPE']))
pat = (r'({})'.format('|'.join(d.keys())))
ABC['TYPE']=ABC['PRICE'].str.extract(pat,expand=False).map(d)
但是 190.78 和 191.00 等值变得不匹配。 例如,在处理大量数据时,190.78 应该与食物值相匹配,例如 190.77 与分配了其他值的食物不匹配。并且 198.78 也与其他一些应该与食物匹配的不匹配
您可以执行以下操作:
'''
First we make a artificial key column to be able to merge
We basically just substract the floating numbers from the string
And convert it to type float
'''
df1['price_key'] = df1['price'].str.replace(',', '').str.extract('(\d+\.\d+)').astype(float)
# After that we do a merge on price and price_key and drop the columns which we dont need
df_final = pd.merge(df1, df2, left_on='price_key', right_on='price', suffixes=['', '_2'])
df_final = df_final.drop(['type', 'price_key', 'price_2'], axis='columns')
输出
id price type_2
0 easdca Rs.1,599.00 was trasn by you basketball
1 vbbngy txn of INR 191.00 using movie
2 awerfa Rs.190.78 credits was used by you food
3 zxcmo5 DLR.2000.78 credits was used by you football
我猜你打错了 xyz
table,第三个价格应该是 2000.78
而不是 2000
。
df
id price type
0 easdca Rs.1,599.00 was trasn by you unknown
1 vbbngy txn of INR 191.00 using unknown
2 awerfa Rs.190.78 credits was used by you unknown
3 zxcmo5 DLR.2000 credits was used by you unknown
df2
price type
0 190.78 food
1 191.00 movie
2 2,000 football
3 1,599.00 basketball
使用 re
df['price_'] = df['price'].apply(lambda x: re.findall(r'(?<=[\.\s])[\d\.]+',x.replace(',',''))[0])
df2.columns = ['price_','type']
df2['price_'] = df2['price_'].str.repalce(',','')
将类型更改为 float
df2['price_'] = df2['price_'].astype(float)
df['price_'] = df['price_'] .astype(float)
使用 pd.merge
df = df.merge(df2, on='price_')
df.drop('type_x', axis=1)
输出
id price price_ type_y
0 easdca Rs.1,599.00 was trasn by you 1599.00 basketball
1 vbbngy txn of INR 191.00 using 191.00 movie
2 awerfa Rs.190.78 credits was used by you 190.78 food
3 zxcmo5 DLR.2000 credits was used by you 2000 football