用另一个数据帧中的干净 str 替换凌乱的 str
Replace messy str with clean str from another dataframe
我有 2 组数据框,我想清理 df1['Fruits'] 如果它包含 df2['Fruits'] string
df1
Name Fruits
--------------
Dina Pineapple, [Y*]
Maria PTC*, Apple
Johny Durian, 1-6
Johny 5,6 Rambutan
Maria Apple (Red), [Y] *
Dina [Y] *, Peach88
Dina Kiwi/Qiwi, PS*
df2
Fruits tag
-------------
Apple 20
Pineapple 30
Rambutan 40
Durian 50
Apple (Red) 25
Peach88 55
Kiwi/Qiwi 25
我试过了
df1.loc[df1['Fruits'].contains(df2['Fruits']),'Fruits'] = df2['Fruits']
但它显示
'Series' object has no attribute 'contains'
所以我希望得到的是
df1
Name Fruits
--------------
Dina Pineapple
Maria Apple
Johny Durian
Johny Rambutan
Maria Apple (Red)
Dina Peach88
Dina Kiwi/Qiwi
使用pandas.Series.str.extract
:
reg = '(%s)' % '|'.join(df2['Fruits'])
# Make regex expression using df2['Fruits']
df1['Fruits'] = df1['Fruits'].str.extract(reg)
输出:
Name Fruits
0 Dina Pineapple
1 Maria Apple
2 Johny Durian
3 Johny Rambutan
'(%s)' % '|'.join(df2['Fruits'])
的解释:
'|'.join(df2['Fruits'])
:在正则表达式中为 or
操作创建 |
个分隔词。 ReturnsPineapple|Apple|Durian|Rambutan
(%s) % ...
:这称为 字符串格式化 ,相当于:
str.format
: '({})'.format('|'.join(df2['Fruits']))
,
- 或更隐含(但更少 pythonic)
'(' + '|'.join(df2['Fruits']) + ')'
- 所有这些 returns
(Apple|Pineapple|Rambutan|Durian)
,一个 捕获组 ,pd.Series.str.extract
必须了解要提取的内容。
我有 2 组数据框,我想清理 df1['Fruits'] 如果它包含 df2['Fruits'] string
df1
Name Fruits
--------------
Dina Pineapple, [Y*]
Maria PTC*, Apple
Johny Durian, 1-6
Johny 5,6 Rambutan
Maria Apple (Red), [Y] *
Dina [Y] *, Peach88
Dina Kiwi/Qiwi, PS*
df2
Fruits tag
-------------
Apple 20
Pineapple 30
Rambutan 40
Durian 50
Apple (Red) 25
Peach88 55
Kiwi/Qiwi 25
我试过了
df1.loc[df1['Fruits'].contains(df2['Fruits']),'Fruits'] = df2['Fruits']
但它显示
'Series' object has no attribute 'contains'
所以我希望得到的是
df1
Name Fruits
--------------
Dina Pineapple
Maria Apple
Johny Durian
Johny Rambutan
Maria Apple (Red)
Dina Peach88
Dina Kiwi/Qiwi
使用pandas.Series.str.extract
:
reg = '(%s)' % '|'.join(df2['Fruits'])
# Make regex expression using df2['Fruits']
df1['Fruits'] = df1['Fruits'].str.extract(reg)
输出:
Name Fruits
0 Dina Pineapple
1 Maria Apple
2 Johny Durian
3 Johny Rambutan
'(%s)' % '|'.join(df2['Fruits'])
的解释:
'|'.join(df2['Fruits'])
:在正则表达式中为or
操作创建|
个分隔词。 ReturnsPineapple|Apple|Durian|Rambutan
(%s) % ...
:这称为 字符串格式化 ,相当于:str.format
:'({})'.format('|'.join(df2['Fruits']))
,- 或更隐含(但更少 pythonic)
'(' + '|'.join(df2['Fruits']) + ')'
- 所有这些 returns
(Apple|Pineapple|Rambutan|Durian)
,一个 捕获组 ,pd.Series.str.extract
必须了解要提取的内容。