如何更改正则表达式选择的 pandas 列中的值?
How do I change the values in a pandas column that are selected by a regex?
我正在清理个人项目的数据,并对大量类别进行标准化。看似低垂的果实有足够相似的名字,例如:
'SUSPECIOUS CRAFT', 'SUSPECTED MILITANTS', 'SUSPECTED
PIRATE','SUSPECTED TERRORISTS', 'SUSPICICIOUS APPROACH', 'SUSPICIOPUS
APPROACH', 'SUSPICIOUS APPRAOCH','SUSPICIOUS APPROACH', 'SUSPICIOUS
BOAT', 'SUSPICIOUS BOATS', 'SUSPICIOUS CRAFT', 'SUSPICIOUS CRAFTS',
'SUSPICIOUS VESSEL', 'SUSPICOUS APPROACH', 'SUSPICUIOUS
APPROACH','SUSPIPICIOUS APPROACH', 'SUSPISIOUC CRAFT', 'SUS[ICIOUS
APPROACH'
还有其他一些,包括小写和混合大小写的,所以我使用的是正则表达式。我可以 select 我正在寻找的东西(注意我添加了 #8619:
df[df["hostility"].str.contains(r"^Su(s|c)(p|])(i|e)", regex=True, case=False)]
year hostility victim
878 2018 Suspicious Approach Tug
7060 2001 SUSPICIOUS CRAFT MERCHANT VESSEL
7068 2001 Suspicious group onboard a trawler YACHT
7723 2000 SUSPICIOUS CRAFT MERCHANT VESSEL
8619 2004 Protest tug
10001 2003 SUSPICIOUS CRAFT MERCHANT VESSEL
但我坚持替换所有变体,以便他们喜欢这样:
year hostility victim
878 2018 Suspicious Approach Tug
7060 2001 Suspicious Approach MERCHANT VESSEL
7068 2001 Suspicious Approach YACHT
7723 2000 Suspicious Approach MERCHANT VESSEL
8619 2004 Protest tug
10001 2003 Suspicious Approach MERCHANT VESSEL
这样做最有效的是什么?
您可以使用矢量化 Series.str.replace
method directly to replace the whole string that starts with the pattern of your choice. Note that it is not efficient to use groups with single character alternatives, regex offers you character classes for that. E.g. do not use (c|d)
, use [cd]
instead which is much more efficient (see Why is a character class faster than alternation?).
所以,您可以使用
df['hostility'] = df['hostility'].str.replace(r'(?i)^Su[sc][][p][ie].*', 'Suspicious Approach', regex=True)
请注意,由于使用了 (?i)
内联修饰符,正则表达式不区分大小写,并且 regex=True
使该方法将搜索参数视为正则表达式。
详情:
(?i)
- 上的不区分大小写的修饰符
^
- 字符串开头
Su
- Su
字符串
[sc]
- s
或 c
[][p]
- ]
、[
或 p
字符(请注意,您不必在字符 class 内转义 [
, 和 ]
如果它在字符 class 起始位置)
[ie]
- i
或 e
.*
- 行的其余部分(如果您需要匹配换行符,请将 (?i)
替换为 (?si)
并且 .
也会匹配换行符) .
我正在清理个人项目的数据,并对大量类别进行标准化。看似低垂的果实有足够相似的名字,例如:
'SUSPECIOUS CRAFT', 'SUSPECTED MILITANTS', 'SUSPECTED PIRATE','SUSPECTED TERRORISTS', 'SUSPICICIOUS APPROACH', 'SUSPICIOPUS APPROACH', 'SUSPICIOUS APPRAOCH','SUSPICIOUS APPROACH', 'SUSPICIOUS BOAT', 'SUSPICIOUS BOATS', 'SUSPICIOUS CRAFT', 'SUSPICIOUS CRAFTS', 'SUSPICIOUS VESSEL', 'SUSPICOUS APPROACH', 'SUSPICUIOUS APPROACH','SUSPIPICIOUS APPROACH', 'SUSPISIOUC CRAFT', 'SUS[ICIOUS APPROACH'
还有其他一些,包括小写和混合大小写的,所以我使用的是正则表达式。我可以 select 我正在寻找的东西(注意我添加了 #8619:
df[df["hostility"].str.contains(r"^Su(s|c)(p|])(i|e)", regex=True, case=False)]
year hostility victim
878 2018 Suspicious Approach Tug
7060 2001 SUSPICIOUS CRAFT MERCHANT VESSEL
7068 2001 Suspicious group onboard a trawler YACHT
7723 2000 SUSPICIOUS CRAFT MERCHANT VESSEL
8619 2004 Protest tug
10001 2003 SUSPICIOUS CRAFT MERCHANT VESSEL
但我坚持替换所有变体,以便他们喜欢这样:
year hostility victim
878 2018 Suspicious Approach Tug
7060 2001 Suspicious Approach MERCHANT VESSEL
7068 2001 Suspicious Approach YACHT
7723 2000 Suspicious Approach MERCHANT VESSEL
8619 2004 Protest tug
10001 2003 Suspicious Approach MERCHANT VESSEL
这样做最有效的是什么?
您可以使用矢量化 Series.str.replace
method directly to replace the whole string that starts with the pattern of your choice. Note that it is not efficient to use groups with single character alternatives, regex offers you character classes for that. E.g. do not use (c|d)
, use [cd]
instead which is much more efficient (see Why is a character class faster than alternation?).
所以,您可以使用
df['hostility'] = df['hostility'].str.replace(r'(?i)^Su[sc][][p][ie].*', 'Suspicious Approach', regex=True)
请注意,由于使用了 (?i)
内联修饰符,正则表达式不区分大小写,并且 regex=True
使该方法将搜索参数视为正则表达式。
详情:
(?i)
- 上的不区分大小写的修饰符
^
- 字符串开头Su
-Su
字符串[sc]
-s
或c
[][p]
-]
、[
或p
字符(请注意,您不必在字符 class 内转义[
, 和]
如果它在字符 class 起始位置)[ie]
-i
或e
.*
- 行的其余部分(如果您需要匹配换行符,请将(?i)
替换为(?si)
并且.
也会匹配换行符) .