如何更改正则表达式选择的 pandas 列中的值？

Question

我正在清理个人项目的数据，并对大量类别进行标准化。看似低垂的果实有足够相似的名字，例如：

'SUSPECIOUS CRAFT', 'SUSPECTED MILITANTS', 'SUSPECTED PIRATE','SUSPECTED TERRORISTS', 'SUSPICICIOUS APPROACH', 'SUSPICIOPUS APPROACH', 'SUSPICIOUS APPRAOCH','SUSPICIOUS APPROACH', 'SUSPICIOUS BOAT', 'SUSPICIOUS BOATS', 'SUSPICIOUS CRAFT', 'SUSPICIOUS CRAFTS', 'SUSPICIOUS VESSEL', 'SUSPICOUS APPROACH', 'SUSPICUIOUS APPROACH','SUSPIPICIOUS APPROACH', 'SUSPISIOUC CRAFT', 'SUS[ICIOUS APPROACH'

还有其他一些，包括小写和混合大小写的，所以我使用的是正则表达式。我可以 select 我正在寻找的东西（注意我添加了 #8619:

df[df["hostility"].str.contains(r"^Su(s|c)(p|])(i|e)", regex=True, case=False)]

        year    hostility                victim
878     2018    Suspicious Approach     Tug
7060    2001    SUSPICIOUS CRAFT        MERCHANT VESSEL
7068    2001    Suspicious group onboard a trawler      YACHT
7723    2000    SUSPICIOUS CRAFT        MERCHANT VESSEL
8619    2004    Protest                 tug 
10001   2003    SUSPICIOUS CRAFT        MERCHANT VESSEL

但我坚持替换所有变体，以便他们喜欢这样：

        year    hostility               victim
878     2018    Suspicious Approach     Tug
7060    2001    Suspicious Approach     MERCHANT VESSEL
7068    2001    Suspicious Approach     YACHT
7723    2000    Suspicious Approach     MERCHANT VESSEL
8619    2004    Protest                 tug 
10001   2003    Suspicious Approach     MERCHANT VESSEL

这样做最有效的是什么？

Answer 1

您可以使用矢量化 Series.str.replace method directly to replace the whole string that starts with the pattern of your choice. Note that it is not efficient to use groups with single character alternatives, regex offers you character classes for that. E.g. do not use (c|d), use [cd] instead which is much more efficient (see Why is a character class faster than alternation?).

所以，您可以使用

df['hostility'] = df['hostility'].str.replace(r'(?i)^Su[sc][][p][ie].*', 'Suspicious Approach', regex=True)

请注意，由于使用了 (?i) 内联修饰符，正则表达式不区分大小写，并且 regex=True 使该方法将搜索参数视为正则表达式。

详情:

(?i) -
^ - 字符串开头
Su - Su 字符串
[sc] - s 或 c
[][p] - ]、[ 或 p 字符（请注意，您不必在字符 class 内转义 [ , 和 ] 如果它在字符 class 起始位置)
[ie] - i 或 e
.* - 行的其余部分（如果您需要匹配换行符，请将 (?i) 替换为 (?si) 并且 . 也会匹配换行符） .

如何更改正则表达式选择的 pandas 列中的值？

How do I change the values in a pandas column that are selected by a regex?

regex

dataframe

pandas