如何从 pandas 系列中的字符串中删除正则表达式捕获组的第一个实例？

Question

我想从 pandas 中的一系列字符串中删除正则表达式捕获组的第一个实例，与 pandas.Series.str.extract 相反。这适用于以下代码中的 extract，我在其中提取以双下划线开头的第一个单词：

import pandas as pd

regex = r"^(?:.*?)(__\S+)"

t = pd.Series(['no match', '__match at start', 'match __in_the middle', 'the end __matches', 'string __with two __captures'])
t.str.extract(regex)

           0
0        NaN
1    __match
2   __in_the
3  __matches
4     __with
dtype: object

但是如果我用空字符串 pandas.Series.str.replace 那么它也会替换非捕获组：

t.str.replace(regex, '')

0        no match
1        at start
2          middle
3
4  two __captures
dtype: object

如何获得以下输出？

0                no match
1                at start
2           match  middle
3                the end
4  string  two __captures
dtype: object

Answer 1

您可以使用

t.str.replace(r'^(.*?)__\S+', r'', regex=True)

或：

t.str.replace(r'__\S+', '', n=1, regex=True)

参见Series.str.replace documenation：

Replace each occurrence of pattern/regex in the Series/Index.
...
nint, default -1 (all)
Number of replacements to make from start.

输出：

0         no match
1         at start
2    match  middle
3         the end 
dtype: object

详情:

^ - 字符串开头
(.*?) - 捕获第 1 组：除换行字符外的任何零个或多个字符尽可能少
__ - 两个下划线
\S+ - 一个或多个 non-whitespace 个字符。

(?:...) 是一个 non-capturing 组，它匹配的文本仍然被消耗（即添加到匹配值，因此在正则表达式替换操作期间受到影响）。 “Non-capturing”并不意味着“non-matching”。

替换模式中的是 replacement backreference，它指的是捕获到捕获组 1 中的文本。

重点是我们捕获到一个组中，并使用替换模式中使用的反向引用恢复我们想要保留的文本。我们要 replace/remove 的文本只是匹配，而不是捕获。

Answer 2

根据pandas的docs，您可以指定从一开始要进行的替换次数。如果它像 Python 的 re.sub 一样工作，那么您可以使用@Wiktor Stribiżew 代码并进行一次修改

t.str.replace(r'__\S+', '', 1, regex=True)

如果这不是 pandas 中的工作方式，那么

for s in t:
    x = re.sub(r"__\S+", "", s, 1)

如何从 pandas 系列中的字符串中删除正则表达式捕获组的第一个实例？

how do I remove the first instance of a regex capture group from strings in a pandas series?

python

regex

pandas