如何使用正则表达式删除 pandas 数据帧中管道之前的子字符串？

Question

在 cna pandas 数据框中，对于 Hugo_symbol 列的所有值，如果有管道 (|) 后跟“ENSG*”，则删除所有内容在管道之前。

我的代码：

import re
cna["Hugo_Symbol"] = [re.sub(r"^\|.*", "", str(x)) for x in cna["Hugo_Symbol"]]

当前cna数据帧

	Hugo_Symbol	TCGA_1	TCGA_2	TCGA_3
0	GENEID\|ENSG12345	0.1	0.2	0.3
1	GENEA	0.4	0.5	0.6
2	ANOTHERGENEID\|ENSG6789	0.7	0.8	0.9
3	GENEB	1.0	1.1	1.2

期望的输出

	Hugo_Symbol	TCGA_1	TCGA_2	TCGA_3
0	ENSG12345	0.1	0.2	0.3
1	GENEA	0.4	0.5	0.6
2	ENSG6789	0.7	0.8	0.9
3	GENEB	1.0	1.1	1.2

Answer 1

您可以使用简单的正则表达式 str.replace:

cna['Hugo_Symbol'] = cna['Hugo_Symbol'].str.replace(r'^(.*\|)', '', regex=True)

输出：

  Hugo_Symbol  TCGA_1  TCGA_2  TCGA_3
0   ENSG12345     0.1     0.2     0.3
1       GENEA     0.4     0.5     0.6
2    ENSG6789     0.7     0.8     0.9
3       GENEB     1.0     1.1     1.2

regex demo

Answer 2

您需要使用 Series.str.replace:

cna["Hugo_Symbol"] = cna["Hugo_Symbol"].str.replace(r'^[^|]*\|', '', regex=True)

详情:

^ - 字符串开头
[^|]* - |
\| - 一个 | 字符。

参见regex demo。

Pandas 测试：

import pandas as pd
cna = pd.DataFrame({'Hugo_Symbol':['GENEID|ENSG12345', 'GENEA'], 'TCGA_1':[0.1, 0.4]})
cna["Hugo_Symbol"].str.replace(r'^[^|]*\|', '', regex=True)
0    ENSG12345
1        GENEA
Name: Hugo_Symbol, dtype: object

regex=True注意事项：

累积到 Pandas 1.2.0 release notes:

The default value of regex for Series.str.replace() will change from True to False in a future release. In addition, single character regular expressions will not be treated as literal strings when regex=True is set (GH24804).

如何使用正则表达式删除 pandas 数据帧中管道之前的子字符串？

How do I use regex to remove substring before a pipe in pandas dataframe?

regex

pandas