如何使用正则表达式删除 pandas 数据帧中管道之前的子字符串?

How do I use regex to remove substring before a pipe in pandas dataframe?

cna pandas 数据框中,对于 Hugo_symbol 列的所有值,如果有管道 (|) 后跟“ENSG*”,则删除所有内容在管道之前。

我的代码:

import re
cna["Hugo_Symbol"] = [re.sub(r"^\|.*", "", str(x)) for x in cna["Hugo_Symbol"]]

当前cna数据帧

Hugo_Symbol TCGA_1 TCGA_2 TCGA_3
0 GENEID|ENSG12345 0.1 0.2 0.3
1 GENEA 0.4 0.5 0.6
2 ANOTHERGENEID|ENSG6789 0.7 0.8 0.9
3 GENEB 1.0 1.1 1.2

期望的输出

Hugo_Symbol TCGA_1 TCGA_2 TCGA_3
0 ENSG12345 0.1 0.2 0.3
1 GENEA 0.4 0.5 0.6
2 ENSG6789 0.7 0.8 0.9
3 GENEB 1.0 1.1 1.2

您可以使用简单的正则表达式 str.replace:

cna['Hugo_Symbol'] = cna['Hugo_Symbol'].str.replace(r'^(.*\|)', '', regex=True)

输出:

  Hugo_Symbol  TCGA_1  TCGA_2  TCGA_3
0   ENSG12345     0.1     0.2     0.3
1       GENEA     0.4     0.5     0.6
2    ENSG6789     0.7     0.8     0.9
3       GENEB     1.0     1.1     1.2

regex demo

您需要使用 Series.str.replace:

cna["Hugo_Symbol"] = cna["Hugo_Symbol"].str.replace(r'^[^|]*\|', '', regex=True)

详情:

  • ^ - 字符串开头
  • [^|]* - |
  • 以外的零个或多个字符
  • \| - 一个 | 字符。

参见regex demo

Pandas 测试:

import pandas as pd
cna = pd.DataFrame({'Hugo_Symbol':['GENEID|ENSG12345', 'GENEA'], 'TCGA_1':[0.1, 0.4]})
cna["Hugo_Symbol"].str.replace(r'^[^|]*\|', '', regex=True)
0    ENSG12345
1        GENEA
Name: Hugo_Symbol, dtype: object

regex=True注意事项:

累积到 Pandas 1.2.0 release notes:

The default value of regex for Series.str.replace() will change from True to False in a future release. In addition, single character regular expressions will not be treated as literal strings when regex=True is set (GH24804).