Pandas: 获取字符开始和结束之间的子串

Pandas: Get substring between start and end of the characters

我正在尝试获取不同字符开头和结尾之间的子字符串。我尝试了几种不同的正则表达式符号,我接近我需要的输出,但它并不完全正确。我该怎么做才能解决这个问题?

数据 csv

ID,TEST
abc,1#London4#Harry Potter#5Rowling##
cde,6#Harry Potter1#England#5Rowling
efg,4#Harry Potter#5Rowling##1#USA
ghi,
jkm,4#Harry Potter5#Rowling
xyz,4#Harry Potter1#China#5Rowling

代码:

import pandas as pd
df = pd.read_csv('sample2.csv')
print(df)

尝试:

df['TEST'].astype(str).str.extract('(1#.*(?=#))')

从上面的代码得到输出:它没有选择结束行'1#USA'

1#London4#Harry Potter#5Rowling#
1#England
NaN
NaN
NaN
1#China

需要输出:

1#London
1#England
1#USA
NaN
NaN
1#China

你可以试试:

# capture all characters that are neither `#` nor digits
# following 1#
df['TEST'].str.extract('(1#[^#\d]+)', expand=False)

输出:

0     1#London
1    1#England
2        1#USA
3          NaN
4          NaN
5      1#China
Name: TEST, dtype: object

你可以这样做:

>>> df.TEST.str.extract("(1#[a-zA-Z]*)")
           0
0   1#London
1  1#England
2      1#USA
3        NaN
4        NaN
5    1#China