Pandas: 获取字符开始和结束之间的子串

Question

我正在尝试获取不同字符开头和结尾之间的子字符串。我尝试了几种不同的正则表达式符号，我接近我需要的输出，但它并不完全正确。我该怎么做才能解决这个问题？

数据 csv

ID,TEST
abc,1#London4#Harry Potter#5Rowling##
cde,6#Harry Potter1#England#5Rowling
efg,4#Harry Potter#5Rowling##1#USA
ghi,
jkm,4#Harry Potter5#Rowling
xyz,4#Harry Potter1#China#5Rowling

代码：

import pandas as pd
df = pd.read_csv('sample2.csv')
print(df)

尝试：

df['TEST'].astype(str).str.extract('(1#.*(?=#))')

从上面的代码得到输出：它没有选择结束行'1#USA'

1#London4#Harry Potter#5Rowling#
1#England
NaN
NaN
NaN
1#China

需要输出：

1#London
1#England
1#USA
NaN
NaN
1#China

Answer 1

你可以试试：

# capture all characters that are neither `#` nor digits
# following 1#
df['TEST'].str.extract('(1#[^#\d]+)', expand=False)

输出：

0     1#London
1    1#England
2        1#USA
3          NaN
4          NaN
5      1#China
Name: TEST, dtype: object

Answer 2

你可以这样做：

>>> df.TEST.str.extract("(1#[a-zA-Z]*)")
           0
0   1#London
1  1#England
2      1#USA
3        NaN
4        NaN
5    1#China

Pandas: 获取字符开始和结束之间的子串

Pandas: Get substring between start and end of the characters

regex

substring

python-3.x

pandas