Pandas: 获取字符开始和结束之间的子串
Pandas: Get substring between start and end of the characters
我正在尝试获取不同字符开头和结尾之间的子字符串。我尝试了几种不同的正则表达式符号,我接近我需要的输出,但它并不完全正确。我该怎么做才能解决这个问题?
数据 csv
ID,TEST
abc,1#London4#Harry Potter#5Rowling##
cde,6#Harry Potter1#England#5Rowling
efg,4#Harry Potter#5Rowling##1#USA
ghi,
jkm,4#Harry Potter5#Rowling
xyz,4#Harry Potter1#China#5Rowling
代码:
import pandas as pd
df = pd.read_csv('sample2.csv')
print(df)
尝试:
df['TEST'].astype(str).str.extract('(1#.*(?=#))')
从上面的代码得到输出:它没有选择结束行'1#USA'
1#London4#Harry Potter#5Rowling#
1#England
NaN
NaN
NaN
1#China
需要输出:
1#London
1#England
1#USA
NaN
NaN
1#China
你可以试试:
# capture all characters that are neither `#` nor digits
# following 1#
df['TEST'].str.extract('(1#[^#\d]+)', expand=False)
输出:
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China
Name: TEST, dtype: object
你可以这样做:
>>> df.TEST.str.extract("(1#[a-zA-Z]*)")
0
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China
我正在尝试获取不同字符开头和结尾之间的子字符串。我尝试了几种不同的正则表达式符号,我接近我需要的输出,但它并不完全正确。我该怎么做才能解决这个问题?
数据 csv
ID,TEST
abc,1#London4#Harry Potter#5Rowling##
cde,6#Harry Potter1#England#5Rowling
efg,4#Harry Potter#5Rowling##1#USA
ghi,
jkm,4#Harry Potter5#Rowling
xyz,4#Harry Potter1#China#5Rowling
代码:
import pandas as pd
df = pd.read_csv('sample2.csv')
print(df)
尝试:
df['TEST'].astype(str).str.extract('(1#.*(?=#))')
从上面的代码得到输出:它没有选择结束行'1#USA'
1#London4#Harry Potter#5Rowling#
1#England
NaN
NaN
NaN
1#China
需要输出:
1#London
1#England
1#USA
NaN
NaN
1#China
你可以试试:
# capture all characters that are neither `#` nor digits
# following 1#
df['TEST'].str.extract('(1#[^#\d]+)', expand=False)
输出:
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China
Name: TEST, dtype: object
你可以这样做:
>>> df.TEST.str.extract("(1#[a-zA-Z]*)")
0
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China