根据 Python 中的多个单词分隔符拆分字符串元素

Question

给定一个文本列表如下：

text = ["Nanjing Office and Retail Market Overview 2019 Second Quarter", 
"Xi'an Office and Retail Market Overview 2020 Q1", 
"Suzhou office and retail overview 2019 fourth quarter DTZ Research", 
"marketbeat Shanghai office Second quarter of 2020 Future New Grade A office supply in non-core business districts One-year trend Although the epidemic in Shanghai has been controlled in a timely and effective manner, the negative impact of the epidemic on Shanghai's commercial real estate The impact continues.", 
"Shanghai office September 2019 marketbeats 302.7 -0.4% 12.9% rent rent growth vacancy"]

我想用多个词(Market, quarter, marketbeats)来拆分每个字符串元素，然后得到第一部分包括定界词:

for string in text:
    # string = string.lower()
    split_str = re.split(r"[Market|quarter|marketbeats]", string)
    print(split_str)

输出：

['n', 'njing offic', ' ', 'nd ', '', '', '', 'il ', '', '', '', '', '', ' ov', '', 'vi', 'w 2019 ', '', 'cond ', '', '', '', '', '', '', '']
["xi'", 'n offic', ' ', 'nd ', '', '', '', 'il ', '', '', '', '', '', ' ov', '', 'vi', 'w 2020 ', '1'],
...

但是预期的结果是这样的：

"Nanjing Office and Retail Market", 
"Xi'an Office and Retail Market", 
"Suzhou office and retail overview 2019 fourth quarter", 
"marketbeat Shanghai office Second quarter", 
"Shanghai office September 2019 marketbeats"

如何在 Python 中得到正确的结果？谢谢。

Answer 1

您可以在此处使用 re.findall 方法：

text = ["Nanjing Office and Retail Market Overview 2019 Second Quarter", "Xi'an Office and Retail Market Overview 2020 Q1", "Suzhou office and retail overview 2019 fourth quarter DTZ Research", "marketbeat Shanghai office Second quarter of 2020 Future New Grade A office supply in non-core business districts One-year trend Although the epidemic in Shanghai has been controlled in a timely and effective manner, the negative impact of the epidemic on Shanghai's commercial real estate The impact continues.", "Shanghai office September 2019 marketbeats 302.7 -0.4% 12.9% rent rent growth vacancy"]
output = [re.findall(r'^.*?\b(?:Market|quarter|marketbeats|$)\b', x)[0] for x in text]
print(output)

这会打印：

['Nanjing Office and Retail Market',
 "Xi'an Office and Retail Market",
 'Suzhou office and retail overview 2019 fourth quarter',
 'marketbeat Shanghai office Second quarter',
 'Shanghai office September 2019 marketbeats']

根据 Python 中的多个单词分隔符拆分字符串元素

Split string elements based on multiple words delimiters in Python

regex

string

split

python-3.x

python-re