Python 正则表达式或其他从字符串中提取文本项的解决方案？

Question

我有一个看起来像这样的字符串：

\nInhaltse / techn. Angaben*\n\nAQUA • COCO-GLUCOSIDE • COCOSULFATE • SODIUM\n\n\

并且我需要获取点之间的项目列表，如下：

AQUA COCO-GLUCOSIDE COCOSULFATE  SODIUM

我已经尝试使用正则表达式和其他工具，但找不到正确、灵活*的答案。

*灵活 = 列表可能包含 1 到 N 个元素

Answer 1

你应该更好地定义什么是可能性，以及你想应用哪些规则。
我认为 'any word with only at least 2 uppercase characters or dash preceded and followed by a space or \n' 这样的规则可能适合您。如果是这样，这是您的正则表达式：

import re

my_string = "\nInhaltse / techn. Angaben*\n\nAQUA • COCO-GLUCOSIDE • COCOSULFATE • SODIUM\n\n"

print(re.findall(r"(?<=\n|\s)[A-Z-]{2,}(?=\n|\s)", my_string))

输出：

['AQUA', 'COCO-GLUCOSIDE', 'COCOSULFATE', 'SODIUM']

这是您阅读 RegEx 的方式：

(?<=\n|\s) 表示 前面有 (?<=) 新行 (\n) 或 (|) a space (\s)
[A-Z-\s]{2,} 表示 至少两个 ({2,}) 大写字母、破折号和 spaces ([A-Z-\s])
(?=\n|\s)表示后跟(?=)换行(\n)或 (|) a space (\s)

或为了更好地满足您的要求：

get a list of the items between dots

您可以使用：

r"(?<=\n\n|\•\s)[A-Z-\s]{2,}(?=\n\n|\s\•)"

这意味着：

at least 2 uppercase letters, dash or spaces, preceded by two new line or a dot and a space and followed by two new lines or a space and a dot

Python 正则表达式或其他从字符串中提取文本项的解决方案？

Python regex or other solution to extract text items from a string?

python

text

beautifulsoup

text-mining