使用一个分隔符但多个条件拆分字符串

Question

早上好，

我发现多个线程使用多个定界符处理拆分字符串，但不使用 一个定界符和多个条件。

我想按句子拆分以下字符串：

desc = Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish.

如果我这样做：

[t.split('. ') for t in desc]

我得到：

['Dr', 'Anna Pytlik is an expert in conservative and aesthetic dentistry', 'She speaks both English and Polish.']

我不想拆分 'Dr' 之后的第一个点。如何添加子字符串列表，在这种情况下 .split('.') 不应适用？

谢谢！

Answer 1

您可以将 re.split 与 negative lookbehind 一起使用：

>>> desc = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish."
>>> re.split(r"(?<!Dr|Mr)\. ", desc)
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry',
 'She speaks both English and Polish.']

只需添加更多 "exceptions"，以 | 分隔。

更新：似乎负后视要求所有备选方案具有相同的长度，因此这不适用于 "Dr." 和 "Prof." 一种解决方法可能是用 [= 填充模式15=]，例如(?<!..Dr|..Mr|Prof)。您可以轻松编写一个辅助方法，根据需要用 . 填充每个标题。但是，如果文本的第一个单词是 Dr.，这可能会中断，因为 .. 将不会匹配。

另一种解决方法可能是首先用一些占位符替换所有标题，例如"Dr." -> "{DR}" 和 "Prof." -> "{PROF}"，然后拆分，然后将原始标题交换回来。这样你甚至不需要正则表达式。

pairs = (("Dr.", "{DR}"), ("Prof.", "{PROF}")) # and some more
def subst_titles(s, reverse=False):
    for x, y in pairs:
        s = s.replace(*(x, y) if not reverse else (y, x))
    return s

示例：

>>> text = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. Prof. Miller speaks both English and Polish."
>>> [subst_titles(s, True) for s in subst_titles(text).split(". ")]
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry', 'Prof. Miller speaks both English and Polish.']

Answer 2

你可以分开然后再加入Dr/Mr/... 它不需要复杂的正则表达式并且可能会更快（您应该对其进行基准测试以选择最佳选项）。

使用一个分隔符但多个条件拆分字符串

Split a string with one delimiter but multiple conditions

python

regex

split

string-split

python-2.7