使用一个分隔符但多个条件拆分字符串

Split a string with one delimiter but multiple conditions

早上好,

我发现多个线程使用多个定界符处理拆分字符串,但不使用 一个定界符和多个条件

我想按句子拆分以下字符串:

desc = Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish.

如果我这样做:

[t.split('. ') for t in desc]

我得到:

['Dr', 'Anna Pytlik is an expert in conservative and aesthetic dentistry', 'She speaks both English and Polish.']

我不想拆分 'Dr' 之后的第一个点。如何添加子字符串列表,在这种情况下 .split('.') 不应适用?

谢谢!

您可以将 re.splitnegative lookbehind 一起使用:

>>> desc = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish."
>>> re.split(r"(?<!Dr|Mr)\. ", desc)
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry',
 'She speaks both English and Polish.']

只需添加更多 "exceptions",以 | 分隔。


更新:似乎负后视要求所有备选方案具有相同的长度,因此这不适用于 "Dr." 和 "Prof." 一种解决方法可能是用 [= 填充模式15=],例如(?<!..Dr|..Mr|Prof)。您可以轻松编写一个辅助方法,根据需要用 . 填充每个标题。但是,如果文本的第一个单词是 Dr.,这可能会中断,因为 .. 将不会匹配。

另一种解决方法可能是首先用一些占位符替换所有标题,例如"Dr." -> "{DR}""Prof." -> "{PROF}",然后拆分,然后将原始标题交换回来。这样你甚至不需要正则表达式。

pairs = (("Dr.", "{DR}"), ("Prof.", "{PROF}")) # and some more
def subst_titles(s, reverse=False):
    for x, y in pairs:
        s = s.replace(*(x, y) if not reverse else (y, x))
    return s

示例:

>>> text = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. Prof. Miller speaks both English and Polish."
>>> [subst_titles(s, True) for s in subst_titles(text).split(". ")]
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry', 'Prof. Miller speaks both English and Polish.']

你可以分开然后再加入Dr/Mr/... 它不需要复杂的正则表达式并且可能会更快(您应该对其进行基准测试以选择最佳选项)。