使用一个分隔符但多个条件拆分字符串
Split a string with one delimiter but multiple conditions
早上好,
我发现多个线程使用多个定界符处理拆分字符串,但不使用 一个定界符和多个条件。
我想按句子拆分以下字符串:
desc = Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish.
如果我这样做:
[t.split('. ') for t in desc]
我得到:
['Dr', 'Anna Pytlik is an expert in conservative and aesthetic dentistry', 'She speaks both English and Polish.']
我不想拆分 'Dr' 之后的第一个点。如何添加子字符串列表,在这种情况下 .split('.') 不应适用?
谢谢!
您可以将 re.split
与 negative lookbehind 一起使用:
>>> desc = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish."
>>> re.split(r"(?<!Dr|Mr)\. ", desc)
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry',
'She speaks both English and Polish.']
只需添加更多 "exceptions",以 |
分隔。
更新:似乎负后视要求所有备选方案具有相同的长度,因此这不适用于 "Dr." 和 "Prof." 一种解决方法可能是用 [= 填充模式15=],例如(?<!..Dr|..Mr|Prof)
。您可以轻松编写一个辅助方法,根据需要用 .
填充每个标题。但是,如果文本的第一个单词是 Dr.,这可能会中断,因为 .. 将不会匹配。
另一种解决方法可能是首先用一些占位符替换所有标题,例如"Dr."
-> "{DR}"
和 "Prof."
-> "{PROF}"
,然后拆分,然后将原始标题交换回来。这样你甚至不需要正则表达式。
pairs = (("Dr.", "{DR}"), ("Prof.", "{PROF}")) # and some more
def subst_titles(s, reverse=False):
for x, y in pairs:
s = s.replace(*(x, y) if not reverse else (y, x))
return s
示例:
>>> text = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. Prof. Miller speaks both English and Polish."
>>> [subst_titles(s, True) for s in subst_titles(text).split(". ")]
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry', 'Prof. Miller speaks both English and Polish.']
你可以分开然后再加入Dr/Mr/...
它不需要复杂的正则表达式并且可能会更快(您应该对其进行基准测试以选择最佳选项)。
早上好,
我发现多个线程使用多个定界符处理拆分字符串,但不使用 一个定界符和多个条件。
我想按句子拆分以下字符串:
desc = Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish.
如果我这样做:
[t.split('. ') for t in desc]
我得到:
['Dr', 'Anna Pytlik is an expert in conservative and aesthetic dentistry', 'She speaks both English and Polish.']
我不想拆分 'Dr' 之后的第一个点。如何添加子字符串列表,在这种情况下 .split('.') 不应适用?
谢谢!
您可以将 re.split
与 negative lookbehind 一起使用:
>>> desc = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish."
>>> re.split(r"(?<!Dr|Mr)\. ", desc)
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry',
'She speaks both English and Polish.']
只需添加更多 "exceptions",以 |
分隔。
更新:似乎负后视要求所有备选方案具有相同的长度,因此这不适用于 "Dr." 和 "Prof." 一种解决方法可能是用 [= 填充模式15=],例如(?<!..Dr|..Mr|Prof)
。您可以轻松编写一个辅助方法,根据需要用 .
填充每个标题。但是,如果文本的第一个单词是 Dr.,这可能会中断,因为 .. 将不会匹配。
另一种解决方法可能是首先用一些占位符替换所有标题,例如"Dr."
-> "{DR}"
和 "Prof."
-> "{PROF}"
,然后拆分,然后将原始标题交换回来。这样你甚至不需要正则表达式。
pairs = (("Dr.", "{DR}"), ("Prof.", "{PROF}")) # and some more
def subst_titles(s, reverse=False):
for x, y in pairs:
s = s.replace(*(x, y) if not reverse else (y, x))
return s
示例:
>>> text = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. Prof. Miller speaks both English and Polish."
>>> [subst_titles(s, True) for s in subst_titles(text).split(". ")]
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry', 'Prof. Miller speaks both English and Polish.']
你可以分开然后再加入Dr/Mr/... 它不需要复杂的正则表达式并且可能会更快(您应该对其进行基准测试以选择最佳选项)。