如何拆分多个 unicode 分隔符但仍将分隔符保留在列表中？

Question

给定字符串：

老師說：「你們要記住國父說的『青年要立志做大事，不要做大官』這句話。」

任务是根据一组定界标点拆分字符串，即

puncts = [u'!', u'"', u'#', u'$', u'%', u'&', u"'", u'(', u')', u'*', u'+', u',', u'-', u'.', u'/', u':', u';', u'<', u'=', u'>', u'?', u'@', u'[', u'\', u']', u'^', u'_', u'`', u'{', u'|', u'}', u'~', u'\u2022', u'\u2026', u'\u3001', u'\u3002', u'\u300a', u'\u300b', u'\u300c', u'\u300d', u'\u300e', u'\u300f', u'\uff01', u'\uff08', u'\uff09', u'\uff0c', u'\uff1a', u'\uff1b', u'\uff1f']

期望的输出是：

[u'\u8001\u5e2b\u8aaa', u'\uff1a', u'\u300c', u'\u4f60\u5011\u8981\u8a18\u4f4f\u570b\u7236\u8aaa\u7684', u'\u300e', u'\u9752\u5e74\u8981\u7acb\u5fd7\u505a\u5927\u4e8b', u'\uff0c', u'\u4e0d\u8981\u505a\u5927\u5b98', u'\u300f', u'\u9019\u53e5\u8a71', u'\u3002', u'\u300d']

我看过 Python: Split string with multiple delimiters，使用 re.split 的解决方案非常简洁：

>>> x = u'\u8001\u5e2b\u8aaa\uff1a\u300c\u4f60\u5011\u8981\u8a18\u4f4f\u570b\u7236\u8aaa\u7684\u300e\u9752\u5e74\u8981\u7acb\u5fd7\u505a\u5927\u4e8b\uff0c\u4e0d\u8981\u505a\u5927\u5b98\u300f\u9019\u53e5\u8a71\u3002\u300d'
>>> [i for i in re.split(u"[{}]".format("|".join(puncts)), x, re.U)]
[u'\u8001\u5e2b\u8aaa', None, u'', None, u'\u4f60\u5011\u8981\u8a18\u4f4f\u570b\u7236\u8aaa\u7684', None, u'\u9752\u5e74\u8981\u7acb\u5fd7\u505a\u5927\u4e8b', None, u'\u4e0d\u8981\u505a\u5927\u5b98', None, u'\u9019\u53e5\u8a71', None, u'', None, u'']

注意：抱歉，出于某种原因 SO 认为打印的字符串是垃圾邮件，因此您必须使用字节数 =(

但是 re.split 的结果丢弃了所需的分隔符。

Is there a way to keep the delimiters from `re.split`?

Are there other ways to split the string using the `puncts` list as multiple delimiters and achieved the desired output?

我也试过先用空格填充所有标点符号，然后根据空格拆分：

>>> y = x
>>> for p in puncts:
...     y = y.replace(p, u' {} '.format(p))
... 
>>> y
u'\u8001\u5e2b\u8aaa    \uff1a       \u300c   \u4f60\u5011\u8981\u8a18\u4f4f\u570b\u7236\u8aaa\u7684   \u300e   \u9752\u5e74\u8981\u7acb\u5fd7\u505a\u5927\u4e8b    \uff0c    \u4e0d\u8981\u505a\u5927\u5b98   \u300f   \u9019\u53e5\u8a71    \u3002       \u300d   '
>>> y.split()
[u'\u8001\u5e2b\u8aaa', u'\uff1a', u'\u300c', u'\u4f60\u5011\u8981\u8a18\u4f4f\u570b\u7236\u8aaa\u7684', u'\u300e', u'\u9752\u5e74\u8981\u7acb\u5fd7\u505a\u5927\u4e8b', u'\uff0c', u'\u4e0d\u8981\u505a\u5927\u5b98', u'\u300f', u'\u9019\u53e5\u8a71', u'\u3002', u'\u300d']

是否有更简单的方法来实现相同的期望输出？

Answer 1

Document:

>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']

Answer 2

您可以将 puncts 列表转换为正则表达式以按如下方式拆分：

import re

text = u"老師說：「你們要記住國父說的『青年要立志做大事，不要做大官』這句話。」"
puncts = [u'!', u'"', u'#', u'$', u'%', u'&', u"'", u'(', u')', u'*', u'+', u',', u'-', u'.', u'/', u':', u';', u'<', u'=', u'>', u'?', u'@', u'[', u'\', u']', u'^', u'_', u'`', u'{', u'|', u'}', u'~', u'\u2022', u'\u2026', u'\u3001', u'\u3002', u'\u300a', u'\u300b', u'\u300c', u'\u300d', u'\u300e', u'\u300f', u'\uff01', u'\uff08', u'\uff09', u'\uff0c', u'\uff1a', u'\uff1b', u'\uff1f']
puncts = [re.escape(x) for x in puncts]
my_re = re.compile(u'({})'.format(u'|'.join(puncts)))

print [x for x in my_re.split(text) if len(x)]

给你：

[u'\u8001\u5e2b\u8aaa', u'\uff1a', u'\u300c', u'\u4f60\u5011\u8981\u8a18\u4f4f\u570b\u7236\u8aaa\u7684', u'\u300e', u'\u9752\u5e74\u8981\u7acb\u5fd7\u505a\u5927\u4e8b', u'\uff0c', u'\u4e0d\u8981\u505a\u5927\u5b98', u'\u300f', u'\u9019\u53e5\u8a71', u'\u3002', u'\u300d']

最后的列表理解用于删除任何空匹配项。

如何拆分多个 unicode 分隔符但仍将分隔符保留在列表中？

How to split on multiple unicode delimiters but still keeping the delimiter in the list?

python

string

unicode

split

delimiter