在保留换行符的同时进一步拆分文本

Question

我正在拆分文本 para 并使用以下

保留换行符 \n

from nltk import SpaceTokenizer
para="\n[STUFF]\n  comma,  with period. the new question? \n\nthe\n  \nline\n new char*"
sent=SpaceTokenizer().tokenize(para)

这给了我以下 print(sent)

['\n[STUFF]\n', '', 'comma,', '', 'with', 'period.', 'the', 'new', 'question?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']

我的目标是获得以下输出

['\n[STUFF]\n', '', 'comma', ',', '', 'with', 'period', '.', 'the', 'new', 'question', '?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']

也就是说，我想拆分把'comma,'拆分成'comma'，','拆分将'period.'拆分为'period','.'拆分将'question?'拆分为'question','?' while保留\n

我试过word_tokenize，它会实现分裂'comma'，','等但不保留\n

如何在保留 \n 的同时进一步拆分 sent？

Answer 1

https://docs.python.org/3/library/re.html#re.split可能就是你想要的。

然而，从您想要的输出的外观来看，您需要对字符串进行更多处理，而不仅仅是对其应用单个函数。

我首先将所有 \n 替换为 new_line_goes_here 之类的字符串，然后再拆分字符串，然后将 new_line_goes_here 替换为 \n 一旦完成分手了。

Answer 2

根据@randy 建议查看 https://docs.python.org/3/library/re.html#re.split

import re
para = re.split(r'(\W+)', '\n[STUFF]\n  comma,  with period. the new question? \n\nthe\n  \nline\n new char*')
print(para)

输出（接近我要找的）

['', '\n[', 'STUFF', ']\n  ', 'comma', ',  ', 'with', ' ', 'period', '. ', 'the', ' ', 'new', ' ', 'question', '? \n\n', 'the', '\n  \n', 'line', '\n ', 'new', ' ', 'char', '*', '']

在保留换行符的同时进一步拆分文本

splitting text further while preserving line breaks

python

string

split

tokenize

nltk