如何为每第 n 个句子换行?
How to put a newline for every n'th sentence?
苦于如何在长文本字符串中每第 5 个句子添加一个新行。
输入示例
text = 'The puppy is cute. Summer is great. Happy Friday. Sentence4. Sentence5. Sentence6. Sentence7.
期望的输出:
The puppy is cute. Summer is great. Happy friday. Sentence4. Sentence5.
Sentence6. Sentence7.
有人可以帮忙吗?
试试这个:
text = 'The puppy is cute. Summer is great. Happy friday. sentence4. sentence5. sentence6. sentence7.'
splittext = text.split(".")
for x in range(5, len(splittext), 5):
splittext[x] = "\n"+splittext[x].lstrip()
text = ".".join(splittext)
print(text)
使用正则表达式。在“[not .] followed by .”的 5 个匹配项后添加 \n。
import re
text = 'The puppy is cute. Summer is great. Happy friday. sentence4. sentence5. sentence6. sentence7.'
print(re.sub(r'((?:[^.]+\.\s*){5})',r'\n',text))
一个更高级的正则表达式句子匹配器,通过匹配结束标点符号来处理缩写和其他标点符号。
参考:https://mikedombrowski.com/2017/04/regex-sentence-splitter/
注意:仍然存在失败的边缘情况,例如 T.V。后跟 Mr. needs 两个空格来表示一个单独的句子。带有句子的引文将被拆分。等等
import re
sentence_regex = r'((.*?([\.\?!][\'\"\u2018\u2019\u201c\u201d\)\]]*\s*(?<!\w\.\w.)(?<![A-Z][a-z][a-z]\.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)\s+)){5})'
text = 'The puppy is cute. Watch T.V. Mr. Summers is great. Say "my name." My name is. Or not... Happy friday? Sentence4. Sentence5. Sentence6. Sentence7.'
text += " " + text
print(re.sub(sentence_regex,r'\n',text))
如果比这更复杂,您可能需要研究语言处理工具包。
这是一个在第 5 句末尾添加换行符的简单函数
def new_line(sentence: str):
# characters that mark the end of a sentence
end_of_sentence_markers = ['.', '!', '?', '...']
# after n sentences insert new_line
n = 5
# keeps track
count = 0
# final string as list for efficiency
final_str = []
# split at space
sentence_split = sentence.split(' ')
# traverse the sentence split
for word in sentence_split:
# if end of sentence is present then increase count
if word[-1] in end_of_sentence_markers:
count += 1
# if count is equal to n then add newline otherwise add space
if count == n:
final_str.append(word + '\n')
count = 0
else:
final_str.append(word + ' ')
# return the string version of the list
return ''.join(final_str)
这是修改后的版本:
def new_line_better(sentence: str, n: int):
# final string as list for efficiency
final_str = []
# split at period and remove extra spaces
sentence_split = list( map( lambda x : x.strip(), sentence.split('.') ) )
# pop off last space
sentence_split.pop()
# keeps track
count = 0
# traverse the sentences
for sentence in sentence_split:
count += 1
if count == n:
count = 0
final_str.append(sentence+'.\n')
else:
final_str.append(sentence+'. ')
# return the string version of the list
return ''.join(final_str)
另一种方法:
text = 'The puppy is cute. Summer is great. Happy friday. sentence4. sentence5. sentence6. sentence7.'
out = ''
for i, e in enumerate(text.split(".")):
if (i > 0) & (i % 5 == 0):
out = out + '\n'
out = out + e + '.'
out
结果:
'The puppy is cute. Summer is great. Happy friday. sentence4. sentence5.\n sentence6. sentence7..'
有列表理解
text = 'The puppy is cute. Summer is great. Happy friday. sentence4. sentence5. sentence6. sentence7.'
lines = text.split(".")
result = ".".join([l if i % 5 else "\n"+l for (i, l) in enumerate(lines)]).lstrip()
print(result)
苦于如何在长文本字符串中每第 5 个句子添加一个新行。
输入示例
text = 'The puppy is cute. Summer is great. Happy Friday. Sentence4. Sentence5. Sentence6. Sentence7.
期望的输出:
The puppy is cute. Summer is great. Happy friday. Sentence4. Sentence5.
Sentence6. Sentence7.
有人可以帮忙吗?
试试这个:
text = 'The puppy is cute. Summer is great. Happy friday. sentence4. sentence5. sentence6. sentence7.'
splittext = text.split(".")
for x in range(5, len(splittext), 5):
splittext[x] = "\n"+splittext[x].lstrip()
text = ".".join(splittext)
print(text)
使用正则表达式。在“[not .] followed by .”的 5 个匹配项后添加 \n。
import re
text = 'The puppy is cute. Summer is great. Happy friday. sentence4. sentence5. sentence6. sentence7.'
print(re.sub(r'((?:[^.]+\.\s*){5})',r'\n',text))
一个更高级的正则表达式句子匹配器,通过匹配结束标点符号来处理缩写和其他标点符号。
参考:https://mikedombrowski.com/2017/04/regex-sentence-splitter/
注意:仍然存在失败的边缘情况,例如 T.V。后跟 Mr. needs 两个空格来表示一个单独的句子。带有句子的引文将被拆分。等等
import re
sentence_regex = r'((.*?([\.\?!][\'\"\u2018\u2019\u201c\u201d\)\]]*\s*(?<!\w\.\w.)(?<![A-Z][a-z][a-z]\.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)\s+)){5})'
text = 'The puppy is cute. Watch T.V. Mr. Summers is great. Say "my name." My name is. Or not... Happy friday? Sentence4. Sentence5. Sentence6. Sentence7.'
text += " " + text
print(re.sub(sentence_regex,r'\n',text))
如果比这更复杂,您可能需要研究语言处理工具包。
这是一个在第 5 句末尾添加换行符的简单函数
def new_line(sentence: str):
# characters that mark the end of a sentence
end_of_sentence_markers = ['.', '!', '?', '...']
# after n sentences insert new_line
n = 5
# keeps track
count = 0
# final string as list for efficiency
final_str = []
# split at space
sentence_split = sentence.split(' ')
# traverse the sentence split
for word in sentence_split:
# if end of sentence is present then increase count
if word[-1] in end_of_sentence_markers:
count += 1
# if count is equal to n then add newline otherwise add space
if count == n:
final_str.append(word + '\n')
count = 0
else:
final_str.append(word + ' ')
# return the string version of the list
return ''.join(final_str)
这是修改后的版本:
def new_line_better(sentence: str, n: int):
# final string as list for efficiency
final_str = []
# split at period and remove extra spaces
sentence_split = list( map( lambda x : x.strip(), sentence.split('.') ) )
# pop off last space
sentence_split.pop()
# keeps track
count = 0
# traverse the sentences
for sentence in sentence_split:
count += 1
if count == n:
count = 0
final_str.append(sentence+'.\n')
else:
final_str.append(sentence+'. ')
# return the string version of the list
return ''.join(final_str)
另一种方法:
text = 'The puppy is cute. Summer is great. Happy friday. sentence4. sentence5. sentence6. sentence7.'
out = ''
for i, e in enumerate(text.split(".")):
if (i > 0) & (i % 5 == 0):
out = out + '\n'
out = out + e + '.'
out
结果:
'The puppy is cute. Summer is great. Happy friday. sentence4. sentence5.\n sentence6. sentence7..'
有列表理解
text = 'The puppy is cute. Summer is great. Happy friday. sentence4. sentence5. sentence6. sentence7.'
lines = text.split(".")
result = ".".join([l if i % 5 else "\n"+l for (i, l) in enumerate(lines)]).lstrip()
print(result)