修复中间有换行的句子：Python is \n is fun

Question

我目前正在使用 Apache Tika 从 PDF 中提取文本。我正在使用 NLTK 进行命名实体识别和其他任务。我遇到一个问题，pdf 文档中的句子被提取出来，中间有换行符。例如，

I am a sentence that has a python line \nbreak in the middle of it.

模式通常是 space 后跟换行符 <space>\n 或有时 <space>\n<space>。我想修复这些句子，以便对它们使用句子分词器。

我正在尝试使用正则表达式模式 (.+?)(?:\r\n|\n)(.+[.!?]+[\s|$]) 将 \n 替换为 </code>.</p> <p>问题：</p> <ol> <li>在同一行开始后另一个句子结束的句子不匹配。</li> <li><p>如何匹配多行换行的句子？换句话说，如何允许多次出现 <code>(?:\r\n|\n)?

text = """
Random Data, Company
2015

This is a sentence that has line 
break in the middle of it due to extracting from a PDF.

How do I support
3 line sentence 
breaks please?

HEADER HERE

The first sentence will 
match. However, this line will not match
for some reason 
that I cannot figure out.

Portfolio: 
http://DoNotMatchMeBecauseIHaveAPeriodInMe.com 

Full Name 
San Francisco, CA  
94000

1500 testing a number as the first word in
a broken sentence.

Match sentences with capital letters on the next line like 
Wi-Fi.

This line has 
trailing spaces after exclamation mark!       
"""
import re
new_text = re.sub(pattern=r'(.+?)(?:\r\n|\n)(.+[.!?]+[\s|$])', repl='\g<1>\g<2>', string=text, flags=re.MULTILINE)
print(new_text)

expected_result = """
Random Data, Company
2015

This is a sentence that has line break in the middle of it due to extracting from a PDF.

How do I support 3 line sentence breaks please?

HEADER HERE

The first sentence will match. However, this line will not match for some reason that I cannot figure out.

Portfolio: 
http://DoNotMatchMeBecauseIHaveAPeriodInMe.com 

Full Name 
San Francisco, CA  
94000

1500 testing a number as the first word in a broken sentence.

Match sentences with capital letters on the next line like Wi-Fi.

This line has trailing spaces after exclamation mark!       
"""

Demo at regex101.com

Answer 1

正则表达式不匹配末尾有 space 的行，被分成 3 行的句子就是这种情况。结果这句话并没有合二为一

这是一个备用正则表达式，它将两个空行之间的所有行合并为一个，确保合并的行之间只有一个 space：

# The new regex
(\S)[ \t]*(?:\r\n|\n)[ \t]*(\S)
# The replacement string:

解释这将搜索任何非 space 字符 \S 后跟一个新行，然后是 spaces 然后是再次 \S。它用单个 space 替换了两个 '\S' 之间的换行符和 spaces。 Space 和 Tab 是明确给出的，因为 \s 也匹配新行。这是 demo link.

修复中间有换行的句子：Python is \n is fun

Repair sentences that have line breaks in the middle of them: Python is \n is fun

python

regex

nltk