修复中间有换行的句子:Python is \n is fun
Repair sentences that have line breaks in the middle of them: Python is \n is fun
我目前正在使用 Apache Tika 从 PDF 中提取文本。我正在使用 NLTK 进行命名实体识别和其他任务。我遇到一个问题,pdf 文档中的句子被提取出来,中间有换行符。例如,
I am a sentence that has a python line \nbreak in the middle of it.
模式通常是 space 后跟换行符 <space>\n
或有时 <space>\n<space>
。我想修复这些句子,以便对它们使用句子分词器。
我正在尝试使用正则表达式模式 (.+?)(?:\r\n|\n)(.+[.!?]+[\s|$])
将 \n
替换为 </code>.</p>
<p>问题:</p>
<ol>
<li>在同一行开始后另一个句子结束的句子不匹配。</li>
<li><p>如何匹配多行换行的句子?换句话说,如何允许多次出现 <code>(?:\r\n|\n)
?
text = """
Random Data, Company
2015
This is a sentence that has line
break in the middle of it due to extracting from a PDF.
How do I support
3 line sentence
breaks please?
HEADER HERE
The first sentence will
match. However, this line will not match
for some reason
that I cannot figure out.
Portfolio:
http://DoNotMatchMeBecauseIHaveAPeriodInMe.com
Full Name
San Francisco, CA
94000
1500 testing a number as the first word in
a broken sentence.
Match sentences with capital letters on the next line like
Wi-Fi.
This line has
trailing spaces after exclamation mark!
"""
import re
new_text = re.sub(pattern=r'(.+?)(?:\r\n|\n)(.+[.!?]+[\s|$])', repl='\g<1>\g<2>', string=text, flags=re.MULTILINE)
print(new_text)
expected_result = """
Random Data, Company
2015
This is a sentence that has line break in the middle of it due to extracting from a PDF.
How do I support 3 line sentence breaks please?
HEADER HERE
The first sentence will match. However, this line will not match for some reason that I cannot figure out.
Portfolio:
http://DoNotMatchMeBecauseIHaveAPeriodInMe.com
Full Name
San Francisco, CA
94000
1500 testing a number as the first word in a broken sentence.
Match sentences with capital letters on the next line like Wi-Fi.
This line has trailing spaces after exclamation mark!
"""
正则表达式不匹配末尾有 space 的行,被分成 3 行的句子就是这种情况。结果这句话并没有合二为一
这是一个备用正则表达式,它将两个空行之间的所有行合并为一个,确保合并的行之间只有一个 space:
# The new regex
(\S)[ \t]*(?:\r\n|\n)[ \t]*(\S)
# The replacement string:
解释 这将搜索任何非 space 字符 \S
后跟一个新行,然后是 spaces 然后是再次 \S
。它用单个 space 替换了两个 '\S' 之间的换行符和 spaces。 Space 和 Tab 是明确给出的,因为 \s
也匹配新行。这是 demo link.
我目前正在使用 Apache Tika 从 PDF 中提取文本。我正在使用 NLTK 进行命名实体识别和其他任务。我遇到一个问题,pdf 文档中的句子被提取出来,中间有换行符。例如,
I am a sentence that has a python line \nbreak in the middle of it.
模式通常是 space 后跟换行符 <space>\n
或有时 <space>\n<space>
。我想修复这些句子,以便对它们使用句子分词器。
我正在尝试使用正则表达式模式 (.+?)(?:\r\n|\n)(.+[.!?]+[\s|$])
将 \n
替换为 </code>.</p>
<p>问题:</p>
<ol>
<li>在同一行开始后另一个句子结束的句子不匹配。</li>
<li><p>如何匹配多行换行的句子?换句话说,如何允许多次出现 <code>(?:\r\n|\n)
?
text = """
Random Data, Company
2015
This is a sentence that has line
break in the middle of it due to extracting from a PDF.
How do I support
3 line sentence
breaks please?
HEADER HERE
The first sentence will
match. However, this line will not match
for some reason
that I cannot figure out.
Portfolio:
http://DoNotMatchMeBecauseIHaveAPeriodInMe.com
Full Name
San Francisco, CA
94000
1500 testing a number as the first word in
a broken sentence.
Match sentences with capital letters on the next line like
Wi-Fi.
This line has
trailing spaces after exclamation mark!
"""
import re
new_text = re.sub(pattern=r'(.+?)(?:\r\n|\n)(.+[.!?]+[\s|$])', repl='\g<1>\g<2>', string=text, flags=re.MULTILINE)
print(new_text)
expected_result = """
Random Data, Company
2015
This is a sentence that has line break in the middle of it due to extracting from a PDF.
How do I support 3 line sentence breaks please?
HEADER HERE
The first sentence will match. However, this line will not match for some reason that I cannot figure out.
Portfolio:
http://DoNotMatchMeBecauseIHaveAPeriodInMe.com
Full Name
San Francisco, CA
94000
1500 testing a number as the first word in a broken sentence.
Match sentences with capital letters on the next line like Wi-Fi.
This line has trailing spaces after exclamation mark!
"""
正则表达式不匹配末尾有 space 的行,被分成 3 行的句子就是这种情况。结果这句话并没有合二为一
这是一个备用正则表达式,它将两个空行之间的所有行合并为一个,确保合并的行之间只有一个 space:
# The new regex
(\S)[ \t]*(?:\r\n|\n)[ \t]*(\S)
# The replacement string:
解释 这将搜索任何非 space 字符 \S
后跟一个新行,然后是 spaces 然后是再次 \S
。它用单个 space 替换了两个 '\S' 之间的换行符和 spaces。 Space 和 Tab 是明确给出的,因为 \s
也匹配新行。这是 demo link.