Oracle regexp_replace - 添加 space 来分隔句子
Oracle regexp_replace - Adding space to separate sentences
我正在使用 Oracle 修复一些文本。问题是我数据中的句子中有单词,其中句子没有被 space 分隔。例如:
句无space.Between句
一句带问号?第二句
我已经在 regex101 中测试了以下替换语句,它似乎在那里工作,但我无法确定为什么它在 Oracle 中不起作用:
regexp_replace(review_text, '([^\s\.])([\.!\?]+)([^\s\.\d])', ' ')
这应该允许我寻找句子分隔 periods/exclamation points/question 标记(单个或分组)并在句子之间添加必要的 space。我意识到还有其他方法可以分隔句子,但我上面的内容应该涵盖了大部分用例。第三个捕获组中的 \d 是为了确保我不会不小心将“4.5”等数值更改为“4.5”。
测试组前:
Sentence without space.Between sentences
Sentence with space. Between sentences
Sentence with multiple periods...Between sentences
False positive sentence with 4.5 Liters
Sentence with!Exclamation point
Sentence with!Question mark
更改后应如下所示:
Sentence without space. Between sentences
Sentence with space. Between sentences
Sentence with multiple periods... Between sentences
False positive sentence with 4.5 Liters
Sentence with! Exclamation point
Sentence with! Question mark
Regex101 link: https://regex101.com/r/dC9zT8/1
虽然 regex101 的所有更改都按预期工作,但我的问题是我进入 Oracle 时我的第三个和第四个测试用例没有按预期工作。 Oracle 没有在多句点(省略号)组之后添加 space,而 regexp_replace 为“4.5”添加了 space。我不确定为什么会这样,但也许我遗漏了 Oracle regexp_replace 的一些特殊性。
感谢任何和所有见解。谢谢!
这可能会让您入门。这将检查.?!在任何组合中,后跟零个或多个 space 和一个大写字母,它将用一个 space 替换 "zero or more spaces"。这不会分隔十进制数;但它会漏掉以大写字母以外的任何字母开头的句子。您可以开始添加条件 - 如果您 运行 遇到困难,请回信,我们会尽力提供帮助。参考其他正则表达式方言可能会有所帮助,但这可能不是获得答案的最快方法。
with
inputs ( str ) as (
select 'Sentence without space.Between sentences' from dual union all
select 'Sentence with space. Between sentences' from dual union all
select 'Sentence with multiple periods...Between sentences' from dual union all
select 'False positive sentence with 4.5 Liters' from dual union all
select 'Sentence with!Exclamation point' from dual union all
select 'Sentence with!Question mark' from dual
)
select regexp_replace(str, '([.!?]+)\s*([A-Z])', ' ') as new_str
from inputs;
NEW_STR
-------------------------------------------------------
Sentence without space. Between sentences
Sentence with space. Between sentences
Sentence with multiple periods... Between sentences
False positive sentence with 4.5 Liters
Sentence with! Exclamation point
Sentence with! Question mark
6 rows selected.
我正在使用 Oracle 修复一些文本。问题是我数据中的句子中有单词,其中句子没有被 space 分隔。例如:
句无space.Between句
一句带问号?第二句
我已经在 regex101 中测试了以下替换语句,它似乎在那里工作,但我无法确定为什么它在 Oracle 中不起作用:
regexp_replace(review_text, '([^\s\.])([\.!\?]+)([^\s\.\d])', ' ')
这应该允许我寻找句子分隔 periods/exclamation points/question 标记(单个或分组)并在句子之间添加必要的 space。我意识到还有其他方法可以分隔句子,但我上面的内容应该涵盖了大部分用例。第三个捕获组中的 \d 是为了确保我不会不小心将“4.5”等数值更改为“4.5”。
测试组前:
Sentence without space.Between sentences
Sentence with space. Between sentences
Sentence with multiple periods...Between sentences
False positive sentence with 4.5 Liters
Sentence with!Exclamation point
Sentence with!Question mark
更改后应如下所示:
Sentence without space. Between sentences
Sentence with space. Between sentences
Sentence with multiple periods... Between sentences
False positive sentence with 4.5 Liters
Sentence with! Exclamation point
Sentence with! Question mark
Regex101 link: https://regex101.com/r/dC9zT8/1
虽然 regex101 的所有更改都按预期工作,但我的问题是我进入 Oracle 时我的第三个和第四个测试用例没有按预期工作。 Oracle 没有在多句点(省略号)组之后添加 space,而 regexp_replace 为“4.5”添加了 space。我不确定为什么会这样,但也许我遗漏了 Oracle regexp_replace 的一些特殊性。
感谢任何和所有见解。谢谢!
这可能会让您入门。这将检查.?!在任何组合中,后跟零个或多个 space 和一个大写字母,它将用一个 space 替换 "zero or more spaces"。这不会分隔十进制数;但它会漏掉以大写字母以外的任何字母开头的句子。您可以开始添加条件 - 如果您 运行 遇到困难,请回信,我们会尽力提供帮助。参考其他正则表达式方言可能会有所帮助,但这可能不是获得答案的最快方法。
with
inputs ( str ) as (
select 'Sentence without space.Between sentences' from dual union all
select 'Sentence with space. Between sentences' from dual union all
select 'Sentence with multiple periods...Between sentences' from dual union all
select 'False positive sentence with 4.5 Liters' from dual union all
select 'Sentence with!Exclamation point' from dual union all
select 'Sentence with!Question mark' from dual
)
select regexp_replace(str, '([.!?]+)\s*([A-Z])', ' ') as new_str
from inputs;
NEW_STR
-------------------------------------------------------
Sentence without space. Between sentences
Sentence with space. Between sentences
Sentence with multiple periods... Between sentences
False positive sentence with 4.5 Liters
Sentence with! Exclamation point
Sentence with! Question mark
6 rows selected.