如何 select 最匹配的子字符串到另一个字符串
How to select the most matching subString to another String
假设完整的字符串是
The following example examines the string, looking for the first substring bounded by comas
子字符串是
substing bounded
如果使用 sql
包含 90% 匹配的子字符串,有什么方法可以检查完整字符串
喜欢单词 substing bounded 和 substring bounded 在我的例子中
子字符串可能是更多单词的组合,所以我无法将完整的字符串拆分成单词。
使用 SOUNDEX
函数进行试验。
我还没有测试过这个,但这可能对你有帮助:
WITH strings AS (
select regexp_substr('The following example examines the string, looking for the first substring bounded by comas','[ ]+', 1, level) ss
from dual
connect by regexp_substr('The following example examines the string, looking for the first substring bounded by comas', '[ ]+', 1, level) is not null
)
SELECT ss
FROM strings
WHERE SOUNDEX(ss) = SOUNDEX( 'commas' ) ;
REGEXP_SUBSTR
和 CONNECT BY
将长字符串拆分为单词(通过 space)- 根据需要修改分隔符以包含标点符号等
这里我们依赖于符合我们预期的内置 SOUNDEX
。
首先将您的文本转换为 table 个单词。您会在 SO 上找到很多关于该主题的帖子,例如here
您必须调整分隔符列表以仅提取单词。
这是一个示例查询
with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual ),
t2 as (select rownum colnum from dual connect by level < 16 /* (max) number of words */),
t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum))) col from t1, t2
where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
select * from t3;
COL
----------
The
following
example
examines
...
在下一步中,您 Levenshtein Distance 获取结束词。
with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual ),
t2 as (select rownum colnum from dual connect by level < 16 /* (max) number of words */),
t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum))) col from t1, t2
where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
select col, str, UTL_MATCH.EDIT_DISTANCE(col, str) distance
from t3
cross join (select 'commas' str from dual)
order by 3;
COL STR DISTANCE
---------- ------ ----------
comas commas 1
for commas 5
examines commas 6
...
检查 Levenshtein 距离的定义并定义距离阈值以获得候选词。
为了 独立于单词边界的匹配 简单地扫描您的输入,并根据差异调整匹配字符串的长度中的所有子字符串,例如增加 10%。
您可以通过过滤仅从单词边界开始的子字符串来限制候选者。其余同距离计算。
with txt as (select 'The following example examines the string, looking for the first substring bounded by comas' txt from dual),
str as (select 'substing bounded' str from dual),
t1 as (select substr(txt, rownum, (select length(str) * 1.1 from str)) substr, /* add 10% length for the match */
(select str from str) str
from txt connect by level < (select length(txt) from txt) - (select length(str) from str))
select SUBSTR, STR,
UTL_MATCH.EDIT_DISTANCE(SUBSTR, STR) distance
from t1
order by 3;
SUBSTR STR DISTANCE
-------------------- ---------------- ----------
substring bounded substing bounded 1
ubstring bounded substing bounded 3
substring bounde substing bounded 3
t substring bound substing bounded 5
...
假设完整的字符串是
The following example examines the string, looking for the first substring bounded by comas
子字符串是
substing bounded
如果使用 sql
包含 90% 匹配的子字符串,有什么方法可以检查完整字符串喜欢单词 substing bounded 和 substring bounded 在我的例子中
子字符串可能是更多单词的组合,所以我无法将完整的字符串拆分成单词。
使用 SOUNDEX
函数进行试验。
我还没有测试过这个,但这可能对你有帮助:
WITH strings AS (
select regexp_substr('The following example examines the string, looking for the first substring bounded by comas','[ ]+', 1, level) ss
from dual
connect by regexp_substr('The following example examines the string, looking for the first substring bounded by comas', '[ ]+', 1, level) is not null
)
SELECT ss
FROM strings
WHERE SOUNDEX(ss) = SOUNDEX( 'commas' ) ;
REGEXP_SUBSTR
和 CONNECT BY
将长字符串拆分为单词(通过 space)- 根据需要修改分隔符以包含标点符号等
这里我们依赖于符合我们预期的内置 SOUNDEX
。
首先将您的文本转换为 table 个单词。您会在 SO 上找到很多关于该主题的帖子,例如here
您必须调整分隔符列表以仅提取单词。
这是一个示例查询
with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual ),
t2 as (select rownum colnum from dual connect by level < 16 /* (max) number of words */),
t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum))) col from t1, t2
where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
select * from t3;
COL
----------
The
following
example
examines
...
在下一步中,您 Levenshtein Distance 获取结束词。
with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual ),
t2 as (select rownum colnum from dual connect by level < 16 /* (max) number of words */),
t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum))) col from t1, t2
where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
select col, str, UTL_MATCH.EDIT_DISTANCE(col, str) distance
from t3
cross join (select 'commas' str from dual)
order by 3;
COL STR DISTANCE
---------- ------ ----------
comas commas 1
for commas 5
examines commas 6
...
检查 Levenshtein 距离的定义并定义距离阈值以获得候选词。
为了 独立于单词边界的匹配 简单地扫描您的输入,并根据差异调整匹配字符串的长度中的所有子字符串,例如增加 10%。
您可以通过过滤仅从单词边界开始的子字符串来限制候选者。其余同距离计算。
with txt as (select 'The following example examines the string, looking for the first substring bounded by comas' txt from dual),
str as (select 'substing bounded' str from dual),
t1 as (select substr(txt, rownum, (select length(str) * 1.1 from str)) substr, /* add 10% length for the match */
(select str from str) str
from txt connect by level < (select length(txt) from txt) - (select length(str) from str))
select SUBSTR, STR,
UTL_MATCH.EDIT_DISTANCE(SUBSTR, STR) distance
from t1
order by 3;
SUBSTR STR DISTANCE
-------------------- ---------------- ----------
substring bounded substing bounded 1
ubstring bounded substing bounded 3
substring bounde substing bounded 3
t substring bound substing bounded 5
...