如何 select 最匹配的子字符串到另一个字符串

How to select the most matching subString to another String

假设完整的字符串是

The following example examines the string, looking for the first substring bounded by comas

子字符串是

substing bounded

如果使用 sql

包含 90% 匹配的子字符串,有什么方法可以检查完整字符串

喜欢单词 substing boundedsubstring bounded 在我的例子中

子字符串可能是更多单词的组合,所以我无法将完整的字符串拆分成单词。

使用 SOUNDEX 函数进行试验。

我还没有测试过这个,但这可能对你有帮助:

    WITH strings AS (
      select regexp_substr('The following example examines the string, looking for the first substring bounded by comas','[ ]+', 1, level) ss 
      from dual
      connect by regexp_substr('The following example examines the string, looking for the first substring bounded by comas', '[ ]+', 1, level) is not null
    )
    SELECT ss 
    FROM strings
    WHERE SOUNDEX(ss) = SOUNDEX( 'commas' ) ;

REGEXP_SUBSTRCONNECT BY 将长字符串拆分为单词(通过 space)- 根据需要修改分隔符以包含标点符号等

这里我们依赖于符合我们预期的内置 SOUNDEX

首先将您的文本转换为 table 个单词。您会在 SO 上找到很多关于该主题的帖子,例如here

您必须调整分隔符列表以仅提取单词。

这是一个示例查询

 with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual  ),
      t2 as (select  rownum colnum from dual connect by level < 16 /* (max) number of words */),
      t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum)))  col  from t1, t2 
      where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
 select * from t3;

COL      
----------
The        
following  
example    
examines
...

在下一步中,您 Levenshtein Distance 获取结束词。

 with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual  ),
      t2 as (select  rownum colnum from dual connect by level < 16 /* (max) number of words */),
      t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum)))  col  from t1, t2 
      where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
 select col, str, UTL_MATCH.EDIT_DISTANCE(col, str)  distance
 from t3
 cross join (select 'commas' str from dual)
 order by 3;

COL        STR      DISTANCE
---------- ------ ----------
comas      commas          1 
for        commas          5 
examines   commas          6 
...

检查 Levenshtein 距离的定义并定义距离阈值以获得候选词。

为了 独立于单词边界的匹配 简单地扫描您的输入,并根据差异调整匹配字符串的长度中的所有子字符串,例如增加 10%。

您可以通过过滤仅从单词边界开始的子字符串来限制候选者。其余同距离计算。

 with txt as (select  'The following example examines the string, looking for the first substring bounded by comas' txt from dual),
      str as (select  'substing bounded' str from dual),
      t1 as (select  substr(txt, rownum, (select length(str) * 1.1 from str)) substr, /* add 10% length for the match */
                     (select str from str) str 
             from txt connect by level < (select length(txt) from txt) - (select length(str) from str)) 
 select SUBSTR, STR, 
        UTL_MATCH.EDIT_DISTANCE(SUBSTR, STR)  distance
 from t1
 order by 3;

SUBSTR               STR                DISTANCE
-------------------- ---------------- ----------
substring bounded    substing bounded          1 
ubstring bounded     substing bounded          3 
 substring bounde    substing bounded          3 
t substring bound    substing bounded          5 
...