在 CLOB 中搜索 list/table 中的单词

Question

我有一个很大的 table，其中有一个 clob 列（+100,000 行），我需要从中搜索特定时间范围内的特定单词。

{select id, clob_field,  dbms_lob.instr(clob_field, '.doc',1,1) as doc,  --ideally want .doc
      dbms_lob.instr(clob_field, '.docx',1,1) as docx, --ideally want .docx
      dbms_lob.instr(clob_field, '.DOC',1,1) as DOC,  --ideally want .DOC
      dbms_lob.instr(clob_field, '.DOCX',1,1) as DOCX  --ideally want .DOCX
 from clob_table, search_words s
 where (to_char(date_entered, 'DD-MON-YYYY') 
      between to_date('01-SEP-2018') and to_date('30-SEP-2018'))
 AND (contains(clob_field, s.words )>0)  ;}

这组词是“.doc”、“.DOC”、“.docx”和“.docx”。当我使用 CONTAINS() 它似乎忽略了点，因此为我提供了很多行，但其中没有文档扩展名。它会查找地址中包含 .doc 的电子邮件，因此 doc 的两边都会有一个句点。

即mail.doc.george@here.com

我不希望出现这些情况。我试过在单词末尾加上 space，但它忽略了 space。我已将它们放入我创建的搜索 table 中，如上所示，它仍然忽略 space。有什么建议么？

谢谢！！

Answer 1

这里有两个建议。

简单、低效的方法是使用 CONTAINS 之外的东西。众所周知，上下文索引很难做到正确。所以除了最后一行，你可以这样做：

AND regexp_instr(clob_field, '\.docx', 1,1,0,'i') > 0

我认为这应该可行，但可能会很慢。那是你使用索引的时候。但是 Oracle Text 索引比普通索引更复杂。 This old doc explains that punctuation characters (as defined in the index parameters) are not indexed, because the point of Oracle Text is to index words. If you want special characters to be indexed as part of the word, you need to add it to the set of printjoin characters. This doc explains how，但我会把它贴在这里。您需要删除现有的 CONTEXT 索引并使用此首选项重新创建它：

begin
ctx_ddl.create_preference('mylex', 'BASIC_LEXER');
ctx_ddl.set_attribute('mylex', 'printjoins', '._-'); -- periods, underscores, dashes can be parts of words
end;
/

CREATE INDEX myindex on clob_table(clob_field) INDEXTYPE IS CTXSYS.CONTEXT
  parameters ('LEXER mylex');

请记住，CONTEXT 索引在默认情况下不区分大小写；我认为这就是您想要的，但仅供参考，您可以通过在词法分析器上将 'mixed_case' 属性设置为 'Y' 来更改它，就在您在上面设置 printjoins 属性的位置下方。

您似乎也在尝试搜索以 .docx 结尾的 单词，但 CONTAINS 不是 INSTR - 默认情况下它匹配整个单词，而不是字符串人物。您可能希望修改查询以执行 AND contains(clob_field, '%.docx')>0

在 CLOB 中搜索 list/table 中的单词

searching in CLOB for words in a list/table

oracle

select

contains

clob