文本解析优化 - Oracle SQL

Optimization of text parsing - Oracle SQL

我正在处理一个 SQL 查询,该查询将计算 "long text" 中某些单词的出现次数,或者一个 CLOB 数据类型的巨大文本字段。

我的数据集(很大,约 500 万行)看起来像这样:

http://sqlfiddle.com/#!4/2c13d/1

我有一个查询,像这样:

SELECT
  TheTask AS Tasking,
  SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%LONG%' THEN 1 ELSE 0 END) AS LongCount,
  SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%TEXT%' THEN 1 ELSE 0 END) AS TextCount,
  SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%ENGLISH%' THEN 1 ELSE 0 END) AS EnglishCount
FROM
  example
GROUP BY
  TheTask

但是,运行 需要很长时间才能完成完整的数据集(约 3 小时左右)。我相信这是由于 LIKE optimization issues, but I am unsure of how else to achieve this goal dataset. I have tried researching other articles on how to optimize like,但 REGEX 或其他东西是否可能更快?我希望通过评估 LIKE 性能来优化此查询。

CONTEXT索引类型用于索引长文本。您可以使用:

CREATE INDEX idx_TheTaskTxt ON example(TRIM(UPPER(TheTaskText))) INDEXTYPE IS CTXSYS.CONTEXT;

并收集优化器生效的统计信息:

EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'EXAMPLE', cascade=>TRUE);

并致电

SELECT
  TheTask AS Tasking,
  SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'LONG', 1) > 0 THEN 1 ELSE 0 END) AS LongCount,
  SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'TEXT', 1) > 0 THEN 1 ELSE 0 END) AS TextCount,
  SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'ENGLISH', 1) > 0 THEN 1 ELSE 0 END) AS EnglishCount
FROM example
GROUP BY TheTask
HAVING 
       SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'LONG', 1) > 0 THEN 1 ELSE 0 END) * 
       SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'TEXT', 1) > 0 THEN 1 ELSE 0 END) *
       SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'ENGLISH', 1) > 0 THEN 1 ELSE 0 END) 
       IN (0,1)