文本解析优化 - Oracle SQL
Optimization of text parsing - Oracle SQL
我正在处理一个 SQL 查询,该查询将计算 "long text" 中某些单词的出现次数,或者一个 CLOB
数据类型的巨大文本字段。
我的数据集(很大,约 500 万行)看起来像这样:
http://sqlfiddle.com/#!4/2c13d/1
我有一个查询,像这样:
SELECT
TheTask AS Tasking,
SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%LONG%' THEN 1 ELSE 0 END) AS LongCount,
SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%TEXT%' THEN 1 ELSE 0 END) AS TextCount,
SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%ENGLISH%' THEN 1 ELSE 0 END) AS EnglishCount
FROM
example
GROUP BY
TheTask
但是,运行 需要很长时间才能完成完整的数据集(约 3 小时左右)。我相信这是由于 LIKE optimization issues, but I am unsure of how else to achieve this goal dataset. I have tried researching other articles on how to optimize like,但 REGEX
或其他东西是否可能更快?我希望通过评估 LIKE
性能来优化此查询。
CONTEXT
索引类型用于索引长文本。您可以使用:
CREATE INDEX idx_TheTaskTxt ON example(TRIM(UPPER(TheTaskText))) INDEXTYPE IS CTXSYS.CONTEXT;
并收集优化器生效的统计信息:
EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'EXAMPLE', cascade=>TRUE);
并致电
SELECT
TheTask AS Tasking,
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'LONG', 1) > 0 THEN 1 ELSE 0 END) AS LongCount,
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'TEXT', 1) > 0 THEN 1 ELSE 0 END) AS TextCount,
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'ENGLISH', 1) > 0 THEN 1 ELSE 0 END) AS EnglishCount
FROM example
GROUP BY TheTask
HAVING
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'LONG', 1) > 0 THEN 1 ELSE 0 END) *
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'TEXT', 1) > 0 THEN 1 ELSE 0 END) *
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'ENGLISH', 1) > 0 THEN 1 ELSE 0 END)
IN (0,1)
我正在处理一个 SQL 查询,该查询将计算 "long text" 中某些单词的出现次数,或者一个 CLOB
数据类型的巨大文本字段。
我的数据集(很大,约 500 万行)看起来像这样:
http://sqlfiddle.com/#!4/2c13d/1
我有一个查询,像这样:
SELECT
TheTask AS Tasking,
SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%LONG%' THEN 1 ELSE 0 END) AS LongCount,
SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%TEXT%' THEN 1 ELSE 0 END) AS TextCount,
SUM(CASE WHEN TRIM(UPPER(TheTaskText)) LIKE '%ENGLISH%' THEN 1 ELSE 0 END) AS EnglishCount
FROM
example
GROUP BY
TheTask
但是,运行 需要很长时间才能完成完整的数据集(约 3 小时左右)。我相信这是由于 LIKE optimization issues, but I am unsure of how else to achieve this goal dataset. I have tried researching other articles on how to optimize like,但 REGEX
或其他东西是否可能更快?我希望通过评估 LIKE
性能来优化此查询。
CONTEXT
索引类型用于索引长文本。您可以使用:
CREATE INDEX idx_TheTaskTxt ON example(TRIM(UPPER(TheTaskText))) INDEXTYPE IS CTXSYS.CONTEXT;
并收集优化器生效的统计信息:
EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'EXAMPLE', cascade=>TRUE);
并致电
SELECT
TheTask AS Tasking,
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'LONG', 1) > 0 THEN 1 ELSE 0 END) AS LongCount,
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'TEXT', 1) > 0 THEN 1 ELSE 0 END) AS TextCount,
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'ENGLISH', 1) > 0 THEN 1 ELSE 0 END) AS EnglishCount
FROM example
GROUP BY TheTask
HAVING
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'LONG', 1) > 0 THEN 1 ELSE 0 END) *
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'TEXT', 1) > 0 THEN 1 ELSE 0 END) *
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'ENGLISH', 1) > 0 THEN 1 ELSE 0 END)
IN (0,1)