PL SQL 删除 HTML 编码

PL SQL Remove HTML Encoding

我知道您可以使用如下命令删除 HTML 标签:

REGEXP_REPLACE(overview, '<.+?>')

但是,某些文本具有实际的 HTML 编码,其中应用程序实际对内容进行了编码,例如单引号为:' 或 ’

我假设这些是非常标准的。有没有办法删除它们并用实际字符替换它们,或者我是否坚持使用 REPLACE 并列出它们?

非常感谢!

您可以使用 utl_i18n.unescape_references():

utl_i18n.unescape_reference(regexp_replace(overview, '<.+?>'))

作为演示:

-- sample data
with t (overview) as (
  select '<div><p>Some entities: &amp; &#39; &lt; &gt; to be handled </p></div>'
  from dual
)
select REGEXP_REPLACE(overview, '<.+?>') as result1,
  utl_i18n.unescape_reference(regexp_replace(overview, '<.+?>')) as result2
from t

得到

RESULT1 RESULT2
Some entities: &amp; &#39; &lt; &gt; to be handled Some entities: & ' < > to be handled

db<>fiddle


我不赞同(或攻击)使用正则表达式的想法;那是 handled and refuted and discussed 其他地方。我只是在解决有关编码实体的部分。

使用合适的 XML 解析器:

with t (overview) as (
  SELECT '<div><p>Some entities: &amp; &#39; &lt; &gt; to be handled </p></div>' from dual UNION ALL
  SELECT '<html><head><title>Test</title></head><body><p>&lt;test&gt;</p></body></html>' from dual
)
SELECT x.*
FROM   t
       CROSS JOIN LATERAL (
         SELECT LISTAGG(value) WITHIN GROUP (ORDER BY ROWNUM) AS text
         FROM   XMLTABLE(
                  '//*'
                  PASSING XMLTYPE(t.overview)
                  COLUMNS
                    value CLOB PATH './text()'
                )
      ) x

输出:

TEXT
Some entities: & ' < > to be handled
Test<test>

db<>fiddle here