如何删除空 html 标签（其中包含空格 and/or 它们的 html 代码）

Question

需要 preg_replace 的正则表达式。

这个问题没有在“另一个问题”中回答，因为并非我要删除的所有标签都不为空。

我不仅要从 HTML 结构中删除空标签，还要删除包含换行符和空格的标签 and/or 它们的 html 代码。

可能的代码是：

删除匹配标签之前：

<div> 
  <h1>This is a html structure.</h1> 
  <p>This is not empty.</p> 
  <p></p> 
  <p><br /></p>
  <p> <br /> &;thinsp;</p>
  <p>&nbsp;</p> 
  <p> &nbsp; </p> 
</div>

删除匹配标签后：

<div> 
  <h1>This is a html structure.</h1> 
  <p>This is not empty.</p> 
</div>

Answer 1

您可以使用以下内容：

<([^>\s]+)[^>]*>(?:\s*(?:<br \/>|&nbsp;|&thinsp;|&ensp;|&emsp;|&#8201;|&#8194;|&#8195;)\s*)*<\/>

并替换为 ''（空字符串）

见DEMO

注意： 这也适用于具有属性的空 html 标签。

Answer 2

使用tidy 它使用以下函数：

function cleaning($string, $tidyConfig = null) {
    $out = array ();
    $config = array (
            'indent' => true,
            'show-body-only' => false,
            'clean' => true,
            'output-xhtml' => true,
            'preserve-entities' => true 
    );
    if ($tidyConfig == null) {
        $tidyConfig = &$config;
    }
    $tidy = new tidy ();
    $out ['full'] = $tidy->repairString ( $string, $tidyConfig, 'UTF8' );
    unset ( $tidy );
    unset ( $tidyConfig );
    $out ['body'] = preg_replace ( "/.*<body[^>]*>|<\/body>.*/si", "", $out ['full'] );
    $out ['style'] = '<style type="text/css">' . preg_replace ( "/.*<style[^>]*>|<\/style>.*/si", "", $out ['full'] ) . '</style>';
    return ($out);
}

Answer 3

我不太擅长 regex 但是，试试这个

\<.*\>\s*\&.*sp;\s*\<\/.*\>|\<.*\>\s*\<\s*br\s*\/\>\s*\&.*sp;\s*\<\/.*\>|\<.*\>\s*\&.*sp;\s*\<\s*br\s*\/\>\<\/.*\>

基本匹配

包含 HTML space 个元素的标签或
在 HTML space 个元素之前出现中断的标签
在 HTML space 个元素之后出现中断的标签

如何删除空 html 标签（其中包含空格 and/or 它们的 html 代码）

How to remove empty html tags (which contain whitespaces and/or their html codes)

html

php

regex

preg-replace