提取正文 HTML 并使用 PHP 和正则表达式清理评论

Question

我想使用 PHP 和正则表达式清除 HTML 中 <body> 部分的评论和其他一些垃圾或标签，但我的代码不起作用：

$str=preg_replace_callback('/<body>(.*?)<\/body>/s', 
    function($matches){
        return '<body>'.preg_replace(array(
            '/<!--(.|\s)*?-->/',
        ),
        array(
            '',
        ), $matches[1]).'</body>';
    }, $str);

问题是没有任何反应。评论将保留在原处或任何清洁工作，没有任何反应。你能帮我吗？谢谢！

编辑：

感谢@mhall，我发现我的正则表达式因为 <body> 标签中的属性而无法工作。我使用他的代码并更新它：

$str = preg_replace_callback('/(?=<body(.*?)>)(.*?)(?<=<\/body>)/s',
    function($matches) {
        return preg_replace('/<!--.*?-->/s', '', $matches[2]);
    }, $str);

这项工作完美！

谢谢大家！

Answer 1

试试这个。对 preg_replace_callback 进行了修改，不包含 body 标签，并在 preg_replace 中将 (.|\s) 替换为 .*。还删除了 array 语法并添加了 /s 修饰符：

$str = <<<EOS
<html>
    <body>
        <p>
             Here is some <!-- One comment --> text
             with a few <!--
                Another comment
             -->
             Comments in it
        </p>
    </body>
</html>
EOS;

$str = preg_replace_callback('/(?=<body>)(.*?)(?<=<\/body>)/s',
    function($matches) {
        return preg_replace('/<!--.*?-->/s', '', $matches[1]);
    }, $str);

echo $str, PHP_EOL;

输出：

<html>
    <body>
        <p>
             Here is some  text
             with a few 
             Comments in it
        </p>
    </body>
</html>

Answer 2

你是不是想得太复杂了？您不需要通过回调跳入和跳出，因为 preg_replace 会在每场比赛中进行替换：

$parts = explode("<body", $str, 2);
$clean = preg_replace('/<!--.*?-->/s', '', $parts[1]);
$str = $parts[0]."<body".$clean;

将字符串拆分为头部和主体将头部从替换中排除，而无需大量混乱的正则表达式。请注意模式后的 s：'/.../s'。这使得正则表达式中的点匹配嵌入的换行符和其他字符。

提取正文 HTML 并使用 PHP 和正则表达式清理评论

Extracting the body HTML and clean comments using PHP and Regex

html

php

regex

html-parsing