删除括号(以及里面的任何括号)的正则表达式模式

Regular expression pattern to remove parentheses (and any parentheses inside)

输入是维基百科页面的第一段。我想删除括号和括号本身之间的任何内容。

但是,有时(经常)括号内的HTML内容本身包含一个或多个括号,一般在link的href=""

取以下内容:

<p>
    The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> (from Greek σαρξ <i>sarx</i>, flesh, and πτερυξ <i>pteryx</i>, fin) – sometimes considered synonymous with <b>Crossopterygii</b> ("fringe-finned fish", from Greek κροσσός <i>krossos</i>, fringe) – constitute a <a href="/wiki/Clade" title="Clade">clade</a> (traditionally a <a href="/wiki/Class_(biology)" title="Class (biology)">class</a> or subclass) of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>

我希望最终结果是:

<p>
    The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> – sometimes considered synonymous with <b>Crossopterygii</b> – constitute a <a href="/wiki/Clade" title="Clade">clade</a> of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>

但是当我使用下面的 preg_replace 模式时它不起作用,它会被括号内的括号混淆。

public function removeParentheses( $content ) {

    $pattern = '@\(.*?\)@';
    $content = preg_replace( $pattern, '', $content );
    $content = str_replace( ' .', '.', $content );
    $content = str_replace( '  ', ' ', $content );
    return $content;
}

其次,links' href=""title="" 中的括号如何保留?这些(如果不在文本括号内)很重要。

您可以用占位符替换所有链接,然后删除所有括号,最后将占位符替换回原来的值。

这是通过 preg_replace_callback(), passing a occurrences counter and a replacements array to keep track of the links, then using your own removeParentheses() to get rid of the parentheses, and finally using str_replace() with array_keys() and array_values() 完成的,以恢复您的链接:

<?php
$string = '<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> (from Greek σαρξ <i>sarx</i>, flesh, and πτερυξ <i>pteryx</i>, fin) – sometimes considered synonymous with <b>Crossopterygii</b> ("fringe-finned fish", from Greek κροσσός <i>krossos</i>, fringe) – constitute a <a href="/wiki/Clade" title="Clade">clade</a> (traditionally a <a href="/wiki/Class_(biology)" title="Class (biology)">class</a> or subclass) of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>';
$occurrences = 0;
$replacements = [];
$replacedString = preg_replace_callback("/<a .*?>.*?<\/a>/i", function($el) use (&$occurrences, &$replacements) {
    $replacements["|||".$occurrences] = $el[0]; // the ||| are just to avoid unwanted matches
    return "|||".$occurrences++;
}, $string);
function removeParentheses( $content ) {
    $pattern = '@\(.*?\)@';
    $content = preg_replace( $pattern, '', $content );
    $content = str_replace( ' .', '.', $content );
    $content = str_replace( '  ', ' ', $content );
    return $content;
}
$replacedString = removeParentheses($replacedString);
$replacedString = str_replace(array_keys($replacements), array_values($replacements), $replacedString); // get your links back
echo $replacedString;

Demo

结果

<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> – sometimes considered synonymous with <b>Crossopterygii</b> – constitute a <a href="/wiki/Clade" title="Clade">clade</a> of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>

但是在我看来这有点脆弱。正如其他人在评论中告诉您的那样,您 shouldn't parse HTML with regular expressions 可以改变,您可以获得意想不到的结果。不过,这可能会让您朝着正确的方向前进。

edit关于括号内的括号,可以使用递归模式。看看 this great answer by Bart Kiers:

function removeParentheses( $content ) {
    $pattern = '@\(([^()]|(?R))*\)@';
    $content = preg_replace( $pattern, '', $content );
    $content = str_replace( ' .', '.', $content );
    $content = str_replace( '  ', ' ', $content );
    return $content;
}

Demo