删除括号(以及里面的任何括号)的正则表达式模式
Regular expression pattern to remove parentheses (and any parentheses inside)
输入是维基百科页面的第一段。我想删除括号和括号本身之间的任何内容。
但是,有时(经常)括号内的HTML内容本身包含一个或多个括号,一般在link的href=""
。
取以下内容:
<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> (from Greek σαρξ <i>sarx</i>, flesh, and πτερυξ <i>pteryx</i>, fin) – sometimes considered synonymous with <b>Crossopterygii</b> ("fringe-finned fish", from Greek κροσσός <i>krossos</i>, fringe) – constitute a <a href="/wiki/Clade" title="Clade">clade</a> (traditionally a <a href="/wiki/Class_(biology)" title="Class (biology)">class</a> or subclass) of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>
我希望最终结果是:
<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> – sometimes considered synonymous with <b>Crossopterygii</b> – constitute a <a href="/wiki/Clade" title="Clade">clade</a> of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>
但是当我使用下面的 preg_replace
模式时它不起作用,它会被括号内的括号混淆。
public function removeParentheses( $content ) {
$pattern = '@\(.*?\)@';
$content = preg_replace( $pattern, '', $content );
$content = str_replace( ' .', '.', $content );
$content = str_replace( ' ', ' ', $content );
return $content;
}
其次,links' href=""
和 title=""
中的括号如何保留?这些(如果不在文本括号内)很重要。
您可以用占位符替换所有链接,然后删除所有括号,最后将占位符替换回原来的值。
这是通过 preg_replace_callback()
, passing a occurrences counter and a replacements array to keep track of the links, then using your own removeParentheses()
to get rid of the parentheses, and finally using str_replace()
with array_keys()
and array_values()
完成的,以恢复您的链接:
<?php
$string = '<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> (from Greek σαρξ <i>sarx</i>, flesh, and πτερυξ <i>pteryx</i>, fin) – sometimes considered synonymous with <b>Crossopterygii</b> ("fringe-finned fish", from Greek κροσσός <i>krossos</i>, fringe) – constitute a <a href="/wiki/Clade" title="Clade">clade</a> (traditionally a <a href="/wiki/Class_(biology)" title="Class (biology)">class</a> or subclass) of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>';
$occurrences = 0;
$replacements = [];
$replacedString = preg_replace_callback("/<a .*?>.*?<\/a>/i", function($el) use (&$occurrences, &$replacements) {
$replacements["|||".$occurrences] = $el[0]; // the ||| are just to avoid unwanted matches
return "|||".$occurrences++;
}, $string);
function removeParentheses( $content ) {
$pattern = '@\(.*?\)@';
$content = preg_replace( $pattern, '', $content );
$content = str_replace( ' .', '.', $content );
$content = str_replace( ' ', ' ', $content );
return $content;
}
$replacedString = removeParentheses($replacedString);
$replacedString = str_replace(array_keys($replacements), array_values($replacements), $replacedString); // get your links back
echo $replacedString;
结果
<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> – sometimes considered synonymous with <b>Crossopterygii</b> – constitute a <a href="/wiki/Clade" title="Clade">clade</a> of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>
但是在我看来这有点脆弱。正如其他人在评论中告诉您的那样,您 shouldn't parse HTML with regular expressions。 手 可以改变,您可以获得意想不到的结果。不过,这可能会让您朝着正确的方向前进。
edit关于括号内的括号,可以使用递归模式。看看 this great answer by Bart Kiers:
function removeParentheses( $content ) {
$pattern = '@\(([^()]|(?R))*\)@';
$content = preg_replace( $pattern, '', $content );
$content = str_replace( ' .', '.', $content );
$content = str_replace( ' ', ' ', $content );
return $content;
}
输入是维基百科页面的第一段。我想删除括号和括号本身之间的任何内容。
但是,有时(经常)括号内的HTML内容本身包含一个或多个括号,一般在link的href=""
。
取以下内容:
<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> (from Greek σαρξ <i>sarx</i>, flesh, and πτερυξ <i>pteryx</i>, fin) – sometimes considered synonymous with <b>Crossopterygii</b> ("fringe-finned fish", from Greek κροσσός <i>krossos</i>, fringe) – constitute a <a href="/wiki/Clade" title="Clade">clade</a> (traditionally a <a href="/wiki/Class_(biology)" title="Class (biology)">class</a> or subclass) of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>
我希望最终结果是:
<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> – sometimes considered synonymous with <b>Crossopterygii</b> – constitute a <a href="/wiki/Clade" title="Clade">clade</a> of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>
但是当我使用下面的 preg_replace
模式时它不起作用,它会被括号内的括号混淆。
public function removeParentheses( $content ) {
$pattern = '@\(.*?\)@';
$content = preg_replace( $pattern, '', $content );
$content = str_replace( ' .', '.', $content );
$content = str_replace( ' ', ' ', $content );
return $content;
}
其次,links' href=""
和 title=""
中的括号如何保留?这些(如果不在文本括号内)很重要。
您可以用占位符替换所有链接,然后删除所有括号,最后将占位符替换回原来的值。
这是通过 preg_replace_callback()
, passing a occurrences counter and a replacements array to keep track of the links, then using your own removeParentheses()
to get rid of the parentheses, and finally using str_replace()
with array_keys()
and array_values()
完成的,以恢复您的链接:
<?php
$string = '<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> (from Greek σαρξ <i>sarx</i>, flesh, and πτερυξ <i>pteryx</i>, fin) – sometimes considered synonymous with <b>Crossopterygii</b> ("fringe-finned fish", from Greek κροσσός <i>krossos</i>, fringe) – constitute a <a href="/wiki/Clade" title="Clade">clade</a> (traditionally a <a href="/wiki/Class_(biology)" title="Class (biology)">class</a> or subclass) of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>';
$occurrences = 0;
$replacements = [];
$replacedString = preg_replace_callback("/<a .*?>.*?<\/a>/i", function($el) use (&$occurrences, &$replacements) {
$replacements["|||".$occurrences] = $el[0]; // the ||| are just to avoid unwanted matches
return "|||".$occurrences++;
}, $string);
function removeParentheses( $content ) {
$pattern = '@\(.*?\)@';
$content = preg_replace( $pattern, '', $content );
$content = str_replace( ' .', '.', $content );
$content = str_replace( ' ', ' ', $content );
return $content;
}
$replacedString = removeParentheses($replacedString);
$replacedString = str_replace(array_keys($replacements), array_values($replacements), $replacedString); // get your links back
echo $replacedString;
结果
<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> – sometimes considered synonymous with <b>Crossopterygii</b> – constitute a <a href="/wiki/Clade" title="Clade">clade</a> of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>
但是在我看来这有点脆弱。正如其他人在评论中告诉您的那样,您 shouldn't parse HTML with regular expressions。 手 可以改变,您可以获得意想不到的结果。不过,这可能会让您朝着正确的方向前进。
edit关于括号内的括号,可以使用递归模式。看看 this great answer by Bart Kiers:
function removeParentheses( $content ) {
$pattern = '@\(([^()]|(?R))*\)@';
$content = preg_replace( $pattern, '', $content );
$content = str_replace( ' .', '.', $content );
$content = str_replace( ' ', ' ', $content );
return $content;
}