如何在正则表达式中添加限制
How to add restriction in regex
我有一个 Regex 函数,可以让我替换文本中出现 X 的单词。
我尝试添加条件,如果单词在标签 <h1>,<h2>,<h3>
和图像 alt
信标中,则不要替换。请有人帮我编辑函数以添加此条件。
public function str_ireplace_n($search, $replace, $subject, $occurrence)
{
$search = preg_quote($search);
return preg_replace("/^((?:(?:.*?$search){" . --$occurrence . "}.*?))$search/i", "$replace", $subject);
}
示例:
$text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et <h2>Lorem ipsum dolor sit</h2> justo non quam laoreet euismod. Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum."
// I replace the second Lorem in this text by a link
$text = $this->str_ireplace_n('Lorem', ' <a href="' . $domain . '" alt="">Lorem</a> ', $text, 2); //2 for the second occurence
//The result will add a link on the Lorem inside the <h1> and I want to avoid this.
//I want the Regex do nothing in the case where the keyword is in h1 h2 or alt of image
我没有选择要替换的"Lorem",随机出现。当发生在 <h1>
/<h2>
或图像 alt
.
上时,我必须确保我不做任何事情
提前致谢
就我个人而言,我会先使用 preg_split 之类的东西:
$string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et <h2>Lorem ipsum dolor sit</h2> justo non quam laoreet euismod. Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum.';
$split = preg_split('/(<[^\/]+(?:\/|<\/[^>]+)>)/', $string, null, PREG_SPLIT_DELIM_CAPTURE);
这给了你这个(这是我们需要做的基本事情):
Array
(
[0] => Lorem ipsum dolor sit amet, consectetur adipiscing elit.
[1] => <h1>Lorem ipsum dolor sit</h1>
[2] => Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et
[3] => <h2>Lorem ipsum dolor sit</h2>
[4] => justo non quam laoreet euismod. Ut eget dapibus ligula.
[5] => <img src="url" alt="Lorem ipsum dolor sit"/>
[6] => Vestibulum vestibulum.
)
现在我们已将这些项目隔离在标签内。所以现在我们可以遍历这个集合并检查前导字符是或不是 <
并了解它是在标签内部还是外部。只要您的标签以 </...>
或 />
.
结尾,这就应该有效
基本上 HTML 标签 + 内容成为分隔符,我们也捕获它。
重点是简单的正则表达式无法解析 HTML,因为它不是常规语言。所以我们必须在 PHP 中做一些工作来将它们联系在一起。我们可以分解它并使用简单的正则表达式简化问题,就像我在这里所做的那样。
$subject = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et <h2>Lorem ipsum dolor sit</h2> Lorem justo non quam laoreet euismod. Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum.';
//word to replace
$search = 'Lorem';
//stuff to replace with
$replace = '<a href="Lorem">foo</a>';
//what match to replace
$occurrence = 2;
function str_ireplace_n($search, $replace, $subject, $occurrence){
$search = preg_quote($search);
//separate the HTML from the "body" text
$split = preg_split('/(<(?:h1|h2|h3|img)[^\/]+(?:\/|<\/[^>]+)>)/', $subject, null, PREG_SPLIT_DELIM_CAPTURE);
//the number of current matches
$match = 0;
foreach($split as &$s){
//if strpos < is 0 it's the first character - meaning its part of HTML (we don't want that)
//if it matches search
if(0 !== strpos($s,'<') && preg_match('/\b'.$search.'\b/i', $s)){
//increment the match counter
++$match;
//replace the match if it's the nth one
if($match == $occurrence) $s = preg_replace('/\b'.$search.'\b/i',$replace,$s);
}
}
return implode($split);
}
echo str_ireplace_n($search, $replace, $subject, $occurrence);
输出:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1>
Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et
<h2>Lorem ipsum dolor sit</h2> <a href="Lorem">foo</a> justo non quam laoreet euismod.
Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum.
这是被替换的部分<a href="Lorem">foo</a>
我添加了几行 returns 以提高可读性(在输出中),又添加了另一行 "Lorem" (在输入中),因为 HTML 标签之外没有第二行匹配上。在任何情况下,如果您注意到,HTML 标签中的任何内容都没有被修改。在这种情况下,只有第二场比赛发生了变化。
并不是 100% 清楚您需要什么(这些类型的问题通常都是这种情况),所以我尝试解释如何做,而不是仅仅做。
我有一个 Regex 函数,可以让我替换文本中出现 X 的单词。
我尝试添加条件,如果单词在标签 <h1>,<h2>,<h3>
和图像 alt
信标中,则不要替换。请有人帮我编辑函数以添加此条件。
public function str_ireplace_n($search, $replace, $subject, $occurrence)
{
$search = preg_quote($search);
return preg_replace("/^((?:(?:.*?$search){" . --$occurrence . "}.*?))$search/i", "$replace", $subject);
}
示例:
$text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et <h2>Lorem ipsum dolor sit</h2> justo non quam laoreet euismod. Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum."
// I replace the second Lorem in this text by a link
$text = $this->str_ireplace_n('Lorem', ' <a href="' . $domain . '" alt="">Lorem</a> ', $text, 2); //2 for the second occurence
//The result will add a link on the Lorem inside the <h1> and I want to avoid this.
//I want the Regex do nothing in the case where the keyword is in h1 h2 or alt of image
我没有选择要替换的"Lorem",随机出现。当发生在 <h1>
/<h2>
或图像 alt
.
提前致谢
就我个人而言,我会先使用 preg_split 之类的东西:
$string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et <h2>Lorem ipsum dolor sit</h2> justo non quam laoreet euismod. Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum.';
$split = preg_split('/(<[^\/]+(?:\/|<\/[^>]+)>)/', $string, null, PREG_SPLIT_DELIM_CAPTURE);
这给了你这个(这是我们需要做的基本事情):
Array
(
[0] => Lorem ipsum dolor sit amet, consectetur adipiscing elit.
[1] => <h1>Lorem ipsum dolor sit</h1>
[2] => Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et
[3] => <h2>Lorem ipsum dolor sit</h2>
[4] => justo non quam laoreet euismod. Ut eget dapibus ligula.
[5] => <img src="url" alt="Lorem ipsum dolor sit"/>
[6] => Vestibulum vestibulum.
)
现在我们已将这些项目隔离在标签内。所以现在我们可以遍历这个集合并检查前导字符是或不是 <
并了解它是在标签内部还是外部。只要您的标签以 </...>
或 />
.
基本上 HTML 标签 + 内容成为分隔符,我们也捕获它。
重点是简单的正则表达式无法解析 HTML,因为它不是常规语言。所以我们必须在 PHP 中做一些工作来将它们联系在一起。我们可以分解它并使用简单的正则表达式简化问题,就像我在这里所做的那样。
$subject = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et <h2>Lorem ipsum dolor sit</h2> Lorem justo non quam laoreet euismod. Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum.';
//word to replace
$search = 'Lorem';
//stuff to replace with
$replace = '<a href="Lorem">foo</a>';
//what match to replace
$occurrence = 2;
function str_ireplace_n($search, $replace, $subject, $occurrence){
$search = preg_quote($search);
//separate the HTML from the "body" text
$split = preg_split('/(<(?:h1|h2|h3|img)[^\/]+(?:\/|<\/[^>]+)>)/', $subject, null, PREG_SPLIT_DELIM_CAPTURE);
//the number of current matches
$match = 0;
foreach($split as &$s){
//if strpos < is 0 it's the first character - meaning its part of HTML (we don't want that)
//if it matches search
if(0 !== strpos($s,'<') && preg_match('/\b'.$search.'\b/i', $s)){
//increment the match counter
++$match;
//replace the match if it's the nth one
if($match == $occurrence) $s = preg_replace('/\b'.$search.'\b/i',$replace,$s);
}
}
return implode($split);
}
echo str_ireplace_n($search, $replace, $subject, $occurrence);
输出:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1>
Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et
<h2>Lorem ipsum dolor sit</h2> <a href="Lorem">foo</a> justo non quam laoreet euismod.
Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum.
这是被替换的部分<a href="Lorem">foo</a>
我添加了几行 returns 以提高可读性(在输出中),又添加了另一行 "Lorem" (在输入中),因为 HTML 标签之外没有第二行匹配上。在任何情况下,如果您注意到,HTML 标签中的任何内容都没有被修改。在这种情况下,只有第二场比赛发生了变化。
并不是 100% 清楚您需要什么(这些类型的问题通常都是这种情况),所以我尝试解释如何做,而不是仅仅做。