PHP Regex/str_replace 奇怪的不匹配

Question

这个有点让我困惑，我似乎无法弄清楚为什么 http://www.example.com/a/b/c will return https://example.net//b/c - 最好的猜测是它与第一场比赛冲突，但为什么呢？

代码：

 $contents = '
<a href="http://www.example.com/a">Works</a>
<a href="http://www.example.com/a/b/c">Doesnt Work</a>
<a href="http://www.example.com/x/y/z">Works</a>';


            $regexp = "/<a\s[^>]*href=\"([^\"]*)\"[^>]*>(.*)<\/a>/siU";
            if(preg_match_all($regexp, $contents, $matches, PREG_SET_ORDER)) {
                foreach($matches as $match) {
                    print_r($match);
                    if (!empty($match[1])) { 
                        $urlString = 'https://www.example.net/newlink/';
                        $contents = str_replace($match[1], $urlString, $contents);
                    }
                }
            }

echo $contents;

输出：

Array
(
    [0] => <a href="http://www.example.com/a">Works</a>
    [1] => http://www.example.com/a
    [2] => Works
)
Array
(
    [0] => <a href="http://www.example.com/a/b/c">Doesnt Work</a>
    [1] => http://www.example.com/a/b/c
    [2] => Doesnt Work
)
Array
(
    [0] => <a href="http://www.example.com/x/y/z">Works</a>
    [1] => http://www.example.com/x/y/z
    [2] => Works
)

    <a href="https://www.example.net/newlink/">Works</a>
    <a href="https://www.example.net/newlink//b/c">Doesnt Work</a>
    <a href="https://www.example.net/newlink/">Works</a>

https://eval.in/528426

Answer 1

请参阅 str_replace()

的手册

它将两次出现的 http://www.example.com/a 替换为 https://www.example.net/newlink/
然后无法找到 http://www.example.com/a/b/c 因为此时它是 https://www.example.net/newlink//b/c

编辑： 这应该有效：$contents = str_replace('"'.$match[1].'"', '"'.$urlString.'"', $contents); // 在 search/replace

中包含引号

Answer 2

问题是在第一次迭代期间在 $contents 中执行了 2 次替换，因为有 2 个 http://www.example.com/a 子字符串。

一个可能的解决方案是使用 preg_replace_callback 来匹配捕获所有需要保留的部分的子字符串，并只匹配需要替换的部分：

见IDEONE demo:

$contents = '<a href="http://www.example.com/a">Works</a>
<a href="http://www.example.com/a/b/c">Doesnt Work</a>
<a href="http://www.example.com/x/y/z">Works</a>';
$regexp = "/(<a\s[^>]*href=\")[^\"]*(\"[^>]*>.*<\/a>)/siU";
$contents = preg_replace_callback($regexp, function($m) {
  return $m[1] . 'https://www.example.net/newlink/' . $m[2];
}, $contents);
echo $contents;

但是，如果您要处理 HTML，我宁愿使用基于 DOM 的解决方案。以下是如何将所有链接设置为指向 https://www.example.net/newlink/:

$html = <<<DATA
<a href="http://www.example.com/a">Works</a>
<a href="http://www.example.com/a/b/c">Doesnt Work</a>
<a href="http://www.example.com/x/y/z">Works</a>
DATA;

$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);
$links = $xpath->query('//a');

foreach($links as $link) { 
   $link->setAttribute('href', 'https://www.example.net/newlink/');
}
echo $dom->saveHTML();

见another demo。

PHP Regex/str_replace 奇怪的不匹配

PHP Regex/str_replace strange non-match

php

regex

html-parsing