使用 preg_replace 向链接添加尾部斜线
add trailing slash to links with preg_replace
我的网站内容中有一些没有尾随“/”的内部链接,这导致了我的一些抓取问题。想要搜索并替换这些链接。所以https://www.example.com/slug should become https://www.example.com/slug/。我正在使用以下功能推送页面的全部内容并替换页面上的所有必要链接:
function str_replace_links($subject, &$count) {
//match the first part of the link http://www.example.com{/slug}
$regex = '/(https:\/\/www.example.com)(\/[a-zA-Z_0-9\-]*)*';
//check for the trailing '/' or if it is a file
$regex .= '([^(\/|\.js|\.css|\.xml|\.less|\.png|\.jpg|\.jpeg|\.gif|\.pdf|\.doc|\.txt|\.ico|\.rss|\.zip|\.mp3|\.rar|\.exe|\.wmv|\.doc|\.avi|\.ppt|\.mpg|\.mpeg|\.tif|\.wav|\.mov|\.psd|\.ai|\.xls|\.mp4|\.m4a|\.swf|\.dat|\.dmg|\.iso|\.flv|\.torrent|\.ttf|\.woff|\.svg|\.eot|\.woff2)])';
//finish ooff regex
$regex .= '/i';
$i; // counter for # changed
$content = preg_replace($regex, '/', $subject, 1, $i);
$count += $i;
return $content;
}
我已经尝试用字符串测试几个链接:
$string ='
<a href="https://www.example.com/slug1/page">1</a><br/>
<a href="https://www.example.com/slug2/page">2</a><br/>
<a href="https://www.example.com/slug1/page/">3</a><br/>
<a href="https://www.example.com/slug2/page/">4</a><br/>
<a href="https://www.example.com/">5</a><br/>
<a href="https://www.example.com">5b</a><br/>
<a href="https://www.example.com/style.css">6</a><br/>
<a href="https://www.example.com/style.jpg">7</a><br/>
<a href="https://www.example.com/style.png">8</a><br/>
<a href="https://www.example.com/style.pdf">9</a><br/>
';
echo str_replace_links($string, $switch);
但是,这不会产生正确的结果:
<a href="https://www.example.com/page/>1</a><br/>
<a href="https://www.example.com/page/>2</a><br/>
<a href="https://www.example.com//>3</a><br/>
<a href="https://www.example.com//>4</a><br/>
<a href="https://www.example.com//>5</a><br/>
<a href="https://www.example.com/>5b</a><br/>
<a href="https://www.example.com/st/le.css">6</a><br/>
<a href="https://www.example.com/st/le.jpg">7</a><br/>
<a href="https://www.example.com/st/le.png">8</a><br/>
<a href="https://www.example.com/st/le.pdf">9</a><br/>
任何有关正则表达式的帮助将不胜感激。
您可以使用经过调整的 URL 验证器来完成此操作。
~(?i)(?<=")((?!mailto:)(?:[a-z]*:\/\/)?(?:\S+(?::\S*)?@)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{a1}-\x{ffff}0-9]+-?)*[a-z\x{a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{a1}-\x{ffff}0-9]+-?)*[a-z\x{a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{a1}-\x{ffff}]{2,})))|localhost)(:\d{2,5})?(?:\/(?:[^\s/]*/)*[^\s/.]+)?)(?=")~
https://regex101.com/r/GcT8ZU/1
格式化
(?i)
(?<= " )
( # (1 start)
(?! mailto: )
(?: [a-z]* :\/\/ )?
(?:
\S+
(?: : \S* )?
@
)?
(?:
(?:
(?:
[1-9] \d?
| 1 \d\d
| 2 [01] \d
| 22 [0-3]
)
(?:
\.
(?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
){2}
(?:
\.
(?:
[1-9] \d?
| 1 \d\d
| 2 [0-4] \d
| 25 [0-4]
)
)
| (?:
(?: [a-z\x{a1}-\x{ffff}0-9]+ -? )*
[a-z\x{a1}-\x{ffff}0-9]+
)
(?:
\.
(?: [a-z\x{a1}-\x{ffff}0-9]+ -? )*
[a-z\x{a1}-\x{ffff}0-9]+
)*
(?:
\.
(?: [a-z\x{a1}-\x{ffff}]{2,} )
)
)
| localhost
)
( : \d{2,5} )? # (2)
(?:
\/
(?: [^\s/]* / )*
[^\s/.]+
)?
) # (1 end)
(?= " )
我的网站内容中有一些没有尾随“/”的内部链接,这导致了我的一些抓取问题。想要搜索并替换这些链接。所以https://www.example.com/slug should become https://www.example.com/slug/。我正在使用以下功能推送页面的全部内容并替换页面上的所有必要链接:
function str_replace_links($subject, &$count) {
//match the first part of the link http://www.example.com{/slug}
$regex = '/(https:\/\/www.example.com)(\/[a-zA-Z_0-9\-]*)*';
//check for the trailing '/' or if it is a file
$regex .= '([^(\/|\.js|\.css|\.xml|\.less|\.png|\.jpg|\.jpeg|\.gif|\.pdf|\.doc|\.txt|\.ico|\.rss|\.zip|\.mp3|\.rar|\.exe|\.wmv|\.doc|\.avi|\.ppt|\.mpg|\.mpeg|\.tif|\.wav|\.mov|\.psd|\.ai|\.xls|\.mp4|\.m4a|\.swf|\.dat|\.dmg|\.iso|\.flv|\.torrent|\.ttf|\.woff|\.svg|\.eot|\.woff2)])';
//finish ooff regex
$regex .= '/i';
$i; // counter for # changed
$content = preg_replace($regex, '/', $subject, 1, $i);
$count += $i;
return $content;
}
我已经尝试用字符串测试几个链接:
$string ='
<a href="https://www.example.com/slug1/page">1</a><br/>
<a href="https://www.example.com/slug2/page">2</a><br/>
<a href="https://www.example.com/slug1/page/">3</a><br/>
<a href="https://www.example.com/slug2/page/">4</a><br/>
<a href="https://www.example.com/">5</a><br/>
<a href="https://www.example.com">5b</a><br/>
<a href="https://www.example.com/style.css">6</a><br/>
<a href="https://www.example.com/style.jpg">7</a><br/>
<a href="https://www.example.com/style.png">8</a><br/>
<a href="https://www.example.com/style.pdf">9</a><br/>
';
echo str_replace_links($string, $switch);
但是,这不会产生正确的结果:
<a href="https://www.example.com/page/>1</a><br/>
<a href="https://www.example.com/page/>2</a><br/>
<a href="https://www.example.com//>3</a><br/>
<a href="https://www.example.com//>4</a><br/>
<a href="https://www.example.com//>5</a><br/>
<a href="https://www.example.com/>5b</a><br/>
<a href="https://www.example.com/st/le.css">6</a><br/>
<a href="https://www.example.com/st/le.jpg">7</a><br/>
<a href="https://www.example.com/st/le.png">8</a><br/>
<a href="https://www.example.com/st/le.pdf">9</a><br/>
任何有关正则表达式的帮助将不胜感激。
您可以使用经过调整的 URL 验证器来完成此操作。
~(?i)(?<=")((?!mailto:)(?:[a-z]*:\/\/)?(?:\S+(?::\S*)?@)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{a1}-\x{ffff}0-9]+-?)*[a-z\x{a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{a1}-\x{ffff}0-9]+-?)*[a-z\x{a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{a1}-\x{ffff}]{2,})))|localhost)(:\d{2,5})?(?:\/(?:[^\s/]*/)*[^\s/.]+)?)(?=")~
https://regex101.com/r/GcT8ZU/1
格式化
(?i)
(?<= " )
( # (1 start)
(?! mailto: )
(?: [a-z]* :\/\/ )?
(?:
\S+
(?: : \S* )?
@
)?
(?:
(?:
(?:
[1-9] \d?
| 1 \d\d
| 2 [01] \d
| 22 [0-3]
)
(?:
\.
(?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
){2}
(?:
\.
(?:
[1-9] \d?
| 1 \d\d
| 2 [0-4] \d
| 25 [0-4]
)
)
| (?:
(?: [a-z\x{a1}-\x{ffff}0-9]+ -? )*
[a-z\x{a1}-\x{ffff}0-9]+
)
(?:
\.
(?: [a-z\x{a1}-\x{ffff}0-9]+ -? )*
[a-z\x{a1}-\x{ffff}0-9]+
)*
(?:
\.
(?: [a-z\x{a1}-\x{ffff}]{2,} )
)
)
| localhost
)
( : \d{2,5} )? # (2)
(?:
\/
(?: [^\s/]* / )*
[^\s/.]+
)?
) # (1 end)
(?= " )