preg_replace 去除 HTML 文件中匹配主机的所有查询字符串

Question

给定一个静态 HTML 文件，其中包含来自多个 hosts/domains 的查询字符串的链接，我如何使用 preg_replace 仅去除特定主机的所有查询字符串？

在这种情况下，DOMDocument 不是一个选项。
所有查询字符串都将被删除，只留下一个 URL 段 /foo/ 或一个文件路径，http://example.com/bar.jpg

示例输入：

<span><a href="http://domainneedingstripping.com/path/file.jpg?string=blah">x</a>
</span><img src="http://otherdomain.com?dontStripThis=true" />
<p>And much more content as in a full HTML doc</p>

预期输出：

<span><a href="http://domainneedingstripping.com/path/file.jpg">x</a>
</span><img src="http://otherdomain.com?dontStripThis=true" />
<p>And much more content as in a full HTML doc</p>

^ 注意只有一个域的查询字符串需要被剥离，保留来自其他主机的任何 URLs 可能包含查询字符串

我发现正则表达式示例可以从单个 URL 中删除查询字符串，但不能从完整文档中删除。我认为使用其中之一，我可以弄清楚如何将其限制为特定的 host/domain.

Answer 1

在PHP中，在file_get_contents function的帮助下，你可以得到你的内容。html:

$content = file_get_contents ("myFile.html");

然后是$_SERVER可以给你IP的全局变量：

$ip = $_SERVER['REMOTE_ADDR'];

然后就可以随心所欲地使用preg_replace()了。

Answer 2

$string = '
    <span><a href="http://domainneedingstripping.com/path/file.jpg?string=blah">x</a></span>
    <img src="http://otherdomain.com?dontStripThis=true" />
    <p>And much more content as in a full HTML doc</p>
    <span><a href="http://domainneedingstripping.com/otherpath/otherfile.jpg?string=blah">x</a></span>';

$pattern = "/(['|\"])(http:\/\/domainneedingstripping.com)(.+?)\?(.+?)(['|\"])/im";

$replacement = "${1}${2}${3}${5}";

echo preg_replace($pattern, $replacement, $string);

结果是：

<span><a href="http://domainneedingstripping.com/path/file.jpg">x</a></span>
<img src="http://otherdomain.com?dontStripThis=true" />
<p>And much more content as in a full HTML doc</p>
<span><a href="http://domainneedingstripping.com/otherpath/otherfile.jpg">x</a></span>

这可能是一个解决方案，但 HTML 文件中的变量可能很多，所以我建议你 http://simplehtmldom.sourceforge.net/

此解决方案仅适用于特定域。

preg_replace 去除 HTML 文件中匹配主机的所有查询字符串

preg_replace to strip all query strings for a matching host within an HTML file

php

regex

preg-replace