PHP DOM 获取网站所有脚本src
PHP DOM Get website all scripts src
我想使用 curl 和 DOM 从网站获取所有脚本 src 链接。
我有这个代码:
$scripts = $dom->getElementsByTagName('script');
foreach ($scripts as $scripts1) {
if($scripts1->getAttribute('src')) {
echo $scripts1->getAttribute('src');
}
}
此脚本运行正常,但如果网站有这样的脚本标签会怎样:
<script type="text/javascript">
window._wpemojiSettings = {"source":{"concatemoji":"http:\/\/domain.com\/wp-includes\/js\/wp-emoji-release.min.js?ver=4.2.4"}}; ........
</script>
我还需要获取此脚本源。我该怎么做?
如果您的第一个解析器是空的,我会使用正则表达式创建另一个,即:
$html = file_get_contents("http://somesite.com/");
preg_match_all('/<script.*?(http.*?\.js(?:\?.*?)?)"/si', $html, $matches, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($matches[1]); $i++) {
echo str_replace("\/", "/", $matches[1][$i]);
}
您可能需要调整正则表达式以适用于不同的网站,但上面的代码应该让您了解您需要什么。
正则表达式解释:
<script.*?(http.*?\.js(?:\?.*?)?)"
----------------------------------
Match the character string “<script” literally «<script»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regex below and capture its match into backreference number 1 «(http.*?\.js(?:\?.*?)?)»
Match the character string “http” literally «http»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “.” literally «\.»
Match the character string “js” literally «js»
Match the regular expression below «(?:\?.*?)?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character “?” literally «\?»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “"” literally «"»
正则表达式教程
我想使用 curl 和 DOM 从网站获取所有脚本 src 链接。
我有这个代码:
$scripts = $dom->getElementsByTagName('script');
foreach ($scripts as $scripts1) {
if($scripts1->getAttribute('src')) {
echo $scripts1->getAttribute('src');
}
}
此脚本运行正常,但如果网站有这样的脚本标签会怎样:
<script type="text/javascript">
window._wpemojiSettings = {"source":{"concatemoji":"http:\/\/domain.com\/wp-includes\/js\/wp-emoji-release.min.js?ver=4.2.4"}}; ........
</script>
我还需要获取此脚本源。我该怎么做?
如果您的第一个解析器是空的,我会使用正则表达式创建另一个,即:
$html = file_get_contents("http://somesite.com/");
preg_match_all('/<script.*?(http.*?\.js(?:\?.*?)?)"/si', $html, $matches, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($matches[1]); $i++) {
echo str_replace("\/", "/", $matches[1][$i]);
}
您可能需要调整正则表达式以适用于不同的网站,但上面的代码应该让您了解您需要什么。
正则表达式解释:
<script.*?(http.*?\.js(?:\?.*?)?)"
----------------------------------
Match the character string “<script” literally «<script»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regex below and capture its match into backreference number 1 «(http.*?\.js(?:\?.*?)?)»
Match the character string “http” literally «http»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “.” literally «\.»
Match the character string “js” literally «js»
Match the regular expression below «(?:\?.*?)?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character “?” literally «\?»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “"” literally «"»
正则表达式教程