无法使用 php 将所有链接自动下载为分隔的 html 文件

Question

我正在尝试在 PHP 中创建一个页面，该页面读取网页源代码，找到所有 link，然后针对每个 link（如果是 html) 在我的电脑上自动下载文件（最好不要问在哪里...）。

这是我的代码：

<?php

$srcUrl= 'http://www.justdogbreeds.com/all-dog-breeds.html';

$html = file_get_contents($srcUrl);

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);

//finding the a tag
$hrefs = $xpath->evaluate("/html/body//a");

$testo = '<table width="100%" border="1" cellspacing="2" cellpadding="2" summary="layout">
  <caption>
    List of links
  </caption>
  <tr>
    <th scope="col">&nbsp;</th>
        <th scope="col">&nbsp;</th>
  </tr>';

//Loop to display all the links and download
for ($i = 0; $i < $hrefs->length; $i++) {

       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');

 //if real link
       if($url!='#')  

       {

 //Code to get the file...
 $data = file_get_contents($url);

 //save as?
 $filename = $url;

 /*save the file...
 $fh = fopen($filename,"w");
 fwrite($fh,$data);
 fclose($fh);*/

        $hfile = fopen($data ,"r");
        if($hfile){
            while(!feof($hfile)){
                $html=fgets($hfile,1024);
            }
        }
 $fh = fopen($filename,"w");
 fwrite($fh,$html);
 fclose($fh);

//download automatically (better if without asking where... maybe in download folder)
header('Content-disposition: attachment; filename=' . $filename);
header("Content-Type: application/force-download");
header('Content-type: text/html');

 //display link to the file you just saved...
    $testo.='<tr>
    <td>'.$url.'</td>
    <td></td>
    </tr>';
       }

}

$testo.='</table>';

echo $testo;

?>

我做错了什么？谢谢

Answer 1

你在搞乱两件事。这是您当前代码的作用：

加载原文内容URL
找到links
每个 link：
1. 下载内容 ($data = file_get_contents($url);) - 这很好
2. 打开新文件进行读取 ($hfile = fopen($data ,"r");) - 不确定您为什么需要这个，它实际上什么都不做，因为您尝试打开的文件名是 3.1 中的内容，而您确实不需要不需要阅读任何内容 - 您已经有了 url.
3. 写入您刚刚读取的文件的内容（$h = fopen ->fclose 的行），但是 - 您在这里遇到了一些问题，因为您正在尝试的文件名称create 是一个 url（即 http://somedomain.sometld/somefile.html?t=1&r=2），您不能使用该名称创建文件。您需要创建一个随机文件名。
4. 发送 headers 让浏览器下载一个 HTML 文件，其中包含您刚刚保存的文件的名称。
  您在这里遇到了几个问题：首先，您的 headers 乘以您在该页面上找到的 link 的数量，而您不需要它。您只需发送这些 headers 一次。其次 - 您对文件名有同样的问题。

我对你的代码做了一些修改，它应该可以工作：

<?php
$srcUrl= 'http://www.justdogbreeds.com/all-dog-breeds.html';

$html = file_get_contents($srcUrl);

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);

//finding the a tag
$hrefs = $xpath->evaluate("/html/body//a");

$testo = '<table width="100%" border="1" cellspacing="2" cellpadding="2" summary="layout">
  <caption>
    List of links
  </caption>
  <tr>
    <th scope="col">&nbsp;</th>
        <th scope="col">&nbsp;</th>
  </tr>';

$filename = 'list-of-links.html';
header('Content-disposition: attachment; filename=' . $filename);
header("Content-Type: application/force-download");
header('Content-type: text/html');

//Loop to display all the links and download
for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    //if real link
    if($url!='#') {
        //Code to get the file...
        $data = file_get_contents($url);

        //save as?
        $filename = mt_rand(10000000, 90000000) . ".html";
        file_put_contents($filename, $data);

        //display link to the file you just saved...
        $testo.='<tr>
        <td>'.$url.'</td>
        <td></td>
        </tr>';
    }
}
$testo.='</table>';
echo $testo;
?>

我建议在每次请求后添加几秒钟的休眠，以确保您不会对服务器施加太大压力。

无法使用 php 将所有链接自动下载为分隔的 html 文件

Can't download automatically all the links as separeted html files using php

html

php

web-crawler