PHP 网络爬虫提取时跳过网站的前两条语句

Question

我有一个 PHP 网络爬虫，它工作得很好（目前）

它从站点中提取论坛问题及其链接并将其粘贴到我的站点中。

所以，我一直试图让它做同样的事情，除了这一次，我希望它从提取站点跳过 2 行。因此，它不会从站点获取所有语句，而是从语句 3 开始。

我的代码如下：

<?php
    function get_data($url) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_URL,$url);
        $result=curl_exec($ch);
        curl_close($ch);
        return $result;
    }
    $returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
    $first_step = explode( '<tbody id="threadbits_forum_26"' , $returned_content );
    $second_step = explode('</tbody>', $first_step[1]);
    $third_step = explode('<tr>', $second_step[0]);
    // print_r($third_step);
    foreach ($third_step as $key=>$element) {
        $child_first = explode( '<td class="alt1"' , $element );
        $child_second = explode( '</td>' , $child_first[1] );
        $child_third = explode( '<a href=' , $child_second[0] );
        $child_fourth = explode( '</a>' , $child_third[1] );
        $final = "<a href=".$child_fourth[0]."</a></br>";
        echo '<li target="_blank" class="itemtitle">';
        if($key < 5 && $key > 2 && rand(0,1) == 1) {
            echo '<span class="item_new">new</span>';
        }
        echo $final;
        echo '</li>';
        if($key==10) {
            break;
        }
    }
?>

感谢任何帮助..

Answer 1

您可以引入一个变量 $i 并每隔 foreach-step 增加一次。然后只在增加两次后执行你的代码：

<?php
    function get_data($url) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_URL,$url);
        $result=curl_exec($ch);
        curl_close($ch);
        return $result;
    }
    $returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
    $first_step = explode( '<tbody id="threadbits_forum_26"' , $returned_content );
    $second_step = explode('</tbody>', $first_step[1]);
    $third_step = explode('<tr>', $second_step[0]);
    // print_r($third_step);
    $i = 1;
    foreach ($third_step as $key=>$element) {
        if ($i < 3) {
            $i++;
            continue;
        }
        $child_first = explode( '<td class="alt1"' , $element );
        $child_second = explode( '</td>' , $child_first[1] );
        $child_third = explode( '<a href=' , $child_second[0] );
        $child_fourth = explode( '</a>' , $child_third[1] );
        $final = "<a href=".$child_fourth[0]."</a></br>";
        echo '<li target="_blank" class="itemtitle">';
        if($key < 5 && $key > 2 && rand(0,1) == 1) {
            echo '<span class="item_new">new</span>';
        }
        echo $final;
        echo '</li>';
        if($key==10) {
            break;
        }
    }
?>

Answer 2

我不太确定你的 <span>new</span> 随机化器背后的逻辑，但我可以向你保证，用字符串函数在 html 数据上切分是不可信的（当它失败时，它会默默地失败).相反，我会推荐使用 Xpath 的 DomDocument 来完成您的任务。

代码：(Demo)

$dom=new DOMDocument; 
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = '';
foreach ($xpath->evaluate("//td[@class='alt1']/a") as $i => $node) {  // target a tags that have <td class="alt1"> as parent
    if ($i > 1) {  // disqualify first two nodes
        $result .= "<li class=\"itemtitle\"><a href=\"{$node->getAttribute('href')}\" target=\"_blank\">{$node->nodeValue}</a></li>";
        if ($i == 12) { break; }  // set a limit of 10 rows of data (#3 to #13)
    }
}
if ($result) {
    echo "<ul>$result</ul>";
}

示例输入：（因为我不想抓取已发布的 url）

$html = <<<HTML
<table>
    <tbody id="threadbits_forum_26">
        <tr>
            <td class="alt1">
                <a href="http://www.example1.com">test1</a>
            </td>
        </tr>
        <tr>
            <td class="alt1">
                <a href="http://www.example2.com">test2</a>
            </td>
        </tr>
        <tr>
            <td class="alt1">
                <a href="http://www.example3.com">test3</a>
            </td>
        </tr>
        <tr>
            <td class="alt1">
                <a href="http://www.example4.com">test4</a>
            </td>
        </tr>
        <tr>
            <td class="alt1">
                <a href="http://www.example5.com">test5</a>
            </td>
        </tr>
        <tr>
            <td class="alt1">
                <a href="http://www.example6.com">test6</a>
            </td>
        </tr>
    </tbody>
</table>
HTML;

输出：

<ul>
    <li class="itemtitle"><a href="http://www.example3.com" target="_blank">test3</a></li>
    <li class="itemtitle"><a href="http://www.example4.com" target="_blank">test4</a></li>
    <li class="itemtitle"><a href="http://www.example5.com" target="_blank">test5</a></li>
    <li class="itemtitle"><a href="http://www.example6.com" target="_blank">test6</a></li>
</ul>

PHP 网络爬虫提取时跳过网站的前两条语句

Skip first two statement of a site when extracted by a PHP web crawler

php

limit

html-parsing

web-scraping