PHP 网络爬虫提取时跳过网站的前两条语句
Skip first two statement of a site when extracted by a PHP web crawler
我有一个 PHP 网络爬虫,它工作得很好(目前)
它从站点中提取论坛问题及其链接并将其粘贴到我的站点中。
所以,我一直试图让它做同样的事情,除了这一次,我希望它从提取站点跳过 2 行。
因此,它不会从站点获取所有语句,而是从语句 3 开始。
我的代码如下:
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
$first_step = explode( '<tbody id="threadbits_forum_26"' , $returned_content );
$second_step = explode('</tbody>', $first_step[1]);
$third_step = explode('<tr>', $second_step[0]);
// print_r($third_step);
foreach ($third_step as $key=>$element) {
$child_first = explode( '<td class="alt1"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[1] );
$final = "<a href=".$child_fourth[0]."</a></br>";
echo '<li target="_blank" class="itemtitle">';
if($key < 5 && $key > 2 && rand(0,1) == 1) {
echo '<span class="item_new">new</span>';
}
echo $final;
echo '</li>';
if($key==10) {
break;
}
}
?>
感谢任何帮助..
您可以引入一个变量 $i
并每隔 foreach-step 增加一次。然后只在增加两次后执行你的代码:
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
$first_step = explode( '<tbody id="threadbits_forum_26"' , $returned_content );
$second_step = explode('</tbody>', $first_step[1]);
$third_step = explode('<tr>', $second_step[0]);
// print_r($third_step);
$i = 1;
foreach ($third_step as $key=>$element) {
if ($i < 3) {
$i++;
continue;
}
$child_first = explode( '<td class="alt1"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[1] );
$final = "<a href=".$child_fourth[0]."</a></br>";
echo '<li target="_blank" class="itemtitle">';
if($key < 5 && $key > 2 && rand(0,1) == 1) {
echo '<span class="item_new">new</span>';
}
echo $final;
echo '</li>';
if($key==10) {
break;
}
}
?>
我不太确定你的 <span>new</span>
随机化器背后的逻辑,但我可以向你保证,用字符串函数在 html 数据上切分是不可信的(当它失败时,它会默默地失败).相反,我会推荐使用 Xpath 的 DomDocument 来完成您的任务。
代码:(Demo)
$dom=new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = '';
foreach ($xpath->evaluate("//td[@class='alt1']/a") as $i => $node) { // target a tags that have <td class="alt1"> as parent
if ($i > 1) { // disqualify first two nodes
$result .= "<li class=\"itemtitle\"><a href=\"{$node->getAttribute('href')}\" target=\"_blank\">{$node->nodeValue}</a></li>";
if ($i == 12) { break; } // set a limit of 10 rows of data (#3 to #13)
}
}
if ($result) {
echo "<ul>$result</ul>";
}
示例输入:(因为我不想抓取已发布的 url)
$html = <<<HTML
<table>
<tbody id="threadbits_forum_26">
<tr>
<td class="alt1">
<a href="http://www.example1.com">test1</a>
</td>
</tr>
<tr>
<td class="alt1">
<a href="http://www.example2.com">test2</a>
</td>
</tr>
<tr>
<td class="alt1">
<a href="http://www.example3.com">test3</a>
</td>
</tr>
<tr>
<td class="alt1">
<a href="http://www.example4.com">test4</a>
</td>
</tr>
<tr>
<td class="alt1">
<a href="http://www.example5.com">test5</a>
</td>
</tr>
<tr>
<td class="alt1">
<a href="http://www.example6.com">test6</a>
</td>
</tr>
</tbody>
</table>
HTML;
输出:
<ul>
<li class="itemtitle"><a href="http://www.example3.com" target="_blank">test3</a></li>
<li class="itemtitle"><a href="http://www.example4.com" target="_blank">test4</a></li>
<li class="itemtitle"><a href="http://www.example5.com" target="_blank">test5</a></li>
<li class="itemtitle"><a href="http://www.example6.com" target="_blank">test6</a></li>
</ul>
我有一个 PHP 网络爬虫,它工作得很好(目前)
它从站点中提取论坛问题及其链接并将其粘贴到我的站点中。
所以,我一直试图让它做同样的事情,除了这一次,我希望它从提取站点跳过 2 行。 因此,它不会从站点获取所有语句,而是从语句 3 开始。
我的代码如下:
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
$first_step = explode( '<tbody id="threadbits_forum_26"' , $returned_content );
$second_step = explode('</tbody>', $first_step[1]);
$third_step = explode('<tr>', $second_step[0]);
// print_r($third_step);
foreach ($third_step as $key=>$element) {
$child_first = explode( '<td class="alt1"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[1] );
$final = "<a href=".$child_fourth[0]."</a></br>";
echo '<li target="_blank" class="itemtitle">';
if($key < 5 && $key > 2 && rand(0,1) == 1) {
echo '<span class="item_new">new</span>';
}
echo $final;
echo '</li>';
if($key==10) {
break;
}
}
?>
感谢任何帮助..
您可以引入一个变量 $i
并每隔 foreach-step 增加一次。然后只在增加两次后执行你的代码:
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
$first_step = explode( '<tbody id="threadbits_forum_26"' , $returned_content );
$second_step = explode('</tbody>', $first_step[1]);
$third_step = explode('<tr>', $second_step[0]);
// print_r($third_step);
$i = 1;
foreach ($third_step as $key=>$element) {
if ($i < 3) {
$i++;
continue;
}
$child_first = explode( '<td class="alt1"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[1] );
$final = "<a href=".$child_fourth[0]."</a></br>";
echo '<li target="_blank" class="itemtitle">';
if($key < 5 && $key > 2 && rand(0,1) == 1) {
echo '<span class="item_new">new</span>';
}
echo $final;
echo '</li>';
if($key==10) {
break;
}
}
?>
我不太确定你的 <span>new</span>
随机化器背后的逻辑,但我可以向你保证,用字符串函数在 html 数据上切分是不可信的(当它失败时,它会默默地失败).相反,我会推荐使用 Xpath 的 DomDocument 来完成您的任务。
代码:(Demo)
$dom=new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = '';
foreach ($xpath->evaluate("//td[@class='alt1']/a") as $i => $node) { // target a tags that have <td class="alt1"> as parent
if ($i > 1) { // disqualify first two nodes
$result .= "<li class=\"itemtitle\"><a href=\"{$node->getAttribute('href')}\" target=\"_blank\">{$node->nodeValue}</a></li>";
if ($i == 12) { break; } // set a limit of 10 rows of data (#3 to #13)
}
}
if ($result) {
echo "<ul>$result</ul>";
}
示例输入:(因为我不想抓取已发布的 url)
$html = <<<HTML
<table>
<tbody id="threadbits_forum_26">
<tr>
<td class="alt1">
<a href="http://www.example1.com">test1</a>
</td>
</tr>
<tr>
<td class="alt1">
<a href="http://www.example2.com">test2</a>
</td>
</tr>
<tr>
<td class="alt1">
<a href="http://www.example3.com">test3</a>
</td>
</tr>
<tr>
<td class="alt1">
<a href="http://www.example4.com">test4</a>
</td>
</tr>
<tr>
<td class="alt1">
<a href="http://www.example5.com">test5</a>
</td>
</tr>
<tr>
<td class="alt1">
<a href="http://www.example6.com">test6</a>
</td>
</tr>
</tbody>
</table>
HTML;
输出:
<ul>
<li class="itemtitle"><a href="http://www.example3.com" target="_blank">test3</a></li>
<li class="itemtitle"><a href="http://www.example4.com" target="_blank">test4</a></li>
<li class="itemtitle"><a href="http://www.example5.com" target="_blank">test5</a></li>
<li class="itemtitle"><a href="http://www.example6.com" target="_blank">test6</a></li>
</ul>