PHP 简单的 DOMDocument 抓取排除 td class
PHP Simple DOMDocument scraping exclude td class
我只是想获取驻留在 <tr>
元素内的所有 <td>
元素数据。我的问题是因为我试图抓取 table 结构,我需要排除所有具有属性 COLLSPAN
的元素,即 <td collspan = 12>
从下面的代码中可以看出,获取 table 数据非常简单,但由于 table 结构,我需要排除所有 collspan 属性。
<?php
$html = file_get_contents('http://www.superxv.com/fixtures/'); //get the html returned from the following url
$game_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)) { //if any html is actually returned
$game_doc->loadHTML($html);
libxml_clear_errors(); //remove error
$xpath = new DOMXPath($game_doc);
// Modify the XPath query to match the content
foreach ($xpath->query('//table')->item(0)->getElementsByTagName('tr') as $rows) {
$cells = $rows->getElementsByTagName('td');
//$cells2 = $rows->getElementsByTagName('th');
echo '<pre>';
//@ signs are added due to table structure
//Get scrapped columns
echo $dayDateBye[] = $cells->item(0)->textContent;
echo $homeTeam[] = $cells->item(1)->textContent;
echo $awayTeam[] = $cells->item(2)->textContent;
echo $venue[] = $cells->item(3)->textContent;
echo $timeGMT[] = $cells->item(5)->textContent;
echo $timeZA[] = $cells->item(10)->textContent;
echo '</pre>';
}
}
在这里您可以看到 table 结构,它显示了 5 奇数行灯具,然后在新的一周开始时更改结构。我可以识别跳过此结构更改的元素是所有 <td collspan = 12>
元素。这使得它变得棘手,因为 TD 元素没有 class 名称,只有用于标识它的元素。
任何意见表示赞赏。
您可以根据标签的长度跳过那些
<?php
$html = file_get_contents('http://www.superxv.com/fixtures/'); //get the html returned from the following url
$game_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)) { //if any html is actually returned
$game_doc->loadHTML($html);
libxml_clear_errors(); //remove error
$xpath = new DOMXPath($game_doc);
// Modify the XPath query to match the content
foreach ($xpath->query('//table')->item(0)->getElementsByTagName('tr') as $rows) {
$cells = $rows->getElementsByTagName('td');
if( $cells->length > 1 ){
//$cells2 = $rows->getElementsByTagName('th');
echo '<pre>';
//@ signs are added due to table structure
//Get scrapped columns
echo $dayDateBye[] = $cells->item(0)->textContent;
echo $homeTeam[] = $cells->item(1)->textContent;
echo $awayTeam[] = $cells->item(2)->textContent;
echo $venue[] = $cells->item(3)->textContent;
echo $timeGMT[] = $cells->item(5)->textContent;
echo $timeZA[] = $cells->item(10)->textContent;
echo '</pre>';
}
}
}
?>
使用 xpath 排除具有 colspan
属性的元素
所以代替:
$cells = $rows->getElementsByTagName('td');
使用:
$cells = $xpath->query('td[not(@colspan)]', $rows);
我只是想获取驻留在 <tr>
元素内的所有 <td>
元素数据。我的问题是因为我试图抓取 table 结构,我需要排除所有具有属性 COLLSPAN
的元素,即 <td collspan = 12>
从下面的代码中可以看出,获取 table 数据非常简单,但由于 table 结构,我需要排除所有 collspan 属性。
<?php
$html = file_get_contents('http://www.superxv.com/fixtures/'); //get the html returned from the following url
$game_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)) { //if any html is actually returned
$game_doc->loadHTML($html);
libxml_clear_errors(); //remove error
$xpath = new DOMXPath($game_doc);
// Modify the XPath query to match the content
foreach ($xpath->query('//table')->item(0)->getElementsByTagName('tr') as $rows) {
$cells = $rows->getElementsByTagName('td');
//$cells2 = $rows->getElementsByTagName('th');
echo '<pre>';
//@ signs are added due to table structure
//Get scrapped columns
echo $dayDateBye[] = $cells->item(0)->textContent;
echo $homeTeam[] = $cells->item(1)->textContent;
echo $awayTeam[] = $cells->item(2)->textContent;
echo $venue[] = $cells->item(3)->textContent;
echo $timeGMT[] = $cells->item(5)->textContent;
echo $timeZA[] = $cells->item(10)->textContent;
echo '</pre>';
}
}
在这里您可以看到 table 结构,它显示了 5 奇数行灯具,然后在新的一周开始时更改结构。我可以识别跳过此结构更改的元素是所有 <td collspan = 12>
元素。这使得它变得棘手,因为 TD 元素没有 class 名称,只有用于标识它的元素。
任何意见表示赞赏。
您可以根据标签的长度跳过那些
<?php
$html = file_get_contents('http://www.superxv.com/fixtures/'); //get the html returned from the following url
$game_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)) { //if any html is actually returned
$game_doc->loadHTML($html);
libxml_clear_errors(); //remove error
$xpath = new DOMXPath($game_doc);
// Modify the XPath query to match the content
foreach ($xpath->query('//table')->item(0)->getElementsByTagName('tr') as $rows) {
$cells = $rows->getElementsByTagName('td');
if( $cells->length > 1 ){
//$cells2 = $rows->getElementsByTagName('th');
echo '<pre>';
//@ signs are added due to table structure
//Get scrapped columns
echo $dayDateBye[] = $cells->item(0)->textContent;
echo $homeTeam[] = $cells->item(1)->textContent;
echo $awayTeam[] = $cells->item(2)->textContent;
echo $venue[] = $cells->item(3)->textContent;
echo $timeGMT[] = $cells->item(5)->textContent;
echo $timeZA[] = $cells->item(10)->textContent;
echo '</pre>';
}
}
}
?>
使用 xpath 排除具有 colspan
属性的元素
所以代替:
$cells = $rows->getElementsByTagName('td');
使用:
$cells = $xpath->query('td[not(@colspan)]', $rows);