如何使用简单的 html dom php 查找特定数据
How to find specific data using simple html dom php
当我抓取 table 时,table tr 和 td 值正在改变。下面是原文 table.
<table class="scoretable">
<tbody>
<tr><td class="jdhead">Name</td><td class="fullhead">John</td></tr>
<tr><td class="jdhead">Age</td><td class="fullhead">30</td></tr>
<tr><td class="jdhead">Phone</td><td class="fullhead">91234988788</td></tr>
<tr><td class="jdhead">Location</td><td class="fullhead">Madrid</td></tr>
<tr><td class="jdhead">Country</td><td class="fullhead">Spain</td></tr>
<tr><td class="jdhead">Role</td><td class="fullhead">Manager</td></tr>
</tbody>
</table>
<table class="scoretable">
<tbody>
<tr><td class="jdhead">Name</td><td class="fullhead">John</td></tr>
<tr><td class="jdhead">Age</td><td class="fullhead">30</td></tr>
<tr><td class="jdhead">Phone</td><td class="fullhead">91234988788</td></tr>
<tr><td class="jdhead">Role</td><td class="fullhead">Manager</td></tr>
</tbody>
</table>
以上两个table来自不同的页面。我需要抓取名称、Phone 和角色。
$url = "http://name.com/listings";
$html = file_get_html( $url );
$posts1 = $html->find('td[class=fullhead]',1);
foreach ( $posts1 as $post1 ) {
$poster1 = $post1->outertext;
echo $poster1;
}
我会尝试 preg_match
来自 HTML 的所需值,如下所示:
<?php
$url = 'http://name.com/listings';
$html = file_get_contents($url);
if (preg_match('~<tr><td class="jdhead">Name</td><td class="fullhead">([^<]*)</td></tr>~', $html, $matches)) {
echo $matches[1]; // here is you name
}
if (preg_match('~<tr><td class="jdhead">Phone</td><td class="fullhead">([^<]*)</td></tr>~', $html, $matches)) {
echo $matches[1]; // here is you phone
}
if (preg_match('~<tr><td class="jdhead">Role</td><td class="fullhead">([^<]*)</td></tr>~', $html, $matches)) {
echo $matches[1]; // here is you role
}
更新(见下方评论):
<?php
$url = 'http://jobsearch.naukri.com/job-listings-010915006292';
$html = file_get_contents($url);
if (preg_match('~<TR VALIGN="top"> <TD CLASS="jdHead">Job Posted </TD> <TD VALIGN="top" CLASS="detailJob">([^<]*)</TD> </TR>~', $html, $matches)) {
echo 'Job Posted: ' . $matches[1] . '<br><br>';
}
if (preg_match('~<TR VALIGN="top"> <TD CLASS="jdHead">Job Description</TD> <TD VALIGN="top" CLASS="detailJob">(.*?)</TD> </TR>~', $html, $matches)) {
echo 'Job Description: ' . $matches[1] . '<br><br>';
}
我有适合您的解决方案示例:
<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("tabledata.html");
// required nodes
$required_data = ['Name', 'Phone', 'Role'];
$tbody_elements = $doc->getElementsByTagName('tbody');
// xpath object
$xpath = new DOMXPath($doc);
// array for final data
$finaldata = [];
// each tr is one user
foreach($tbody_elements as $key => $tbody)
{
// iterate though the required data
foreach($required_data as $data)
{
$return = $xpath->query("tr[td[text()='$data']]", $tbody);
foreach($return as $node)
{
$finaldata[$key][$data] = $node->textContent;
}
}
}
输出:
array(2) {
[0]=>
array(3) {
["Name"]=>
string(8) "NameJohn"
["Phone"]=>
string(16) "Phone91234988788"
["Role"]=>
string(11) "RoleManager"
}
[1]=>
array(3) {
["Name"]=>
string(8) "NameJohn"
["Phone"]=>
string(16) "Phone91234988788"
["Role"]=>
string(11) "RoleManager"
}
}
当我抓取 table 时,table tr 和 td 值正在改变。下面是原文 table.
<table class="scoretable">
<tbody>
<tr><td class="jdhead">Name</td><td class="fullhead">John</td></tr>
<tr><td class="jdhead">Age</td><td class="fullhead">30</td></tr>
<tr><td class="jdhead">Phone</td><td class="fullhead">91234988788</td></tr>
<tr><td class="jdhead">Location</td><td class="fullhead">Madrid</td></tr>
<tr><td class="jdhead">Country</td><td class="fullhead">Spain</td></tr>
<tr><td class="jdhead">Role</td><td class="fullhead">Manager</td></tr>
</tbody>
</table>
<table class="scoretable">
<tbody>
<tr><td class="jdhead">Name</td><td class="fullhead">John</td></tr>
<tr><td class="jdhead">Age</td><td class="fullhead">30</td></tr>
<tr><td class="jdhead">Phone</td><td class="fullhead">91234988788</td></tr>
<tr><td class="jdhead">Role</td><td class="fullhead">Manager</td></tr>
</tbody>
</table>
以上两个table来自不同的页面。我需要抓取名称、Phone 和角色。
$url = "http://name.com/listings";
$html = file_get_html( $url );
$posts1 = $html->find('td[class=fullhead]',1);
foreach ( $posts1 as $post1 ) {
$poster1 = $post1->outertext;
echo $poster1;
}
我会尝试 preg_match
来自 HTML 的所需值,如下所示:
<?php
$url = 'http://name.com/listings';
$html = file_get_contents($url);
if (preg_match('~<tr><td class="jdhead">Name</td><td class="fullhead">([^<]*)</td></tr>~', $html, $matches)) {
echo $matches[1]; // here is you name
}
if (preg_match('~<tr><td class="jdhead">Phone</td><td class="fullhead">([^<]*)</td></tr>~', $html, $matches)) {
echo $matches[1]; // here is you phone
}
if (preg_match('~<tr><td class="jdhead">Role</td><td class="fullhead">([^<]*)</td></tr>~', $html, $matches)) {
echo $matches[1]; // here is you role
}
更新(见下方评论):
<?php
$url = 'http://jobsearch.naukri.com/job-listings-010915006292';
$html = file_get_contents($url);
if (preg_match('~<TR VALIGN="top"> <TD CLASS="jdHead">Job Posted </TD> <TD VALIGN="top" CLASS="detailJob">([^<]*)</TD> </TR>~', $html, $matches)) {
echo 'Job Posted: ' . $matches[1] . '<br><br>';
}
if (preg_match('~<TR VALIGN="top"> <TD CLASS="jdHead">Job Description</TD> <TD VALIGN="top" CLASS="detailJob">(.*?)</TD> </TR>~', $html, $matches)) {
echo 'Job Description: ' . $matches[1] . '<br><br>';
}
我有适合您的解决方案示例:
<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("tabledata.html");
// required nodes
$required_data = ['Name', 'Phone', 'Role'];
$tbody_elements = $doc->getElementsByTagName('tbody');
// xpath object
$xpath = new DOMXPath($doc);
// array for final data
$finaldata = [];
// each tr is one user
foreach($tbody_elements as $key => $tbody)
{
// iterate though the required data
foreach($required_data as $data)
{
$return = $xpath->query("tr[td[text()='$data']]", $tbody);
foreach($return as $node)
{
$finaldata[$key][$data] = $node->textContent;
}
}
}
输出:
array(2) {
[0]=>
array(3) {
["Name"]=>
string(8) "NameJohn"
["Phone"]=>
string(16) "Phone91234988788"
["Role"]=>
string(11) "RoleManager"
}
[1]=>
array(3) {
["Name"]=>
string(8) "NameJohn"
["Phone"]=>
string(16) "Phone91234988788"
["Role"]=>
string(11) "RoleManager"
}
}