使用 GuzzleClient 抓取时随机丢失 html 中的节点
Randomly missing nodes in html when scraping with GuzzleClient
我在这里处理一个关于 scrape 的问题,因为子元素的不一致有时存在有时缺失。
由于我正在保存引用 $values[]
数组的状态,我发现有时 $value[18]
是电子邮件地址,其他时候可能是 phone 或传真。
3次迭代的样本数组如下:
[0] => [
[1] => Firm: The Firm One Name
[2] => Firm:
[3] => The Firm One Name
[4] => Office: 5th Av. 18980, NY
[5] => Office:
[6] => 5th Av. 18980, NY
[7] => City: New York
[8] => City:
[9] => New York
[10] => Country: USA
[11] => Country:
[12] => USA
[13] => Tel: +123 4 567 890
[14] => Tel:
[15] => +123 4 567 890
[16] => Email: person.one@example.com
[17] => Email:
[18] => person.one@example.com
],
[1] => [
[1] => Firm: The Firm Two Name
[2] => Firm:
[3] => The Firm Two Name
[4] => Office: 5th Av. 342680, NY
[5] => Office:
[6] => 5th Av. 342680, NY
[7] => City: New York
[8] => City:
[9] => New York
[10] => Country: USA
[11] => Country:
[12] => USA
[13] => Tel: +123 4 567 890
[14] => Tel:
[15] => +123 4 567 890
[16] => Fax: +123 4 567 891
[17] => Fax:
[18] => +123 4 567 891
[19] => Email: person.two@example.com
[20] => Email:
[21] => person.two@example.com
],
[2] => [[1] => Firm: The Firm Three Name
[2] => Firm:
[3] => The Firm Three Name
[4] => Office: 5th Av. 89280, NY
[5] => Office:
[6] => 5th Av. 89280, NY
[7] => Country: USA
[8] => Country:
[9] => USA
[10] => Fax: +123 4 567 899
[11] => Fax:
[12] => +123 4 567 899
[13] => Email: person.three@example.com
[14] => Email:
[15] => person.three@example.com
]
可能会注意到,当我迭代并保存最后一个数组的 $values[15]
时,即电子邮件地址,第一个 [0][15]
对应于电话。数.
我的问题是,有没有比在字段上执行 'crazy loop' 并始终将电子邮件保存为电子邮件而不是 [=43] 更简单的方法=] 数?
我正在使用 GuzzleClient()
和 $node->filterXPath()
and/or $node->filter()
取决于我要抓取的东西。
我正在处理的 html 结构非常简短,如下例所示,有时会缺少节点...:
<div id="profiledtails">
<div class="abc-g">
<div class="abc-gf">
<div class="abc-u first">Firm:</div>
<div class="abc-u">
<a href="http://example.com/123456/" title="More information here" class="Item" abc-tracker="office" abc-tracking="true">Person One</a>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Office:</div>
<div class="abc-u">
<address>
5th Av.<br>18980,<br>NY
</address>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">City:</div>
<div class="abc-u">New York</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Country:</div>
<div class="abc-u">USA</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Tel:</div>
<div class="abc-u">+123 4 567 890</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Fax:</div>
<div class="abc-u">+123 4 567 891</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Email:</div>
<div class="abc-u">
<a href="mailto:mperson.one@example.com">person.one@example.com</a></div>
</div>
</div>
我以前处理过同样的情况,这种情况的唯一解决方案是正则表达式,因为Html元素每次都在变化,你无法跟踪直到你使用正则表达式,这是你的解决方法
$re = '/ <div class="abc-u first">Email:<\/div>
<div class="abc-u">
<a href="mailto:mperson.one@example.com">(.*)<\/a>/';
$str = '<div id="profiledtails">
<div class="abc-g">
<div class="abc-gf">
<div class="abc-u first">Firm:</div>
<div class="abc-u">
<a href="http://example.com/123456/" title="More information here" class="Item" abc-tracker="office" abc-tracking="true">Person One</a>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Office:</div>
<div class="abc-u">
<address>
5th Av.<br>18980,<br>NY
</address>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">City:</div>
<div class="abc-u">New York</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Country:</div>
<div class="abc-u">USA</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Tel:</div>
<div class="abc-u">+123 4 567 890</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Fax:</div>
<div class="abc-u">+123 4 567 891</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Email:</div>
<div class="abc-u">
<a href="mailto:mperson.one@example.com">person.one@example.com</a></div>
</div>
</div>';
preg_match($re, $str, $matches, PREG_OFFSET_CAPTURE, 0);
// Print the entire match result
var_dump($matches);
以同样的方式,您必须为其他值准备正则表达式并准备开始,上面的代码看起来很乱,但您可以从字符串和正则表达式中删除空格以使其干净。
这可以通过正则表达式轻松完成,我没有太多接触PHP,但是对于正则表达式:
您可以对密钥使用以下命令:
class="abc-u first">(.*):
& 价值:
class="abc-u">(.*?)</
休息一下并重新思考问题后,我找到了根据需要清理数据的解决方案。毕竟这只是过滤结果并在数组中的正确位置获取正确值的问题。
这是我为任何情况制作和工作的东西(当适应需要时):
$crawler->filterXPath('//*[@id="profiledetails"]/div')->each(function($node) use ($data, $start, $i) {
// get the values
foreach($node->filter('div') as $k => $v) {
$values[] = trim($v->nodeValue);
}
// sanitise the data
$sanitised = [];
foreach($values as $k => $v) {
trim($v); // trim to make sure there's no spaces
if($v == 'Firm:') {
$sanitised['firm_name'] = $values[$k + 1]; // Note: the +1 is to get the next node where the value is set
}
if($v == 'Office:') {
$sanitised['address'] = $values[$k + 1];
}
if($v == 'City:') {
$sanitised['city'] = $values[$k + 1];
}
if($v == 'Country:') {
$sanitised['country'] = $values[$k + 1];
}
if($v == 'Tel:') {
$sanitised['phone'] = $values[$k + 1];
}
if($v == 'Fax:') {
$sanitised['fax'] = $values[$k + 1];
}
if($v == 'Email:') {
$sanitised['email'] = $values[$k + 1];
}
}
$data['firm_name'] = !empty($sanitized['firm_name']) ? $sanitized['firm_name'] : null;
$data['address'] = !empty($sanitized['address']) ? nl2br($sanitized['address']) : null;
$data['city'] = !empty($sanitized['city']) ? $sanitized['city'] : null;
$data['country'] = !empty($sanitized['country']) ? $sanitized['country'] : null;
$data['phone'] = !empty($sanitized['phone']) ? $sanitized['phone'] : null;
$data['fax'] = !empty($sanitized['fax']) ? $sanitized['fax'] : null;
$data['email'] = !empty($sanitized['email']) ? $sanitized['email'] : null;
// Save the data
ProfileModel::where('id', $i)->update($data);
// just a console log to know where we are in case it fails on timeout
echo "Done for profile id " . $i . PHP_EOL;
});
即使发现空节点或缺失节点,每次迭代的结果也始终是正确的数组。它看起来像这样:
[
['firm_name'] = 'Firm Name One';
['address'] = '5th Av.<br>18980,<br>NY';
['city'] = 'New Yok';
['country'] = 'USA';
['phone'] = '+123 4 567 890';
['fax'] = null;
['email'] = 'person.one@example.com';
]
现在数据库中的每一行都在正确的列中获取数据(或 NULL
)。
我在这里处理一个关于 scrape 的问题,因为子元素的不一致有时存在有时缺失。
由于我正在保存引用 $values[]
数组的状态,我发现有时 $value[18]
是电子邮件地址,其他时候可能是 phone 或传真。
3次迭代的样本数组如下:
[0] => [
[1] => Firm: The Firm One Name
[2] => Firm:
[3] => The Firm One Name
[4] => Office: 5th Av. 18980, NY
[5] => Office:
[6] => 5th Av. 18980, NY
[7] => City: New York
[8] => City:
[9] => New York
[10] => Country: USA
[11] => Country:
[12] => USA
[13] => Tel: +123 4 567 890
[14] => Tel:
[15] => +123 4 567 890
[16] => Email: person.one@example.com
[17] => Email:
[18] => person.one@example.com
],
[1] => [
[1] => Firm: The Firm Two Name
[2] => Firm:
[3] => The Firm Two Name
[4] => Office: 5th Av. 342680, NY
[5] => Office:
[6] => 5th Av. 342680, NY
[7] => City: New York
[8] => City:
[9] => New York
[10] => Country: USA
[11] => Country:
[12] => USA
[13] => Tel: +123 4 567 890
[14] => Tel:
[15] => +123 4 567 890
[16] => Fax: +123 4 567 891
[17] => Fax:
[18] => +123 4 567 891
[19] => Email: person.two@example.com
[20] => Email:
[21] => person.two@example.com
],
[2] => [[1] => Firm: The Firm Three Name
[2] => Firm:
[3] => The Firm Three Name
[4] => Office: 5th Av. 89280, NY
[5] => Office:
[6] => 5th Av. 89280, NY
[7] => Country: USA
[8] => Country:
[9] => USA
[10] => Fax: +123 4 567 899
[11] => Fax:
[12] => +123 4 567 899
[13] => Email: person.three@example.com
[14] => Email:
[15] => person.three@example.com
]
可能会注意到,当我迭代并保存最后一个数组的 $values[15]
时,即电子邮件地址,第一个 [0][15]
对应于电话。数.
我的问题是,有没有比在字段上执行 'crazy loop' 并始终将电子邮件保存为电子邮件而不是 [=43] 更简单的方法=] 数?
我正在使用 GuzzleClient()
和 $node->filterXPath()
and/or $node->filter()
取决于我要抓取的东西。
我正在处理的 html 结构非常简短,如下例所示,有时会缺少节点...:
<div id="profiledtails">
<div class="abc-g">
<div class="abc-gf">
<div class="abc-u first">Firm:</div>
<div class="abc-u">
<a href="http://example.com/123456/" title="More information here" class="Item" abc-tracker="office" abc-tracking="true">Person One</a>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Office:</div>
<div class="abc-u">
<address>
5th Av.<br>18980,<br>NY
</address>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">City:</div>
<div class="abc-u">New York</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Country:</div>
<div class="abc-u">USA</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Tel:</div>
<div class="abc-u">+123 4 567 890</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Fax:</div>
<div class="abc-u">+123 4 567 891</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Email:</div>
<div class="abc-u">
<a href="mailto:mperson.one@example.com">person.one@example.com</a></div>
</div>
</div>
我以前处理过同样的情况,这种情况的唯一解决方案是正则表达式,因为Html元素每次都在变化,你无法跟踪直到你使用正则表达式,这是你的解决方法
$re = '/ <div class="abc-u first">Email:<\/div>
<div class="abc-u">
<a href="mailto:mperson.one@example.com">(.*)<\/a>/';
$str = '<div id="profiledtails">
<div class="abc-g">
<div class="abc-gf">
<div class="abc-u first">Firm:</div>
<div class="abc-u">
<a href="http://example.com/123456/" title="More information here" class="Item" abc-tracker="office" abc-tracking="true">Person One</a>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Office:</div>
<div class="abc-u">
<address>
5th Av.<br>18980,<br>NY
</address>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">City:</div>
<div class="abc-u">New York</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Country:</div>
<div class="abc-u">USA</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Tel:</div>
<div class="abc-u">+123 4 567 890</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Fax:</div>
<div class="abc-u">+123 4 567 891</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Email:</div>
<div class="abc-u">
<a href="mailto:mperson.one@example.com">person.one@example.com</a></div>
</div>
</div>';
preg_match($re, $str, $matches, PREG_OFFSET_CAPTURE, 0);
// Print the entire match result
var_dump($matches);
以同样的方式,您必须为其他值准备正则表达式并准备开始,上面的代码看起来很乱,但您可以从字符串和正则表达式中删除空格以使其干净。
这可以通过正则表达式轻松完成,我没有太多接触PHP,但是对于正则表达式:
您可以对密钥使用以下命令:
class="abc-u first">(.*):
& 价值:
class="abc-u">(.*?)</
休息一下并重新思考问题后,我找到了根据需要清理数据的解决方案。毕竟这只是过滤结果并在数组中的正确位置获取正确值的问题。 这是我为任何情况制作和工作的东西(当适应需要时):
$crawler->filterXPath('//*[@id="profiledetails"]/div')->each(function($node) use ($data, $start, $i) {
// get the values
foreach($node->filter('div') as $k => $v) {
$values[] = trim($v->nodeValue);
}
// sanitise the data
$sanitised = [];
foreach($values as $k => $v) {
trim($v); // trim to make sure there's no spaces
if($v == 'Firm:') {
$sanitised['firm_name'] = $values[$k + 1]; // Note: the +1 is to get the next node where the value is set
}
if($v == 'Office:') {
$sanitised['address'] = $values[$k + 1];
}
if($v == 'City:') {
$sanitised['city'] = $values[$k + 1];
}
if($v == 'Country:') {
$sanitised['country'] = $values[$k + 1];
}
if($v == 'Tel:') {
$sanitised['phone'] = $values[$k + 1];
}
if($v == 'Fax:') {
$sanitised['fax'] = $values[$k + 1];
}
if($v == 'Email:') {
$sanitised['email'] = $values[$k + 1];
}
}
$data['firm_name'] = !empty($sanitized['firm_name']) ? $sanitized['firm_name'] : null;
$data['address'] = !empty($sanitized['address']) ? nl2br($sanitized['address']) : null;
$data['city'] = !empty($sanitized['city']) ? $sanitized['city'] : null;
$data['country'] = !empty($sanitized['country']) ? $sanitized['country'] : null;
$data['phone'] = !empty($sanitized['phone']) ? $sanitized['phone'] : null;
$data['fax'] = !empty($sanitized['fax']) ? $sanitized['fax'] : null;
$data['email'] = !empty($sanitized['email']) ? $sanitized['email'] : null;
// Save the data
ProfileModel::where('id', $i)->update($data);
// just a console log to know where we are in case it fails on timeout
echo "Done for profile id " . $i . PHP_EOL;
});
即使发现空节点或缺失节点,每次迭代的结果也始终是正确的数组。它看起来像这样:
[
['firm_name'] = 'Firm Name One';
['address'] = '5th Av.<br>18980,<br>NY';
['city'] = 'New Yok';
['country'] = 'USA';
['phone'] = '+123 4 567 890';
['fax'] = null;
['email'] = 'person.one@example.com';
]
现在数据库中的每一行都在正确的列中获取数据(或 NULL
)。