使用 php-spider，是否有可能发现大多数网站上的 URI 的标准 Xpath？

Question

我正在使用名为 php-spider 的精彩脚本，目的是从一些网站上抓取标题、描述、H1、H2、H3 和 H4。作为配置脚本的一部分，有必要设置一个 'XpathExpressionDiscoverer' 来指示脚本如何在每个页面上找到额外的超链接以进行抓取。我假设这是指标准的 Xpath 查询语言。

我的目标是找到一个通常适用于大多数网站的 XpathExpressionDiscoverer（而不是要求我为每个网站自定义它）。

这是我尝试过的：

我注意到作者提供的 example 使用非常具体的 XpathExpressionDiscoverer 来抓取给定的示例站点：

// The URI we want to start crawling with
$seed = 'http://dmoztools.net/Computers/Internet/';

// We add an URI discoverer. Without it, the spider wouldn't get past the seed resource.
$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//*[@id='cat-list-content-2']/div/a"));

因为我的目标只是发现页面上的任何超链接，所以我尝试将 XPath 扩展为更通用的内容（“//a”），如下所示：

// We add an URI discoverer. Without it, the spider wouldn't get past the seed resource.
$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//a"));

虽然这个新的 Xpath 成功抓取了示例站点 (dmoztools.net)，但它似乎不适用于我尝试的其他示例（如下）。它只是抓取种子页面，但无法发现或抓取页面上的其他 URI（即使它们具有应与 Xpath 匹配的 A HREF 标记）。

示例 A：https://www.petco.com/shop/en/petcostore/category/fish

示例 B：https://www.thetruthaboutcars.com/

你看到我错在哪里了吗？谢谢！

Answer 1

示例代码包含这一行：

$spider->getDiscovererSet()->addFilter(new AllowedSchemeFilter(array('http')));

应该是：

$spider->getDiscovererSet()->addFilter(new AllowedSchemeFilter(array('http', 'https')));

请注意添加 https 作为允许的架构。否则，只允许具有 http 架构的 URL，您举的示例网站是 https.

顺便说一句，当我对此进行测试时，我发现了一个错误，即没有路径和尾部斜杠的 URL 有时会导致失败。我在 0.4.4 版中添加了针对该错误的修复程序。请升级。

使用 php-spider，是否有可能发现大多数网站上的 URI 的标准 Xpath？

Using php-spider, is there a standard Xpath that might discover the URIs on most web sites?

php

web-crawler