抓取我不想但不知道如何排除的链接

Question

假设我有这个结构

<div data-next="link0">
   <a href="link1"/>
   <a href="link2"/>
   <a href="link3"/>
   <a href="link4"/>
</div>

并且对于我的规则对象，我只想访问 link0，而不访问 link1、link2、link3、link 4.

我该怎么做？

我试过了

Rule(LinkExtractor(restrict_xpaths=('//div[@data-next]/@data-next')), callback='parse_item'),

但它不起作用，因为我需要对元素的引用，而不是直接对 link 的引用。但是，如果我删除@data-next，link1、link2、link3、link4 也会被删除。

那么，有什么方法可以在这种情况下使用 Rule 对象仅抓取 link0？

Answer 1

以下xpath

//div[@data-next="link0]

Answer 2

Rule(LinkExtractor(restrict_xpaths='//div[@data-next]', tags='div', attrs='data-next'), callback='parse_item'),

LinkExtractor 默认查找 <a> 标签和 @href 属性。在这种情况下，您已指定搜索中应包括哪些标签和属性。来自 Scrapy docs 的更多信息：

Parameters:

(...)

tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to ('a', 'area').

attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)

抓取我不想但不知道如何排除的链接

Scraping links that I don't want to but I don't know how to exclude

python

xpath

scrapy