为什么 restrict_xpath 忽略了 <a> 标签内的 hrefs?
Why is restrict_xpath neglecting hrefs inside inside <a> tags?
我正在抓取维基百科页面以提取所有图像网址,这是它的代码。
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class WikiSpider(CrawlSpider):
name = 'wiki'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Katy_Perry']
rules = [Rule(LinkExtractor(restrict_xpaths=('//a[@class="image"]')),
callback='parse_item', follow=False),]
def parse_item(self, response):
print(response.url)
当我 运行 蜘蛛时,它没有显示任何结果,但是当我更改 restrict_xpaths
内的 xpath 时,它会打印一些随机链接。我需要 xpath '//a[@class="image"]'
中的 hrefs 但它不起作用,原因是什么?我知道我可以使用基本蜘蛛而不是 CrawlSpider
并完全避免规则。但我想知道为什么我输入的 xpath 不起作用以及 restrict_xpaths
接受什么样的 xpath 和 html 标签?
您要的link是图片:
$ scrapy shell "https://en.wikipedia.org/wiki/Katy_Perry" -s USER_AGENT='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'
2016-08-19 11:17:05 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)
(...)
2016-08-19 11:17:06 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Katy_Perry> (referer: None)
(...)
In [1]: response.xpath('//a[@class="image"]/@href').extract()
Out[1]:
['/wiki/File:Katy_Perry_DNC_July_2016_(cropped).jpg',
'/wiki/File:Katy_Perry_performing.jpg',
'/wiki/File:Katy_Perry%E2%80%93Zenith_Paris.jpg',
'/wiki/File:PWT_Cropped.jpg',
'/wiki/File:Alanis_Morissette_5-19-2014.jpg',
'/wiki/File:Freddie_Mercury_performing_in_New_Haven,_CT,_November_1977.jpg',
'/wiki/File:Katy_Perry_California_Dreams_Tour_01.jpg',
'/wiki/File:Katy_Perry_UNICEF_2012.jpg',
'/wiki/File:Katy_Perry_Hillary_Clinton,_I%27m_With_Her_Concert.jpg',
'/wiki/File:Wikiquote-logo.svg',
'/wiki/File:Commons-logo.svg']
默认情况下 link 提取器过滤器 a lot of extensions,包括图像:
In [2]: from scrapy.linkextractors import LinkExtractor
In [3]: LinkExtractor(restrict_xpaths=('//a[@class="image"]')).extract_links(response)
Out[3]: []
您可以use deny_extensions=[]
不过滤任何内容:
In [4]: LinkExtractor(restrict_xpaths=('//a[@class="image"]'), deny_extensions=[]).extract_links(response)
Out[4]:
[Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_DNC_July_2016_(cropped).jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_performing.jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry%E2%80%93Zenith_Paris.jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:PWT_Cropped.jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Alanis_Morissette_5-19-2014.jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Freddie_Mercury_performing_in_New_Haven,_CT,_November_1977.jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_California_Dreams_Tour_01.jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_UNICEF_2012.jpg', text='', fragment='', nofollow=False),
Link(url="https://en.wikipedia.org/wiki/File:Katy_Perry_Hillary_Clinton,_I'm_With_Her_Concert.jpg", text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Wikiquote-logo.svg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Commons-logo.svg', text='', fragment='', nofollow=False)]
我正在抓取维基百科页面以提取所有图像网址,这是它的代码。
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class WikiSpider(CrawlSpider):
name = 'wiki'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Katy_Perry']
rules = [Rule(LinkExtractor(restrict_xpaths=('//a[@class="image"]')),
callback='parse_item', follow=False),]
def parse_item(self, response):
print(response.url)
当我 运行 蜘蛛时,它没有显示任何结果,但是当我更改 restrict_xpaths
内的 xpath 时,它会打印一些随机链接。我需要 xpath '//a[@class="image"]'
中的 hrefs 但它不起作用,原因是什么?我知道我可以使用基本蜘蛛而不是 CrawlSpider
并完全避免规则。但我想知道为什么我输入的 xpath 不起作用以及 restrict_xpaths
接受什么样的 xpath 和 html 标签?
您要的link是图片:
$ scrapy shell "https://en.wikipedia.org/wiki/Katy_Perry" -s USER_AGENT='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'
2016-08-19 11:17:05 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)
(...)
2016-08-19 11:17:06 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Katy_Perry> (referer: None)
(...)
In [1]: response.xpath('//a[@class="image"]/@href').extract()
Out[1]:
['/wiki/File:Katy_Perry_DNC_July_2016_(cropped).jpg',
'/wiki/File:Katy_Perry_performing.jpg',
'/wiki/File:Katy_Perry%E2%80%93Zenith_Paris.jpg',
'/wiki/File:PWT_Cropped.jpg',
'/wiki/File:Alanis_Morissette_5-19-2014.jpg',
'/wiki/File:Freddie_Mercury_performing_in_New_Haven,_CT,_November_1977.jpg',
'/wiki/File:Katy_Perry_California_Dreams_Tour_01.jpg',
'/wiki/File:Katy_Perry_UNICEF_2012.jpg',
'/wiki/File:Katy_Perry_Hillary_Clinton,_I%27m_With_Her_Concert.jpg',
'/wiki/File:Wikiquote-logo.svg',
'/wiki/File:Commons-logo.svg']
默认情况下 link 提取器过滤器 a lot of extensions,包括图像:
In [2]: from scrapy.linkextractors import LinkExtractor
In [3]: LinkExtractor(restrict_xpaths=('//a[@class="image"]')).extract_links(response)
Out[3]: []
您可以use deny_extensions=[]
不过滤任何内容:
In [4]: LinkExtractor(restrict_xpaths=('//a[@class="image"]'), deny_extensions=[]).extract_links(response)
Out[4]:
[Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_DNC_July_2016_(cropped).jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_performing.jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry%E2%80%93Zenith_Paris.jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:PWT_Cropped.jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Alanis_Morissette_5-19-2014.jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Freddie_Mercury_performing_in_New_Haven,_CT,_November_1977.jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_California_Dreams_Tour_01.jpg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_UNICEF_2012.jpg', text='', fragment='', nofollow=False),
Link(url="https://en.wikipedia.org/wiki/File:Katy_Perry_Hillary_Clinton,_I'm_With_Her_Concert.jpg", text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Wikiquote-logo.svg', text='', fragment='', nofollow=False),
Link(url='https://en.wikipedia.org/wiki/File:Commons-logo.svg', text='', fragment='', nofollow=False)]