Scrapy xpath 不起作用 - 只能与 css-选择器结合使用?
Scrapy xpath not working - only in combination with css-selector?
我尝试用 scrapy 抓取以下站点并尝试用 scrapy shell -
这是基础蜘蛛:
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html']
start_urls = ['http://https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html/']
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
pass
我用这个 xpath 得到了所有相关部分:
(当我尝试 len(tmpSEC) 我得到 30 这对我来说似乎没问题)
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection'
]")
现在我想提取第一个 href-tag 并用这个 xpath 尝试它:
(但结果我只得到“/”)
>>> tmpSEC[0].xpath("//a/@href").get()
'/'
还有
>>> tmpSEC[0].xpath("(//a)[1]/@href").get()
'/'
但只有使用 css 选择器才能正常工作
>>> tmpSEC[0].css("a::attr(href)").get()
'/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html'
为什么这只适用于 css-选择器而不适用于 xpath-选择器?
这是使用 xpath 的工作解决方案。您需要像下面这样注入点 (.):
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
start_urls = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html/']
def parse(self, response):
tmpSEC = response.xpath(
"//section[@data-automation='AppPresentation_SingleFlexCardSection']")
#for elem in tmpSEC:
yield {
'link':tmpSEC[0].xpath(".//a/@href").get()
}
输出:
{'link': '/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html'}
我尝试用 scrapy 抓取以下站点并尝试用 scrapy shell -
这是基础蜘蛛:
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html']
start_urls = ['http://https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html/']
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
pass
我用这个 xpath 得到了所有相关部分: (当我尝试 len(tmpSEC) 我得到 30 这对我来说似乎没问题)
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection'
]")
现在我想提取第一个 href-tag 并用这个 xpath 尝试它: (但结果我只得到“/”)
>>> tmpSEC[0].xpath("//a/@href").get()
'/'
还有
>>> tmpSEC[0].xpath("(//a)[1]/@href").get()
'/'
但只有使用 css 选择器才能正常工作
>>> tmpSEC[0].css("a::attr(href)").get()
'/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html'
为什么这只适用于 css-选择器而不适用于 xpath-选择器?
这是使用 xpath 的工作解决方案。您需要像下面这样注入点 (.):
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
start_urls = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html/']
def parse(self, response):
tmpSEC = response.xpath(
"//section[@data-automation='AppPresentation_SingleFlexCardSection']")
#for elem in tmpSEC:
yield {
'link':tmpSEC[0].xpath(".//a/@href").get()
}
输出:
{'link': '/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html'}