Scrapy / Xpath 无法获取 href 元素？

Question

我试着从这个网站上抓取一些东西并在 scrapy 中工作 shell； https://www.tripadvisor.co.uk/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html

在网站上我有以下部分代码，我想获取所有这三个 a 元素的 href 信息：

<div class="fvqxY f dlzPP">
  <div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"
      href="http://www.blackpoolzoo.org.uk"><span class="WlYyy cacGK Wb">Visit website</span><svg viewBox="0 0 24 24"
        width="16px" height="16px" class="fecdL d Vb wQMPa">
        <path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path>
      </svg></a></div>
  <div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
      href="tel:%2B44%201253%20830830"><span class="WlYyy cacGK Wb">Call</span></a></div>
  <div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
      href="mailto:info@blackpoolzoo.org.uk"><span class="WlYyy cacGK Wb">Email</span></a></div>
</div>

我用这个 xpath 试过了 - 它在 chrome-inspector 中对我很好用 - 但我只得到一个空结果

>>> response.xpath("//div[@class='Lvkmj']//ancestor::a/@href") 
[]

我也用 class = "Lvkmj" 检查了第一个 div 并得到了这个结果：

>>> response.xpath("//div[@class='Lvkmj']").get()                                                                   s="WlYyy cacGK Wb">Visit website</s
'<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"><span clas 8.293-8.293H7.854v-2h10v10h-2V7.56s="WlYyy cacGK Wb">Visit website</span><svg viewbox="0 0 24 24" width="16px" height="16px" class="fecdL d Vb wQMPa"><path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path></svg></a></div>'
>>>

我意识到乍一看它是整个 div 元素 - 但后来我发现它看起来与 inspecto 中的完全一样，但无论出于何种原因缺少 href 元素。

为什么在那种情况下使用 scapy shell 时缺少 href 元素？

你可以在下面找到完整的代码-

import scrapy

class ZoosSpider(scrapy.Spider):
  name = 'zoos'
  allowed_domains = ['www.tripadvisor.co.uk']
  start_urls = [
                "https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html",
                "https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
              ]

  def parse(self, response):
    tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
    for elem in tmpSEC:
      link = response.urljoin(elem.xpath(".//a/@href").get())   
      yield response.follow(link, callback=self.parseDetails)             

  def parseDetails(self, response):
    tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()  
    tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()
    tmpErg = response.xpath("//div[@class='dlzPP']//ancestor::div[@class='WlYyy diXIH dDKKM']/text()").getall()
    
    yield {
      "cat": tmpErg[1],
      "link": tmpLink,
      "name": tmpName ,
    }

更新 - 起初@Fazlul 的解决方案效果很好 - 但经过几次尝试后，HREF 列表不再有更多输出 - 这是我从 scrapy 获得的日志的一部分：

2021-11-18 12:16:32 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 140
2021-11-18 12:16:32 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.co.uk/Attraction_Review-g580423-d3427163-Reviews-Pontefract_Races-Pontefract_West_Yorkshire_England.html> (referer: https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html)
2021-11-18 12:16:32 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/url {"url": "https://www.tripadvisor.co.uk/Attraction_Review-g190792-d3250215-Reviews-Cartmel_Racecourse-Grange_over_Sands_Lake_District_Cumbria_England.html"}
2021-11-18 12:16:34 [urllib3.connectionpool] DEBUG: http://localhost:62481 "POST /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 14
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/source {}
2021-11-18 12:16:34 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/source HTTP/1.1" 200 732552
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/url {}
[1118/121634.931:INFO:CONSOLE(1)] "Evidon -- evidon-notice-link not found on page, cant display the consent link.", source: https://c.evidon.com/sitenotice/evidon-sitenotice-tag.js (1)
2021-11-18 12:16:34 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 156
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.co.uk/Attraction_Review-g190792-d3250215-Reviews-Cartmel_Racecourse-Grange_over_Sands_Lake_District_Cumbria_England.html> (referer: https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html)
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/url {"url": "https://www.tripadvisor.co.uk/Attraction_Review-g186332-d7692268-Reviews-Coral_Island_Blackpool-Blackpool_Lancashire_England.html"}
2021-11-18 12:16:36 [urllib3.connectionpool] DEBUG: http://localhost:62481 "POST /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 14
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/source {}
2021-11-18 12:16:36 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/source HTTP/1.1" 200 872845
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/url {}
[1118/121636.173:INFO:CONSOLE(1)] "Evidon -- evidon-notice-link not found on page, cant display the consent link.", source: https://c.evidon.com/sitenotice/evidon-sitenotice-tag.js (1)
2021-11-18 12:16:36 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 141
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.co.uk/Attraction_Review-g186332-d7692268-Reviews-Coral_Island_Blackpool-Blackpool_Lancashire_England.html> (referer: https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html)
2021-11-18 12:16:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g656899-d13201486-Reviews-Gala_Bingo-Cramlington_Northumberland_England.html>
{'name': 'Gala Bingo', 'HREFs': []}
2021-11-18 12:16:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g580427-d2663587-Reviews-Romford_Greyhound_Stadium-Romford_Greater_London_England.html>
{'name': 'Romford Greyhound Stadium', 'HREFs': []}

Answer 1

您的 XPath：

//div[@class='Lvkmj']//ancestor::a/@href

显示结果...因为您的第二个 // 告诉 XPath engine:find 当前节点的任何后代节点然后 ancestor::a 告诉引擎找到任何名为 a 的祖先元素.因为 a 确实有后代，所以您的 XPath 给出了结果....但是有更好的方法：只需使用：

//div[@class='Lvkmj']/a/@href

/a 的意思是：给我 div[@class='Lvkmj']

的直接 child 命名为 a

但这并不能解决你的问题。

您的问题：为什么在那种情况下使用 scapy shell 时缺少 href 元素？

因为我认为它只使用了文档的来源而不是更新的（由javascript）dom。

如果它会使用更新后的 dom 你的线路

tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()

returns 一个字符串数组。所以你必须循环抛出结果，或者如果你只对感兴趣，请使用：

tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href ").get()

Answer 2

由于@Siebe Jongebloed 的回答（没有结果 - 因为似乎发生了一些 javascript dom-更改）我尝试 scrapy_selenium 获取数据 -

所以我将代码更改为：

import scrapy
from shutil import which

from scrapy_selenium import SeleniumRequest

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = r"C:\Users\Polzi\Documents\DEV\Python-Private\chromedriver.exe"
SELENIUM_DRIVER_ARGUMENTS=['--headless', "--no-sandbox", "--disable-gpu"]
DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

class ZoosSpider(scrapy.Spider):
  name = 'zoos'
  allowed_domains = ['www.tripadvisor.co.uk']
  start_urls = [
                "https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
                ]  
  existList = []  

  def parse(self, response):
    tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
    for elem in tmpSEC:
      link = response.urljoin(elem.xpath(".//a/@href").get())   
      yield SeleniumRequest(url=link, callback=self.parseDetails)  

  def parseDetails(self, response):
    tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()  
    tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()    
    
    yield {
      "name": tmpName ,
      "HREFs": tmpLink
    }

但是 HREF 结果列表仍然是空的...

Answer 3

@Rapid1898 这是迄今为止使用 SeleniumRequest

的有效解决方案

import scrapy

from scrapy_selenium import SeleniumRequest


class ZoosSpider(scrapy.Spider):
    name = 'zoos'
    allowed_domains = ['www.tripadvisor.co.uk']

    def start_requests(self):
        yield SeleniumRequest(
            url="https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html",
             wait_time= 3,
             callback=self.parse)
    def parse(self, response):
        tmpSEC = response.xpath( "//section[@data-automation='AppPresentation_SingleFlexCardSection']")
        for elem in tmpSEC:
            link = response.urljoin(elem.xpath(".//a/@href").get())
            yield SeleniumRequest(url=link, callback=self.parseDetails)

    def parseDetails(self, response):
        tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()  
        tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()    
    
        yield {
        "name": tmpName ,
        "HREFs": tmpLink
        }

Settings.py 文件:

#Middleware

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

#Selenium
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
# '--headless' if using chrome instead of firefox
SELENIUM_DRIVER_ARGUMENTS = ['--headless']

输出：

{'name': 'Sandown Park Racecourse', 'HREFs': ['http://www.sandown.co.uk/', 'tel:%2B44%201372%20464348', 'mailto:sandown.ticketing@thejockeyclub.co.uk']}
2021-11-18 00:55:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g656899-d13201486-Reviews-Gala_Bingo-Cramlington_Northumberland_England.html>
{'name': 'Gala Bingo', 'HREFs': ['https://www.galabingoclubs.co.uk/club/cramlington.html', 
'tel:%2B44%201670%20739739', 'mailto:Cramlington.club@galaleisure.com']}
2021-11-18 00:55:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g580427-d2663587-Reviews-Romford_Greyhound_Stadium-Romford_Greater_London_England.html>
{'name': 'Romford Greyhound Stadium', 'HREFs': []}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g190818-d7364032-Reviews-Wetherby_Racecourse-Wetherby_Leeds_West_Yorkshire_England.html>
{'name': 'Wetherby Racecourse', 'HREFs': ['http://www.wetherbyracing.co.uk', 'tel:%2B44%201937%20582035']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g580423-d3427163-Reviews-Pontefract_Races-Pontefract_West_Yorkshire_England.html>
{'name': 'Pontefract Races', 'HREFs': ['http://www.pontefract-races.co.uk/', 'tel:%2B44%201977%20781307', 'mailto:info@pontefract-races.co.uk']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g190792-d3250215-Reviews-Cartmel_Racecourse-Grange_over_Sands_Lake_District_Cumbria_England.html>
{'name': 'Cartmel Racecourse', 'HREFs': ['http://www.cartmel-racecourse.co.uk', 'tel:%2B44%2015395%2036340', 'mailto:info@cartmel-racecourse.co.uk']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g186332-d7692268-Reviews-Coral_Island_Blackpool-Blackpool_Lancashire_England.html>
{'name': 'Coral Island Blackpool', 'HREFs': []}
2021-11-18 00:55:53 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-18 00:55:53 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:60913/session/8821f802ba0aeaa844dec796ad9187b3 {}
2021-11-18 00:55:53 [urllib3.connectionpool] DEBUG: http://127.0.0.1:60913 "DELETE /session/8821f802ba0aeaa844dec796ad9187b3 HTTP/1.1" 200 14
2021-11-18 00:55:53 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 00:55:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 19882232,
 'downloader/response_count': 31,
 'downloader/response_status_count/200': 31

..等等

Scrapy / Xpath 无法获取 href 元素？

Scrapy / Xpath not working to get href-element?

python

xpath

scrapy

web-scraping