Scrapy / Xpath 无法获取 href 元素?
Scrapy / Xpath not working to get href-element?
我试着从这个网站上抓取一些东西并在 scrapy 中工作 shell;
https://www.tripadvisor.co.uk/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html
在网站上我有以下部分代码,我想获取所有这三个 a 元素的 href 信息:
<div class="fvqxY f dlzPP">
<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"
href="http://www.blackpoolzoo.org.uk"><span class="WlYyy cacGK Wb">Visit website</span><svg viewBox="0 0 24 24"
width="16px" height="16px" class="fecdL d Vb wQMPa">
<path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path>
</svg></a></div>
<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
href="tel:%2B44%201253%20830830"><span class="WlYyy cacGK Wb">Call</span></a></div>
<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
href="mailto:info@blackpoolzoo.org.uk"><span class="WlYyy cacGK Wb">Email</span></a></div>
</div>
我用这个 xpath 试过了 - 它在 chrome-inspector 中对我很好用 - 但我只得到一个空结果
>>> response.xpath("//div[@class='Lvkmj']//ancestor::a/@href")
[]
我也用 class = "Lvkmj" 检查了第一个 div 并得到了这个结果:
>>> response.xpath("//div[@class='Lvkmj']").get() s="WlYyy cacGK Wb">Visit website</s
'<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"><span clas 8.293-8.293H7.854v-2h10v10h-2V7.56s="WlYyy cacGK Wb">Visit website</span><svg viewbox="0 0 24 24" width="16px" height="16px" class="fecdL d Vb wQMPa"><path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path></svg></a></div>'
>>>
我意识到乍一看它是整个 div 元素 - 但后来我发现它看起来与 inspecto 中的完全一样,但无论出于何种原因缺少 href 元素。
为什么在那种情况下使用 scapy shell 时缺少 href 元素?
你可以在下面找到完整的代码-
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['www.tripadvisor.co.uk']
start_urls = [
"https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html",
"https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
]
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
link = response.urljoin(elem.xpath(".//a/@href").get())
yield response.follow(link, callback=self.parseDetails)
def parseDetails(self, response):
tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()
tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()
tmpErg = response.xpath("//div[@class='dlzPP']//ancestor::div[@class='WlYyy diXIH dDKKM']/text()").getall()
yield {
"cat": tmpErg[1],
"link": tmpLink,
"name": tmpName ,
}
更新 - 起初@Fazlul 的解决方案效果很好 - 但经过几次尝试后,HREF 列表不再有更多输出 - 这是我从 scrapy 获得的日志的一部分:
2021-11-18 12:16:32 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 140
2021-11-18 12:16:32 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.co.uk/Attraction_Review-g580423-d3427163-Reviews-Pontefract_Races-Pontefract_West_Yorkshire_England.html> (referer: https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html)
2021-11-18 12:16:32 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/url {"url": "https://www.tripadvisor.co.uk/Attraction_Review-g190792-d3250215-Reviews-Cartmel_Racecourse-Grange_over_Sands_Lake_District_Cumbria_England.html"}
2021-11-18 12:16:34 [urllib3.connectionpool] DEBUG: http://localhost:62481 "POST /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 14
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/source {}
2021-11-18 12:16:34 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/source HTTP/1.1" 200 732552
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/url {}
[1118/121634.931:INFO:CONSOLE(1)] "Evidon -- evidon-notice-link not found on page, cant display the consent link.", source: https://c.evidon.com/sitenotice/evidon-sitenotice-tag.js (1)
2021-11-18 12:16:34 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 156
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.co.uk/Attraction_Review-g190792-d3250215-Reviews-Cartmel_Racecourse-Grange_over_Sands_Lake_District_Cumbria_England.html> (referer: https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html)
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/url {"url": "https://www.tripadvisor.co.uk/Attraction_Review-g186332-d7692268-Reviews-Coral_Island_Blackpool-Blackpool_Lancashire_England.html"}
2021-11-18 12:16:36 [urllib3.connectionpool] DEBUG: http://localhost:62481 "POST /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 14
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/source {}
2021-11-18 12:16:36 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/source HTTP/1.1" 200 872845
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/url {}
[1118/121636.173:INFO:CONSOLE(1)] "Evidon -- evidon-notice-link not found on page, cant display the consent link.", source: https://c.evidon.com/sitenotice/evidon-sitenotice-tag.js (1)
2021-11-18 12:16:36 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 141
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.co.uk/Attraction_Review-g186332-d7692268-Reviews-Coral_Island_Blackpool-Blackpool_Lancashire_England.html> (referer: https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html)
2021-11-18 12:16:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g656899-d13201486-Reviews-Gala_Bingo-Cramlington_Northumberland_England.html>
{'name': 'Gala Bingo', 'HREFs': []}
2021-11-18 12:16:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g580427-d2663587-Reviews-Romford_Greyhound_Stadium-Romford_Greater_London_England.html>
{'name': 'Romford Greyhound Stadium', 'HREFs': []}
您的 XPath:
//div[@class='Lvkmj']//ancestor::a/@href
显示结果...因为您的第二个 //
告诉 XPath engine:find 当前节点的任何后代节点然后 ancestor::a
告诉引擎找到任何名为 a 的祖先元素.因为 a 确实有后代,所以您的 XPath 给出了结果....但是有更好的方法:只需使用:
//div[@class='Lvkmj']/a/@href
/a
的意思是:给我 div[@class='Lvkmj']
的直接 child
命名为 a
但这并不能解决你的问题。
您的问题:为什么在那种情况下使用 scapy shell 时缺少 href 元素?
因为我认为它只使用了文档的来源而不是更新的(由javascript)dom。
如果它会使用更新后的 dom 你的线路
tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()
returns 一个字符串数组。所以你必须循环抛出结果,或者如果你只对 感兴趣,请使用:
tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href ").get()
由于@Siebe Jongebloed 的回答(没有结果 - 因为似乎发生了一些 javascript dom-更改)我尝试 scrapy_selenium 获取数据 -
所以我将代码更改为:
import scrapy
from shutil import which
from scrapy_selenium import SeleniumRequest
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = r"C:\Users\Polzi\Documents\DEV\Python-Private\chromedriver.exe"
SELENIUM_DRIVER_ARGUMENTS=['--headless', "--no-sandbox", "--disable-gpu"]
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['www.tripadvisor.co.uk']
start_urls = [
"https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
]
existList = []
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
link = response.urljoin(elem.xpath(".//a/@href").get())
yield SeleniumRequest(url=link, callback=self.parseDetails)
def parseDetails(self, response):
tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()
tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()
yield {
"name": tmpName ,
"HREFs": tmpLink
}
但是 HREF 结果列表仍然是空的...
@Rapid1898 这是迄今为止使用 SeleniumRequest
的有效解决方案
import scrapy
from scrapy_selenium import SeleniumRequest
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['www.tripadvisor.co.uk']
def start_requests(self):
yield SeleniumRequest(
url="https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html",
wait_time= 3,
callback=self.parse)
def parse(self, response):
tmpSEC = response.xpath( "//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
link = response.urljoin(elem.xpath(".//a/@href").get())
yield SeleniumRequest(url=link, callback=self.parseDetails)
def parseDetails(self, response):
tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()
tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()
yield {
"name": tmpName ,
"HREFs": tmpLink
}
Settings.py 文件:
#Middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
#Selenium
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
# '--headless' if using chrome instead of firefox
SELENIUM_DRIVER_ARGUMENTS = ['--headless']
输出:
{'name': 'Sandown Park Racecourse', 'HREFs': ['http://www.sandown.co.uk/', 'tel:%2B44%201372%20464348', 'mailto:sandown.ticketing@thejockeyclub.co.uk']}
2021-11-18 00:55:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g656899-d13201486-Reviews-Gala_Bingo-Cramlington_Northumberland_England.html>
{'name': 'Gala Bingo', 'HREFs': ['https://www.galabingoclubs.co.uk/club/cramlington.html',
'tel:%2B44%201670%20739739', 'mailto:Cramlington.club@galaleisure.com']}
2021-11-18 00:55:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g580427-d2663587-Reviews-Romford_Greyhound_Stadium-Romford_Greater_London_England.html>
{'name': 'Romford Greyhound Stadium', 'HREFs': []}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g190818-d7364032-Reviews-Wetherby_Racecourse-Wetherby_Leeds_West_Yorkshire_England.html>
{'name': 'Wetherby Racecourse', 'HREFs': ['http://www.wetherbyracing.co.uk', 'tel:%2B44%201937%20582035']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g580423-d3427163-Reviews-Pontefract_Races-Pontefract_West_Yorkshire_England.html>
{'name': 'Pontefract Races', 'HREFs': ['http://www.pontefract-races.co.uk/', 'tel:%2B44%201977%20781307', 'mailto:info@pontefract-races.co.uk']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g190792-d3250215-Reviews-Cartmel_Racecourse-Grange_over_Sands_Lake_District_Cumbria_England.html>
{'name': 'Cartmel Racecourse', 'HREFs': ['http://www.cartmel-racecourse.co.uk', 'tel:%2B44%2015395%2036340', 'mailto:info@cartmel-racecourse.co.uk']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g186332-d7692268-Reviews-Coral_Island_Blackpool-Blackpool_Lancashire_England.html>
{'name': 'Coral Island Blackpool', 'HREFs': []}
2021-11-18 00:55:53 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-18 00:55:53 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:60913/session/8821f802ba0aeaa844dec796ad9187b3 {}
2021-11-18 00:55:53 [urllib3.connectionpool] DEBUG: http://127.0.0.1:60913 "DELETE /session/8821f802ba0aeaa844dec796ad9187b3 HTTP/1.1" 200 14
2021-11-18 00:55:53 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 00:55:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 19882232,
'downloader/response_count': 31,
'downloader/response_status_count/200': 31
..等等
我试着从这个网站上抓取一些东西并在 scrapy 中工作 shell; https://www.tripadvisor.co.uk/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html
在网站上我有以下部分代码,我想获取所有这三个 a 元素的 href 信息:
<div class="fvqxY f dlzPP">
<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"
href="http://www.blackpoolzoo.org.uk"><span class="WlYyy cacGK Wb">Visit website</span><svg viewBox="0 0 24 24"
width="16px" height="16px" class="fecdL d Vb wQMPa">
<path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path>
</svg></a></div>
<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
href="tel:%2B44%201253%20830830"><span class="WlYyy cacGK Wb">Call</span></a></div>
<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
href="mailto:info@blackpoolzoo.org.uk"><span class="WlYyy cacGK Wb">Email</span></a></div>
</div>
我用这个 xpath 试过了 - 它在 chrome-inspector 中对我很好用 - 但我只得到一个空结果
>>> response.xpath("//div[@class='Lvkmj']//ancestor::a/@href")
[]
我也用 class = "Lvkmj" 检查了第一个 div 并得到了这个结果:
>>> response.xpath("//div[@class='Lvkmj']").get() s="WlYyy cacGK Wb">Visit website</s
'<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"><span clas 8.293-8.293H7.854v-2h10v10h-2V7.56s="WlYyy cacGK Wb">Visit website</span><svg viewbox="0 0 24 24" width="16px" height="16px" class="fecdL d Vb wQMPa"><path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path></svg></a></div>'
>>>
我意识到乍一看它是整个 div 元素 - 但后来我发现它看起来与 inspecto 中的完全一样,但无论出于何种原因缺少 href 元素。
为什么在那种情况下使用 scapy shell 时缺少 href 元素?
你可以在下面找到完整的代码-
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['www.tripadvisor.co.uk']
start_urls = [
"https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html",
"https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
]
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
link = response.urljoin(elem.xpath(".//a/@href").get())
yield response.follow(link, callback=self.parseDetails)
def parseDetails(self, response):
tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()
tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()
tmpErg = response.xpath("//div[@class='dlzPP']//ancestor::div[@class='WlYyy diXIH dDKKM']/text()").getall()
yield {
"cat": tmpErg[1],
"link": tmpLink,
"name": tmpName ,
}
更新 - 起初@Fazlul 的解决方案效果很好 - 但经过几次尝试后,HREF 列表不再有更多输出 - 这是我从 scrapy 获得的日志的一部分:
2021-11-18 12:16:32 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 140
2021-11-18 12:16:32 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.co.uk/Attraction_Review-g580423-d3427163-Reviews-Pontefract_Races-Pontefract_West_Yorkshire_England.html> (referer: https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html)
2021-11-18 12:16:32 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/url {"url": "https://www.tripadvisor.co.uk/Attraction_Review-g190792-d3250215-Reviews-Cartmel_Racecourse-Grange_over_Sands_Lake_District_Cumbria_England.html"}
2021-11-18 12:16:34 [urllib3.connectionpool] DEBUG: http://localhost:62481 "POST /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 14
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/source {}
2021-11-18 12:16:34 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/source HTTP/1.1" 200 732552
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/url {}
[1118/121634.931:INFO:CONSOLE(1)] "Evidon -- evidon-notice-link not found on page, cant display the consent link.", source: https://c.evidon.com/sitenotice/evidon-sitenotice-tag.js (1)
2021-11-18 12:16:34 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 156
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.co.uk/Attraction_Review-g190792-d3250215-Reviews-Cartmel_Racecourse-Grange_over_Sands_Lake_District_Cumbria_England.html> (referer: https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html)
2021-11-18 12:16:34 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/url {"url": "https://www.tripadvisor.co.uk/Attraction_Review-g186332-d7692268-Reviews-Coral_Island_Blackpool-Blackpool_Lancashire_England.html"}
2021-11-18 12:16:36 [urllib3.connectionpool] DEBUG: http://localhost:62481 "POST /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 14
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/source {}
2021-11-18 12:16:36 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/source HTTP/1.1" 200 872845
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://localhost:62481/session/7dacecfe2b35d929244b907b93efd712/url {}
[1118/121636.173:INFO:CONSOLE(1)] "Evidon -- evidon-notice-link not found on page, cant display the consent link.", source: https://c.evidon.com/sitenotice/evidon-sitenotice-tag.js (1)
2021-11-18 12:16:36 [urllib3.connectionpool] DEBUG: http://localhost:62481 "GET /session/7dacecfe2b35d929244b907b93efd712/url HTTP/1.1" 200 141
2021-11-18 12:16:36 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 12:16:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.co.uk/Attraction_Review-g186332-d7692268-Reviews-Coral_Island_Blackpool-Blackpool_Lancashire_England.html> (referer: https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html)
2021-11-18 12:16:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g656899-d13201486-Reviews-Gala_Bingo-Cramlington_Northumberland_England.html>
{'name': 'Gala Bingo', 'HREFs': []}
2021-11-18 12:16:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g580427-d2663587-Reviews-Romford_Greyhound_Stadium-Romford_Greater_London_England.html>
{'name': 'Romford Greyhound Stadium', 'HREFs': []}
您的 XPath:
//div[@class='Lvkmj']//ancestor::a/@href
显示结果...因为您的第二个 //
告诉 XPath engine:find 当前节点的任何后代节点然后 ancestor::a
告诉引擎找到任何名为 a 的祖先元素.因为 a 确实有后代,所以您的 XPath 给出了结果....但是有更好的方法:只需使用:
//div[@class='Lvkmj']/a/@href
/a
的意思是:给我 div[@class='Lvkmj']
child
命名为 a
但这并不能解决你的问题。
您的问题:为什么在那种情况下使用 scapy shell 时缺少 href 元素?
因为我认为它只使用了文档的来源而不是更新的(由javascript)dom。
如果它会使用更新后的 dom 你的线路
tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()
returns 一个字符串数组。所以你必须循环抛出结果,或者如果你只对
tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href ").get()
由于@Siebe Jongebloed 的回答(没有结果 - 因为似乎发生了一些 javascript dom-更改)我尝试 scrapy_selenium 获取数据 -
所以我将代码更改为:
import scrapy
from shutil import which
from scrapy_selenium import SeleniumRequest
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = r"C:\Users\Polzi\Documents\DEV\Python-Private\chromedriver.exe"
SELENIUM_DRIVER_ARGUMENTS=['--headless', "--no-sandbox", "--disable-gpu"]
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['www.tripadvisor.co.uk']
start_urls = [
"https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
]
existList = []
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
link = response.urljoin(elem.xpath(".//a/@href").get())
yield SeleniumRequest(url=link, callback=self.parseDetails)
def parseDetails(self, response):
tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()
tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()
yield {
"name": tmpName ,
"HREFs": tmpLink
}
但是 HREF 结果列表仍然是空的...
@Rapid1898 这是迄今为止使用 SeleniumRequest
的有效解决方案import scrapy
from scrapy_selenium import SeleniumRequest
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['www.tripadvisor.co.uk']
def start_requests(self):
yield SeleniumRequest(
url="https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html",
wait_time= 3,
callback=self.parse)
def parse(self, response):
tmpSEC = response.xpath( "//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
link = response.urljoin(elem.xpath(".//a/@href").get())
yield SeleniumRequest(url=link, callback=self.parseDetails)
def parseDetails(self, response):
tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()
tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()
yield {
"name": tmpName ,
"HREFs": tmpLink
}
Settings.py 文件:
#Middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
#Selenium
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
# '--headless' if using chrome instead of firefox
SELENIUM_DRIVER_ARGUMENTS = ['--headless']
输出:
{'name': 'Sandown Park Racecourse', 'HREFs': ['http://www.sandown.co.uk/', 'tel:%2B44%201372%20464348', 'mailto:sandown.ticketing@thejockeyclub.co.uk']}
2021-11-18 00:55:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g656899-d13201486-Reviews-Gala_Bingo-Cramlington_Northumberland_England.html>
{'name': 'Gala Bingo', 'HREFs': ['https://www.galabingoclubs.co.uk/club/cramlington.html',
'tel:%2B44%201670%20739739', 'mailto:Cramlington.club@galaleisure.com']}
2021-11-18 00:55:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g580427-d2663587-Reviews-Romford_Greyhound_Stadium-Romford_Greater_London_England.html>
{'name': 'Romford Greyhound Stadium', 'HREFs': []}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g190818-d7364032-Reviews-Wetherby_Racecourse-Wetherby_Leeds_West_Yorkshire_England.html>
{'name': 'Wetherby Racecourse', 'HREFs': ['http://www.wetherbyracing.co.uk', 'tel:%2B44%201937%20582035']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g580423-d3427163-Reviews-Pontefract_Races-Pontefract_West_Yorkshire_England.html>
{'name': 'Pontefract Races', 'HREFs': ['http://www.pontefract-races.co.uk/', 'tel:%2B44%201977%20781307', 'mailto:info@pontefract-races.co.uk']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g190792-d3250215-Reviews-Cartmel_Racecourse-Grange_over_Sands_Lake_District_Cumbria_England.html>
{'name': 'Cartmel Racecourse', 'HREFs': ['http://www.cartmel-racecourse.co.uk', 'tel:%2B44%2015395%2036340', 'mailto:info@cartmel-racecourse.co.uk']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g186332-d7692268-Reviews-Coral_Island_Blackpool-Blackpool_Lancashire_England.html>
{'name': 'Coral Island Blackpool', 'HREFs': []}
2021-11-18 00:55:53 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-18 00:55:53 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:60913/session/8821f802ba0aeaa844dec796ad9187b3 {}
2021-11-18 00:55:53 [urllib3.connectionpool] DEBUG: http://127.0.0.1:60913 "DELETE /session/8821f802ba0aeaa844dec796ad9187b3 HTTP/1.1" 200 14
2021-11-18 00:55:53 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 00:55:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 19882232,
'downloader/response_count': 31,
'downloader/response_status_count/200': 31
..等等