使用 xpath 从列表中提取元素

Extracting elements from list using xpath

我正在尝试在此页面上提取 tripadvisor 上的城市列表: https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html

同时仅使用 scrapy 和 xpaths。我尝试过的:

def parse(self, response):
cities = response.xpath('//div[@id="LOCATION_LIST"]')
for links in cities:
    loader = ItemLoader(AdvisorItem(), selector=links)
    loader.add_xpath('cities', './/ul[@class="geoList"]/li/span[@class="state"]//text()')
    loader.add_xpath('cities_url', './/ul[@class="geoList"]/li/a//@href')
    yield loader.load_item()

这只有 returns 个结果,结果是西约克郡不在那个页面上!所以我不确定它是从哪里得到的。如何为该页面中的所有链接获取链接的正确 xpath 和城市名称?

您可以尝试 select 以这种方式更正 xpath 定位器:

//*[@class="geoList"]/li 

它将select个元素列表

".//a/text()"  

".//a/@href/text()"

他们将 select 每个城市名称和每个 link

以scrapy实现为例:

脚本:

import scrapy

class TripSpider(scrapy.Spider):
     name = 'trip'

     allowed_domains = ["tripadvisor.co.uk"]
     start_urls = ['https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html']

  

    def parse(self, response):
        cities = response.xpath('//*[@class="geoList"]/li')
        for city in cities:
            url = city.xpath(".//a/@href").get()
            abs_url= f'https://www.tripadvisor.co.uk{url}'
            yield {
                'city': city.xpath(".//a/text()").get(),
                'link':  abs_url}

输出:

{'city': 'Bradford Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186408-Bradford_West_Yorkshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Plymouth Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186258-Plymouth_Devon_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Southend-on-Sea Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g503790-Southend_on_Sea_Essex_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Swansea Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186466-Swansea_Swansea_County_South_Wales_Wales.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Aberdeen Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186487-Aberdeen_Aberdeenshire_Scotland.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Coventry Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186403-Coventry_West_Midlands_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Portsmouth Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186298-Portsmouth_Hampshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Kingston-upon-Hull Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186317-Kingston_upon_Hull_East_Riding_of_Yorkshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Oxford Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186361-Oxford_Oxfordshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Isle of Wight Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186308-Isle_of_Wight_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Doncaster Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187067-Doncaster_South_Yorkshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Reading Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186363-Reading_Berkshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Cambridge Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186225-Cambridge_Cambridgeshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Milton Keynes Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187055-Milton_Keynes_Buckinghamshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Derby Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187048-Derby_Derbyshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Stockport Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g528793-Stockport_Greater_Manchester_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Northampton Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186349-Northampton_Northamptonshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Bolton Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187053-Bolton_Greater_Manchester_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Bath Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186370-Bath_Somerset_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Preston Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187062-Preston_Lancashire_England.html'}
2021-12-08 23:58:50 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-08 23:58:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 345,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 103132,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 3.084321,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 12, 8, 17, 58, 50, 809225),
 'httpcompression/response_bytes': 384303,
 'httpcompression/response_count': 1,
 'item_scraped_count': 20,