通过 xpath 查询时 Scrapy 返回 None

Question

您好，我正在使用 srapy 抓取网站 https://www.centralbankofindia.co.in，我收到了回复，但在通过 XPath 查找地址时，我收到了 None

    start_urls = [
    "https://www.centralbankofindia.co.in/en/branch-locator?field_state_target_id=All&combine=&page={}".format(
        i
    )
    for i in range(0, 5)
]
brand_name = "Central Bank of India"
spider_type = "chain"
# //*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[1]/td[2]/div/span[2]
# //*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[2]/td[2]/div/span[2]
# //*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[3]/td[2]/div/span[2]
def parse(self, response, **kwargs):
    """Parse response."""
    # print(response.text)
    for id in range(1, 11):
        address = self.get_text(
            response,
            f'//*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[{id}]/td[2]/div/span[2]',
        )
        print(address)

    def get_text(self, response, path):
    sol = response.xpath(path).extract_first()
    return sol

网站中地址的跨度 class 没有唯一 ID，这是导致问题的原因吗？

Answer 1

我认为你创建的太复杂了xpath。您应该跳过一些元素并改用 //。

某些浏览器可能会在 DevTools 中显示 tbody，但它可能不存在于 HTML 中，scrapy 从服务器获取，因此最好始终跳过它。

您可以使用 extract() 代替 tr[{id}] 和 extract_first()

这个 xpath 适合我。

all_items = response.xpath('//*[@id="block-cbi-content"]//td[2]//span[2]/text()').extract()
        
for address in all_items:
    print(address)

顺便说一句：我在 xpath 中使用了 text() 来获取没有 HTML 标签的地址。

完整的工作代码。

您可以将所有内容放在一个文件中，然后运行将其作为 python script.py 而无需创建 project。

它将结果保存在 output.csv。

在 start_urls 中，我只将 link 设置为第一页，因为 parse() 搜索 link 到 HTML 中的下一页 - 所以它可以获取所有页面而不是 range(0, 5)

#!/usr/bin/env python3

import scrapy

class MySpider(scrapy.Spider):
    
    start_urls = [
        # f"https://www.centralbankofindia.co.in/en/branch-locator?field_state_target_id=All&combine=&page={i}"
        # for i in range(0, 5)
        
        # only first page - links to other pages it will find in HTML
        "https://www.centralbankofindia.co.in/en/branch-locator?field_state_target_id=All&combine=&page=0"
    ]
    
    name = "Central Bank of India"
    
    def parse(self, response):
        print(f'url: {response.url}')
        
        all_items = response.xpath('//*[@id="block-cbi-content"]//td[2]//span[2]/text()').extract()
        
        for address in all_items:
            print(address)
            yield {'address': address}

        # get link to next page
        
        next_page = response.xpath('//a[@rel="next"]/@href').extract_first()
        
        if next_page:
            print(f'Next Page: {next_page}')
            yield response.follow(next_page)
            
# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(MySpider)
c.start()

通过 xpath 查询时 Scrapy 返回 None

Scrapy returning None on querying by xpath

python

scrapy

web-scraping