用 Scrapy 抓取 patents.google 失败

Question

我正在尝试用Scrapy 抓取本页的主标题：https://patents.google.com/patent/CN102093389B/en（“双氧桥杂环紫檀碱化合物及其制备方法”），但无法抓取。我正在尝试用 css 提取它。相同的 css 选择器在 puppeteer 中工作正常并提取主要 header 但在 Scrapy 中给出 None。代码写的是这样的

import scrapy

class GooglepatentsspiderSpider(scrapy.Spider):
    name = 'googlePatentsSpider'
    allowed_domains = ['patents.google.com']
    start_urls = ['https://patents.google.com/patent/CN102093389B/en']

    def parse(self, response):
        title = response.css('h1#title::text').get()

        yield {
            'title': title
        }

Answer 1

您的 css 路径不正确。试试这个， response.css('span[itemprop="title"]::text').get()

用 Scrapy 抓取 patents.google 失败

Scrape patents.google with Scrapy fails

python

scrapy