如何在 HtmlResponse 的对象中使用 css 选择器

How to use css selector in object from HtmlResponse

我目前正在使用 Scrapy.

开发应用程序

我想使用 CSS selectordef parse 中获取一些值,所以我首先创建了一个 HtmlResponse 对象并尝试使用 css() 获取一些值,但是我无法获得任何价值...

def parse内,我可以用同样的方法获取值

在def parse之外怎么办?


代码如下:

import scrapy
from scrapy.http import HtmlResponse


class SampleSpider(scrapy.Spider):

    name = 'sample'
    allowed_domains = ['sample.com']
    start_urls = ['https://sample.com/search']

    my_response = HtmlResponse(url=start_urls[0])

    print('HtmlResponse')
    print(my_response)

    h3s = my_response.css('h3')

    print(str(len(h3s)))

    print('----------')

    def parse(self, response, **kwargs):

        print('def parse')
        print(response)

        h3s = response.css('h3')

        print(str(len(h3s)))

控制台显示:

HtmlResponse
<200 https://sample.com/search>
0 # <- I want to show '3' here
----------
def parse
<200 https://sample.com/search>
3

更新

我最终要创建的程序是下面的代码:

[(注)以下代码不可参考]

import scrapy
from scrapy.http import HtmlResponse


class SampleSpider(scrapy.Spider):

    name = 'sample'
    allowed_domains = ['sample.com']
    start_urls = []
    response_url = 'https://sample.com/search'

    my_response = HtmlResponse(url=response_url)
    categories = my_response.css('.categories a::attr(href)').getall()

    for category in categories:
        start_urls.append(category)

    def parse(self, response, **kwargs):
        
        pages = response.css('h3')

        for page in pages:
            print(page.css('::text').get())
        

Python 3.8.5

刮擦 2.5.0

我明白你的意思,你的开始 url 是基本域,但你还想获取所有类别页面以提取 h3
在 scrapy 中,您可以提取数据并使用相同的解析方法跟踪新链接,这是一个示例。

import scrapy


class SampleSpider(scrapy.Spider):

    name = 'sample'
    allowed_domains = ['sample.com']
    start_urls = ['https://sample.com/search']

    def parse(self, response, **kwargs):

        print('def parse')
        print(response)

        pages = response.css('h3')

        #extract data at here
        for page in pages:
            print(page.css('::text').get())
            yield page.css('::text').get()
        
        #follow new links here
        categories = response.css('.categories a::attr(href)').getall()
        for category in categories:
            yield scrapy.Request(category,callback=self.parse)

您可以阅读 scrapy document 了解更多信息