如何在 HtmlResponse 的对象中使用 css 选择器
How to use css selector in object from HtmlResponse
我目前正在使用 Scrapy
.
开发应用程序
我想使用 CSS selector
从 def parse
中获取一些值,所以我首先创建了一个 HtmlResponse
对象并尝试使用 css()
获取一些值,但是我无法获得任何价值...
在def parse
内,我可以用同样的方法获取值
在def parse之外怎么办?
代码如下:
import scrapy
from scrapy.http import HtmlResponse
class SampleSpider(scrapy.Spider):
name = 'sample'
allowed_domains = ['sample.com']
start_urls = ['https://sample.com/search']
my_response = HtmlResponse(url=start_urls[0])
print('HtmlResponse')
print(my_response)
h3s = my_response.css('h3')
print(str(len(h3s)))
print('----------')
def parse(self, response, **kwargs):
print('def parse')
print(response)
h3s = response.css('h3')
print(str(len(h3s)))
控制台显示:
HtmlResponse
<200 https://sample.com/search>
0 # <- I want to show '3' here
----------
def parse
<200 https://sample.com/search>
3
更新
我最终要创建的程序是下面的代码:
[(注)以下代码不可参考]
import scrapy
from scrapy.http import HtmlResponse
class SampleSpider(scrapy.Spider):
name = 'sample'
allowed_domains = ['sample.com']
start_urls = []
response_url = 'https://sample.com/search'
my_response = HtmlResponse(url=response_url)
categories = my_response.css('.categories a::attr(href)').getall()
for category in categories:
start_urls.append(category)
def parse(self, response, **kwargs):
pages = response.css('h3')
for page in pages:
print(page.css('::text').get())
Python 3.8.5
刮擦 2.5.0
我明白你的意思,你的开始 url 是基本域,但你还想获取所有类别页面以提取 h3
。
在 scrapy 中,您可以提取数据并使用相同的解析方法跟踪新链接,这是一个示例。
import scrapy
class SampleSpider(scrapy.Spider):
name = 'sample'
allowed_domains = ['sample.com']
start_urls = ['https://sample.com/search']
def parse(self, response, **kwargs):
print('def parse')
print(response)
pages = response.css('h3')
#extract data at here
for page in pages:
print(page.css('::text').get())
yield page.css('::text').get()
#follow new links here
categories = response.css('.categories a::attr(href)').getall()
for category in categories:
yield scrapy.Request(category,callback=self.parse)
您可以阅读 scrapy document 了解更多信息
我目前正在使用 Scrapy
.
我想使用 CSS selector
从 def parse
中获取一些值,所以我首先创建了一个 HtmlResponse
对象并尝试使用 css()
获取一些值,但是我无法获得任何价值...
在def parse
内,我可以用同样的方法获取值
在def parse之外怎么办?
代码如下:
import scrapy
from scrapy.http import HtmlResponse
class SampleSpider(scrapy.Spider):
name = 'sample'
allowed_domains = ['sample.com']
start_urls = ['https://sample.com/search']
my_response = HtmlResponse(url=start_urls[0])
print('HtmlResponse')
print(my_response)
h3s = my_response.css('h3')
print(str(len(h3s)))
print('----------')
def parse(self, response, **kwargs):
print('def parse')
print(response)
h3s = response.css('h3')
print(str(len(h3s)))
控制台显示:
HtmlResponse
<200 https://sample.com/search>
0 # <- I want to show '3' here
----------
def parse
<200 https://sample.com/search>
3
更新
我最终要创建的程序是下面的代码:
[(注)以下代码不可参考]
import scrapy
from scrapy.http import HtmlResponse
class SampleSpider(scrapy.Spider):
name = 'sample'
allowed_domains = ['sample.com']
start_urls = []
response_url = 'https://sample.com/search'
my_response = HtmlResponse(url=response_url)
categories = my_response.css('.categories a::attr(href)').getall()
for category in categories:
start_urls.append(category)
def parse(self, response, **kwargs):
pages = response.css('h3')
for page in pages:
print(page.css('::text').get())
Python 3.8.5
刮擦 2.5.0
我明白你的意思,你的开始 url 是基本域,但你还想获取所有类别页面以提取 h3
。
在 scrapy 中,您可以提取数据并使用相同的解析方法跟踪新链接,这是一个示例。
import scrapy
class SampleSpider(scrapy.Spider):
name = 'sample'
allowed_domains = ['sample.com']
start_urls = ['https://sample.com/search']
def parse(self, response, **kwargs):
print('def parse')
print(response)
pages = response.css('h3')
#extract data at here
for page in pages:
print(page.css('::text').get())
yield page.css('::text').get()
#follow new links here
categories = response.css('.categories a::attr(href)').getall()
for category in categories:
yield scrapy.Request(category,callback=self.parse)
您可以阅读 scrapy document 了解更多信息