为什么 scrapy 为 "Title" 项返回 None?
Why is scrapy returning None for the "Title" item?
我正在尝试抓取 https://www.jobs.ch/de/stellenangebote/administration-hr-consulting-ceo/,我目前被困在那里,因为 scrapy returns None 用于“标题”项,即作业名称。 css 选择器在 shell 中工作正常,其他项目也工作。我曾尝试更改选择器或添加延迟,但似乎无济于事。有人有想法吗?下面的代码。
import scrapy
from jobscraping.items import JobscrapingItem
class GetdataSpider(scrapy.Spider):
name = 'getdata2'
start_urls = ['https://www.jobs.ch/de/stellenangebote/administration-hr-consulting-ceo/']
def parse(self, response):
for add in response.css('div.sc-AxiKw.VacancySerpItem__ShadowBox-qr45cp-0.hqhfbd'):
item = JobscrapingItem()
addpage = response.urljoin(add.css('div.sc-AxiKw.VacancySerpItem__ShadowBox-qr45cp-0.hqhfbd a::attr(href)').get())
item['link'] = addpage
request = scrapy.Request(addpage, callback=self.get_addinfos)
request.meta['item'] = item
yield request
def get_addinfos(self, response):
item = response.meta['item']
item['Title'] = response.css('.sc-AxhUy.Text__h2-jiiyzm-1.eBKnmN.sc-fzqNJr.Text__span-jiiyzm-8.Text-jiiyzm-9.iNTZsv::text').get()
item['Company'] = response.css('span.sc-fzqNJr.Text__span-jiiyzm-8.kGLBca.sc-fzqNJr.Text__span-jiiyzm-8.Text-jiiyzm-9.kjfvVS::text').get()
item['Location'] = response.css('span.sc-fzqNJr.Text__span-jiiyzm-8.kGLBca.sc-fzqNJr.Text__span-jiiyzm-8.Text-jiiyzm-9.WBPTt::text').getall()
yield item
这是 items.py 文件:
import scrapy
class JobscrapingItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
Title = scrapy.Field()
Company = scrapy.Field()
Location = scrapy.Field()
您正在使用更复杂的 css 选择器。请记住,您不必总是使用 类 或 id。你可以像在这种情况下使用其他属性 data-cy="vacancy-title"
似乎是完美的。
item['Title'] = response.css('h1[data-cy="vacancy-title"]::text').get()
应该可以。简单易行,出错后调试修改。
我正在尝试抓取 https://www.jobs.ch/de/stellenangebote/administration-hr-consulting-ceo/,我目前被困在那里,因为 scrapy returns None 用于“标题”项,即作业名称。 css 选择器在 shell 中工作正常,其他项目也工作。我曾尝试更改选择器或添加延迟,但似乎无济于事。有人有想法吗?下面的代码。
import scrapy
from jobscraping.items import JobscrapingItem
class GetdataSpider(scrapy.Spider):
name = 'getdata2'
start_urls = ['https://www.jobs.ch/de/stellenangebote/administration-hr-consulting-ceo/']
def parse(self, response):
for add in response.css('div.sc-AxiKw.VacancySerpItem__ShadowBox-qr45cp-0.hqhfbd'):
item = JobscrapingItem()
addpage = response.urljoin(add.css('div.sc-AxiKw.VacancySerpItem__ShadowBox-qr45cp-0.hqhfbd a::attr(href)').get())
item['link'] = addpage
request = scrapy.Request(addpage, callback=self.get_addinfos)
request.meta['item'] = item
yield request
def get_addinfos(self, response):
item = response.meta['item']
item['Title'] = response.css('.sc-AxhUy.Text__h2-jiiyzm-1.eBKnmN.sc-fzqNJr.Text__span-jiiyzm-8.Text-jiiyzm-9.iNTZsv::text').get()
item['Company'] = response.css('span.sc-fzqNJr.Text__span-jiiyzm-8.kGLBca.sc-fzqNJr.Text__span-jiiyzm-8.Text-jiiyzm-9.kjfvVS::text').get()
item['Location'] = response.css('span.sc-fzqNJr.Text__span-jiiyzm-8.kGLBca.sc-fzqNJr.Text__span-jiiyzm-8.Text-jiiyzm-9.WBPTt::text').getall()
yield item
这是 items.py 文件:
import scrapy
class JobscrapingItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
Title = scrapy.Field()
Company = scrapy.Field()
Location = scrapy.Field()
您正在使用更复杂的 css 选择器。请记住,您不必总是使用 类 或 id。你可以像在这种情况下使用其他属性 data-cy="vacancy-title"
似乎是完美的。
item['Title'] = response.css('h1[data-cy="vacancy-title"]::text').get()
应该可以。简单易行,出错后调试修改。