Scrapy CrawlSpider 下一页不工作
Scrapy CrawlSpider next page isn't working
我想从每张卡片上抓取所有项目,第一条规则工作正常,但第二条规则意味着分页规则不起作用。
这是我的代码:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class RealtorListSpider(CrawlSpider):
name = 'realtor_list'
allowed_domains = ['www.realtor.com']
start_urls = ['https://www.realtor.com/realestateagents/New-Orleans_LA/pg-1']
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[@data-testid="component-agentCard"]'), callback='parse_item', follow=False),
Rule(LinkExtractor(restrict_xpaths='//a[@aria-label="Go to next page"]'), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield{
'name': response.xpath('(//*[@class="jsx-3130164309 profile-Tiltle-main"]/text())[2]').get()
}
问题在于您在 link 提取器中选择的元素,而不是分页规则。 Xpath 表达式不包含 link 选择,但选择是正确的,这就是为什么我在 url 开始时进行分页并且工作正常。
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class RealtorListSpider(CrawlSpider):
name = 'realtor_list'
allowed_domains = ['www.realtor.com']
start_urls = ['https://www.realtor.com/realestateagents/New-Orleans_LA/pg-'+str(x) +'' for x in range(1,6)]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[@data-testid="component-agentCard"]'), callback='parse_item', follow=False),
)
def parse_item(self, response):
yield{
'name': response.xpath('(//*[@class="jsx-3130164309 profile-Tiltle-main"]/text())[2]').get()
}
我想从每张卡片上抓取所有项目,第一条规则工作正常,但第二条规则意味着分页规则不起作用。 这是我的代码:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class RealtorListSpider(CrawlSpider):
name = 'realtor_list'
allowed_domains = ['www.realtor.com']
start_urls = ['https://www.realtor.com/realestateagents/New-Orleans_LA/pg-1']
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[@data-testid="component-agentCard"]'), callback='parse_item', follow=False),
Rule(LinkExtractor(restrict_xpaths='//a[@aria-label="Go to next page"]'), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield{
'name': response.xpath('(//*[@class="jsx-3130164309 profile-Tiltle-main"]/text())[2]').get()
}
问题在于您在 link 提取器中选择的元素,而不是分页规则。 Xpath 表达式不包含 link 选择,但选择是正确的,这就是为什么我在 url 开始时进行分页并且工作正常。
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class RealtorListSpider(CrawlSpider):
name = 'realtor_list'
allowed_domains = ['www.realtor.com']
start_urls = ['https://www.realtor.com/realestateagents/New-Orleans_LA/pg-'+str(x) +'' for x in range(1,6)]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[@data-testid="component-agentCard"]'), callback='parse_item', follow=False),
)
def parse_item(self, response):
yield{
'name': response.xpath('(//*[@class="jsx-3130164309 profile-Tiltle-main"]/text())[2]').get()
}