Scrapy如何跳转到下一页

How can I jump to next page in Scrapy

我正在尝试使用 scrapy 从 here 中抓取结果。问题是在单击 'load more results' 选项卡之前,并非所有 类 都显示在页面上。

问题可以看这里:

我的代码如下所示:

class ClassCentralSpider(CrawlSpider):
    name = "class_central"
    allowed_domains = ["www.class-central.com"]
    start_urls = (
        'https://www.class-central.com/courses/recentlyAdded',
    )
    rules = (
        Rule(
            LinkExtractor(
                # allow=("index\d00\.html",),
                restrict_xpaths=('//div[@id="show-more-courses"]',)
            ),
            callback='parse',
            follow=True
        ),
    )

def parse(self, response):
    x = response.xpath('//span[@class="course-name-text"]/text()').extract()
    item = ClasscentralItem()
    for y in x:
        item['name'] = y
        print item['name']

    pass

该网站的第二页似乎是通过 AJAX 调用生成的。如果您查看任何浏览器检查工具的网络选项卡,您会看到如下内容:

在这种情况下,它似乎是从 https://www.class-central.com/maestro/courses/recentlyAdded?page=2&_=1469471093134

中检索 json 文件

现在 url 参数 _=1469471093134 似乎没有任何作用,因此您可以 trim 将其移至:https://www.class-central.com/maestro/courses/recentlyAdded?page=2
return json 包含下一页的 html 代码:

# so you just need to load it up with 
data = json.loads(response.body) 
# and convert it to scrapy selector - 
sel = Selector(text=data['table'])

要在您的代码中复制它,请尝试类似的操作:

from w3lib.url import add_or_replace_parameter 
def parse(self, response):
    # check if response is json, if so convert to selector
    if response.meta.get('is_json',False):
        # convert the json to scrapy.Selector here for parsing
        sel = Selector(text=json.loads(response.body)['table'])
    else:
        sel = Selector(response) 
    # parse page here for items
    x = sel.xpath('//span[@class="course-name-text"]/text()').extract()
    item = ClasscentralItem()
    for y in x:
        item['name'] = y
        print(item['name'])
    # do next page
    next_page_el = respones.xpath("//div[@id='show-more-courses']")
    if next_page_el:  # there is next page
        next_page = response.meta.get('page',1) + 1
        # make next page url
        url = add_or_replace_parameter(url, 'page', next_page)
        yield Request(url, self.parse, meta={'page': next_page, 'is_json': True)