Scrapy 不会爬行

Question

    # -*- coding: utf-8 -*-
    import scrapy

    class ProvasSpider(scrapy.Spider):
        name = 'provas'
        allowed_domains = ['folhadirigida.com.br']
        start_urls = ['https://folhadirigida.com.br/']

        def parse(self, response): #criando pagina
            page = response.url.split ("/")[-3]
            filename = '%s.html' % page
            with open(filename, 'wb') as f:
                f.write(response.body)

当我运行这个程序，抓取this page，我得到：。例如，如果我运行 this page 上的相同程序，我会得到此页面的精确副本。为什么它不适用于第一页？

Answer 1

由于您的 HTML 页面中缺少 <base href="https://folhadirigida.com.br/">，您认为它看起来不太好。

import scrapy

class ProvasSpider(scrapy.Spider):
    name = 'provas'
    allowed_domains = ['folhadirigida.com.br']
    start_urls = ['https://folhadirigida.com.br/']

    def parse(self, response): #criando pagina
        page = response.url.split ("/")[-2]
        filename = '%s.html' % page
        with open(filename, 'wb') as f:
            ref_body = response.body[:42] + b'<base href="https://folhadirigida.com.br/">'\
                       + response.body[42:]
            f.write(ref_body)

像这段代码一样将其添加到您的 HTML 正文将使页面看起来不错。

Scrapy 不会爬行

Scrapy doesn't crawl

python

web-crawler

scrapy

python-3.x