Scrapy 不会爬行
Scrapy doesn't crawl
# -*- coding: utf-8 -*-
import scrapy
class ProvasSpider(scrapy.Spider):
name = 'provas'
allowed_domains = ['folhadirigida.com.br']
start_urls = ['https://folhadirigida.com.br/']
def parse(self, response): #criando pagina
page = response.url.split ("/")[-3]
filename = '%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
当我运行这个程序,抓取this page,我得到:。
例如,如果我 运行 this page 上的相同程序,我会得到此页面的精确副本。为什么它不适用于第一页?
由于您的 HTML 页面中缺少 <base href="https://folhadirigida.com.br/">
,您认为它看起来不太好。
import scrapy
class ProvasSpider(scrapy.Spider):
name = 'provas'
allowed_domains = ['folhadirigida.com.br']
start_urls = ['https://folhadirigida.com.br/']
def parse(self, response): #criando pagina
page = response.url.split ("/")[-2]
filename = '%s.html' % page
with open(filename, 'wb') as f:
ref_body = response.body[:42] + b'<base href="https://folhadirigida.com.br/">'\
+ response.body[42:]
f.write(ref_body)
像这段代码一样将其添加到您的 HTML 正文将使页面看起来不错。
# -*- coding: utf-8 -*-
import scrapy
class ProvasSpider(scrapy.Spider):
name = 'provas'
allowed_domains = ['folhadirigida.com.br']
start_urls = ['https://folhadirigida.com.br/']
def parse(self, response): #criando pagina
page = response.url.split ("/")[-3]
filename = '%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
当我运行这个程序,抓取this page,我得到:
由于您的 HTML 页面中缺少 <base href="https://folhadirigida.com.br/">
,您认为它看起来不太好。
import scrapy
class ProvasSpider(scrapy.Spider):
name = 'provas'
allowed_domains = ['folhadirigida.com.br']
start_urls = ['https://folhadirigida.com.br/']
def parse(self, response): #criando pagina
page = response.url.split ("/")[-2]
filename = '%s.html' % page
with open(filename, 'wb') as f:
ref_body = response.body[:42] + b'<base href="https://folhadirigida.com.br/">'\
+ response.body[42:]
f.write(ref_body)
像这段代码一样将其添加到您的 HTML 正文将使页面看起来不错。