将 Scrapy Shell 与 FormRequest 一起使用
Using Scrapy Shell with FormRequest
尝试使用 scrapy document and fellow posts 中的代码登录 CMS 会员站点,但我总是失败。我的错误信息:
2017-03-20 18:18:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/robots.txt> (referer: None)
2017-03-20 18:18:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/wp-login.php> (referer: None)
2017-03-20 18:18:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <POST http://members.com/login.php> from <POST http://members.com/login.ph
p?wpe-login=membersipa>
我尝试将用户代理更改为:
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; rv:32.0) Gecko/20100101 Firefox/32.0'
但我的错误是:
2017-03-20 17:47:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/robots.txt> (referer: None)
2017-03-20 17:47:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/wp-login.php> (referer: None)
2017-03-20 17:47:23 [scrapy.core.engine] DEBUG: Crawled (403) <POST http://members.com/wp-login.php?wpe-login=membersipa> (referer: http://members.com/wp-login.php)
2017-03-20 17:47:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://members.com/wp-login.php?wpe-login=membersipa>: HTTP status code is not handled or not
allowed
这是产生错误的代码:
import scrapy
class LoginSpider(scrapy.Spider):
name = 'freddy'
start_urls = ['http://members.com/wlogin.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'log': 'name', 'pwd': 'password'},
callback=self.after_login
)
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
else:
return Request(url="http://members.com",
callback=self.parse_ipro)
def parse_ipro(self, response):
title = response.xpath('/html/body/div[2]/div/div[1]/div/div/div[2]/div/div/main/article/header/h1').extract_first()
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
最终,我想使用 scrapy shell 来测试选择器并尝试使用 scrapy
尝试使用 scrapy shell 但也遇到了问题:
import scrapy
class LoginSpider(scrapy.Spider):
name = 'freddy'
start_urls = ['http://members.com/wlogin.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'log': 'name', 'pwd': 'password'},
callback=self.after_login
)
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
并在 shell 中进行了测试:
response.xpath('//title/text()').extract_first()
但收到了 301 和 302 重定向
添加后:
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
headers={'Content-Type': 'text/html; charset=UTF-8', 'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'},
formdata={'log': 'Name', 'pwd': 'Password', },
callback=self.after_login
)
消息更改为:
2017-03-22 03:46:07 [scrapy.core.engine] INFO: Spider opened
2017-03-22 03:46:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-22 03:46:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on
2017-03-22 03:46:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/login.php> (referer: None)
2017-03-22 03:46:08 [scrapy.core.scraper] ERROR: Spider error processing <GET http://members.com/login.php> (referer: None)
Traceback (most recent call last):
感谢帮助
您很可能在 FormRequest
中遗漏了一些 header。
在浏览器工具中打开网络选项卡,找到您要查找的请求并在 "request headers" 部分下查找(参见相关问题 Can scrapy be used to scrape dynamic content from websites that are using AJAX?)。有些 header 不是必需的,有些已经包含在 FormRequest
中,但有些不是,因此您需要复制它们。
通常是 Content-Type
header 需要复制。
headers = {
'Content-Type': 'json/...',
}
req = FormRequest(url, formdata=form, headers=headers)
尝试使用 scrapy document and fellow posts 中的代码登录 CMS 会员站点,但我总是失败。我的错误信息:
2017-03-20 18:18:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/robots.txt> (referer: None)
2017-03-20 18:18:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/wp-login.php> (referer: None)
2017-03-20 18:18:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <POST http://members.com/login.php> from <POST http://members.com/login.ph
p?wpe-login=membersipa>
我尝试将用户代理更改为:
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; rv:32.0) Gecko/20100101 Firefox/32.0'
但我的错误是:
2017-03-20 17:47:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/robots.txt> (referer: None)
2017-03-20 17:47:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/wp-login.php> (referer: None)
2017-03-20 17:47:23 [scrapy.core.engine] DEBUG: Crawled (403) <POST http://members.com/wp-login.php?wpe-login=membersipa> (referer: http://members.com/wp-login.php)
2017-03-20 17:47:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://members.com/wp-login.php?wpe-login=membersipa>: HTTP status code is not handled or not
allowed
这是产生错误的代码:
import scrapy
class LoginSpider(scrapy.Spider):
name = 'freddy'
start_urls = ['http://members.com/wlogin.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'log': 'name', 'pwd': 'password'},
callback=self.after_login
)
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
else:
return Request(url="http://members.com",
callback=self.parse_ipro)
def parse_ipro(self, response):
title = response.xpath('/html/body/div[2]/div/div[1]/div/div/div[2]/div/div/main/article/header/h1').extract_first()
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
最终,我想使用 scrapy shell 来测试选择器并尝试使用 scrapy
尝试使用 scrapy shell 但也遇到了问题:
import scrapy
class LoginSpider(scrapy.Spider):
name = 'freddy'
start_urls = ['http://members.com/wlogin.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'log': 'name', 'pwd': 'password'},
callback=self.after_login
)
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
并在 shell 中进行了测试:
response.xpath('//title/text()').extract_first()
但收到了 301 和 302 重定向
添加后:
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
headers={'Content-Type': 'text/html; charset=UTF-8', 'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'},
formdata={'log': 'Name', 'pwd': 'Password', },
callback=self.after_login
)
消息更改为:
2017-03-22 03:46:07 [scrapy.core.engine] INFO: Spider opened
2017-03-22 03:46:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-22 03:46:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on
2017-03-22 03:46:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/login.php> (referer: None)
2017-03-22 03:46:08 [scrapy.core.scraper] ERROR: Spider error processing <GET http://members.com/login.php> (referer: None)
Traceback (most recent call last):
感谢帮助
您很可能在 FormRequest
中遗漏了一些 header。
在浏览器工具中打开网络选项卡,找到您要查找的请求并在 "request headers" 部分下查找(参见相关问题 Can scrapy be used to scrape dynamic content from websites that are using AJAX?)。有些 header 不是必需的,有些已经包含在 FormRequest
中,但有些不是,因此您需要复制它们。
通常是 Content-Type
header 需要复制。
headers = {
'Content-Type': 'json/...',
}
req = FormRequest(url, formdata=form, headers=headers)