如何用python Scrapy爬取Factiva数据?
How to crawl Factiva data with python Scrapy?
我正在 Python 3.5.2 中从 Factiva 获取数据。而且我必须使用学校登录才能看到数据。
我已按照此 post 尝试创建 login spider
但是,我得到了这个错误:
这是我的代码:
# Test Login Spider
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
login_url = "https://login.proxy.lib.sfu.ca/login?qurl=https%3a%2f%2fglobal.factiva.com%2fen%2fsess%2flogin.asp%3fXSID%3dS002sbj1svr2sVo5DEs5DEpOTAvNDAoODZyMHn0YqYvMq382rbRQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQQAA"
user_name = b"[my_user_name]"
pswd = b"[my_password]"
response_page = "https://global-factiva-com.proxy.lib.sfu.ca/hp/printsavews.aspx?pp=Save&hc=All"
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
return [scrapy.FormRequest(login_url,
formdata={'user': user_name, 'pass': pswd},
callback=self.logged_in)]
def logged_in(self, response):
# login failed
if "authentication failed" in response.body:
print ("Login failed")
# login succeeded
else:
print ('login succeeded')
# return Request(url=response_page,
# callback=self.parse_responsepage)
def parse_responsepage(self, response):
hxs = HtmlXPathSelector(response)
yum = hxs.select('//span/@enHeadline')
def main():
test_spider = MySpider(scrapy.Spider)
test_spider.start_requests()
if __name__ == "__main__":
main()
为了运行这段代码,我在项目的顶层目录中使用了终端命令行:
scrapy runspider [my_file_path]/auth_spider.py
你知道如何处理这里的错误吗?
当您使用 Python 3.x 时,"authentication failed"
是 str
而 response.body
是 bytes
类型。
要解决此问题,请在 str
:
中执行测试
if "authentication failed" in response.body_as_unicode():
或 bytes
:
if b"authentication failed" in response.body:
我正在 Python 3.5.2 中从 Factiva 获取数据。而且我必须使用学校登录才能看到数据。
我已按照此 post 尝试创建 login spider
但是,我得到了这个错误:
这是我的代码:
# Test Login Spider
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
login_url = "https://login.proxy.lib.sfu.ca/login?qurl=https%3a%2f%2fglobal.factiva.com%2fen%2fsess%2flogin.asp%3fXSID%3dS002sbj1svr2sVo5DEs5DEpOTAvNDAoODZyMHn0YqYvMq382rbRQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQQAA"
user_name = b"[my_user_name]"
pswd = b"[my_password]"
response_page = "https://global-factiva-com.proxy.lib.sfu.ca/hp/printsavews.aspx?pp=Save&hc=All"
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
return [scrapy.FormRequest(login_url,
formdata={'user': user_name, 'pass': pswd},
callback=self.logged_in)]
def logged_in(self, response):
# login failed
if "authentication failed" in response.body:
print ("Login failed")
# login succeeded
else:
print ('login succeeded')
# return Request(url=response_page,
# callback=self.parse_responsepage)
def parse_responsepage(self, response):
hxs = HtmlXPathSelector(response)
yum = hxs.select('//span/@enHeadline')
def main():
test_spider = MySpider(scrapy.Spider)
test_spider.start_requests()
if __name__ == "__main__":
main()
为了运行这段代码,我在项目的顶层目录中使用了终端命令行:
scrapy runspider [my_file_path]/auth_spider.py
你知道如何处理这里的错误吗?
当您使用 Python 3.x 时,"authentication failed"
是 str
而 response.body
是 bytes
类型。
要解决此问题,请在 str
:
if "authentication failed" in response.body_as_unicode():
或 bytes
:
if b"authentication failed" in response.body: