使用 Scrapy 进行登录自动化和爬取 Python

Question

我一直在尝试编写一个脚本来检索我在 spoj 上接受的解决方案 See more

我在自动登录过程中遇到了困难。我发现 Scrapy 很难理解。在多次浏览文档和代码示例后，我对幕后发生的事情有了一个模糊的了解，这就是我现在的立场：

（我在需要的地方注释了代码）

import os
import os.path
import scrapy
import urllib.request
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from bs4 import BeautifulSoup

class LoginSpider(scrapy.Spider):
    name = 'spoj'
    start_urls = ['http://www.spoj.com/login']
    outputFile = open('output.txt' , 'w')

    def parse(self, response):
        username = input('Enter username\n')
        password = input('Enter password\n')
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': username, 'password': password},
            callback=self.after_login
        )

    def after_login(self, response):

        # Even if I type in correct username and password it always leads to 
        # authentication faliure and the following if statement evaluates to true.

        if str.encode('Authentication failed!') in response.body:
            self.logger.error("Login failed")
            print ('Incorrect credentials')
            return    

        print('lol') # ofcourse this isn't printed
        return scrapy.Request(url = "http://www.spoj.com/myaccount/" , callback = self.retrieve_codes ) 

    # needless to say, the following function is never called
    def retrieve_codes(self, response):

        print('Hello testing!') 
        url = 'http://www.spoj.com/files/src/16731976/'
        html = urllib.request.urlopen(url).read()
        soup = BeautifulSoup(html , 'html.parser')
        self.outputFile.write(str(soup.prettify()))

在文档中它是 if "authentication failed" in response.body: 我改为

if str.encode('Authentication failed!') in response.body: 原因

我遇到了这个错误 a byte like object is required not 'str'
在输入错误凭据的 spoj 中显示 Authentication failed! 而不是 authentication failed。我们需要在这里精确。

请告诉我哪里做错了。我还没有在网上找到任何详细讨论表单验证的好资源。在看到 this code from docs 之后，我最初的问题是

这是唯一的方法吗？
这种方法是否适用于所有网站？因为我了解到这个过程的复杂性因站点而异。
我能找到对背后发生的事情的更具描述性的解释吗？

我也尝试过使用 robobrowser 但没有成功。我有点期待像美丽的汤那样好的文档。

谢谢！

Answer 1

您使用了错误的 formdata 字段名称。您需要将示例代码从 scrapy 文档调整到特定网站。目前您使用 username 和 password 作为 formdata 字段。

如果您在登录时使用浏览器的开发人员工具，您可以看到 POST 发送的字段标记为 login_user 和 password。

所以这应该很容易修复 :-)

使用 Scrapy 进行登录自动化和爬取 Python

Login automation and crawling using Scrapy Python

python

scrapy

scrapy-spider