Python Scrapy : 响应 Body 除了重定向什么也没显示

Python Scrapy : Response Body shows nothing but Redirecting

我正在尝试构建一个抓取 Google 开发者控制台帐户的抓取工具。当我 运行 蜘蛛时,它似乎 log-in 成功并且日志输出正常。当我尝试请求另一个页面并将 response.body 写入文件时。它给出以下 (response.html) :

<!DOCTYPE html><html><head><title>Redirecting...</title><script type="text/javascript" language="javascript">var url = 'https:\/\/accounts.google.com\/ServiceLogin?service\x3dandroiddeveloper\x26passive\x3d1209600\x26continue\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14813004207305910035%23__HASH__\x26followup\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14813004207305910035'; var fragment = ''; if (self.document.location.hash) {fragment = self.document.location.hash.replace(/^#/,'');}url = url.replace(new RegExp("__HASH__", 'g'), encodeURIComponent(fragment));window.location.assign(url);</script><noscript><meta http-equiv="refresh" content="0; url='https://accounts.google.com/ServiceLogin?service&#61;androiddeveloper&amp;passive&#61;1209600&amp;continue&#61;https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035&amp;followup&#61;https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035'"></meta></noscript></head><body></body></html>

所以基本上我理解它是一个普通的 html,没有 body 和标题 --> 重定向...

我假设蜘蛛甚至在页面加载之前就开始爬行了。我研究并尝试将 meta={'handle_httpstatus_list': [302],'dont_redirect': True} 添加到 Request ,似乎没有区别。

这是我的蜘蛛 :

from scrapy.http import FormRequest, Request
import  logging
import scrapy

class LoginSpider(scrapy.Spider):
    name = 'super'
    start_urls = ['https://accounts.google.com/ServiceLogin?service=androiddeveloper&passive=1209600&continue=https://play.google.com/apps/publish/%23&followup=https://play.google.com/apps/publish/#identifier']

def parse(self, response):
    return [FormRequest.from_response(response,
                formdata={'Email': 'devaccnt@gmail.com', 'Passwd': 'devpwd'},

                callback=self.after_login)]

def after_login(self, response):
    if "wrong" in str(response.body):
        self.log("Login failed", level=logging.ERROR)
        return
# We've successfully authenticated, let's have some fun!
    print("Login Successful!!")
    return Request(url="https://play.google.com/apps/publish/?dev_acc=14592564207369815#AppListPlace", meta={'handle_httpstatus_list': [302],
                           'dont_redirect': True},
           callback=self.parse_tastypage)


def parse_tastypage(self, response):
    print ("---------------------")
    filename = 'response.html'
    print(filename)
    with open(filename, 'wb') as f:
        f.write(response.body)
    print ("---------------------")

** 不要介意缩进,它们在原始脚本中没问题

我认为发生的事情实际上是相反的,即 Scrapy 不遵循重定向。这是一个 scrapy shell 会话示例,您可以在其中看到 HTTP 响应代码是 200,而不是 302:

$ scrapy shell 'https://play.google.com/apps/publish/?dev_acc=14592564207369815#AppListPlace'
2017-02-07 10:30:45 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
(...)
2017-02-07 10:30:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/apps/publish/?dev_acc=14592564207369815#AppListPlace> (referer: None)
>>> print(response.text)
<!DOCTYPE html><html><head><title>Redirecting...</title><script type="text/javascript" language="javascript">var url = 'https:\/\/accounts.google.com\/ServiceLogin?service\x3dandroiddeveloper\x26passive\x3d1209600\x26continue\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14592564207369815%23__HASH__\x26followup\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14592564207369815'; var fragment = ''; if (self.document.location.hash) {fragment = self.document.location.hash.replace(/^#/,'');}url = url.replace(new RegExp("__HASH__", 'g'), encodeURIComponent(fragment));window.location.assign(url);</script><noscript><meta http-equiv="refresh" content="0; url='https://accounts.google.com/ServiceLogin?service&#61;androiddeveloper&amp;passive&#61;1209600&amp;continue&#61;https://play.google.com/apps/publish/?dev_acc%3D14592564207369815&amp;followup&#61;https://play.google.com/apps/publish/?dev_acc%3D14592564207369815'"></meta></noscript></head><body></body></html>

Scrapy 不解释 JavaScript 但它应该能够理解这个:

<noscript>
<meta http-equiv="refresh" content="0; url='https://accounts.google.com/ServiceLogin?service&#61;androiddeveloper&amp;passive&#61;1209600&amp;continue&#61;https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035&amp;followup&#61;https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035'">
</meta>
</noscript>

但事实并非如此。

框架中负责这种 meta-refresh 重定向的部分是 scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware

目前实现的是查找 meta-refresh 不在 <script><noscript> 中的信息(参见 scrapy.utils.response.get_meta_refresh

您可以使用自定义 MetaRefreshMiddleware 更改此设置,该自定义 MetaRefreshMiddleware 也在 <noscript> 元素内查找 meta-refresh:

>>> w3lib.html.get_meta_refresh(response.text, response.url, response.encoding, ignore_tags=('script'))
(0.0, 'https://accounts.google.com/ServiceLogin?service=androiddeveloper&passive=1209600&continue=https://play.google.com/apps/publish/?dev_acc%3D14592564207369815&followup=https://play.google.com/apps/publish/?dev_acc%3D14592564207369815')