为什么无法记录导致404错误的请求？

Question

curl -I -w %{http_code}  http://quotes.money.163.com/f10/gszl_600024.html
HTTP/1.1 404 Not Found
Server: nginx

curl -I -w %{http_code}  http://quotes.money.163.com/f10/gszl_600023.html
HTTP/1.1 200 OK
Server: nginx

说明http://quotes.money.163.com/f10/gszl_600024.html不存在，其http错误码为404；http://quotes.money.163.com/f10/gszl_600023.html存在，其http错误码为200。

我想写一个爬虫来记录导致404错误的请求。

在middlewares.py中添加HTTPERROR_ALLOWED_CODES。

HTTPERROR_ALLOWED_CODES = [404,403,406, 408, 500, 503, 504]
在settings.py中添加日志设置。

LOG_LEVEL = "CRITICAL"
LOG_FILE = "mylog"

创建一个蜘蛛。

import scrapy
from info.items import InfoItem
import logging

class InfoSpider(scrapy.Spider):
    handle_httpstatus_list = [404]
    name = 'info'
    allowed_domains = ['quotes.money.163.com']
    start_urls = [ r"http://quotes.money.163.com/f10/gszl_600023.html",
               r"http://quotes.money.163.com/f10/gszl_600024.html"]

    def parse(self, response):
        item = StockinfoItem()
        if(response.status == 200):logging.critical("url whose status is 200 : " + response.url)
        if(response.status == 404):logging.critical("url whose status is 404 : " + response.url)

在运行蜘蛛之后打开 mylog 文件。

2019-04-25 08:47:57 [root] CRITICAL: url whose status is 200 : http://quotes.money.163.com/
2019-04-25 08:47:57 [root] CRITICAL: url whose status is 200 : http://quotes.money.163.com/f10/gszl_600023.html

为什么 http://quotes.money.163.com/ 的状态为 200？当你在浏览器中输入 http://quotes.money.163.com/f10/gszl_600023.html 时， url 的服务器上没有内容，它会在 5 秒后重定向到 http://quotes.money.163.com/ 并且 http://quotes.money.163.com/ 的 http 代码是 200，所以这里有两个 200 状态行。

令我困惑的是，没有

这样的日志信息

2019-04-25 08:47:57 [root] CRITICAL: url whose status is 404 : http://quotes.money.163.com/f10/gszl_600024.html

在日志文件中 mylog。

如何让if(response.status == 404):logging.critical("url whose status is 404 : " + response.url)在我的scrapy1.6中执行？

Answer 1

您已从 404 页面重定向到主页面。所以你可以设置 dont_redirect 它会显示你需要的回应。试试这个：

class InfoSpider(scrapy.Spider):
    handle_httpstatus_list = [404]
    name = 'info'
    allowed_domains = ['quotes.money.163.com']
    start_urls = [
        r"http://quotes.money.163.com/f10/gszl_600023.html",
        r"http://quotes.money.163.com/f10/gszl_600024.html"
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, meta={'dont_redirect': True})

    def parse(self, response):
        if response.status == 200:
            logging.critical("url whose status is 200 : " + response.url)
        if response.status == 404:
            logging.critical("url whose status is 404 : " + response.url)

所以，现在我进入我的日志：

2019-04-25 08:09:23 [root] CRITICAL: url whose status is 200 : http://quotes.money.163.com/f10/gszl_600023.html
2019-04-25 08:09:23 [root] CRITICAL: url whose status is 404 : http://quotes.money.163.com/f10/gszl_600024.html

为什么无法记录导致404错误的请求？

Why can't record the request which result in 404 error?

scrapy

http-status-code-404

python-3.x