scrapy 跟不上 full link

Question

scrapy shell ""https://www.winemag.com/wine-ratings/2/"
response

然而我得到

2019-02-19 14:16:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-02-19 14:16:35 [scrapy.core.engine] INFO: Spider opened 2019-02-19 14:16:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.winemag.com/robots.txt> (referer: None) 2019-02-19 14:16:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.winemag.com/wine-ratings> from <GET https://www.winemag.com/wine-ratings/2/> 2019-02-19 14:16:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.winemag.com/wine-ratings> from <GET http://www.winemag.com/wine-ratings> 2019-02-19 14:16:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.winemag.com/wine-ratings/> from <GET https://www.winemag.com/wine-ratings> 2019-02-19 14:16:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.winemag.com/wine-ratings/> (referer: None)

<200 https://www.winemag.com/wine-ratings/>

我不明白为什么它没有获得完整的link，请有人给我一个建议。

Answer 1

似乎 winemag 将抓取工具重定向到其主页：

⇾ curl -I 'https://www.winemag.com/wine-ratings/2/'
HTTP/2 301
[...]
location: http://www.winemag.com/wine-ratings
[...]

所以这似乎是 scrapy 的预期行为，它遵循您正在访问的网站返回给它的重定向？

Answer 2

我找到了答案。我必须在设置文件中指定 USER_AGENT。

scrapy 跟不上 full link

scrapy can not follow full link

python

scrapy

web-scraping

scrapy-shell