Beautiful Soup return none 非动态内容的对象

Question

import requests as requests
from bs4 import BeautifulSoup
headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
        'cache-control': 'private, max-age=0, no-cache'
    }
htmlFile=requests.get('https://www.goodreads.com/shelf/show/1', headers=headers).text
sou=BeautifulSoup(htmlFile,"html.parser")
re = sou.find(class_="leftContainer").findAll(class_="bookTitle")
print(re)

内容不是动态的，不需要JS之类的。
那么为什么有时 None 对象 return?
while循环谁反复尝试可以解决问题，但那不是真正的解决方案。

Answer 1

错误

重现问题后，我推断服务器returns不同在 100-200 个请求后响应。标准的、可解析的响应与没有数据可抓取的响应之间的区别在于，不同的响应是作为偶尔的错误页面和 returns 代码 [504].这段代码本质上是服务器超时或出错，所以返回一个默认的HTML页面，没有书，因此代码出错。

来自错误的测试用例响应的片段：

<h1>
                            Goodreads request took too long.
                        </h1>
<p>
                            The latest request to the Goodreads servers took too long to respond. We have been notified of the issue and are looking into it.
                        </p>

解决方案

这可能是偶然的，也可能是为了防止爬虫使用过多的系统资源而施加的限制，但这更有可能只是一个正常的服务器端错误，并且不可避免.只需包含一个 try/except 捕获以防出现错误响应，然后重试请求！

Beautiful Soup return none 非动态内容的对象

Beautiful Soup return none object for non-dynamic content

python

beautifulsoup

python-requests