Python：避免下载 html 时出现 "page doesn't exist" 错误

Question

我正在自学网页抓取，想下载一堆 .pgn 文件（本质上是文本文件），使用 requests。文件名采用日期形式，但并非严格按时间顺序排列。我运行遍历了可能的日期，但如果索引日期与文件不对应，我仍然最终将 filename.pgn 下载为带有错误 html 的文本文件页。相反，我想要的是跳过这些日期。

这是一个例子：

如果我运行:

filename = 'games9jul18.pgn'
url = 'https://www.chesspublishing.com/p/9/jul18/'+filename
response = requests.post(url, data=payload)
with open(filename, 'wb') as e:
    e.write(response.text)

通过 payload 中的适当身份验证，保存了正确的文件 games9jul18.pgn。但是如果我运行:

filename = 'games9aug18.pgn'
url = 'https://www.chesspublishing.com/p/9/aug18/'+filename
response = requests.post(url, data=payload)
with open(filename, 'wb') as e:
    e.write(response.text)

我仍然得到一个保存的文件 games9aug18.pgn，但不是 'real' pgn 文件，而是错误页面 html 的文本文件。在我的浏览器上导航到错误页面，它没有错误代码，但有一大块文本您询问的页面可能已被删除，或者可能根本不存在。

不幸的是，由于日期结构不一致，无法只循环对应于实际文件的文件名。如果到达错误页面，如何添加条件以不创建 .pgn 文件？

Answer 1

您应该检查响应状态。 "Page not found" 是 404，因此您可以检查该代码甚至检查是否成功请求，即 200:

response = requests.post(url, data=payload)
if response.status == 200:
    with...

Answer 2

你可以查看请求成功的代码：200 找不到页面时：404。

if response.status_code ==200:
    with open(filename, 'wb') as e:
        e.write(response.text)

Python：避免下载 html 时出现 "page doesn't exist" 错误

Python: Avoiding downloading html when taken to "page doesn't exist" error

python

loops

http-status-code-404

python-requests