为什么 requests.get(url) 会产生所有 <Response [406]> 结果？

Question

我正在测试这段代码，尝试从一个 URL.

下载大约 120 Excel 个文件

import requests
from bs4 import BeautifulSoup
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"}
resp = requests.get("https://healthcare.ascension.org/price-transparency/price-transparency-files",headers=headers)
soup = BeautifulSoup(resp.text,"html.parser")

for link in soup.find_all('a', href=True):
    if 'xls' in link['href']:
        print(link['href'])
        url="https://healthcare.ascension.org"+link['href']
        data=requests.get(url)
        print(data)
        output = open(f'C:/Users/ryans/Downloads/{url.split("/")[-1].split(".")[0]}.xls', 'wb')
        output.write(data.content)
        output.close()

这一行：data=requests.get(url) 总是给我 Response [406] 结果。显然，根据 HTTP.CAT 和 Mozilla，HTTP 406 的状态为“不可接受”。不确定这里有什么问题，但我想我应该下载 120 Excel 个文件，其中包含数据。现在，我的笔记本电脑上有 120 个 Excel 个文件，但是 none 个文件中有任何数据。

Answer 1

这个网站好像过滤了user-agent，所以你在字典里设置了header，调用get方法的时候传给request就可以了:

requests.get(url, headers=headers)

好像只勾选了user-agent

Answer 2

未指定 User-Agent 时出现 HTTP 406 错误。

一旦问题得到解决并且 HREF 被适当地解析（针对格式和相关性），OP 的代码就应该可以工作了。但是，它会非常慢，因为正在获取的 XLSX 文件的大小以 MB 为单位。

因此，使用多线程方法可以大大改善问题，如下所示：

import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor
import os
import re
import sys

HEADERS = {'User-Agent': 'PostmanRuntime/7.29.0'}
TARGET = '<Your download folder>'
HOST = 'https://healthcare.ascension.org'
XLSX = re.compile('.*xlsx$')

def download(url):
    try:
        base = os.path.basename(url)
        print(f'Processing {base}')
        (r := requests.get(url, headers=HEADERS, stream=True)).raise_for_status()
        with open(os.path.join(TARGET, base), 'wb') as xl:
            for chunk in r.iter_content(chunk_size=16*1024):
                xl.write(chunk)
    except Exception as e:
        print(e, file=sys.stderr)


(r := requests.get(f'{HOST}/price-transparency/price-transparency-files', headers=HEADERS)).raise_for_status()
soup = BS(r.text, 'lxml')
with ThreadPoolExecutor() as executor:
    executor.map(download, [HOST + link['href'] for link in soup.find_all('a', href=XLSX)])
print('Done')

注：

Python 要求 3.8+

为什么 requests.get(url) 会产生所有 <Response [406]> 结果？

Why does requests.get(url) produce all <Response [406]> results?

python

beautifulsoup

python-3.x

python-requests