为什么 requests.get(url) 会产生所有 <Response [406]> 结果?

Why does requests.get(url) produce all <Response [406]> results?

我正在测试这段代码,尝试从一个 URL.

下载大约 120 Excel 个文件
import requests
from bs4 import BeautifulSoup
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"}
resp = requests.get("https://healthcare.ascension.org/price-transparency/price-transparency-files",headers=headers)
soup = BeautifulSoup(resp.text,"html.parser")

for link in soup.find_all('a', href=True):
    if 'xls' in link['href']:
        print(link['href'])
        url="https://healthcare.ascension.org"+link['href']
        data=requests.get(url)
        print(data)
        output = open(f'C:/Users/ryans/Downloads/{url.split("/")[-1].split(".")[0]}.xls', 'wb')
        output.write(data.content)
        output.close()

这一行:data=requests.get(url) 总是给我 Response [406] 结果。显然,根据 HTTP.CAT 和 Mozilla,HTTP 406 的状态为“不可接受”。不确定这里有什么问题,但我想我应该下载 120 Excel 个文件,其中包含数据。现在,我的笔记本电脑上有 120 个 Excel 个文件,但是 none 个文件中有任何数据。

这个网站好像过滤了user-agent,所以你在字典里设置了header,调用get方法的时候传给request就可以了:

requests.get(url, headers=headers)

好像只勾选了user-agent

未指定 User-Agent 时出现 HTTP 406 错误。

一旦问题得到解决并且 HREF 被适当地解析(针对格式和相关性),OP 的代码就应该可以工作了。但是,它会非常慢,因为正在获取的 XLSX 文件的大小以 MB 为单位。

因此,使用多线程方法可以大大改善问题,如下所示:

import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor
import os
import re
import sys

HEADERS = {'User-Agent': 'PostmanRuntime/7.29.0'}
TARGET = '<Your download folder>'
HOST = 'https://healthcare.ascension.org'
XLSX = re.compile('.*xlsx$')

def download(url):
    try:
        base = os.path.basename(url)
        print(f'Processing {base}')
        (r := requests.get(url, headers=HEADERS, stream=True)).raise_for_status()
        with open(os.path.join(TARGET, base), 'wb') as xl:
            for chunk in r.iter_content(chunk_size=16*1024):
                xl.write(chunk)
    except Exception as e:
        print(e, file=sys.stderr)


(r := requests.get(f'{HOST}/price-transparency/price-transparency-files', headers=HEADERS)).raise_for_status()
soup = BS(r.text, 'lxml')
with ThreadPoolExecutor() as executor:
    executor.map(download, [HOST + link['href'] for link in soup.find_all('a', href=XLSX)])
print('Done')

注:

Python 要求 3.8+