为什么 requests.get(url) 会产生所有 <Response [406]> 结果?
Why does requests.get(url) produce all <Response [406]> results?
我正在测试这段代码,尝试从一个 URL.
下载大约 120 Excel 个文件
import requests
from bs4 import BeautifulSoup
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"}
resp = requests.get("https://healthcare.ascension.org/price-transparency/price-transparency-files",headers=headers)
soup = BeautifulSoup(resp.text,"html.parser")
for link in soup.find_all('a', href=True):
if 'xls' in link['href']:
print(link['href'])
url="https://healthcare.ascension.org"+link['href']
data=requests.get(url)
print(data)
output = open(f'C:/Users/ryans/Downloads/{url.split("/")[-1].split(".")[0]}.xls', 'wb')
output.write(data.content)
output.close()
这一行:data=requests.get(url)
总是给我 Response [406] 结果。显然,根据 HTTP.CAT 和 Mozilla,HTTP 406 的状态为“不可接受”。不确定这里有什么问题,但我想我应该下载 120 Excel 个文件,其中包含数据。现在,我的笔记本电脑上有 120 个 Excel 个文件,但是 none 个文件中有任何数据。
这个网站好像过滤了user-agent,所以你在字典里设置了header,调用get方法的时候传给request就可以了:
requests.get(url, headers=headers)
好像只勾选了user-agent
未指定 User-Agent 时出现 HTTP 406 错误。
一旦问题得到解决并且 HREF 被适当地解析(针对格式和相关性),OP 的代码就应该可以工作了。但是,它会非常慢,因为正在获取的 XLSX 文件的大小以 MB 为单位。
因此,使用多线程方法可以大大改善问题,如下所示:
import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor
import os
import re
import sys
HEADERS = {'User-Agent': 'PostmanRuntime/7.29.0'}
TARGET = '<Your download folder>'
HOST = 'https://healthcare.ascension.org'
XLSX = re.compile('.*xlsx$')
def download(url):
try:
base = os.path.basename(url)
print(f'Processing {base}')
(r := requests.get(url, headers=HEADERS, stream=True)).raise_for_status()
with open(os.path.join(TARGET, base), 'wb') as xl:
for chunk in r.iter_content(chunk_size=16*1024):
xl.write(chunk)
except Exception as e:
print(e, file=sys.stderr)
(r := requests.get(f'{HOST}/price-transparency/price-transparency-files', headers=HEADERS)).raise_for_status()
soup = BS(r.text, 'lxml')
with ThreadPoolExecutor() as executor:
executor.map(download, [HOST + link['href'] for link in soup.find_all('a', href=XLSX)])
print('Done')
注:
Python 要求 3.8+
我正在测试这段代码,尝试从一个 URL.
下载大约 120 Excel 个文件import requests
from bs4 import BeautifulSoup
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"}
resp = requests.get("https://healthcare.ascension.org/price-transparency/price-transparency-files",headers=headers)
soup = BeautifulSoup(resp.text,"html.parser")
for link in soup.find_all('a', href=True):
if 'xls' in link['href']:
print(link['href'])
url="https://healthcare.ascension.org"+link['href']
data=requests.get(url)
print(data)
output = open(f'C:/Users/ryans/Downloads/{url.split("/")[-1].split(".")[0]}.xls', 'wb')
output.write(data.content)
output.close()
这一行:data=requests.get(url)
总是给我 Response [406] 结果。显然,根据 HTTP.CAT 和 Mozilla,HTTP 406 的状态为“不可接受”。不确定这里有什么问题,但我想我应该下载 120 Excel 个文件,其中包含数据。现在,我的笔记本电脑上有 120 个 Excel 个文件,但是 none 个文件中有任何数据。
这个网站好像过滤了user-agent,所以你在字典里设置了header,调用get方法的时候传给request就可以了:
requests.get(url, headers=headers)
好像只勾选了user-agent
未指定 User-Agent 时出现 HTTP 406 错误。
一旦问题得到解决并且 HREF 被适当地解析(针对格式和相关性),OP 的代码就应该可以工作了。但是,它会非常慢,因为正在获取的 XLSX 文件的大小以 MB 为单位。
因此,使用多线程方法可以大大改善问题,如下所示:
import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor
import os
import re
import sys
HEADERS = {'User-Agent': 'PostmanRuntime/7.29.0'}
TARGET = '<Your download folder>'
HOST = 'https://healthcare.ascension.org'
XLSX = re.compile('.*xlsx$')
def download(url):
try:
base = os.path.basename(url)
print(f'Processing {base}')
(r := requests.get(url, headers=HEADERS, stream=True)).raise_for_status()
with open(os.path.join(TARGET, base), 'wb') as xl:
for chunk in r.iter_content(chunk_size=16*1024):
xl.write(chunk)
except Exception as e:
print(e, file=sys.stderr)
(r := requests.get(f'{HOST}/price-transparency/price-transparency-files', headers=HEADERS)).raise_for_status()
soup = BS(r.text, 'lxml')
with ThreadPoolExecutor() as executor:
executor.map(download, [HOST + link['href'] for link in soup.find_all('a', href=XLSX)])
print('Done')
注:
Python 要求 3.8+