使用 Beautiful Soup/Requests 在 HTML 下载 URL 不完整的 PDF
Download PDFs with incomplete URLs in HTML with Beautiful Soup/Requests
我要下载页面 https://www.mdpi.com/search?authors=University+of+Alabama%2C+Tuscaloosa 上列出的所有 259 个 PDF,例如:
<a href="/1424-8220/21/19/6384/pdf" class="UD_Listings_ArticlePDF" onclick="if (!window.__cfRLUnblockHandlers) return false; ga('send', 'pageview', '/1424-8220/21/19/6384/pdf');" title="Article PDF" data-cf-modified-fa685c2bcda960230d46973e-="">
<i class="material-icons">get_app</i>
</a>
href只有域名后面URL的部分,所以完整的URL是https://mdpi.com/1424-8220/21/19/6384/pdf.
当我运行这样下载文件时:
for link in links:
if ('/pdf' in link.get('href', [])):
i += 1
print("Downloading file: ", i)
response = requests.get(link.get('href'))
我得到这个回溯:
requests.exceptions.MissingSchema: Invalid URL '/1424-8220/21/19/6384/pdf': No schema supplied. Perhaps you meant http:///1424-8220/21/19/6384/pdf?
URL、“https://mdpi.com”的缺失部分应该放在哪里?
.get()
正在接受字符串,因此 f-string 应该可以工作。
for link in links:
if ('/pdf' in link.get('href', [])):
i += 1
print("Downloading file: ", i)
response = requests.get(f"https://mdpi.com{link.get('href')}")
我要下载页面 https://www.mdpi.com/search?authors=University+of+Alabama%2C+Tuscaloosa 上列出的所有 259 个 PDF,例如:
<a href="/1424-8220/21/19/6384/pdf" class="UD_Listings_ArticlePDF" onclick="if (!window.__cfRLUnblockHandlers) return false; ga('send', 'pageview', '/1424-8220/21/19/6384/pdf');" title="Article PDF" data-cf-modified-fa685c2bcda960230d46973e-="">
<i class="material-icons">get_app</i>
</a>
href只有域名后面URL的部分,所以完整的URL是https://mdpi.com/1424-8220/21/19/6384/pdf.
当我运行这样下载文件时:
for link in links:
if ('/pdf' in link.get('href', [])):
i += 1
print("Downloading file: ", i)
response = requests.get(link.get('href'))
我得到这个回溯:
requests.exceptions.MissingSchema: Invalid URL '/1424-8220/21/19/6384/pdf': No schema supplied. Perhaps you meant http:///1424-8220/21/19/6384/pdf?
URL、“https://mdpi.com”的缺失部分应该放在哪里?
.get()
正在接受字符串,因此 f-string 应该可以工作。
for link in links:
if ('/pdf' in link.get('href', [])):
i += 1
print("Downloading file: ", i)
response = requests.get(f"https://mdpi.com{link.get('href')}")