下载多个 PDF 时出现问题
Issue downloading multiple PDFs
运行以下代码后,我无法打开下载的 PDF。尽管代码运行成功,但下载的PDF文件已损坏。
我的电脑的错误信息是
Unable to open file. it may be damaged or in a format Preview doesn't recognize.
为什么它们会损坏,我该如何解决?
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"
#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/ Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
问题是文件在 open/write 之后没有正确关闭。
只需在代码末尾添加 f.close()
即可。
此问题是您在需要 'raw'
link:[=15= 时请求 github 'blob'
内的 link ]
'/sonhuytran/MIT8.01SC.2010F/blob/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'
但你想要:
'/sonhuytran/MIT8.01SC.2010F/raw/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'
所以调整一下。完整代码如下:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"
#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
pdf_link = link['href'].replace('blob','raw')
pdf_file = requests.get('https://github.com' + pdf_link)
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(pdf_file.content)
我必须使用 soup.select("a[href$=.pdf]")(不带内引号)才能正确到达 select 链接。
在那之后,您的脚本可以运行,但是:您下载的不是 PDF,而是 HTML 网页!尝试访问 URL 之一:https://github.com/sonhuytran/MIT8.01SC.2010F/blob/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf
您将看到一个 GitHub 网页,而不是实际的 PDF。为此,您需要 "raw" GitHub URL,当您将鼠标悬停在“下载”按钮上时可以看到它:https://github.com/sonhuytran/MIT8.01SC.2010F/raw/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf
因此,看起来您只需在适当的位置将 blob
替换为 raw
即可使其正常工作:
href = link['href']
href = href.replace('/blob/', '/raw/')
requests.get(urljoin(url,href).content)
运行以下代码后,我无法打开下载的 PDF。尽管代码运行成功,但下载的PDF文件已损坏。
我的电脑的错误信息是
Unable to open file. it may be damaged or in a format Preview doesn't recognize.
为什么它们会损坏,我该如何解决?
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"
#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/ Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
问题是文件在 open/write 之后没有正确关闭。
只需在代码末尾添加 f.close()
即可。
此问题是您在需要 'raw'
link:[=15= 时请求 github 'blob'
内的 link ]
'/sonhuytran/MIT8.01SC.2010F/blob/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'
但你想要:
'/sonhuytran/MIT8.01SC.2010F/raw/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'
所以调整一下。完整代码如下:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"
#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
pdf_link = link['href'].replace('blob','raw')
pdf_file = requests.get('https://github.com' + pdf_link)
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(pdf_file.content)
我必须使用 soup.select("a[href$=.pdf]")(不带内引号)才能正确到达 select 链接。
在那之后,您的脚本可以运行,但是:您下载的不是 PDF,而是 HTML 网页!尝试访问 URL 之一:https://github.com/sonhuytran/MIT8.01SC.2010F/blob/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf
您将看到一个 GitHub 网页,而不是实际的 PDF。为此,您需要 "raw" GitHub URL,当您将鼠标悬停在“下载”按钮上时可以看到它:https://github.com/sonhuytran/MIT8.01SC.2010F/raw/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf
因此,看起来您只需在适当的位置将 blob
替换为 raw
即可使其正常工作:
href = link['href']
href = href.replace('/blob/', '/raw/')
requests.get(urljoin(url,href).content)