python download/scrape 来自 url 列表的 ssrn 论文
python download/scrape ssrn papers from list of urls
我有一堆 link 除了最后的 id 完全相同。我想要做的就是遍历每个 link 并使用下载为 PDF 按钮将论文下载为 PDF。在理想情况下,文件名将是论文的标题,但如果不可能,我可以稍后重命名它们。让他们全部下载更重要。我有 200 个 link,但我将在此处提供 5 个作为示例。
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134
我想做的事可行吗?我对循环访问 URL 来抓取表格有些熟悉,但我从未尝试过使用下载按钮做任何事情。
我没有示例代码,因为我不知道从哪里开始。但是像
for url in urls:
(go to each link)
(download as pdf via the "download this paper" button)
(save file as title of paper)
尝试:
import requests
from bs4 import BeautifulSoup
urls = [
"https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262",
"https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007",
"https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924",
"https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552",
"https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134",
]
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}
for url in urls:
soup = BeautifulSoup(
requests.get(url, headers=headers).content, "html.parser"
)
pdf_url = (
"https://papers.ssrn.com/sol3/"
+ soup.select_one("a[data-abstract-id]")["href"]
)
filename = url.split("=")[-1] + ".pdf"
print(f"Downloading {pdf_url} as {filename}")
with open(filename, "wb") as f_out:
f_out.write(
requests.get(pdf_url, headers={**headers, "Referer": url}).content
)
打印:
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3860262_code1719241.pdf?abstractid=3860262&mirid=1 as 3860262.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2521007_code576529.pdf?abstractid=2521007&mirid=1 as 2521007.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID4066577_code104690.pdf?abstractid=3146924&mirid=1 as 3146924.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2505208_code16198.pdf?abstractid=2488552&mirid=1 as 2488552.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3506882_code16198.pdf?abstractid=3330134&mirid=1 as 3330134.pdf
并将 PDF 另存为:
andrej@PC:~$ ls -alF *pdf
-rw-r--r-- 1 root root 993466 máj 24 01:10 2488552.pdf
-rw-r--r-- 1 root root 3583616 máj 24 01:10 2521007.pdf
-rw-r--r-- 1 root root 1938284 máj 24 01:10 3146924.pdf
-rw-r--r-- 1 root root 685777 máj 24 01:10 3330134.pdf
-rw-r--r-- 1 root root 939157 máj 24 01:10 3860262.pdf
我有一堆 link 除了最后的 id 完全相同。我想要做的就是遍历每个 link 并使用下载为 PDF 按钮将论文下载为 PDF。在理想情况下,文件名将是论文的标题,但如果不可能,我可以稍后重命名它们。让他们全部下载更重要。我有 200 个 link,但我将在此处提供 5 个作为示例。
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134
我想做的事可行吗?我对循环访问 URL 来抓取表格有些熟悉,但我从未尝试过使用下载按钮做任何事情。
我没有示例代码,因为我不知道从哪里开始。但是像
for url in urls:
(go to each link)
(download as pdf via the "download this paper" button)
(save file as title of paper)
尝试:
import requests
from bs4 import BeautifulSoup
urls = [
"https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262",
"https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007",
"https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924",
"https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552",
"https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134",
]
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}
for url in urls:
soup = BeautifulSoup(
requests.get(url, headers=headers).content, "html.parser"
)
pdf_url = (
"https://papers.ssrn.com/sol3/"
+ soup.select_one("a[data-abstract-id]")["href"]
)
filename = url.split("=")[-1] + ".pdf"
print(f"Downloading {pdf_url} as {filename}")
with open(filename, "wb") as f_out:
f_out.write(
requests.get(pdf_url, headers={**headers, "Referer": url}).content
)
打印:
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3860262_code1719241.pdf?abstractid=3860262&mirid=1 as 3860262.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2521007_code576529.pdf?abstractid=2521007&mirid=1 as 2521007.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID4066577_code104690.pdf?abstractid=3146924&mirid=1 as 3146924.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2505208_code16198.pdf?abstractid=2488552&mirid=1 as 2488552.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3506882_code16198.pdf?abstractid=3330134&mirid=1 as 3330134.pdf
并将 PDF 另存为:
andrej@PC:~$ ls -alF *pdf
-rw-r--r-- 1 root root 993466 máj 24 01:10 2488552.pdf
-rw-r--r-- 1 root root 3583616 máj 24 01:10 2521007.pdf
-rw-r--r-- 1 root root 1938284 máj 24 01:10 3146924.pdf
-rw-r--r-- 1 root root 685777 máj 24 01:10 3330134.pdf
-rw-r--r-- 1 root root 939157 máj 24 01:10 3860262.pdf