从数据帧循环 url 并在 Python 中下载 pdf 文件

Question

基于 , I'm able to crawler url for each transation and save them into an excel file which can be downloaded here 中的代码。

现在我想更进一步点击 url link:

对于每个 url，我需要打开并保存 pdf 格式文件：

我怎么能在 Python 中做到这一点？任何帮助将不胜感激。

参考代码：

import shutil
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse

url = 'xxx'
for page in range(6):
    r = requests.get(url.format(page))
    soup = BeautifulSoup(r.content, "html.parser")
    for link in soup.select("h3[class='sv-card-title']>a"):
        r = requests.get(link.get("href"), stream=True)
        r.raw.decode_content = True
        with open('./files/' + link.text + '.pdf', 'wb') as f:
            shutil.copyfileobj(r.raw, f)

Answer 1

在您上传的 excel 文件中下载 pdf 文件的示例。

from bs4 import BeautifulSoup
import requests

# Let's assume there is only one page.If you need to download many files, save them in a list.

url = 'http://xinsanban.eastmoney.com/Article/NoticeContent?id=AN201909041348533085'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")

link = soup.select_one(".lookmore")
title = soup.select_one(".newsContent").select_one("h1").text

print(title.strip() + '.pdf')
data = requests.get(link.get("href")).content
with open(title.strip().replace(":", "-") + '.pdf', "wb+") as f: # file name shouldn't contain ':', so I replace it to "-"
    f.write(data)

并下载成功：

Answer 2

这里有一些不同的方法。您不必从 excel 文件中打开这些 url，因为您可以自己构建 .pdf 文件源 url。

例如：

import requests

urls = [
    "http://data.eastmoney.com/notices/detail/871792/AN201909041348533085,JWU2JWEwJTk2JWU5JTljJTllJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/872955/AN201912101371726768,JWU0JWI4JWFkJWU5JTgzJWJkJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/832816/AN202008171399155565,JWU3JWI0JWEyJWU1JTg1JThiJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/831971/AN201505220009713696,JWU1JWJjJTgwJWU1JTg1JTgzJWU3JTg5JWE5JWU0JWI4JTlh.html",
]

for url in urls:
    file_id, _ = url.split('/')[-1].split(',')
    pdf_file_url = f"http://pdf.dfcfw.com/pdf/H2_{file_id}_1.pdf"
    print(f"Fetching {pdf_file_url}...")
    with open(f"{file_id}.pdf", "wb") as f:
        f.write(requests.get(pdf_file_url).content)

从数据帧循环 url 并在 Python 中下载 pdf 文件

Loop url from dataframe and download pdf files in Python

beautifulsoup

web-crawler

python-3.x

python-requests