如何同时掌握header和下载链接

Question

我正在尝试从网页下载所有 pdf 文件。我想使用 h3 标签文本作为我的文件名。现在可以用了。谢谢@Gauri Shankar Badola

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://docs.python.org/3/download.html"

#If there is no such folder, the script will create one automatically
folder_location = r'D:/Download'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")  
for link in soup.find_all("div", class_="presentation__content"):
    anchor_elements = link.findAll("a", class_="presentation__doc-link")
    h3_elements = link.findAll("h3", class_="presentation__title")
    if anchor_elements and h3_elements:
        pdf_url = anchor_elements[0].attrs['href']
        header_text = h3_elements[0].text.strip()
        #print (pdf_url)
        #print(header_text.replace(" ", "_"))
    filename = os.path.join(folder_location, header_text.replace(" ", "_"))
    #print (filename)
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,pdf_url)).content)

Answer 1

抱歉，刚才没看清问题。但是我不熟悉BeautifulSoup。我再给你一个解决方案。

import os
from simplified_scrapy import SimplifiedDoc,req,utils
url = "http://chemlabs.princeton.edu/macmillan/presentations/"
folder_location = r'D:/download'
if not os.path.exists(folder_location):os.mkdir(folder_location)

html = req.get(url)
doc = SimplifiedDoc(html)
links = doc.selects('a').contains('.pdf',attr='href')
for link in links:
  h3 = link.getNext('h3')
  filename = os.path.join(folder_location,h3.text)
  print (filename)

Answer 2

不是获取所有以 .pdf 结尾的 href 锚点元素，而是获取每个 div，它既有 pdf link 的锚点，又有用于显示的 h3。

更新代码：

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "http://chemlabs.princeton.edu/macmillan/presentations/"

#If there is no such folder, the script will create one automatically
folder_location = r'D:/download'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
# find all divs with presentation_content class
for link in soup.find_all("div", class_="presentation__content"):
    anchor_elements = link.findAll("a", class_="presentation__doc-link")
    h3_elements = link.findAll("h3", class_="presentation__title")
    if anchor_elements and h3_elements:
        pdf_url = anchor_elements[0].attrs['href']
        header_text = h3_elements[0].text.strip()
    filename = os.path.join(folder_location, header_text)
    print (filename)

Windows 上的输出：

D:/download\Decarboxylative and Decarbonylative Couplings of (Hetero)Aryl Carboxylic Acids and Derivatives
D:/download\Boron Homologation
D:/download\Metal-Organic Frameworks (MOFs)
D:/download\Bioceramic Materials
D:/download\The Olifactory System
D:/download\PROteolysis Targeting Chimera (PROTAC) Targeted Intracellular Protein Degradation
D:/download\High Energy Materials
D:/download\Bioisosteres of Common Functional Groups
D:/download\Halogen Bonding
D:/download\Nonperfect Synchronization
D:/download\Total Syntheses Enabled by Cross Coupling
D:/download\Carbenes: multiplicity and reactivity
D:/download\Selective C-F bond Functionalization in Multifluoroarenes and Trifluoroarenes and Trifluoromethylarenes
D:/download\Proximity- and Affinity- Based Labeling Methods for Interactome Mapping
D:/download\Chemistry of First-Row Transition Metal Photocatalysts
D:/download\Switchable Catalysis
D:/download\Linear Free Energy Relationships
D:/download\Machine Learning
D:/download\Polyoxometalate Photocatalysis
D:/download\Cobalt in Organic Synthesis
D:/download\Metal Nanoparticles in Catalysis
D:/download\Ultrafast Spectroscopic Methods: Fundamental Principles and Applications in Photocatalysis
D:/download\Quantum Dots: Applications in Electron and Energy Transfer Processes
D:/download\PET Imaging
D:/download\Spin-Orbit Coupling and Inorganic Photocatalysts
D:/download\Recent Advances in Cross-Coupling by Manganese Catalysis
D:/download\Recent Developments in Nucleophilic Fluorination
D:/download\Advances in Cancer Immunotherapy

PS : 对于文件保存，将空格替换为连字符。此外，基本位置应该有 windows.

的反斜杠

如何同时掌握header和下载链接

how to grasp header and download links at the same time

python

web-crawler