如何使用 python 抓取超链接的 name/text？

Question

我想从此 URL https://www.ccexpert.us/ccda/best-practices-for-hierarchical-layers.html 中提取链接的名称，但是，我无法继续下一步。下面是我目前的代码

import requests as re
from bs4 import BeautifulSoup

URL = "https://www.ccexpert.us/ccda/best-practices-for-hierarchical-layers.html"
page = re.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="post altr")

for result in results:
    print(result)

我仍然不知道如何进行下一步。很感谢任何形式的帮助。谢谢。

Answer 1

此代码获取页面中 link 的所有文本：

import requests as re
from bs4 import BeautifulSoup

URL = "https://www.ccexpert.us/ccda/best-practices-for-hierarchical-layers.html"
page = re.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find_all('a')

for result in results:
    print(result.text.strip())

输出：

CCDA
port channels
RPVST
Dynamic Trunking Protocol
VTP transparent mode
Layer 3 load balancing
user ports
enable PortFast
the core layer
link redundancy
access layer switches
Gateway Load Balancing Protocol
core switches
distribution switches
redundant paths
campus core
Large Building LANs
LAN Design Types and Models
Shutting Down a BGP Neighbor
Core Layer Functionality - Network Design
Distribution Layer Functionality
Characterizing Types of Traffic Flow for New Network Applications
DHCP Starvation and Spoofing Attacks
How to Start an Ecommerce Business
Reply
About
Contact
Advertise
Privacy Policy
Resources

之所以有效，是因为为了在 html 中创建一个 hyperlink，使用了标签。我相信您要的是恰好有 hyperlink 的文本块，但如果您要的是 link，请按以下方法操作：

import requests as re
from bs4 import BeautifulSoup

URL = "https://www.ccexpert.us/ccda/best-practices-for-hierarchical-layers.html"
page = re.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

for a in soup.find_all('a', href=True):
    print(a['href'])

输出：

/
/reviews/traffic-xtractor.html



/ccda/
/routing-switching/using-routed-ports-and-portchannels-with-mls.html
/root-bridge/rapid-pervlan-spanning-tree-protocol.html
/network-security-2/dynamic-trunking-protocol-dtp.html
/root-bridge/vtp-modes.html
/root-bridge/configuring-etherchannel-load-balancing.html
/routing-switching-2/switch-security-best-practices-for-unused-and-user-ports.html
/global-configuration/enabling-bpdu-guard.html
/network-design/core-layer-functionality.html
/network-design/designing-link-redundancy.html
/network-design/access-layer-functionality.html
/root-bridge/gateway-load-balancing-protocol.html
/switching/collapsed-core.html
/switching/distribution-layer-switches.html
/switching/backbonefast-redundant-backbone-paths.html
/network-design/campus-core-design-considerations.html
/ccda/largebuilding-lans.html
/ccda/lan-design-types-and-models.html
/cisco-internetworks-2/shutting-down-a-bgp-neighbor.html
/network-design/core-layer-functionality.html
/network-design/distribution-layer-functionality.html
/network-design-2/characterizing-types-of-traffic-flow-for-new-network-applications.html
/snrs-3/dhcp-starvation-and-spoofing-attacks.html
/ecommerce.html
/about/
/contact/
/advertise-with-us/
/privacy-policy/
/resources/

这只擦除每个标签的 'href'。

如何使用 python 抓取超链接的 name/text？

How do I scrape the name/text of the hyperlinks using python?

python

beautifulsoup

web-scraping