使用 BeautifulSoup 获取正确的 href link

Question

我正在编写网络抓取工具，并且正在努力从网页中获取 href link。 URL 是 https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp。我正在尝试获取下面的 href link：

<div class="mb-2">
             <a href="https://vcnewsdaily.com/Tessera%20Therapeutics/venture-funding.php"> &gt;&gt; Click here for more funding data on Tessera Therapeutics</a>
             </div>

这是我的代码：

from cgi import print_directory
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import re

URL = "https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []

for link in soup.findAll(class_='mb-2'):
    links.append(link.get('href'))
print(links)

当我运行代码时，它输出：

[None, None, None, None]

有人可以指导我正确的方向吗？

Answer 1

变量 link 不包含具有 href= 属性的 <a> 标签。要 select 所有 <a> 标签下 class .mb-2 你可以使用例如 CSS select 或：

import requests
from bs4 import BeautifulSoup

URL = "https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []

for link in soup.select(".mb-2 a"):  # <-- select <a> tags here
    links.append(link.get("href"))
print(links)

打印：

['https://vcnewsdaily.com/Tessera%20Therapeutics/venture-funding.php', 'https://vcnewsdaily.com/marketing.php']

Answer 2

您的代码几乎可以工作，只需使用 find 而不是 get 并搜索 a:

from cgi import print_directory
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import re

URL = "https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []

for link in soup.findAll(class_='mb-2'):
    links.append(link.find('a'))
print(links)

使用 BeautifulSoup 获取正确的 href link

Get the proper href link using BeautifulSoup

python

beautifulsoup