使用 BeautifulSoup 获取正确的 href link
Get the proper href link using BeautifulSoup
我正在编写网络抓取工具,并且正在努力从网页中获取 href link。 URL 是 https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp。我正在尝试获取下面的 href link:
<div class="mb-2">
<a href="https://vcnewsdaily.com/Tessera%20Therapeutics/venture-funding.php"> >> Click here for more funding data on Tessera Therapeutics</a>
</div>
这是我的代码:
from cgi import print_directory
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import re
URL = "https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for link in soup.findAll(class_='mb-2'):
links.append(link.get('href'))
print(links)
当我运行代码时,它输出:
[None, None, None, None]
有人可以指导我正确的方向吗?
变量 link
不包含具有 href=
属性的 <a>
标签。要 select 所有 <a>
标签下 class .mb-2
你可以使用例如 CSS select 或:
import requests
from bs4 import BeautifulSoup
URL = "https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for link in soup.select(".mb-2 a"): # <-- select <a> tags here
links.append(link.get("href"))
print(links)
打印:
['https://vcnewsdaily.com/Tessera%20Therapeutics/venture-funding.php', 'https://vcnewsdaily.com/marketing.php']
您的代码几乎可以工作,只需使用 find
而不是 get
并搜索 a
:
from cgi import print_directory
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import re
URL = "https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for link in soup.findAll(class_='mb-2'):
links.append(link.find('a'))
print(links)
我正在编写网络抓取工具,并且正在努力从网页中获取 href link。 URL 是 https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp。我正在尝试获取下面的 href link:
<div class="mb-2">
<a href="https://vcnewsdaily.com/Tessera%20Therapeutics/venture-funding.php"> >> Click here for more funding data on Tessera Therapeutics</a>
</div>
这是我的代码:
from cgi import print_directory
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import re
URL = "https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for link in soup.findAll(class_='mb-2'):
links.append(link.get('href'))
print(links)
当我运行代码时,它输出:
[None, None, None, None]
有人可以指导我正确的方向吗?
变量 link
不包含具有 href=
属性的 <a>
标签。要 select 所有 <a>
标签下 class .mb-2
你可以使用例如 CSS select 或:
import requests
from bs4 import BeautifulSoup
URL = "https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for link in soup.select(".mb-2 a"): # <-- select <a> tags here
links.append(link.get("href"))
print(links)
打印:
['https://vcnewsdaily.com/Tessera%20Therapeutics/venture-funding.php', 'https://vcnewsdaily.com/marketing.php']
您的代码几乎可以工作,只需使用 find
而不是 get
并搜索 a
:
from cgi import print_directory
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import re
URL = "https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for link in soup.findAll(class_='mb-2'):
links.append(link.find('a'))
print(links)