在 python3 中使用 beautifulsoup 从 html 抓取锚标记有困难
difficulty in crawling anchor tags from html using beautifulsoup in python3
我正在尝试从机构的网页中提取 href。
我必须提取部门代码以进一步抓取 activity。
我写了以下代码:
import requests
import re
import urllib
from bs4 import BeautifulSoup
codesurl="http://www.iitkgp.ac.in/academics/?page=acadunits"
response = requests.get(codesurl)
# print(response.content)
soup=BeautifulSoup(response.content)
# print(soup.prettify())
p = re.compile('page=acadunits*')
p1 = re.compile('<a href=.*page=acadunits*')
links=soup.find_all("a")
print(links)
for link in links:
# if p1.match(link):
print("%s" %(link))
但是我没有得到所有的 href,例如:
<a href="?page=acadunits&&dept=ME">Mechanical Engineering</a>
<a href="?page=acadunits&&dept=MD">Medical Science & Technology</a>
<a href="?page=acadunits&&dept=MT">Metallurgical & Materials Engineering</a>
还有更多
有人可以帮我吗 this.This 是我第一次爬行。
你也可以看看 website.I need to extract dept code from url
dept=ME
dept=MT
dept=MD
我的网页包含:
<div class="tab_container">
<div id="tab1" class="tab_content" style="display: block;">
<h3></h3>
<!--Content-->
<img src="./Indian Institute of Technology Kharagpur_files/academicunits.jpg">
<br><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=AE">Aerospace Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=AG">Agricultural & Food Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=AR">Architecture & Regional Planning</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=BT">Biotechnology</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=CH">Chemical Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=CM">Chemistry</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=CE">Civil Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=CS">Computer Science & Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=CR">Cryogenic Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=ED">Center for Educational Technology</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=EE">Electrical Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=EC"> Electronics & Electrical Communication Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=GS">G S Sanyal School of Telecommunications</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MG">Geology & Geophysics</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=HS">Humanities & Social Sciences</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=IM">Industrial & Systems Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=IT">Information Technology</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MS">Materials Science</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MM">Mathematics</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=ME">Mechanical Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MD">Medical Science & Technology</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MT">Metallurgical & Materials Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MI">Mining Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=NA">Ocean Engineering & Naval Architecture</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=N2">Oceans, Rivers, Atmosphere and Land Sciences</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MP">Physics</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=PK">P K Sinha Centre for Bio Energy</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=RJ">Rajendra Mishra School of Engineering Entrepreneurship</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=RG">Rajiv Gandhi School of Intellectual Property Law</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=ID">Ranbir and Chitra Gupta School of Infrastructure Design and Management</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=RE">Reliability Engineering Centre</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=RT">Rubber Technology Centre</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=RD">Rural Development Centre</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=BS">School of Bioscience</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=ES">School of Energy Science & Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=EF">School of Environmental Science and Technology</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=NT">School of Nano-Science and Technology</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=WM">School of Water Resources</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=SM">Vinod Gupta School of Management</a><br>
<br><br>
<!--Content-->
</div>
但是当我这样做的时候:
codesurl="http://www.iitkgp.ac.in/academics/?page=acadunits"
response = requests.get(codesurl)
soup=BeautifulSoup(response.text)
soup 不显示这些 href
有人可以建议如何提取这些 href 标签吗??
最好的方法是使用 urllib.parse
模块中的 parse_qs
。
for link in links:
qs = parse_qs(link.get('href'))
print('dept', qs['dept'][0])
或使用rpartition
for link in links:
print(link.get('href').rpartition('&&')[-1])
首先,部门链接通过 GET 请求动态加载 this URL。
然后,我们的想法是找到 href
属性值匹配特定 pattern 的所有链接,然后使用此模式提取部门代码。工作代码:
import re
import requests
from bs4 import BeautifulSoup
codesurl = "http://www.iitkgp.ac.in/academics/academic.php"
response = requests.get(codesurl)
soup = BeautifulSoup(response.content, "lxml")
pattern = re.compile(r"dept=([A-Z]+)")
links = soup.find_all("a", href=pattern)
for link in links:
print(pattern.search(link["href"]).group(1))
打印:
AE
AG
AR
...
NT
WM
SM
我正在尝试从机构的网页中提取 href。 我必须提取部门代码以进一步抓取 activity。 我写了以下代码:
import requests
import re
import urllib
from bs4 import BeautifulSoup
codesurl="http://www.iitkgp.ac.in/academics/?page=acadunits"
response = requests.get(codesurl)
# print(response.content)
soup=BeautifulSoup(response.content)
# print(soup.prettify())
p = re.compile('page=acadunits*')
p1 = re.compile('<a href=.*page=acadunits*')
links=soup.find_all("a")
print(links)
for link in links:
# if p1.match(link):
print("%s" %(link))
但是我没有得到所有的 href,例如:
<a href="?page=acadunits&&dept=ME">Mechanical Engineering</a>
<a href="?page=acadunits&&dept=MD">Medical Science & Technology</a>
<a href="?page=acadunits&&dept=MT">Metallurgical & Materials Engineering</a>
还有更多 有人可以帮我吗 this.This 是我第一次爬行。 你也可以看看 website.I need to extract dept code from url
dept=ME
dept=MT
dept=MD
我的网页包含:
<div class="tab_container">
<div id="tab1" class="tab_content" style="display: block;">
<h3></h3>
<!--Content-->
<img src="./Indian Institute of Technology Kharagpur_files/academicunits.jpg">
<br><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=AE">Aerospace Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=AG">Agricultural & Food Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=AR">Architecture & Regional Planning</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=BT">Biotechnology</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=CH">Chemical Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=CM">Chemistry</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=CE">Civil Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=CS">Computer Science & Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=CR">Cryogenic Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=ED">Center for Educational Technology</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=EE">Electrical Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=EC"> Electronics & Electrical Communication Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=GS">G S Sanyal School of Telecommunications</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MG">Geology & Geophysics</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=HS">Humanities & Social Sciences</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=IM">Industrial & Systems Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=IT">Information Technology</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MS">Materials Science</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MM">Mathematics</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=ME">Mechanical Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MD">Medical Science & Technology</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MT">Metallurgical & Materials Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MI">Mining Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=NA">Ocean Engineering & Naval Architecture</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=N2">Oceans, Rivers, Atmosphere and Land Sciences</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=MP">Physics</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=PK">P K Sinha Centre for Bio Energy</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=RJ">Rajendra Mishra School of Engineering Entrepreneurship</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=RG">Rajiv Gandhi School of Intellectual Property Law</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=ID">Ranbir and Chitra Gupta School of Infrastructure Design and Management</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=RE">Reliability Engineering Centre</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=RT">Rubber Technology Centre</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=RD">Rural Development Centre</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=BS">School of Bioscience</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=ES">School of Energy Science & Engineering</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=EF">School of Environmental Science and Technology</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=NT">School of Nano-Science and Technology</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=WM">School of Water Resources</a><br>
<a href="http://www.iitkgp.ac.in/academics/?page=acadunits&&dept=SM">Vinod Gupta School of Management</a><br>
<br><br>
<!--Content-->
</div>
但是当我这样做的时候:
codesurl="http://www.iitkgp.ac.in/academics/?page=acadunits"
response = requests.get(codesurl)
soup=BeautifulSoup(response.text)
soup 不显示这些 href 有人可以建议如何提取这些 href 标签吗??
最好的方法是使用 urllib.parse
模块中的 parse_qs
。
for link in links:
qs = parse_qs(link.get('href'))
print('dept', qs['dept'][0])
或使用rpartition
for link in links:
print(link.get('href').rpartition('&&')[-1])
首先,部门链接通过 GET 请求动态加载 this URL。
然后,我们的想法是找到 href
属性值匹配特定 pattern 的所有链接,然后使用此模式提取部门代码。工作代码:
import re
import requests
from bs4 import BeautifulSoup
codesurl = "http://www.iitkgp.ac.in/academics/academic.php"
response = requests.get(codesurl)
soup = BeautifulSoup(response.content, "lxml")
pattern = re.compile(r"dept=([A-Z]+)")
links = soup.find_all("a", href=pattern)
for link in links:
print(pattern.search(link["href"]).group(1))
打印:
AE
AG
AR
...
NT
WM
SM