如何使用 BeautifulSoup 提取链接
How to extract the links using BeautifulSoup
如何提取下面的link html:
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
使用 list comprehension
和 css selectors
获取链接列表 - Select 所有以 .pdf
:
结尾的链接
[a['href'] for a in soup.select('a[href$=".pdf"]')]
或更具体的 <a>
与 href
作为 <i>
与 class 的兄弟 fa-file-pdf
:
[a['href'] for a in soup.select('li i.fa-file-pdf + a[href]')]
所以如果目标是只提取第一个:
link = [a['href'] for a in soup.select('a[href$=".pdf"]')][0]
或
link = soup.select_one('a[href$=".pdf"]')['href']
例子
from bs4 import BeautifulSoup
import requests
html = '''
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
'''
soup = BeautifulSoup(html)
urlList = [a['href'] for a in soup.select('a[href$=".pdf"]')]
输出
['https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf',
'https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf',
'https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf',
'https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf']
如何提取下面的link html:
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
使用 list comprehension
和 css selectors
获取链接列表 - Select 所有以 .pdf
:
[a['href'] for a in soup.select('a[href$=".pdf"]')]
或更具体的 <a>
与 href
作为 <i>
与 class 的兄弟 fa-file-pdf
:
[a['href'] for a in soup.select('li i.fa-file-pdf + a[href]')]
所以如果目标是只提取第一个:
link = [a['href'] for a in soup.select('a[href$=".pdf"]')][0]
或
link = soup.select_one('a[href$=".pdf"]')['href']
例子
from bs4 import BeautifulSoup
import requests
html = '''
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
'''
soup = BeautifulSoup(html)
urlList = [a['href'] for a in soup.select('a[href$=".pdf"]')]
输出
['https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf',
'https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf',
'https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf',
'https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf']