如何从 div class 中只抓取一个 href?
How to scrape only a single href from a div class?
我想从这个<div>
中提取第一个<a href>
的内容
<div class="tocDeliverFormatsLinks"><a href="/doi/abs/10.1080/03066150.2021.1956473">Abstract</a> | <a
class="ref nowrap full" href="/doi/full/10.1080/03066150.2021.1956473">Full Text</a> | <a
class="ref nowrap references" href="/doi/ref/10.1080/03066150.2021.1956473">References</a> | <a
class="ref nowrap nocolwiz" target="_blank" title="Opens new window"
href="/doi/pdf/10.1080/03066150.2021.1956473">PDF (2239 KB)</a> | <a class="ref nowrap epub"
href="/doi/epub/10.1080/03066150.2021.1956473" target="_blank">EPUB</a> | <a
href="/servlet/linkout?type=rightslink&url=startPage%3D1%26pageCount%3D28%26author%3DSaturnino%2BM.%2BBorras%2BJr.%252C%2B%252C%2BIan%2BScoones%252C%2Bet%2Bal%26orderBeanReset%3Dtrue%26imprint%3DRoutledge%26volumeNum%3D49%26issueNum%3D1%26contentID%3D10.1080%252F03066150.2021.1956473%26title%3DClimate%2Bchange%2Band%2Bagrarian%2Bstruggles%253A%2Ban%2Binvitation%2Bto%2Bcontribute%2Bto%2Ba%2BJPS%2BForum%26numPages%3D28%26pa%3D%26oa%3DCC-BY-NC-ND%26issn%3D0306-6150%26publisherName%3Dtandfuk%26publication%3DFJPS%26rpt%3Dn%26endPage%3D28%26publicationDate%3D01%252F02%252F2022"
class="rightslink" target="_blank" title="Opens new window">Permissions</a>\xa0</div>
<a href="/doi/abs/10.1080/03066150.2021.1956473">
我正在使用 BeautifulSoup,我还从同一页面上抓取了一些其他内容,并使用以下解决方案作为 abstract
的结果,我有 None
for article_entry in article_list_items:
title_article = article_entry.find('span', class_='hlFld-Title').text
author = article_entry.find('span', class_='articleEntryAuthorsLinks').text
abstract = article_entry.find('a', class_='tocDeliverFormatsLinks')
print(author, title_article, abstract)
Saturnino M. Borras Jr., Ian Scoones, Amita Baviskar, Marc Edelman, Nancy Lee Peluso & Wendy Wolford Climate change and agrarian struggles: an invitation to contribute to a JPS Forum None
是否有系统可以通过使用类似于 'a'[:1]
的方式到达第一个 href?
您可以 select 列表然后切片或使用 select_one
作为 css select 或 select 单个元素,如下所示:
html_doc = '''<div class="tocDeliverFormatsLinks"><a href="/doi/abs/10.1080/03066150.2021.1956473">Abstract</a> | <a
class="ref nowrap full" href="/doi/full/10.1080/03066150.2021.1956473">Full Text</a> | <a
class="ref nowrap references" href="/doi/ref/10.1080/03066150.2021.1956473">References</a> | <a
class="ref nowrap nocolwiz" target="_blank" title="Opens new window"
href="/doi/pdf/10.1080/03066150.2021.1956473">PDF (2239 KB)</a> | <a class="ref nowrap epub"
href="/doi/epub/10.1080/03066150.2021.1956473" target="_blank">EPUB</a> | <a
href="/servlet/linkout?type=rightslink&url=startPage%3D1%26pageCount%3D28%26author%3DSaturnino%2BM.%2BBorras%2BJr.%252C%2B%252C%2BIan%2BScoones%252C%2Bet%2Bal%26orderBeanReset%3Dtrue%26imprint%3DRoutledge%26volumeNum%3D49%26issueNum%3D1%26contentID%3D10.1080%252F03066150.2021.1956473%26title%3DClimate%2Bchange%2Band%2Bagrarian%2Bstruggles%253A%2Ban%2Binvitation%2Bto%2Bcontribute%2Bto%2Ba%2BJPS%2BForum%26numPages%3D28%26pa%3D%26oa%3DCC-BY-NC-ND%26issn%3D0306-6150%26publisherName%3Dtandfuk%26publication%3DFJPS%26rpt%3Dn%26endPage%3D28%26publicationDate%3D01%252F02%252F2022"
class="rightslink" target="_blank" title="Opens new window">Permissions</a>\xa0</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
href = soup.select_one('div.tocDeliverFormatsLinks a').get('href')
print(href)
输出:
/doi/abs/10.1080/03066150.2021.1956473
我想从这个<div>
<a href>
的内容
<div class="tocDeliverFormatsLinks"><a href="/doi/abs/10.1080/03066150.2021.1956473">Abstract</a> | <a
class="ref nowrap full" href="/doi/full/10.1080/03066150.2021.1956473">Full Text</a> | <a
class="ref nowrap references" href="/doi/ref/10.1080/03066150.2021.1956473">References</a> | <a
class="ref nowrap nocolwiz" target="_blank" title="Opens new window"
href="/doi/pdf/10.1080/03066150.2021.1956473">PDF (2239 KB)</a> | <a class="ref nowrap epub"
href="/doi/epub/10.1080/03066150.2021.1956473" target="_blank">EPUB</a> | <a
href="/servlet/linkout?type=rightslink&url=startPage%3D1%26pageCount%3D28%26author%3DSaturnino%2BM.%2BBorras%2BJr.%252C%2B%252C%2BIan%2BScoones%252C%2Bet%2Bal%26orderBeanReset%3Dtrue%26imprint%3DRoutledge%26volumeNum%3D49%26issueNum%3D1%26contentID%3D10.1080%252F03066150.2021.1956473%26title%3DClimate%2Bchange%2Band%2Bagrarian%2Bstruggles%253A%2Ban%2Binvitation%2Bto%2Bcontribute%2Bto%2Ba%2BJPS%2BForum%26numPages%3D28%26pa%3D%26oa%3DCC-BY-NC-ND%26issn%3D0306-6150%26publisherName%3Dtandfuk%26publication%3DFJPS%26rpt%3Dn%26endPage%3D28%26publicationDate%3D01%252F02%252F2022"
class="rightslink" target="_blank" title="Opens new window">Permissions</a>\xa0</div>
<a href="/doi/abs/10.1080/03066150.2021.1956473">
我正在使用 BeautifulSoup,我还从同一页面上抓取了一些其他内容,并使用以下解决方案作为 abstract
的结果,我有 None
for article_entry in article_list_items:
title_article = article_entry.find('span', class_='hlFld-Title').text
author = article_entry.find('span', class_='articleEntryAuthorsLinks').text
abstract = article_entry.find('a', class_='tocDeliverFormatsLinks')
print(author, title_article, abstract)
Saturnino M. Borras Jr., Ian Scoones, Amita Baviskar, Marc Edelman, Nancy Lee Peluso & Wendy Wolford Climate change and agrarian struggles: an invitation to contribute to a JPS Forum None
是否有系统可以通过使用类似于 'a'[:1]
的方式到达第一个 href?
您可以 select 列表然后切片或使用 select_one
作为 css select 或 select 单个元素,如下所示:
html_doc = '''<div class="tocDeliverFormatsLinks"><a href="/doi/abs/10.1080/03066150.2021.1956473">Abstract</a> | <a
class="ref nowrap full" href="/doi/full/10.1080/03066150.2021.1956473">Full Text</a> | <a
class="ref nowrap references" href="/doi/ref/10.1080/03066150.2021.1956473">References</a> | <a
class="ref nowrap nocolwiz" target="_blank" title="Opens new window"
href="/doi/pdf/10.1080/03066150.2021.1956473">PDF (2239 KB)</a> | <a class="ref nowrap epub"
href="/doi/epub/10.1080/03066150.2021.1956473" target="_blank">EPUB</a> | <a
href="/servlet/linkout?type=rightslink&url=startPage%3D1%26pageCount%3D28%26author%3DSaturnino%2BM.%2BBorras%2BJr.%252C%2B%252C%2BIan%2BScoones%252C%2Bet%2Bal%26orderBeanReset%3Dtrue%26imprint%3DRoutledge%26volumeNum%3D49%26issueNum%3D1%26contentID%3D10.1080%252F03066150.2021.1956473%26title%3DClimate%2Bchange%2Band%2Bagrarian%2Bstruggles%253A%2Ban%2Binvitation%2Bto%2Bcontribute%2Bto%2Ba%2BJPS%2BForum%26numPages%3D28%26pa%3D%26oa%3DCC-BY-NC-ND%26issn%3D0306-6150%26publisherName%3Dtandfuk%26publication%3DFJPS%26rpt%3Dn%26endPage%3D28%26publicationDate%3D01%252F02%252F2022"
class="rightslink" target="_blank" title="Opens new window">Permissions</a>\xa0</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
href = soup.select_one('div.tocDeliverFormatsLinks a').get('href')
print(href)
输出:
/doi/abs/10.1080/03066150.2021.1956473