使用 BeautifulSoup 中的 HTML 个实体查找
Using HTML entities in BeautifulSoup find
我正在 Python 中进行一些简单的爬行(使用 BeautifulSoup4),我在检索包含 HTML 个实体的标签时遇到问题。
这是一个小示例(只是删除了真实的 URL)
start_url = "..."
next_chapter_bad = "Next Chapter ]>"
next_chapter_good = "Next Chapter ]>"
"""
<td class="comic_navi_right">
<a href="..." class="navi navi-next-chap" title="Next Chapter ]>">Next Chapter ]></a>
<a href="..." class="navi comic-nav-next navi-next" title="Next Page >">Next Page ></a>
<a href="..." class="navi navi-last" title="Most Recent Page >>">Most Recent Page >></a>
</td>
"""
page = requests.get(start_url)
if page.status_code != requests.codes.ok:
return ''
soup = BeautifulSoup(page.text)
# get the url for the "Next chapter" link
next_link = soup.find('a', href=True, string=next_chapter_bad)
print( next_link)
next_link = soup.find('a', href=True, string=next_chapter_good)
print( next_link)
输出为:
None
<a class="navi navi-next-chap" href="..." title="Next Chapter ]>">Next Chapter ]></a>
有没有办法让 find() 与 HTML 个实体一起工作?
您必须 unescape
HTML () 因为 >
被转义 >
.
from HTMLParser import HTMLParser
...
soup = BeautifulSoup(page.text, 'html.parser')
# get the url for the "Next chapter" link
html_parser = HTMLParser()
next_link = soup.find('a', href=True, string=html_parser.unescape(next_chapter_bad))
print( next_link)
我正在 Python 中进行一些简单的爬行(使用 BeautifulSoup4),我在检索包含 HTML 个实体的标签时遇到问题。
这是一个小示例(只是删除了真实的 URL)
start_url = "..."
next_chapter_bad = "Next Chapter ]>"
next_chapter_good = "Next Chapter ]>"
"""
<td class="comic_navi_right">
<a href="..." class="navi navi-next-chap" title="Next Chapter ]>">Next Chapter ]></a>
<a href="..." class="navi comic-nav-next navi-next" title="Next Page >">Next Page ></a>
<a href="..." class="navi navi-last" title="Most Recent Page >>">Most Recent Page >></a>
</td>
"""
page = requests.get(start_url)
if page.status_code != requests.codes.ok:
return ''
soup = BeautifulSoup(page.text)
# get the url for the "Next chapter" link
next_link = soup.find('a', href=True, string=next_chapter_bad)
print( next_link)
next_link = soup.find('a', href=True, string=next_chapter_good)
print( next_link)
输出为:
None
<a class="navi navi-next-chap" href="..." title="Next Chapter ]>">Next Chapter ]></a>
有没有办法让 find() 与 HTML 个实体一起工作?
您必须 unescape
HTML () 因为 >
被转义 >
.
from HTMLParser import HTMLParser
...
soup = BeautifulSoup(page.text, 'html.parser')
# get the url for the "Next chapter" link
html_parser = HTMLParser()
next_link = soup.find('a', href=True, string=html_parser.unescape(next_chapter_bad))
print( next_link)