使用 Beautiful Soup 进行网页抓取会得到空的结果集
Web scraping with Beautiful Soup gives empty ResultSet
我正在试验 Beautiful Soup,我正在尝试从 HTML 文档中提取信息,该文档包含以下类型的片段:
<div class="entity-body">
<h3 class="entity-name with-profile">
<a href="https://www.linkedin.com/profile/view?id=AA4AAAAC9qXUBMuA3-txf-cKOPsYZZ0TbWJkhgfxfpY&trk=manage_invitations_profile"
data-li-url="/profile/mini-profile-with-connections?_ed=0_3fIDL9gCh6b5R-c9s4-e_B&trk=manage_invitations_miniprofile"
class="miniprofile"
aria-label="View profile for Ivan Grigorov">
<span>Ivan Grigorov</span>
</a>
</h3>
<p class="entity-subheader">
Teacher
</p>
</div>
我使用了以下命令:
with open("C:\Users\pv\MyFiles\HTML\Invites.html","r") as Invites: soup = bs(Invites, 'lxml')
soup.title
out: <title>Sent Invites\n| LinkedIn\n</title>
invites = soup.find_all("div", class_ = "entity-body")
type(invites)
out: bs4.element.ResultSet
len(invites)
out: 0
为什么 find_all returns 空 ResultSet 对象?
我们将不胜感激您的建议。
import bs4
html = '''<div class="entity-body">
<h3 class="entity-name with-profile">
<a href="https://www.linkedin.com/profile/view?id=AA4AAAAC9qXUBMuA3-txf-cKOPsYZZ0TbWJkhgfxfpY&trk=manage_invitations_profile"
data-li-url="/profile/mini-profile-with-connections?_ed=0_3fIDL9gCh6b5R-c9s4-e_B&trk=manage_invitations_miniprofile"
class="miniprofile"
aria-label="View profile for Ivan Grigorov">
<span>Ivan Grigorov</span>
</a>
</h3>
<p class="entity-subheader">
Teacher
</p>
</div>'''
soup = bs4.BeautifulSoup(html, 'lxml')
invites = soup.find_all("div", class_ = "entity-body")
len(invites)
输出:
1
这段代码工作正常
问题是文档没有被阅读,它只是一个 TextIOWrapper
(Python 3) or File
(Python 2) 对象。你必须阅读文档并传递标记,本质上是一个 string
到BeautifulSoup
.
正确的代码是:
with open("C:\Users\pv\MyFiles\HTML\Invites.html", "r") as Invites:
soup = BeautifulSoup(Invites.read(), "html.parser")
soup.title
invites = soup.find_all("div", class_="entity-body")
len(invites)
我正在试验 Beautiful Soup,我正在尝试从 HTML 文档中提取信息,该文档包含以下类型的片段:
<div class="entity-body">
<h3 class="entity-name with-profile">
<a href="https://www.linkedin.com/profile/view?id=AA4AAAAC9qXUBMuA3-txf-cKOPsYZZ0TbWJkhgfxfpY&trk=manage_invitations_profile"
data-li-url="/profile/mini-profile-with-connections?_ed=0_3fIDL9gCh6b5R-c9s4-e_B&trk=manage_invitations_miniprofile"
class="miniprofile"
aria-label="View profile for Ivan Grigorov">
<span>Ivan Grigorov</span>
</a>
</h3>
<p class="entity-subheader">
Teacher
</p>
</div>
我使用了以下命令:
with open("C:\Users\pv\MyFiles\HTML\Invites.html","r") as Invites: soup = bs(Invites, 'lxml')
soup.title
out: <title>Sent Invites\n| LinkedIn\n</title>
invites = soup.find_all("div", class_ = "entity-body")
type(invites)
out: bs4.element.ResultSet
len(invites)
out: 0
为什么 find_all returns 空 ResultSet 对象?
我们将不胜感激您的建议。
import bs4
html = '''<div class="entity-body">
<h3 class="entity-name with-profile">
<a href="https://www.linkedin.com/profile/view?id=AA4AAAAC9qXUBMuA3-txf-cKOPsYZZ0TbWJkhgfxfpY&trk=manage_invitations_profile"
data-li-url="/profile/mini-profile-with-connections?_ed=0_3fIDL9gCh6b5R-c9s4-e_B&trk=manage_invitations_miniprofile"
class="miniprofile"
aria-label="View profile for Ivan Grigorov">
<span>Ivan Grigorov</span>
</a>
</h3>
<p class="entity-subheader">
Teacher
</p>
</div>'''
soup = bs4.BeautifulSoup(html, 'lxml')
invites = soup.find_all("div", class_ = "entity-body")
len(invites)
输出:
1
这段代码工作正常
问题是文档没有被阅读,它只是一个 TextIOWrapper
(Python 3) or File
(Python 2) 对象。你必须阅读文档并传递标记,本质上是一个 string
到BeautifulSoup
.
正确的代码是:
with open("C:\Users\pv\MyFiles\HTML\Invites.html", "r") as Invites:
soup = BeautifulSoup(Invites.read(), "html.parser")
soup.title
invites = soup.find_all("div", class_="entity-body")
len(invites)