如何使用 Python 的 beautifulsoup select 特定元素下的 table 元素
How to select a table element under a specific element using Python's beautifulsoup
我想要 select 个 table 下的元素 <i>Member</>
html代码:
<table class="table profile-table">
<td>Teams</td>
<td>
<i>Leader</i>:
<a href="/shdb-team/20-739/" class="chip team">SHDB Team</a><a href="/the-spider-society/20-490/" class="chip team">The Spider Society</a><a href="/new-warriors/20-79/" class="chip team">New Warriors</a><a href="/the-six/20-474/" class="chip team">The Six</a>
<i>Member</i>:
<a href="/the-mighty-avengers/20-384/" class="chip team">The Mighty Avengers</a><a href="/new-avengers/20-101/" class="chip team">New Avengers</a><a href="/shield/20-467/" class="chip team">S.H.I.E.L.D.</a><a href="/avengers-resistance/20-154/" class="chip team">Avengers Resistance</a><a href="/marvel-knights/20-377/" class="chip team">Marvel Knights</a><a href="/avengers/20-4/" class="chip team">Avengers</a><a href="/secret-defenders/20-96/" class="chip team">Secret Defenders</a><a href="/daily-bugle/20-216/" class="chip team">Daily Bugle</a><a href="/defenders/20-9/" class="chip team">Defenders</a>
<i>Formerly</i>:
<a href="/future-foundation/20-290/" class="chip team">Future Foundation</a><a href="/heroes-for-hire/20-5/" class="chip team">Heroes For Hire</a><a href="/fantastic-four/20-1/" class="chip team">Fantastic Four</a> </td>
如何 select 会员 的文字仅作为示例?
我试过了:
li = bs.find('i', text = "Member")
children = li.findNextSiblings()
for child in children:
member.append(child.text)
print(member)
但是它把所有结果作为输出:
SHDB Team
The Spider Society
New Warriors
The Six
Member
The Mighty Avengers
New Avengers
S.H.I.E.L.D.
Avengers Resistance
Marvel Knights
Avengers
Secret Defenders
Daily Bugle
Defenders
Formerly
Future Foundation
Heroes For Hire
Fantastic Four
我只想选择会员部分。
这段代码让我选择 Member 之后和 formerly 之前的所有内容,但这是一个低效的解决方案:
teams[teams.index("Member")+1:teams.index("Formerly")]
所有 i tags
彼此 following-siblings 通过 td tag
后面的文本值区分,所以很简单,您可以使用 css select 或切片到 select Member section
.
from bs4 import BeautifulSoup
html = """
<table class="table profile-table">
<td>
Teams
</td>
<td>
<i>
Leader
</i>
:
<a class="chip team" href="/shdb-team/20-739/">
SHDB Team
</a>
<a class="chip team" href="/the-spider-society/20-490/">
The Spider Society
</a>
<a class="chip team" href="/new-warriors/20-79/">
New Warriors
</a>
<a class="chip team" href="/the-six/20-474/">
The Six
</a>
<i>
Member
</i>
:
<a class="chip team" href="/the-mighty-avengers/20-384/">
The Mighty Avengers
</a>
<a class="chip team" href="/new-avengers/20-101/">
New Avengers
</a>
<a class="chip team" href="/shield/20-467/">
S.H.I.E.L.D.
</a>
<a class="chip team" href="/avengers-resistance/20-154/">
Avengers Resistance
</a>
<a class="chip team" href="/marvel-knights/20-377/">
Marvel Knights
</a>
<a class="chip team" href="/avengers/20-4/">
Avengers
</a>
<a class="chip team" href="/secret-defenders/20-96/">
Secret Defenders
</a>
<a class="chip team" href="/daily-bugle/20-216/">
Daily Bugle
</a>
<a class="chip team" href="/defenders/20-9/">
Defenders
</a>
<i>
Formerly
</i>
:
<a class="chip team" href="/future-foundation/20-290/">
Future Foundation
</a>
<a class="chip team" href="/heroes-for-hire/20-5/">
Heroes For Hire
</a>
<a class="chip team" href="/fantastic-four/20-1/">
Fantastic Four
</a>
</td>
</table>
"""
soup = BeautifulSoup(html, "html.parser")
for i in soup.select_one('.table.profile-table > td > i:nth-of-type(2)').next_siblings:
if i.name == 'i':
break
if i.name == 'a':
print(i.get_text(strip=True))
输出:
The Mighty Avengers
New Avengers
S.H.I.E.L.D.
Avengers Resistance
Marvel Knights
Avengers
Secret Defenders
Daily Bugle
Defenders
您可以 select next_siblings
元素并检查同级标签名称是否为 a
或者如果标签名称为 i
则中断循环:
for tag in soup.select_one('i:-soup-contains("Member")').next_siblings:
if tag.name == 'i':
break
if tag.name == 'a':
print(tag.text)
例子
html = '''
<table class="table profile-table">
<td>Teams</td>
<td>
<i>Leader</i>:
<a href="/shdb-team/20-739/" class="chip team">SHDB Team</a><a href="/the-spider-society/20-490/" class="chip team">The Spider Society</a><a href="/new-warriors/20-79/" class="chip team">New Warriors</a><a href="/the-six/20-474/" class="chip team">The Six</a>
<i>Member</i>:
<a href="/the-mighty-avengers/20-384/" class="chip team">The Mighty Avengers</a><a href="/new-avengers/20-101/" class="chip team">New Avengers</a><a href="/shield/20-467/" class="chip team">S.H.I.E.L.D.</a><a href="/avengers-resistance/20-154/" class="chip team">Avengers Resistance</a><a href="/marvel-knights/20-377/" class="chip team">Marvel Knights</a><a href="/avengers/20-4/" class="chip team">Avengers</a><a href="/secret-defenders/20-96/" class="chip team">Secret Defenders</a><a href="/daily-bugle/20-216/" class="chip team">Daily Bugle</a><a href="/defenders/20-9/" class="chip team">Defenders</a>
<i>Formerly</i>:
<a href="/future-foundation/20-290/" class="chip team">Future Foundation</a><a href="/heroes-for-hire/20-5/" class="chip team">Heroes For Hire</a><a href="/fantastic-four/20-1/" class="chip team">Fantastic Four</a> </td>
'''
soup = BeautifulSoup(html)
for tag in soup.select_one('i:-soup-contains("Member")').next_siblings:
if tag.name == 'i':
break
if tag.name == 'a':
print(tag.text)
输出
The Mighty Avengers
New Avengers
S.H.I.E.L.D.
Avengers Resistance
Marvel Knights
Avengers
Secret Defenders
Daily Bugle
Defenders
我想要 select 个 table 下的元素 <i>Member</>
html代码:
<table class="table profile-table">
<td>Teams</td>
<td>
<i>Leader</i>:
<a href="/shdb-team/20-739/" class="chip team">SHDB Team</a><a href="/the-spider-society/20-490/" class="chip team">The Spider Society</a><a href="/new-warriors/20-79/" class="chip team">New Warriors</a><a href="/the-six/20-474/" class="chip team">The Six</a>
<i>Member</i>:
<a href="/the-mighty-avengers/20-384/" class="chip team">The Mighty Avengers</a><a href="/new-avengers/20-101/" class="chip team">New Avengers</a><a href="/shield/20-467/" class="chip team">S.H.I.E.L.D.</a><a href="/avengers-resistance/20-154/" class="chip team">Avengers Resistance</a><a href="/marvel-knights/20-377/" class="chip team">Marvel Knights</a><a href="/avengers/20-4/" class="chip team">Avengers</a><a href="/secret-defenders/20-96/" class="chip team">Secret Defenders</a><a href="/daily-bugle/20-216/" class="chip team">Daily Bugle</a><a href="/defenders/20-9/" class="chip team">Defenders</a>
<i>Formerly</i>:
<a href="/future-foundation/20-290/" class="chip team">Future Foundation</a><a href="/heroes-for-hire/20-5/" class="chip team">Heroes For Hire</a><a href="/fantastic-four/20-1/" class="chip team">Fantastic Four</a> </td>
如何 select 会员 的文字仅作为示例?
我试过了:
li = bs.find('i', text = "Member")
children = li.findNextSiblings()
for child in children:
member.append(child.text)
print(member)
但是它把所有结果作为输出:
SHDB Team
The Spider Society
New Warriors
The Six
Member
The Mighty Avengers
New Avengers
S.H.I.E.L.D.
Avengers Resistance
Marvel Knights
Avengers
Secret Defenders
Daily Bugle
Defenders
Formerly
Future Foundation
Heroes For Hire
Fantastic Four
我只想选择会员部分。 这段代码让我选择 Member 之后和 formerly 之前的所有内容,但这是一个低效的解决方案:
teams[teams.index("Member")+1:teams.index("Formerly")]
所有 i tags
彼此 following-siblings 通过 td tag
后面的文本值区分,所以很简单,您可以使用 css select 或切片到 select Member section
.
from bs4 import BeautifulSoup
html = """
<table class="table profile-table">
<td>
Teams
</td>
<td>
<i>
Leader
</i>
:
<a class="chip team" href="/shdb-team/20-739/">
SHDB Team
</a>
<a class="chip team" href="/the-spider-society/20-490/">
The Spider Society
</a>
<a class="chip team" href="/new-warriors/20-79/">
New Warriors
</a>
<a class="chip team" href="/the-six/20-474/">
The Six
</a>
<i>
Member
</i>
:
<a class="chip team" href="/the-mighty-avengers/20-384/">
The Mighty Avengers
</a>
<a class="chip team" href="/new-avengers/20-101/">
New Avengers
</a>
<a class="chip team" href="/shield/20-467/">
S.H.I.E.L.D.
</a>
<a class="chip team" href="/avengers-resistance/20-154/">
Avengers Resistance
</a>
<a class="chip team" href="/marvel-knights/20-377/">
Marvel Knights
</a>
<a class="chip team" href="/avengers/20-4/">
Avengers
</a>
<a class="chip team" href="/secret-defenders/20-96/">
Secret Defenders
</a>
<a class="chip team" href="/daily-bugle/20-216/">
Daily Bugle
</a>
<a class="chip team" href="/defenders/20-9/">
Defenders
</a>
<i>
Formerly
</i>
:
<a class="chip team" href="/future-foundation/20-290/">
Future Foundation
</a>
<a class="chip team" href="/heroes-for-hire/20-5/">
Heroes For Hire
</a>
<a class="chip team" href="/fantastic-four/20-1/">
Fantastic Four
</a>
</td>
</table>
"""
soup = BeautifulSoup(html, "html.parser")
for i in soup.select_one('.table.profile-table > td > i:nth-of-type(2)').next_siblings:
if i.name == 'i':
break
if i.name == 'a':
print(i.get_text(strip=True))
输出:
The Mighty Avengers
New Avengers
S.H.I.E.L.D.
Avengers Resistance
Marvel Knights
Avengers
Secret Defenders
Daily Bugle
Defenders
您可以 select next_siblings
元素并检查同级标签名称是否为 a
或者如果标签名称为 i
则中断循环:
for tag in soup.select_one('i:-soup-contains("Member")').next_siblings:
if tag.name == 'i':
break
if tag.name == 'a':
print(tag.text)
例子
html = '''
<table class="table profile-table">
<td>Teams</td>
<td>
<i>Leader</i>:
<a href="/shdb-team/20-739/" class="chip team">SHDB Team</a><a href="/the-spider-society/20-490/" class="chip team">The Spider Society</a><a href="/new-warriors/20-79/" class="chip team">New Warriors</a><a href="/the-six/20-474/" class="chip team">The Six</a>
<i>Member</i>:
<a href="/the-mighty-avengers/20-384/" class="chip team">The Mighty Avengers</a><a href="/new-avengers/20-101/" class="chip team">New Avengers</a><a href="/shield/20-467/" class="chip team">S.H.I.E.L.D.</a><a href="/avengers-resistance/20-154/" class="chip team">Avengers Resistance</a><a href="/marvel-knights/20-377/" class="chip team">Marvel Knights</a><a href="/avengers/20-4/" class="chip team">Avengers</a><a href="/secret-defenders/20-96/" class="chip team">Secret Defenders</a><a href="/daily-bugle/20-216/" class="chip team">Daily Bugle</a><a href="/defenders/20-9/" class="chip team">Defenders</a>
<i>Formerly</i>:
<a href="/future-foundation/20-290/" class="chip team">Future Foundation</a><a href="/heroes-for-hire/20-5/" class="chip team">Heroes For Hire</a><a href="/fantastic-four/20-1/" class="chip team">Fantastic Four</a> </td>
'''
soup = BeautifulSoup(html)
for tag in soup.select_one('i:-soup-contains("Member")').next_siblings:
if tag.name == 'i':
break
if tag.name == 'a':
print(tag.text)
输出
The Mighty Avengers
New Avengers
S.H.I.E.L.D.
Avengers Resistance
Marvel Knights
Avengers
Secret Defenders
Daily Bugle
Defenders