BeautifulSoup 解析非结构化 html
BeautifulSoup parse unstructured html
尝试用 BeautifulSoup 解析这个 html:
<div class="container">
<strong>Monday</strong> Some info here...<br /> and then some <br />
<strong>Tuesday</strong> Some info here...<br />
<strong>Wednesday</strong> Some info here...<br />
...
</div>
我只想获取星期二的数据:<strong>Tuesday</strong> Some info here...<br />
但是由于没有包装器 div,我很难仅获取此数据。有什么建议么?
这样怎么样:
from bs4 import BeautifulSoup
html = """<div class="container">
<strong>Monday</strong> Some info here...<br /> and then some <br />
<strong>Tuesday</strong> Some info here...<br />
<strong>Wednesday</strong> Some info here...<br />
...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
print(result.decode('utf-8'))
输出:
Some info here...
根据评论更新:
基本上,您可以继续获取 <strong>Tuesday</strong>
的下一个同级文本,直到文本的下一个同级元素是另一个 <strong>
元素或 none
.
from bs4 import BeautifulSoup
html = """<div class="container">
<strong>Monday</strong> Some info here...<br /> and then some <br />
<strong>Tuesday</strong> Some info here...<br /> and then some <br />
<strong>Wednesday</strong> Some info here...<br />
...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
nextSibling = result.findNextSibling()
while nextSibling and nextSibling.name != 'strong':
print(result.decode('utf-8'))
result = nextSibling.findNextSibling(text=True)
nextSibling = result.findNextSibling()
输出:
Some info here...
and then some
尝试用 BeautifulSoup 解析这个 html:
<div class="container">
<strong>Monday</strong> Some info here...<br /> and then some <br />
<strong>Tuesday</strong> Some info here...<br />
<strong>Wednesday</strong> Some info here...<br />
...
</div>
我只想获取星期二的数据:<strong>Tuesday</strong> Some info here...<br />
但是由于没有包装器 div,我很难仅获取此数据。有什么建议么?
这样怎么样:
from bs4 import BeautifulSoup
html = """<div class="container">
<strong>Monday</strong> Some info here...<br /> and then some <br />
<strong>Tuesday</strong> Some info here...<br />
<strong>Wednesday</strong> Some info here...<br />
...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
print(result.decode('utf-8'))
输出:
Some info here...
根据评论更新:
基本上,您可以继续获取 <strong>Tuesday</strong>
的下一个同级文本,直到文本的下一个同级元素是另一个 <strong>
元素或 none
.
from bs4 import BeautifulSoup
html = """<div class="container">
<strong>Monday</strong> Some info here...<br /> and then some <br />
<strong>Tuesday</strong> Some info here...<br /> and then some <br />
<strong>Wednesday</strong> Some info here...<br />
...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
nextSibling = result.findNextSibling()
while nextSibling and nextSibling.name != 'strong':
print(result.decode('utf-8'))
result = nextSibling.findNextSibling(text=True)
nextSibling = result.findNextSibling()
输出:
Some info here...
and then some