BeautifulSoup 解析非结构化 html

Question

尝试用 BeautifulSoup 解析这个 html:

<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>

我只想获取星期二的数据：Tuesday Some info here...  但是由于没有包装器 div，我很难仅获取此数据。有什么建议么？

Answer 1

这样怎么样:

from bs4 import BeautifulSoup

html = """<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
print(result.decode('utf-8'))

输出：

 Some info here...

根据评论更新：

基本上，您可以继续获取 Tuesday 的下一个同级文本，直到文本的下一个同级元素是另一个  元素或 none.

from bs4 import BeautifulSoup

html = """<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br /> and then some <br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
nextSibling = result.findNextSibling()
while nextSibling and nextSibling.name != 'strong':
    print(result.decode('utf-8'))
    result = nextSibling.findNextSibling(text=True)
    nextSibling = result.findNextSibling()

输出：

 Some info here...
 and then some

BeautifulSoup 解析非结构化 html

BeautifulSoup parse unstructured html

python

beautifulsoup

html-parsing