使用 BeautifulSoup 在 HTML 中查找结束标记内容

Question

我正在 Windows 7 机器上的 Python34 中使用 BeautifulSoup。我正在尝试解析以下内容

<bound method Tag.find of <div class="accordion">
<p> <span style="color:039; font-size:14px; font-weight:bold">Acetohydroxamic Acid (Lithostat) Tablets</span><br/><br/>



  <strong>Status: Currently in Shortage </strong><br/><br/>



         » <strong>Date first posted</strong>: 

        07/15/2014<br/>



 » <strong>Therapeutic Categories</strong>: Renal<br/>
</p><p style="padding:10px;">
</p>
<h3>

    Mission Pharmacal  (<em>Reverified  01/21/2015</em>)

我试图在首次发布日期后将“07/15/2014”从行中删除。我也必须把 "Renal" 拿出来。我可以使用 .findAll("strong") 找到所有 "strongs"，但我无法找到在 /strong>: 之后和下一个
之前获取内容的方法。

Answer 1

为什么不使用正则表达式 (?<=/strong>:)([^<]+)。第一组中的 ?<= 意味着它是正向后视：这意味着 "look for this string but don't capture it." 第二组意味着“匹配除 < 之外的任何字符。最后 strip 删除您组周围的任何额外空白。

import re
import requests
s = requests.get(url).text
matches = [l.strip() for l in re.findall('(?<=/strong>:)([^<]+)',s)]

Answer 2

您需要使用.next_sibling获取strong之后的元素 isinstance(el, bs4.Tag) 过滤不是 Tag 的元素，最后 re.sub 去除空行和 :

In [38]: import re

In [39]: import bs4

In [40]: from bs4 import BeautifulSoup

In [41]: soup = BeautifulSoup("""<bound method Tag.find of <div class="accordion">   ....: <p> <span style="color:039; font-size:14px; font-weight:bold">Acetohydroxamic Acid (Lithostat) Tablets</span><br/><br/>
   ....: 
   ....: 
   ....: 
   ....:   <strong>Status: Currently in Shortage </strong><br/><br/>
   ....: 
   ....: 
   ....: 
   ....:         » <strong>Date first posted</strong>: 
   ....: 
   ....:                07/15/2014<br/>
   ....: 
   ....:     
   ....: 
   ....:  » <strong>Therapeutic Categories</strong>: Renal<br/>
   ....: </p><p style="padding:10px;">
   ....: </p>
   ....: <h3>
   ....: 
   ....:        Mission Pharmacal  (<em>Reverified  01/21/2015</em>)""")

In [42]: for strong_tag in soup.find_all('strong'):
   ....:     if not isinstance(strong_tag.next_sibling, bs4.Tag):
   ....:         print(re.sub(r'[:\s]+', '', strong_tag.next_sibling))
   ....:         
07/15/2014
Renal

编辑

Is there a way to get that date without using a loop?

是的，您可以将 text 参数用于 find。

re.sub('[:\s+]', '', soup.find('strong', text=re.compile('Date')).next_sibling)

使用 BeautifulSoup 在 HTML 中查找结束标记内容

Finding end tag content in HTML with BeautifulSoup

python

beautifulsoup

python-3.x