在 BeautifulSoup 中的 <hr> 标记后提取文本
Extract Text After <hr> tag in BeautifulSoup
我有一个从页面中提取数据的脚本。我可以抓取其中的大部分内容,但是在 "hr" 标签之后出现了一些文本;我不确定如何刮擦。 HTML代码如下:
<div itemprop="articleBody" class="article-body">
<p itemprop="immediateRelease" class="immediateRelease">IMMEDIATE RELEASE</p>
<h1 itemprop="headline">HEADLINE</h1>
<div class="hidden-lg meta">
<ul>
<li><time pubdate="" datetime="Jan. 23, 2019">Jan. 23, 2019</time></li>
<li>News Release</li>
<li>Release No: NR-014-19</li>
</ul>
</div>
<hr>
Text Text Text <br>
<br>
Text Text Text <br>
<br>
Text Text Text.<br>
<br>
Text Text Text <a href="mailto: Text Text Text " class="ApplyClass"> Text Text Text </a>.<br>
<p> </p>
<p>E Text Text Text </p>
</div>
如何提取 hr 标签之后的文本,直到 div 标签结束?对于其他元素,我使用了类似的东西:
for meta in soup.find_all('div',class_='hidden-lg meta'):
data = meta.text.splitlines()
d['date'] = data[2]
d['type'] = data[3]
d['release'] = data[4]
这有点棘手,似乎是一种解决方法,但您可以使用 bs4 元素的 next_sibling
属性并测试 type
。但它有效:
from urllib.request import urlopen
import bs4
import requests
import json
from selenium import webdriver
html = """<div itemprop="articleBody" class="article-body">
<p itemprop="immediateRelease" class="immediateRelease">IMMEDIATE RELEASE</p>
<h1 itemprop="headline">HEADLINE</h1>
<div class="hidden-lg meta">
<ul>
<li><time pubdate="" datetime="Jan. 23, 2019">Jan. 23, 2019</time></li>
<li>News Release</li>
<li>Release No: NR-014-19</li>
</ul>
</div>
<hr>
Text Text Text <br>
<br>
Text Text Text <br>
<br>
Text Text Text.<br>
<br>
Text Text Text <a href="mailto: Text Text Text " class="ApplyClass"> Text Text Text </a>.<br>
<p> </p>
<p>E Text Text Text </p>
</div>"""
soup = bs4.BeautifulSoup(html,'html.parser')
div = soup.find('div')
text = ''
el = div.find('hr')
while(el):
el = el.next_sibling
if isinstance(el, bs4.element.Tag):
text += el.get_text()
elif isinstance(el, bs4.element.NavigableString):
text += el
print(text)
输出:
Text Text Text
Text Text Text
Text Text Text.
Text Text Text Text Text Text .
E Text Text Text
我有一个从页面中提取数据的脚本。我可以抓取其中的大部分内容,但是在 "hr" 标签之后出现了一些文本;我不确定如何刮擦。 HTML代码如下:
<div itemprop="articleBody" class="article-body">
<p itemprop="immediateRelease" class="immediateRelease">IMMEDIATE RELEASE</p>
<h1 itemprop="headline">HEADLINE</h1>
<div class="hidden-lg meta">
<ul>
<li><time pubdate="" datetime="Jan. 23, 2019">Jan. 23, 2019</time></li>
<li>News Release</li>
<li>Release No: NR-014-19</li>
</ul>
</div>
<hr>
Text Text Text <br>
<br>
Text Text Text <br>
<br>
Text Text Text.<br>
<br>
Text Text Text <a href="mailto: Text Text Text " class="ApplyClass"> Text Text Text </a>.<br>
<p> </p>
<p>E Text Text Text </p>
</div>
如何提取 hr 标签之后的文本,直到 div 标签结束?对于其他元素,我使用了类似的东西:
for meta in soup.find_all('div',class_='hidden-lg meta'):
data = meta.text.splitlines()
d['date'] = data[2]
d['type'] = data[3]
d['release'] = data[4]
这有点棘手,似乎是一种解决方法,但您可以使用 bs4 元素的 next_sibling
属性并测试 type
。但它有效:
from urllib.request import urlopen
import bs4
import requests
import json
from selenium import webdriver
html = """<div itemprop="articleBody" class="article-body">
<p itemprop="immediateRelease" class="immediateRelease">IMMEDIATE RELEASE</p>
<h1 itemprop="headline">HEADLINE</h1>
<div class="hidden-lg meta">
<ul>
<li><time pubdate="" datetime="Jan. 23, 2019">Jan. 23, 2019</time></li>
<li>News Release</li>
<li>Release No: NR-014-19</li>
</ul>
</div>
<hr>
Text Text Text <br>
<br>
Text Text Text <br>
<br>
Text Text Text.<br>
<br>
Text Text Text <a href="mailto: Text Text Text " class="ApplyClass"> Text Text Text </a>.<br>
<p> </p>
<p>E Text Text Text </p>
</div>"""
soup = bs4.BeautifulSoup(html,'html.parser')
div = soup.find('div')
text = ''
el = div.find('hr')
while(el):
el = el.next_sibling
if isinstance(el, bs4.element.Tag):
text += el.get_text()
elif isinstance(el, bs4.element.NavigableString):
text += el
print(text)
输出:
Text Text Text
Text Text Text
Text Text Text.
Text Text Text Text Text Text .
E Text Text Text