从 BeautifulSoup 中的 br 标签获取文本
Getting text from br tags in BeautifulSoup
我几乎掌握了 Python 中的 BeautifulSoup4,但我似乎无法提取 HTML 数据中 br 标签的 <br/>
数据.
数据结构:
<HTML and CSS Stuff here>
<div class="menu">
<span class="author">Bob</span>
<span class="smaller">(06 Jul at 09:21)</span>
<br/>This message is very important to extract along with the matching author and time of submit<br/>
</div>
我要找的是:
Author: Bob
Time: (06 Jul at 09:21)
Data: This message is very important to extract along with the matching author and time of submit
HTML 通过 requests
进来,一切正常。但我就是没把汤调好。
当前代码:
from bs4 import BeautifulSoup
import requests
html_doc = """
<HTML and CSS Stuff here>
<div class="menu">
<span class="author">Bob</span>
<span class="smaller">(06 Jul at 09:21)</span>
<br/>This message is very important to extract along with the matching author and time of submit<br/>
</div>
"""
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
x = soup.select('div[class="menu"]')
for i in x:
s = soup.select('span[class="author"]')
rr = soup.select('span[class="smaller"]')
for b in s:
print b
print rr
<br/>
标签始终是空标签。该标签中没有文本。
您得到的是两个 <br/>
标签之间 的文本,这可能令人困惑。您可以删除任一标签,它仍然有效 HTML.
您可以使用 .next_sibling
attribute:
获取标签后的文本
soup.select('div.menu br')[0].next_sibling
演示:
>>> from bs4 import BeautifulSoup
>>> html_doc = """
... <HTML and CSS Stuff here>
... <div class="menu">
... <span class="author">Bob</span>
... <span class="smaller">(06 Jul at 09:21)</span>
... <br/>This message is very important to extract along with the matching author and time of submit<br/>
... </div>
... """
>>> soup = BeautifulSoup(html_doc)
>>> soup.select('div.menu br')[0].next_sibling
u'This message is very important to extract along with the matching author and time of submit'
将其与提取所有数据放在一起:
for menu in soup.select('div.menu'):
author = menu.find('span', class_='author').get_text()
time = menu.find('span', class_='smaller').get_text()
data = menu.find('br').next_sibling
产生:
>>> for menu in soup.select('div.menu'):
... author = menu.find('span', class_='author').get_text()
... time = menu.find('span', class_='smaller').get_text()
... data = menu.find('br').next_sibling
... print 'Author: {}\nTime: {}\nData: {}'.format(author, time, data)
...
Author: Bob
Time: (06 Jul at 09:21)
Data: This message is very important to extract along with the matching author and time of submit
我几乎掌握了 Python 中的 BeautifulSoup4,但我似乎无法提取 HTML 数据中 br 标签的 <br/>
数据.
数据结构:
<HTML and CSS Stuff here>
<div class="menu">
<span class="author">Bob</span>
<span class="smaller">(06 Jul at 09:21)</span>
<br/>This message is very important to extract along with the matching author and time of submit<br/>
</div>
我要找的是:
Author: Bob
Time: (06 Jul at 09:21)
Data: This message is very important to extract along with the matching author and time of submit
HTML 通过 requests
进来,一切正常。但我就是没把汤调好。
当前代码:
from bs4 import BeautifulSoup
import requests
html_doc = """
<HTML and CSS Stuff here>
<div class="menu">
<span class="author">Bob</span>
<span class="smaller">(06 Jul at 09:21)</span>
<br/>This message is very important to extract along with the matching author and time of submit<br/>
</div>
"""
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
x = soup.select('div[class="menu"]')
for i in x:
s = soup.select('span[class="author"]')
rr = soup.select('span[class="smaller"]')
for b in s:
print b
print rr
<br/>
标签始终是空标签。该标签中没有文本。
您得到的是两个 <br/>
标签之间 的文本,这可能令人困惑。您可以删除任一标签,它仍然有效 HTML.
您可以使用 .next_sibling
attribute:
soup.select('div.menu br')[0].next_sibling
演示:
>>> from bs4 import BeautifulSoup
>>> html_doc = """
... <HTML and CSS Stuff here>
... <div class="menu">
... <span class="author">Bob</span>
... <span class="smaller">(06 Jul at 09:21)</span>
... <br/>This message is very important to extract along with the matching author and time of submit<br/>
... </div>
... """
>>> soup = BeautifulSoup(html_doc)
>>> soup.select('div.menu br')[0].next_sibling
u'This message is very important to extract along with the matching author and time of submit'
将其与提取所有数据放在一起:
for menu in soup.select('div.menu'):
author = menu.find('span', class_='author').get_text()
time = menu.find('span', class_='smaller').get_text()
data = menu.find('br').next_sibling
产生:
>>> for menu in soup.select('div.menu'):
... author = menu.find('span', class_='author').get_text()
... time = menu.find('span', class_='smaller').get_text()
... data = menu.find('br').next_sibling
... print 'Author: {}\nTime: {}\nData: {}'.format(author, time, data)
...
Author: Bob
Time: (06 Jul at 09:21)
Data: This message is very important to extract along with the matching author and time of submit