如何使用 python+beautifulsoup 抓取标签外的项目

Question

使用 python+beautifulsoup，假设我有一个 <class 'bs4.element.Tag'> 对象，a:

<div class="class1"><em>text1</em> text2</div>

我可以使用下面的命令来提取text1 text2并放入b：

b = a.text

我可以用下面的命令把text1提取出来放到c中：

c = a.findAll("em")[0].text

但是我怎样才能只提取 text2？

Answer 1

我稍微编辑了你的 HTML 片段，使 <em> 标签内外不止一个词，这样 getText() 从你的 <div> 中提取所有文本容器导致以下输出：

'text1 foo bar text2 foobar baz'

如您所见，这只是一个删除了 <em> 标签的字符串。据我了解，您想从 <div> 容器中的内容中删除 <em> 标签的内容。

我的解决方案不是很好，但这可以通过使用 .replace() to replace the contents of the <em> tag with an empty string ''. Since this could lead to leading or trailing spaces you could call .lstrip() 来摆脱那些：

#!/usr/bin/env python3
# coding: utf-8

from bs4 import BeautifulSoup

html = '<div class="class1"><em>text1 foo bar</em> text2 foobar baz</div>'
soup = BeautifulSoup(html, 'html.parser')

result = soup.getText().replace(soup.em.getText(), '').lstrip()

print(result)

打印语句的输出：

'text2 foobar baz'

Answer 2

您可以删除 div 父级的所有子级，然后像这样获取父级的内容：

>>> a = BeautifulSoup(out_div, 'html.parser')
>>> for child in a.div.findChildren():
...     child.replace_with('')
...     
<em>text1</em>
>>> a.get_text()
u' text2'

如何使用 python+beautifulsoup 抓取标签外的项目

How to grab item outside of tag using python+beautifulsoup

python

beautifulsoup

web-scraping