Beautiful soup 如何从汤中删除 links *和* link 文本

Beautiful soup how to remove links *and* the link text from soup

我正在使用 beautiful soup 从网页中获取一些清理过的文本 - 没有 html,只是显示给用户的文本。但是,我真的不希望代码看到附加了 link 的文本作为可见文本。为了弄清楚我在这里的意思:

This text is the problem

以上文本 links 到 Beautiful soup 文档。目前我切掉了实际的 link,但文本 'This text is the problem' 仍然存在。理想情况下,我也想删除该文本。

您可以使用 href 提取 <a> 标签。 .extract().decompose():

这里是完整的:

from bs4 import BeautifulSoup

html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
    </div>'''

soup = BeautifulSoup(html, 'html.parser')

p_tags = soup.find_all('p')

for each in p_tags:
    print (each.text)

输出:

I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: 
This text is the problem
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.

然后删除它:

from bs4 import BeautifulSoup

html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
    </div>'''

soup = BeautifulSoup(html, 'html.parser')

for a in soup.findAll('a', href=True):
    a.extract()

p_tags = soup.find_all('p')

for each in p_tags:
    print (each.text)

输出:

I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: 

The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.

您也可以使用 .decompose():

from bs4 import BeautifulSoup

html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
    </div>'''

soup = BeautifulSoup(html, 'html.parser')

soup.a.decompose()

p_tags = soup.find_all('p')

for each in p_tags:
    print (each.text)