Beautiful soup 如何从汤中删除 links *和* link 文本
Beautiful soup how to remove links *and* the link text from soup
我正在使用 beautiful soup 从网页中获取一些清理过的文本 - 没有 html,只是显示给用户的文本。但是,我真的不希望代码看到附加了 link 的文本作为可见文本。为了弄清楚我在这里的意思:
以上文本 links 到 Beautiful soup 文档。目前我切掉了实际的 link,但文本 'This text is the problem' 仍然存在。理想情况下,我也想删除该文本。
您可以使用 href
提取 <a>
标签。 .extract()
或 .decompose()
:
这里是完整的:
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)
输出:
I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here:
This text is the problem
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.
然后删除它:
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
for a in soup.findAll('a', href=True):
a.extract()
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)
输出:
I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here:
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.
您也可以使用 .decompose()
:
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
soup.a.decompose()
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)
我正在使用 beautiful soup 从网页中获取一些清理过的文本 - 没有 html,只是显示给用户的文本。但是,我真的不希望代码看到附加了 link 的文本作为可见文本。为了弄清楚我在这里的意思:
以上文本 links 到 Beautiful soup 文档。目前我切掉了实际的 link,但文本 'This text is the problem' 仍然存在。理想情况下,我也想删除该文本。
您可以使用 href
提取 <a>
标签。 .extract()
或 .decompose()
:
这里是完整的:
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)
输出:
I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here:
This text is the problem
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.
然后删除它:
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
for a in soup.findAll('a', href=True):
a.extract()
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)
输出:
I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here:
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.
您也可以使用 .decompose()
:
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
soup.a.decompose()
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)