如何从 Beautifulsoup 中删除结果的 HTML 标签 find all

How to remove HTML tags of a result from Beatifulsoup find all

我需要使用 python 和 beautifulsoup.

删除标签并仅保留以下代码输出中的文本

输出:

import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content)
print(soup.prettify())


first_header = soup.find(["h2", "h2"])

first_headers = soup.find_all(["h2", "h2"])
first_headers

要仅从 ResultSet 中获取文本,请对其进行迭代,例如使用 list comprehension,为每个元素调用 .text,并通过 whitespace:

为所有文本元素调用 .join()
' '.join([e.text for e in soup.find_all('h2')])  

例子

import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content)


first_headers = ' '.join([e.text for e in soup.find_all('h2')])

print(first_headers)

输出

Tutorials References Exercises and Quizzes HTML Tutorial HTML Forms HTML Graphics HTML Media HTML APIs HTML Examples HTML References What is HTML? A Simple HTML Document What is an HTML Element? Web Browsers HTML Page Structure HTML History Report Error Thank You For Helping Us!
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content,features="html.parser") # getting content from webpage
# retriving all h1 and h2 tags and extracting text from each of them 
first_headers = [html.text for html in soup.find_all(["h1", "h2"])] 
print(first_headers)

我使用列表理解在一行中解决了它,你可以使用 for 循环来代替

import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content,features="html.parser")

first_headers = soup.find_all(["h1", "h2"])
for i in first_headers:
    print(i.text)

这是我的代码的输出:

Tutorials
References
Exercises and Quizzes
HTML Tutorial
HTML Forms
HTML Graphics
HTML Media
HTML APIs
HTML Examples
HTML References
HTML Introduction
What is HTML?
A Simple HTML Document
What is an HTML Element?
Web Browsers
HTML Page Structure
HTML History
Report Error
Thank You For Helping Us!