如何从 Beautifulsoup 中删除结果的 HTML 标签 find all
How to remove HTML tags of a result from Beatifulsoup find all
我需要使用 python 和 beautifulsoup.
删除标签并仅保留以下代码输出中的文本
输出:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content)
print(soup.prettify())
first_header = soup.find(["h2", "h2"])
first_headers = soup.find_all(["h2", "h2"])
first_headers
要仅从 ResultSet
中获取文本,请对其进行迭代,例如使用 list comprehension
,为每个元素调用 .text
,并通过 whitespace
:
为所有文本元素调用 .join()
' '.join([e.text for e in soup.find_all('h2')])
例子
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content)
first_headers = ' '.join([e.text for e in soup.find_all('h2')])
print(first_headers)
输出
Tutorials References Exercises and Quizzes HTML Tutorial HTML Forms HTML Graphics HTML Media HTML APIs HTML Examples HTML References What is HTML? A Simple HTML Document What is an HTML Element? Web Browsers HTML Page Structure HTML History Report Error Thank You For Helping Us!
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content,features="html.parser") # getting content from webpage
# retriving all h1 and h2 tags and extracting text from each of them
first_headers = [html.text for html in soup.find_all(["h1", "h2"])]
print(first_headers)
我使用列表理解在一行中解决了它,你可以使用 for 循环来代替
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content,features="html.parser")
first_headers = soup.find_all(["h1", "h2"])
for i in first_headers:
print(i.text)
这是我的代码的输出:
Tutorials
References
Exercises and Quizzes
HTML Tutorial
HTML Forms
HTML Graphics
HTML Media
HTML APIs
HTML Examples
HTML References
HTML Introduction
What is HTML?
A Simple HTML Document
What is an HTML Element?
Web Browsers
HTML Page Structure
HTML History
Report Error
Thank You For Helping Us!
我需要使用 python 和 beautifulsoup.
删除标签并仅保留以下代码输出中的文本输出:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content)
print(soup.prettify())
first_header = soup.find(["h2", "h2"])
first_headers = soup.find_all(["h2", "h2"])
first_headers
要仅从 ResultSet
中获取文本,请对其进行迭代,例如使用 list comprehension
,为每个元素调用 .text
,并通过 whitespace
:
.join()
' '.join([e.text for e in soup.find_all('h2')])
例子
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content)
first_headers = ' '.join([e.text for e in soup.find_all('h2')])
print(first_headers)
输出
Tutorials References Exercises and Quizzes HTML Tutorial HTML Forms HTML Graphics HTML Media HTML APIs HTML Examples HTML References What is HTML? A Simple HTML Document What is an HTML Element? Web Browsers HTML Page Structure HTML History Report Error Thank You For Helping Us!
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content,features="html.parser") # getting content from webpage
# retriving all h1 and h2 tags and extracting text from each of them
first_headers = [html.text for html in soup.find_all(["h1", "h2"])]
print(first_headers)
我使用列表理解在一行中解决了它,你可以使用 for 循环来代替
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content,features="html.parser")
first_headers = soup.find_all(["h1", "h2"])
for i in first_headers:
print(i.text)
这是我的代码的输出:
Tutorials
References
Exercises and Quizzes
HTML Tutorial
HTML Forms
HTML Graphics
HTML Media
HTML APIs
HTML Examples
HTML References
HTML Introduction
What is HTML?
A Simple HTML Document
What is an HTML Element?
Web Browsers
HTML Page Structure
HTML History
Report Error
Thank You For Helping Us!