如何循环 div 并仅使用 BeautifulSoup 和 python 获取段落标记中的文本？

Question

我正在使用 beautifulsoup 和 python 抓取网页并仅从网站的段落标记中提取文本。 This is the page I want to crawl 我想要所有段落标签中的所有文本。

提前致谢

Answer 1

始终使用 selenium 作为节省资源的最后手段。

from selenium import webdriver
url = 'https://www.who.int/csr/disease/coronavirus_infections/faq_dec12/en/'
driver = webdriver.Chrome()
try:
  driver.get(url)
  div_text = driver.find_element_by_id('primary').text
  with open('website_content.txt','w') as f:
    f.write(div_text)
except Exception as e:
  print(e)
finally:
  if driver is not None:
    driver.close()

你可以用 requests 和 beautiful soup 达到同样的效果，如下：

import requests as rq
from bs4 import BeautifulSoup


response  = rq.get(url)
if response.status_code == 200:
  soup = BeautifulSoup(response.text,'html.parser')
  div_text = soup.find('div',{'id':'primary'}).text
  with open('website_content.txt','w') as f:
    f.write(div_text)

如何循环 div 并仅使用 BeautifulSoup 和 python 获取段落标记中的文本？

How to loop a div and get the text in the paragraph tag only using BeautifulSoup and python?

python

selenium

web-crawler